This XML document is invalid, likely due to invalid (*5)

Support requests, bug reports, etc. go here. Dedicated servers / VDS hosting only
techcafe
I Can't Search Before Posting
Posts: 2
Joined: 05 Apr 2013, 06:30

This XML document is invalid, likely due to invalid (*5)

Postby techcafe » 05 Apr 2013, 06:36

Hi All,

Great to be using a great tool after moving away from Google Reader. I have successfully imported majority of my feeds however sadly 10% of them have the issues of:

This XML document is invalid, likely due to invalid characters. XML error: Undeclared entity error at line 121, column 53
This XML document is invalid, likely due to invalid characters. XML error: XML_ERR_NAME_REQUIRED at line 7, column 258
This XML document is invalid, likely due to invalid characters. XML error: Reserved XML Name at line 2, column 38
This XML document is invalid, likely due to invalid characters. XML error: SYSTEM or PUBLIC, the URI is missing at line 3, column 56
This XML document is invalid, likely due to invalid characters. XML error: XML_ERR_NAME_REQUIRED at line 698, column 20

I tried doing research but came to a dead end unless I was searching in the wrong place.

Thanks all

murf
Bear Rating Trainee
Bear Rating Trainee
Posts: 3
Joined: 05 Apr 2013, 10:56

Re: This XML document is invalid, likely due to invalid (*5)

Postby murf » 05 Apr 2013, 11:04

Your feed url has likely changed, and the old url is returning garbage. Well, not exactly garbage, but probably something like some HTML notice designed to be read by a human (e.g., someone let their domain expire and some squatter is now sitting on it). Or, worse still, someone redesigned the site, and the old rss feed url now redirects back to the home page instead of returning rss.

Basically, you've got to check and probably redo the feeds that have this error. (And that's exactly what I did with my imported feeds. :-( )

xtaz
Bear Rating Master
Bear Rating Master
Posts: 174
Joined: 24 Dec 2009, 16:48

Re: This XML document is invalid, likely due to invalid (*5)

Postby xtaz » 05 Apr 2013, 14:51

I found that around 10% of my feeds gave similar errors. When I looked into it I found that they were all feedburner feeds. Feedburner have a habit of returning HTML based on the useragent. I fixed it by adding ?format=xml on the end of the feed URL. If it's a feedburner link then try editing the feed and adding that, and then press the f r hotkey when viewing the feed to see if it loads.

daweb
Bear Rating Trainee
Bear Rating Trainee
Posts: 3
Joined: 29 Mar 2013, 19:55

Re: This XML document is invalid, likely due to invalid (*5)

Postby daweb » 05 Apr 2013, 17:36

I had a similar issue and found that yup, some of them were simply no longer valid. Had to check manually each of the bad ones by going to the sites and verifying the feed links. It was worth it for me. :-)

User avatar
raindog469
Bear Rating Trainee
Bear Rating Trainee
Posts: 17
Joined: 17 Mar 2013, 22:35

Re: This XML document is invalid, likely due to invalid (*5)

Postby raindog469 » 05 Apr 2013, 18:17

Some sites also just produce invalid XML (unquoted characters like & > <, control characters and non-ASCII characters, unescaped spaces in URLs, etc). Someone on another thread came up with a way to run every fetched feed through xmllint, but I'd already made a feed cleaner proxy in perl, tiny enough to paste here (and should be trivial to implement in php):

Code: Select all

#!/usr/bin/perl

use CGI qw(:standard);
my $url = param("feed");

die "Bad URL" unless $url =~ /^https?:/i;

open WGET, "-|", "wget", "-O-", $url or die $!;
my $feed = join('', <WGET>);

$feed =~ s/[^\x0a-\x7e]/ /g;
1 while $feed =~ s/(href="[^\"]+)\s([^\"]*)"/$1%20$2/ig;
$feed =~ s/&/&amp;/g;
$feed =~ s/&amp;amp;/&amp;/g;
$feed =~ s/&amp;lt;/&lt;/g;
$feed =~ s/&amp;gt;/&gt;/g;
$feed =~ s/&amp;quot;/&quot;/g;
$feed =~ s/></>\n</g;

print header("application/rss+xml");
print $feed;
exit 0;


It's poorly-written code, but it handles poorly-written XML. I wrote it years ago when akregator or whatever desktop feed reader I was using would choke on feeds, and updated it when I switched from Google Reader to TT-RSS. To use it, I just prepend "http://servername/cgi-bin/feedclean.cgi?feed=" to the feed I'm having problems with. Probably 20-30 of my ~500 feeds have this issue, and usually it's not on every fetch (just when there's an article whose content has issues when parsed as XML). I also force a linebreak between adjacent tags with no text between them, which cleaned up a feed without invalid characters that Simplepie was choking on for some reason that I never figured out.

User avatar
raindog469
Bear Rating Trainee
Bear Rating Trainee
Posts: 17
Joined: 17 Mar 2013, 22:35

Re: This XML document is invalid, likely due to invalid (*5)

Postby raindog469 » 05 Apr 2013, 18:21

(Since I have no edit button, I just want to add the disclaimer that you probably don't want to put that script on a public-facing server. I used "open" in a way that shouldn't spawn a shell and open up the associated vulnerabilities with passing unvalidated input to a shell, but there could be other vulnerabilities I didn't consider since I just wrote it for my own use.)

techcafe
I Can't Search Before Posting
Posts: 2
Joined: 05 Apr 2013, 06:30

Re: This XML document is invalid, likely due to invalid (*5)

Postby techcafe » 05 Apr 2013, 19:34

Excellent thanks for your comments and I'm going to investigate more. It looks like several of the feeds are there but Tiny just wont take them :\

I'll try all of your opinons and give another go.

Thanks all!

User avatar
sleeper_service
Bear Rating Overlord
Bear Rating Overlord
Posts: 884
Joined: 30 Mar 2013, 23:50
Location: Dallas, Texas

Re: This XML document is invalid, likely due to invalid (*5)

Postby sleeper_service » 06 Apr 2013, 00:17

raindog469 wrote:Some sites also just produce invalid XML (unquoted characters like & > <, control characters and non-ASCII characters, unescaped spaces in URLs, etc). Someone on another thread came up with a way to run every fetched feed through xmllint, but I'd already made a feed cleaner proxy in perl, tiny enough to paste here (and should be trivial to implement in php):

....
It's poorly-written code, but it handles poorly-written XML. I wrote it years ago when akregator or whatever desktop feed reader I was using would choke on feeds, and updated it when I switched from Google Reader to TT-RSS. To use it, I just prepend "http://servername/cgi-bin/feedclean.cgi?feed=" to the feed I'm having problems with. Probably 20-30 of my ~500 feeds have this issue, and usually it's not on every fetch (just when there's an article whose content has issues when parsed as XML). I also force a linebreak between adjacent tags with no text between them, which cleaned up a feed without invalid characters that Simplepie was choking on for some reason that I never figured out.


Thanks so much for that, it worked *great* cleaning up bad characters in the feed from http://scienceblogs.com/startswithabang/

dharm0us
Bear Rating Trainee
Bear Rating Trainee
Posts: 1
Joined: 19 Apr 2013, 11:15

Re: This XML document is invalid, likely due to invalid (*5)

Postby dharm0us » 19 Apr 2013, 11:22

I just came across the same problem for this feed : http://techcircle.vccircle.com/feed/.
The problem here is that there are whitespaces at the beginning of the xml output.

So, here is the solution :

1. Open the file lib/simplepie/simplepie.inc
2. Find this piece of code : (somewhere around the line #1342)

Code: Select all

// Loop through each possible encoding, till we return something, or run out of possibilities
      foreach ($encodings as $encoding)
      {
         // Change the encoding to UTF-8 (as we always use UTF-8 internally)
         if ($utf8_data = $this->registry->call('Misc', 'change_encoding', array($this->raw_data, $encoding, 'UTF-8')))


And add this line in the foreach loop :

Code: Select all

         $this->raw_data = trim($this->raw_data);


So that your code should look like this :

Code: Select all

// Loop through each possible encoding, till we return something, or run out of possibilities
foreach ($encodings as $encoding)
      {
         $this->raw_data = trim($this->raw_data);
         // Change the encoding to UTF-8 (as we always use UTF-8 internally)
         if ($utf8_data = $this->registry->call('Misc', 'change_encoding', array($this->raw_data, $encoding, 'UTF-8')))

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: This XML document is invalid, likely due to invalid (*5)

Postby fox » 19 Apr 2013, 12:36

This is something that is better fixed elsewhere, but the idea makes sense.

https://github.com/gothfox/Tiny-Tiny-RS ... 12b7c84388

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: This XML document is invalid, likely due to invalid (*5)

Postby fox » 19 Apr 2013, 14:30

I have also added some hacks to try working around unescaped entities in feeds which is a common cause of parse errors.

wet
Bear Rating Trainee
Bear Rating Trainee
Posts: 1
Joined: 14 Mar 2013, 15:08

Re: This XML document is invalid, likely due to invalid (*5)

Postby wet » 19 Apr 2013, 21:50

I cannot subscribe to some new feeds, apparently since https://github.com/gothfox/Tiny-Tiny-RS ... 57995ee6c7.

Steps to reproduce:

# Subscribe to e.g. http://en.blog.wordpress.com/feed/ or http://www.heise.de/ix/news/news-atom.xml. Both feeds are fine according to the W3C feed validator.
# Receive error box stating "XML validation failed: LibXML error 23 at line 576 (column 90): EntityRef: expecting ';'" and "XML validation failed: LibXML error 5 at line 131 (column 2): Extra content at the end of the document" resp.

This does not happen for all new subscriptions, e.g. http://textpattern.com/rss/?section=weblog works just fine.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: This XML document is invalid, likely due to invalid (*5)

Postby fox » 20 Apr 2013, 01:34

See the FAQ?

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: This XML document is invalid, likely due to invalid (*5)

Postby fox » 20 Apr 2013, 01:36

nvm the above, this was a valid problem with subscription thing.

durval
Bear Rating Trainee
Bear Rating Trainee
Posts: 26
Joined: 27 Jul 2013, 13:35

Re: This XML document is invalid, likely due to invalid (*5)

Postby durval » 28 Jul 2013, 02:39

Hi RainDog469,

raindog469 wrote:Some sites also just produce invalid XML (unquoted characters like & > <, control characters and non-ASCII characters, unescaped spaces in URLs, etc). Someone on another thread came up with a way to run every fetched feed through xmllint, but I'd already made a feed cleaner proxy in perl, tiny enough to paste here (and should be trivial to implement in php)

Thank you very much for writing and then posting your proxy code here, it enabled me to immediately solve many of the issues which tt-rss was having with a dozen feeds imported from my deceased Google Reader account, and then with a little fiddling, solved the rest of them. I even posted a topic here in the forum about it, including the fiddled-with code: http://tt-rss.org/forum/viewtopic.php?f=1&t=2482

Cheers,
--
Durval.


Return to “Support”

Who is online

Users browsing this forum: No registered users and 22 guests