Page 1 of 1

This XML document is invalid, likely due to invalid (*5)

Posted: 05 Apr 2013, 06:36
by techcafe
Hi All,

Great to be using a great tool after moving away from Google Reader. I have successfully imported majority of my feeds however sadly 10% of them have the issues of:

This XML document is invalid, likely due to invalid characters. XML error: Undeclared entity error at line 121, column 53
This XML document is invalid, likely due to invalid characters. XML error: XML_ERR_NAME_REQUIRED at line 7, column 258
This XML document is invalid, likely due to invalid characters. XML error: Reserved XML Name at line 2, column 38
This XML document is invalid, likely due to invalid characters. XML error: SYSTEM or PUBLIC, the URI is missing at line 3, column 56
This XML document is invalid, likely due to invalid characters. XML error: XML_ERR_NAME_REQUIRED at line 698, column 20

I tried doing research but came to a dead end unless I was searching in the wrong place.

Thanks all

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 05 Apr 2013, 11:04
by murf
Your feed url has likely changed, and the old url is returning garbage. Well, not exactly garbage, but probably something like some HTML notice designed to be read by a human (e.g., someone let their domain expire and some squatter is now sitting on it). Or, worse still, someone redesigned the site, and the old rss feed url now redirects back to the home page instead of returning rss.

Basically, you've got to check and probably redo the feeds that have this error. (And that's exactly what I did with my imported feeds. :-( )

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 05 Apr 2013, 14:51
by xtaz
I found that around 10% of my feeds gave similar errors. When I looked into it I found that they were all feedburner feeds. Feedburner have a habit of returning HTML based on the useragent. I fixed it by adding ?format=xml on the end of the feed URL. If it's a feedburner link then try editing the feed and adding that, and then press the f r hotkey when viewing the feed to see if it loads.

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 05 Apr 2013, 17:36
by daweb
I had a similar issue and found that yup, some of them were simply no longer valid. Had to check manually each of the bad ones by going to the sites and verifying the feed links. It was worth it for me. :-)

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 05 Apr 2013, 18:17
by raindog469
Some sites also just produce invalid XML (unquoted characters like & > <, control characters and non-ASCII characters, unescaped spaces in URLs, etc). Someone on another thread came up with a way to run every fetched feed through xmllint, but I'd already made a feed cleaner proxy in perl, tiny enough to paste here (and should be trivial to implement in php):

Code: Select all

#!/usr/bin/perl

use CGI qw(:standard);
my $url = param("feed");

die "Bad URL" unless $url =~ /^https?:/i;

open WGET, "-|", "wget", "-O-", $url or die $!;
my $feed = join('', <WGET>);

$feed =~ s/[^\x0a-\x7e]/ /g;
1 while $feed =~ s/(href="[^\"]+)\s([^\"]*)"/$1%20$2/ig;
$feed =~ s/&/&amp;/g;
$feed =~ s/&amp;amp;/&amp;/g;
$feed =~ s/&amp;lt;/&lt;/g;
$feed =~ s/&amp;gt;/&gt;/g;
$feed =~ s/&amp;quot;/&quot;/g;
$feed =~ s/></>\n</g;

print header("application/rss+xml");
print $feed;
exit 0;


It's poorly-written code, but it handles poorly-written XML. I wrote it years ago when akregator or whatever desktop feed reader I was using would choke on feeds, and updated it when I switched from Google Reader to TT-RSS. To use it, I just prepend "http://servername/cgi-bin/feedclean.cgi?feed=" to the feed I'm having problems with. Probably 20-30 of my ~500 feeds have this issue, and usually it's not on every fetch (just when there's an article whose content has issues when parsed as XML). I also force a linebreak between adjacent tags with no text between them, which cleaned up a feed without invalid characters that Simplepie was choking on for some reason that I never figured out.

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 05 Apr 2013, 18:21
by raindog469
(Since I have no edit button, I just want to add the disclaimer that you probably don't want to put that script on a public-facing server. I used "open" in a way that shouldn't spawn a shell and open up the associated vulnerabilities with passing unvalidated input to a shell, but there could be other vulnerabilities I didn't consider since I just wrote it for my own use.)

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 05 Apr 2013, 19:34
by techcafe
Excellent thanks for your comments and I'm going to investigate more. It looks like several of the feeds are there but Tiny just wont take them :\

I'll try all of your opinons and give another go.

Thanks all!

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 06 Apr 2013, 00:17
by sleeper_service
raindog469 wrote:Some sites also just produce invalid XML (unquoted characters like & > <, control characters and non-ASCII characters, unescaped spaces in URLs, etc). Someone on another thread came up with a way to run every fetched feed through xmllint, but I'd already made a feed cleaner proxy in perl, tiny enough to paste here (and should be trivial to implement in php):

....
It's poorly-written code, but it handles poorly-written XML. I wrote it years ago when akregator or whatever desktop feed reader I was using would choke on feeds, and updated it when I switched from Google Reader to TT-RSS. To use it, I just prepend "http://servername/cgi-bin/feedclean.cgi?feed=" to the feed I'm having problems with. Probably 20-30 of my ~500 feeds have this issue, and usually it's not on every fetch (just when there's an article whose content has issues when parsed as XML). I also force a linebreak between adjacent tags with no text between them, which cleaned up a feed without invalid characters that Simplepie was choking on for some reason that I never figured out.


Thanks so much for that, it worked *great* cleaning up bad characters in the feed from http://scienceblogs.com/startswithabang/

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 19 Apr 2013, 11:22
by dharm0us
I just came across the same problem for this feed : http://techcircle.vccircle.com/feed/.
The problem here is that there are whitespaces at the beginning of the xml output.

So, here is the solution :

1. Open the file lib/simplepie/simplepie.inc
2. Find this piece of code : (somewhere around the line #1342)

Code: Select all

// Loop through each possible encoding, till we return something, or run out of possibilities
      foreach ($encodings as $encoding)
      {
         // Change the encoding to UTF-8 (as we always use UTF-8 internally)
         if ($utf8_data = $this->registry->call('Misc', 'change_encoding', array($this->raw_data, $encoding, 'UTF-8')))


And add this line in the foreach loop :

Code: Select all

         $this->raw_data = trim($this->raw_data);


So that your code should look like this :

Code: Select all

// Loop through each possible encoding, till we return something, or run out of possibilities
foreach ($encodings as $encoding)
      {
         $this->raw_data = trim($this->raw_data);
         // Change the encoding to UTF-8 (as we always use UTF-8 internally)
         if ($utf8_data = $this->registry->call('Misc', 'change_encoding', array($this->raw_data, $encoding, 'UTF-8')))

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 19 Apr 2013, 12:36
by fox
This is something that is better fixed elsewhere, but the idea makes sense.

https://github.com/gothfox/Tiny-Tiny-RS ... 12b7c84388

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 19 Apr 2013, 14:30
by fox
I have also added some hacks to try working around unescaped entities in feeds which is a common cause of parse errors.

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 19 Apr 2013, 21:50
by wet
I cannot subscribe to some new feeds, apparently since https://github.com/gothfox/Tiny-Tiny-RS ... 57995ee6c7.

Steps to reproduce:

# Subscribe to e.g. http://en.blog.wordpress.com/feed/ or http://www.heise.de/ix/news/news-atom.xml. Both feeds are fine according to the W3C feed validator.
# Receive error box stating "XML validation failed: LibXML error 23 at line 576 (column 90): EntityRef: expecting ';'" and "XML validation failed: LibXML error 5 at line 131 (column 2): Extra content at the end of the document" resp.

This does not happen for all new subscriptions, e.g. http://textpattern.com/rss/?section=weblog works just fine.

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 20 Apr 2013, 01:34
by fox
See the FAQ?

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 20 Apr 2013, 01:36
by fox
nvm the above, this was a valid problem with subscription thing.

Re: This XML document is invalid, likely due to invalid (*5)

Posted: 28 Jul 2013, 02:39
by durval
Hi RainDog469,

raindog469 wrote:Some sites also just produce invalid XML (unquoted characters like & > <, control characters and non-ASCII characters, unescaped spaces in URLs, etc). Someone on another thread came up with a way to run every fetched feed through xmllint, but I'd already made a feed cleaner proxy in perl, tiny enough to paste here (and should be trivial to implement in php)

Thank you very much for writing and then posting your proxy code here, it enabled me to immediately solve many of the issues which tt-rss was having with a dozen feeds imported from my deceased Google Reader account, and then with a little fiddling, solved the rest of them. I even posted a topic here in the forum about it, including the fiddled-with code: http://tt-rss.org/forum/viewtopic.php?f=1&t=2482

Cheers,
--
Durval.