Page 1 of 2

Fixing LibXML error "Extra content at the end of document"

Posted: 27 Jul 2013, 18:04
by durval
Hello folks,

Thought this might be of interest: today, while reviewing for the first time my tt-rss "Feeds with update errors" window, found that quite a few of them had the following error:

Code: Select all

 LibXML error 5 at line 68 (column 1): Extra content at the end of the document

(line and column numbers of course varied).

Examined the XML and found that the webmaster added a Google Analytics block of code at the end of the RSS code (ie, right after the "\</xml>" end tag).

So I did a little searching here on the forum and found this excellent post by raindog469, but unfortunately it didn't solve my problem right away, so I fiddled a little with it and, by adding one more line, ended up with something that worked for my case by adding one extra line:

Code: Select all

#!/usr/bin/perl
use CGI qw(:standard);
my $url = param("feed");

die "Bad URL" unless $url =~ /^https?:/i;

open WGET, "-|", "wget", "-O-", $url or die $!;
my $feed = join('', <WGET>);

$feed =~ s/[^\x0a-\x7e]/ /g;
1 while $feed =~ s/(href="[^\"]+)\s([^\"]*)"/$1%20$2/ig;
$feed =~ s/&/&amp;/g;
$feed =~ s/&amp;amp;/&amp;/g;
$feed =~ s/&amp;lt;/&lt;/g;
$feed =~ s/&amp;gt;/&gt;/g;
$feed =~ s/&amp;quot;/&quot;/g;
$feed =~ s/></>\n</g;
$feed =~ s/<\/rss>.*$/<\/rss>\n/si;

print header("application/rss+xml");
print $feed;
exit 0;


Posting it in case it helps anyone else.

Fox, what about incorporating similar code directly in TT-RSS?

Cheers,
--
Durval.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 28 Jul 2013, 00:06
by fox
>Fox, what about incorporating similar code directly in TT-RSS?

https://en.wikipedia.org/wiki/Garbage_In%2C_Garbage_Out

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 28 Jul 2013, 02:21
by durval
Hi Fox,

fox wrote:>Fox, what about incorporating similar code directly in TT-RSS?
https://en.wikipedia.org/wiki/Garbage_In%2C_Garbage_Out

Humrmrmr... good point, but please consider:

http://en.wikipedia.org/wiki/Be_conservative_in_what_you_send,_be_liberal_in_what_you_accept

instead of Babbage's (which is cited in the Wiki page you linked to and who, despite being a genious, never built much of anything), wouldn't you rather be on Postel's side (which helped build the Internet)?

Cheers,
--
Durval.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 30 Jul 2013, 08:23
by Sidicas
durval wrote:Humrmrmr... good point, but please consider:

http://en.wikipedia.org/wiki/Be_conservative_in_what_you_send,_be_liberal_in_what_you_accept

instead of Babbage's (which is cited in the Wiki page you linked to and who, despite being a genious, never built much of anything), wouldn't you rather be on Postel's side (which helped build the Internet)?

Cheers,
--
Durval.

That's what Microsoft did when they made Internet Explorer... Generally considered today to be bad decisions all around since you've now got all sorts of websites out there that render fine in IE but don't render properly in any generic standards-compliant browser.

Contact the website and ask them to fix their feed. I'm pretty sure you're not supposed to have any content outside of the XML boundaries. The best part, is that there is a lot of open source feed parsers out there besides tt-rss that use libXML and they'll throw the exact same error. So if you have the author fix the feed, it fixes it for everybody. If you patch tt-rss it only fixes it for tt-rss users and that's just not thinking about the bigger picture.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 01 Aug 2013, 17:59
by durval
Hi Sidicas,

That's what Microsoft did when they made Internet Explorer... Generally considered today to be bad decisions all around since you've now got all sorts of websites out there that render fine in IE but don't render properly in any generic standards-compliant browser


I agree with you that Internet Explorer was a very "bad decision", but I fail to see how it could possibly be related to the principle I mentioned, namely to "be conservative in what you send and liberal in what you accept": if anything, MS decisions regarding IE were exactly the contrary: not only they failed to accept a lot of very common HTML and Javascript at the time (ie, they were exactly the opposite of "be liberal in what you accept"), but they also pushed their own incompatible extensions (not only in HTML/Javascript but also Active X and other "lock-in" shenanigans), so what they did was also the opposite of "be strict in what you generate". So I'm sorry, but I think that your mention of IE as a "bad example" serves at best to confirm my thesis instead of denying it (and at worst is a complete "non sequitur")...

About contacting the website and asking them to fix the feed: I did it more than once, and the few responses I got back were on the line of "but it works with "RSS Reader X and Y and Z, so the fault must be at your end"... and I agree with them: if the other readers accept it, then it may not be right "de juris", but it indeed is right "de facto", and TT-RSS should consider following suit.

About libXML and other RSS Reader software: can you cite another RSS Reader software which has the same issues with these XML mistakes as TT-RSS? If not, do you agree that they are probably fixing the XML before feeding it to LibXML? That's exactly what I'm suggesting that TT-RSS should do, too.

Cheers,
--
Durval.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 01 Aug 2013, 18:47
by fox
>and I agree with them: if the other readers accept it, then it may not be right "de juris", but it indeed is right "de facto", and TT-RSS should consider following suit.

Yes, let's all cater to idiots who produce broken content because there seems to be a lot of them and being idiots they are unlikely to change. Excellent idea right there. Instead of raising the bar, let's lower it even further.

I can understand why google reader wannabe services cater to their deranged demographic - they have a monetization strategy involving the cattle of their users. I am interested in nothing of the sort so both people who produce broken ass XML and people who demand support for it can go fuck themselves (or each other, whatever strikes their fancy). I hope I'm making myself clear enough because my position on this issue is not going to change.

People should learn to own up to their shitty programming and fix it instead of dragging everyone else into their cesspool of mediocrity.

>About libXML and other RSS Reader software: can you cite another RSS Reader software which has the same issues with these XML mistakes as TT-RSS? If not, do you agree that they are probably fixing the XML before feeding it to LibXML? That's exactly what I'm suggesting that TT-RSS should do, too.

If you had spent a few minutes searching this forum instead of posting essays on the subject of what tt-rss should do, you would have discovered several ways of doing just so which fit within the overall framework provided by the application.

Then again, that would require intelligence someone blindly assuming invariably broken XML as a de-facto standard would probably lack.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 01 Aug 2013, 19:54
by durval
Hi Fox,

fox wrote:>and I agree with them: if the other readers accept it, then it may not be right "de juris", but it indeed is right "de facto", and TT-RSS should consider following suit.

Yes, let's all cater to idiots who produce broken content because there seems to be a lot of them and being idiots they are unlikely to change. Excellent idea right there. Instead of raising the bar, let's lower it even further.


That's certainly one (rather radical, IMHO) way of putting it; the other way (which I prefer) is simply to try to be as interoperable as possible and so to cather to as much users as possible.

fox wrote:I can understand why google reader wannabe services cater to their deranged demographic - they have a monetization strategy involving the cattle of their users. I am interested in nothing of the sort so both people who produce broken ass XML and people who demand support for it can go fuck themselves (or each other, whatever strikes their fancy). I hope I'm making myself clear enough because my position on this issue is not going to change.


:-) That's not only radical but also very graphical :-) Anyway, thanks for making yourself crystal clear on this subject. I shall not insist on it further; if TT-RSS ever bothers me so much in this regard, I will just fork it and have a go at it myself (thanks for making it open source).

I should point out that IMHO It's not just about monetization: it's about making the software as useful as possible for as much people as possible. And people sometimes want to access content that's residing in servers that are returning less-than-ideal XML... telling them to go fsck themselves does not solve the issue.

On a side note, if you really don't care about monetization, perhaps you should consider taking out the "donate" button on the TT-RSS Wiki and also quit the flattr thing, saying that you are not interested in monetization and at the same time having these solicitations up might sound hypocritical (and bi, telling everyone who might think it hypocritical to go fsck themselves up or each other along with the "people who produce broken ass XML and the people who demand support for it" also won't solve it).

fox wrote:[...]
>About libXML and other RSS Reader software: can you cite another RSS Reader software which has the same issues with these XML mistakes as TT-RSS? If not, do you agree that they are probably fixing the XML before feeding it to LibXML? That's exactly what I'm suggesting that TT-RSS should do, too.


fox wrote:If you had spent a few minutes searching this forum instead of posting essays on the subject of what tt-rss should do, you would have discovered several ways of doing just so which fit within the overall framework provided by the application.
Then again, that would require intelligence someone blindly assuming invariably broken XML as a de-facto standard would probably lack.


Do you really have to go at it "ad hominen"? it weakens your whole argument, and moreover it's patently false: please notice that the first thing I posted in this thread was a reference to another thread here on the forum (which I found by yes, searching) where a partial solution was offered, and also posted my go at making it more comprehensive... so I'm clearly not only "posting essays on the subject"...

OTOH, perhaps I was not able to locate other solutions that could have bee posted here for dealing with these issues that you refuse to code into TT-RSS; if you could be so kind as to post links to them instead of trying to offend me (no, I'm not offended, at least not yet), it would be much more productive not only for both of us but also for the other poor folks who could search for a way to fix this kind of issue in the future... and telling me to go fsck myself won't help anyone either.

Cheers,
--
Durval.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 01 Aug 2013, 20:29
by fox
>That's certainly one (rather radical, IMHO) way of putting it; the other way (which I prefer) is simply to try to be as interoperable as possible and so to cather to as much users as possible.

GLHF.

>On a side note, if you really don't care about monetization, perhaps you should consider taking out the "donate" button on the TT-RSS Wiki and also quit the flattr thing, saying that you are not interested in monetization and at the same time having these solicitations up might sound hypocritical

The fuck are you talking about? Wait, don't answer, I don't want to know. Stop posting instead. I'm about as interested in reading your wall of text essays as I am in working around broken XML.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 01 Aug 2013, 20:32
by gbcox
fox wrote:People should learn to own up to their shitty programming and fix it instead of dragging everyone else into their cesspool of mediocrity.

Amen!

durval wrote:but I think that your mention of IE as a "bad example" serves at best to confirm my thesis instead of denying it (and at worst is a complete "non sequitur")...

That's a bit of a reach, and no it doesn't confirm your thesis. The bottom line is "one bad apple spoils the barrel".

In my view, there are people out there who are delusional and don't want to take the extra fraction of a second to do the right thing. Instead, for whatever perverse reason, they much rather spin their wheels for hours on end coming up with perverse mechanizations to reach an end result. Then, they expect the rest of us to stand in line and feed the Frankenstein monster they have created.

There are plenty of feeds out there. If someone refuses to own up and fix theirs then dump it and choose another. I've found that most people aren't aware that there is a problem and are happy to fix their stuff.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 01 Aug 2013, 21:02
by AngryChris
I'm not looking to fan any flames here, but to provide a suggestion. Fox, would it be possible to somehow implement xmllint functionality in the application via official plug-in (meaning a plug-in that is distributed alongside TT-RSS)? I don't mean re-write things so TT-RSS itself "cleans up" or ignores bad XML or whatever, but put, say, a plug-in in the official app that makes xmllint (if installed on the system) easy to enable with a checkbox?

Plugin: af_xmllint
Description: If you have it installed, runs all posts through xmllint prior to insertion into the database.
Version: 1.0
Author: fox (I hope!)

Is this a reasonable feature request?

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 01 Aug 2013, 21:10
by feader
AngryChris wrote:I'm not looking to fan any flames here, but to provide a suggestion. Fox, would it be possible to somehow implement xmllint functionality in the application via official plug-in

It's not official, but a plugin already exists. I don't think that fetching a zip file and extract it into the right directory is to much to ask for. Someone could even make a Knowledge Base entry for this kind of stuff, so that every person with reasonable search skills can find all available solutions.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 01 Aug 2013, 21:15
by gbcox
Fox can and will do whatever he wants... The plugin exists, and people can seek it out if they want it. Personally, I don't see the point other than you're just asking him to support a crutch. Seriously, what is so hard about asking people to fix their stuff? Is it really that hard? Is the content in these broken feeds just so compelling and irreplaceable to insist the world to hack around their sloppy code? I really don't get it.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 01 Aug 2013, 22:16
by fox
AngryChris wrote:I'm not looking to fan any flames here, but to provide a suggestion. Fox, would it be possible to somehow implement xmllint functionality in the application via official plug-in (meaning a plug-in that is distributed alongside TT-RSS)? I don't mean re-write things so TT-RSS itself "cleans up" or ignores bad XML or whatever, but put, say, a plug-in in the official app that makes xmllint (if installed on the system) easy to enable with a checkbox?


Plugin already exists, why bundle it? It should be in the wiki index even.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 02 Aug 2013, 17:54
by durval
Hi gbcox,

gbcox wrote:
durval wrote:but I think that your mention of IE as a "bad example" serves at best to confirm my thesis instead of denying it (and at worst is a complete "non sequitur")...

That's a bit of a reach, and no it doesn't confirm your thesis. The bottom line is "one bad apple spoils the barrel".

In my view, there are people out there who are delusional and don't want to take the extra fraction of a second to do the right thing. Instead, for whatever perverse reason, they much rather spin their wheels for hours on end coming up with perverse mechanizations to reach an end result. Then, they expect the rest of us to stand in line and feed the Frankenstein monster they have created.

There are plenty of feeds out there. If someone refuses to own up and fix theirs then dump it and choose another. I've found that most people aren't aware that there is a problem and are happy to fix their stuff.


I think we should just agree to disagree on that...

Cheers,
--
Durval.

Re: Fixing LibXML error "Extra content at the end of documen

Posted: 02 Aug 2013, 17:55
by durval
Hi Fox,
fox wrote:>On a side note, if you really don't care about monetization, perhaps you should consider taking out the "donate" button on the TT-RSS Wiki and also quit the flattr thing, saying that you are not interested in monetization and at the same time having these solicitations up might sound hypocritical

The fuck are you talking about? Wait, don't answer, I don't want to know. Stop posting instead. I'm about as interested in reading your wall of text essays as I am in working around broken XML.


Your wish has been granted...

Cheers,
--
Durval.