Problem with rss.sciencedirect.com (and patch)

Support requests, bug reports, etc. go here. Dedicated servers / VDS hosting only
jmozmoz
Bear Rating Trainee
Bear Rating Trainee
Posts: 26
Joined: 14 Apr 2013, 18:07

Problem with rss.sciencedirect.com (and patch)

Postby jmozmoz » 15 Apr 2013, 18:32

Hi,

feeds from rss.sciencedirect.com are distributed in a way that tt-rrs does not show all articles. The problem is caused by the guid column of the table ttrss_entries being to short so that different article get the same guid and are filtered out:

This feed:
http://rss.sciencedirect.com/publication/science/1000

has (at the moment) several entries resulting in that same guid:

Code: Select all

1,http://rss.sciencedirect.com/action/redirectFile?&zone=main&currentActivity=feed&usageType=outward&url=http%3A%2F%2Fwww.sciencedirect.com%2Fscience%3F_ob%3DGatewayURL%26_origin%3DIRSSSEARCH%26_method%3DcitationSearch%26_piikey%3DS1293255813000


The following patch fixes the problem by using SHA1 to encode the guid:

Code: Select all

index 859c575..6af7972 100644
--- a/include/rssfuncs.php
+++ b/include/rssfuncs.php
@@ -561,7 +561,7 @@
                                        $entry_author = db_escape_string($link, $entry_author);
                                }

-                               $entry_guid = db_escape_string($link, mb_substr($entry_guid, 0, 245));
+                               $entry_guid = db_escape_string($link, sha1($entry_guid));

                                $entry_comments = db_escape_string($link, mb_substr($entry_comments, 0, 245));
                                $entry_author = db_escape_string($link, mb_substr($entry_author, 0, 245));

I searched the source for places where the guid is created besides the one above, but I didn't find ones in the core of tt-rss (only in the googlereader import plugin).

The other seemingly obvious solution to enlarge the column size in the database didn't work for mysql, because it resulted in the following error message, when (re)creating the database:
MySQL: Specified key was too long; max key length is 767 bytes

(Feeds from ScienceDirect are one of my most important applications of tt-rss (was google reader), so this is an important bug/fix for me. If this fast solution is not applicable I am willing to help to find another.)

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: Problem with rss.sciencedirect.com (and patch)

Postby fox » 15 Apr 2013, 18:46

Do you realize how many dupes changing guid causes? It happened exactly once and caused massive outrage already.

You can make a plugin to mangle guid in the specific feed, btw.

jmozmoz
Bear Rating Trainee
Bear Rating Trainee
Posts: 26
Joined: 14 Apr 2013, 18:07

Re: Problem with rss.sciencedirect.com (and patch)

Postby jmozmoz » 15 Apr 2013, 18:55

fox wrote:Do you realize how many dupes changing guid causes? It happened exactly once and caused massive outrage already.

Actually no.
fox wrote:You can make a plugin to mangle guid in the specific feed, btw.

Do you mean with an plugin that hook into HOOK_ARTICLE_FILTER?

In include/rssfuncs.php there are the following line:

Code: Select all

$article = array("owner_uid" => $owner_uid, // read only
                 "guid" => $entry_guid, // read only

And also it looks like this is too late. So could you please help me, which hook to use.

Thank you!

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: Problem with rss.sciencedirect.com (and patch)

Postby fox » 15 Apr 2013, 18:57

>Actually no.

Everything that is still in the feed XML will be reimported.

>Do you mean with an plugin that hook into HOOK_ARTICLE_FILTER?

Ah, right, you can't do it with that hook, article filters are not allowed to modify guids. I think you should be able to do it with feed fetch one, but this will require manual parsing of feed data.

jmozmoz
Bear Rating Trainee
Bear Rating Trainee
Posts: 26
Joined: 14 Apr 2013, 18:07

Re: Problem with rss.sciencedirect.com (and patch)

Postby jmozmoz » 15 Apr 2013, 19:13

fox wrote:>Actually no.
Everything that is still in the feed XML will be reimported.

Just an idea: There are SQL functions for sha1 in mysql and postgresql. Wouldn't it be possible to use them during an update of tt-rss to recalculate the guid?

Otherwise: Would it be possible to create a new hook in rssfuncs.php before

Code: Select all

 $result = db_query($link, "SELECT plugin_data,title,content,link,tag_cache,author FROM ttrss_entries, ttrss_user_entries WHERE ref_id = id AND guid = '".db_escape_string($link, $entry_guid)."' AND owner_uid = $owner_uid");

to modify the guid?

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: Problem with rss.sciencedirect.com (and patch)

Postby fox » 15 Apr 2013, 19:13

I think it's possible to implement prefixed guids to fix this properly once and for all, much like password hashes versioning is handled.

Consider plain text guids deprecated, check for them and for SHA1:(hash) both.

Recalculating everything is also doable but some people hoard a shitton of data which makes doing this on the fly impossible and are oblivious to release notes.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: Problem with rss.sciencedirect.com (and patch)

Postby fox » 15 Apr 2013, 19:30


jmozmoz
Bear Rating Trainee
Bear Rating Trainee
Posts: 26
Joined: 14 Apr 2013, 18:07

Re: Problem with rss.sciencedirect.com (and patch)

Postby jmozmoz » 15 Apr 2013, 20:38

fox wrote:https://github.com/gothfox/Tiny-Tiny-RSS/commit/5e3d5480f7e154a897363770327001fe1b72f504

Wow, thank you! It works for me :D

Just for reference. If somebody wants to remove all old (incomplete) articles from science direct, the following SQL command can be used:

Code: Select all

DELETE FROM `ttrss_entries` WHERE guid NOT LIKE 'SHA%' AND link LIKE 'http://rss.science%'

Then run

Code: Select all

update.php --feeds


Return to “Support”

Who is online

Users browsing this forum: No registered users and 12 guests