Bug in feed purging (git head) of some sort

Development-related discussion, including bundled plugins
troyengel
Bear Rating Trainee
Bear Rating Trainee
Posts: 18
Joined: 23 Mar 2013, 19:39

Bug in feed purging (git head) of some sort

Postby troyengel » 21 Feb 2016, 18:34

I was preparing to migrate my database and noticed my tt-rss SQL was huge (compared to what it's supposed to be, single user and mark feeds read daily) so started digging in and found years old entries in ttrss_entries that *should have been* purged a long time ago. This might be an older bug that caused it (running git head and all), but not sure.

Here's what I figured out so far for my feed with id = 40 -- I followed one article from 2014-02-02 sitting in the DB:

Code: Select all

mysql> select id,owner_uid,update_interval,purge_interval from ttrss_feeds where id='40'\G;
*************************** 1. row ***************************
             id: 40
      owner_uid: 2
update_interval: 0
 purge_interval: 0
1 row in set (0.00 sec)

mysql> select owner_uid,pref_name,value from ttrss_user_prefs where owner_uid='2' and pref_name='PURGE_OLD_DAYS'\G;
*************************** 1. row ***************************
owner_uid: 2
pref_name: PURGE_OLD_DAYS
    value: 7
1 row in set (0.00 sec)

mysql> select id,title,updated,date_entered,date_updated from ttrss_entries limit 1\G;
*************************** 1. row ***************************
          id: 28016
       title: Resident / Episode 143 / February 01 2014
     updated: 2014-02-02 06:00:12
date_entered: 2014-02-02 16:53:00
date_updated: 2016-02-21 15:15:32
1 row in set (0.00 sec)


...so I immediately noticed that date_updated was the outlier, and indeed find that the purging code is using that as the key. So that means the bug has to be in the update_rss_feed() function in includes/rssfunc.php, but it's really hard to follow if you don't know the code really well, so I used the debug-feed option to update.php to give it a run:

Code: Select all

$ /usr/bin/php ./update.php --debug-feed 40
...
[15:15:32/30174] guid 2,http://podcast.hernancattaneo.com/2014/02/02/resident-episode-143-february-01-2014/ / SHA1:f41ef61836ea96508585872ee896a328cbd9c6c3
[15:15:32/30174] orig date: 1391320812
[15:15:32/30174] date 1391320812 [2014/02/02 06:00:12]
[15:15:32/30174] title Resident / Episode 143 / February 01 2014
[15:15:32/30174] link http://podcast.hernancattaneo.com/e/resident-episode-143-february-01-2014/
[15:15:32/30174] author Hernan Cattaneo
[15:15:32/30174] num_comments: 0
[15:15:32/30174] looking for tags...
[15:15:32/30174] tags found: podcast
[15:15:32/30174] done collecting data.
[15:15:32/30174] article hash: b93e9b7a27b4fc6a305a32a8f176159b00a333de [stored=b93e9b7a27b4fc6a305a32a8f176159b00a333de]
[15:15:32/30174] stored article seems up to date [IID: 28016], updating timestamp only


So the error seems to be in here somewhere, this is the specific block:

Code: Select all

                                _debug("article hash: $entry_current_hash [stored=$entry_stored_hash]", $debug_enabled);

                                if ($entry_current_hash == $entry_stored_hash && !isset($_REQUEST["force_rehash"])) {
                                        _debug("stored article seems up to date [IID: $base_entry_id], updating timestamp only", $debug_enabled);

                                        // we keep encountering the entry in feeds, so we need to
                                        // update date_updated column so that we don't get horrible
                                        // dupes when the entry gets purged and reinserted again e.g.
                                        // in the case of SLOW SLOW OMG SLOW updating feeds

                                        $base_entry_id = db_fetch_result($result, 0, "id");

                                        db_query("UPDATE ttrss_entries SET date_updated = NOW()
                                                WHERE id = '$base_entry_id'");

                                        continue;
                                }


...I mean, I think -- whatever is wrong needs someone more familiar with the code to help sort out what's not working right. I've left my database intact, anyone have a clue? Is this an old bug that was fixed and I just need to flush my DB somehow of all these stale entries? (there are about 900 of them in ttrss_entries going back to 2013). Thx!

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: Bug in feed purging (git head) of some sort

Postby fox » 21 Feb 2016, 18:42

wiki -> faq

e: also try reading the comments in that code block you linked

troyengel
Bear Rating Trainee
Bear Rating Trainee
Posts: 18
Joined: 23 Mar 2013, 19:39

Re: Bug in feed purging (git head) of some sort

Postby troyengel » 21 Feb 2016, 19:06

OK I think I follow, what you're saying is that it's using the server cache controls (if-modified-since, etc.) and since this server works like this:

Code: Select all

$ curl -I -H 'If-Modified-Since: Sat, 20 Feb 2016 00:00:01 GMT' http://podcast.hernancattaneo.com/e/resident-episode-143-february-01-2014/

HTTP/1.1 200 OK
Cache-Control: no-store, no-cache, must-revalidate, post-check=0, pre-check=0
Cache-control: no-cache="set-cookie"
Content-Type: text/html; charset=UTF-8
Date: Sun, 21 Feb 2016 15:55:54 GMT
Expires: Thu, 19 Nov 1981 08:52:00 GMT
Pragma: no-cache
Server: Apache
Set-Cookie: PHPSESSID=2fo1cvu8q0dp05mk0uth2aprf7; path=/
Set-Cookie: AWSELB=830BDD2714C96857D7F5E2533BF21492F4232C6264B263448AA7CFA896F78DE2BABAECEF2E49D2EC34A4BC58BB3F14A5D6CD60C3A80C63CE9C8096D39A29A9FB0B0A81EE3F;PATH=/;MAX-AGE=30
Status: 200 OK
Vary: Accept-Encoding
X-FromPodPressCache: na
X-Pingback: http://www.podbean.com/xmlrpc.php
X-Powered-By: ASP.NET 2.0
Connection: keep-alive


...so since this stupid site sets up no caching and doesn't present modification dates, we have no way to know if it's "new" or not, that makes sense. I examined the RSS feed and sure enough, it has entries going back 150 (all the way to episode 102 from 2013-04-21.

A question then turning this around -- is there a way we can purge the *content* of these entries (this feed is a general example, some of them in the same situation have lots of content)? What is that article hash hashing, the whole thing? In my mind the logic should go something like this:

a) is feed marked Read?
a.1) No, business as usual
a.2) Yes, goto b

b) is this article beyond (purge date)?
b.1) No, business as usual
b.2) Yes. Update article date_updated field to preserve "don't download again" but purge the actual content to keep the DB from getting bloated.

I realize this is an upstream problem with the feeds not setting dates, but it seems there has to be something that allows cleaning up the content itself, while maintaining a tracking cookie (date_updated) to prevent re-downloads. If I were to simple update the content column in ttrss_entries with "" (null it out), would that break the hashing stuff I see going on and cause it to re-download all the previous content again?

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: Bug in feed purging (git head) of some sort

Postby fox » 21 Feb 2016, 19:14

my advice would be not bothering with this because it's not worth the effort

also, that scheme you invented is absolutely terrible, and there's no fucking way in hell i'm going to replace currently working 100% bulletproof scheme with this abomination all to save a few bytes in the database or w/e is the stated goal here

in short, the purging works as intended

e: also, just in case, if-modified-since has nothing to do with this (and tt-rss does not support this header currently)

troyengel
Bear Rating Trainee
Bear Rating Trainee
Posts: 18
Joined: 23 Mar 2013, 19:39

Re: Bug in feed purging (git head) of some sort

Postby troyengel » 21 Feb 2016, 19:53

Here's all I was thinking -- I haven't run it, just used your existing code (meaning I am not a programmer by trade so it's rough) that doesn't redesign the world, just truncates, re-hashes and inserts the new hash:

Code: Select all

            if ($entry_current_hash == $entry_stored_hash && !isset($_REQUEST["force_rehash"])) {
               _debug("stored article seems up to date [IID: $base_entry_id], updating timestamp only", $debug_enabled);

               // we keep encountering the entry in feeds, so we need to
               // update date_updated column so that we don't get horrible
               // dupes when the entry gets purged and reinserted again e.g.
               // in the case of SLOW SLOW OMG SLOW updating feeds

               $base_entry_id = db_fetch_result($result, 0, "id");

               // truncate the content
               $entry_content = '(content purged)';

               // update the date cookie and insert the truncated content
               db_query("UPDATE ttrss_entries SET date_updated = NOW(), content = $entry_content
                  WHERE id = '$base_entry_id'");

               // re-hash with truncated content
               $article = array("owner_uid" => $owner_uid, // read only
                  "guid" => $entry_guid, // read only
                  "guid_hashed" => $entry_guid_hashed, // read only
                  "title" => $entry_title,
                  "content" => $entry_content,
                  "link" => $entry_link,
                  "labels" => $article_labels, // current limitation: can add labels to article, can't remove them
                  "tags" => $entry_tags,
                  "author" => $entry_author,
                  "force_catchup" => false, // ugly hack for the time being
                  "score_modifier" => 0, // no previous value, plugin should recalculate score modifier based on content if needed
                  "language" => $entry_language,
                  "feed" => array("id" => $feed,
                     "fetch_url" => $fetch_url,
                     "site_url" => $site_url)
                  );
               $entry_current_hash = calculate_article_hash($article, $pluginhost);

               // insert our new hash based on truncated content
               db_query("UPDATE ttrss_entries SET content_hash = $entry_current_hash
                  WHERE id = '$base_entry_id'");

               continue;
            }


(I was going to give you a diff but this is cleaner for the forum) BTW this is a gig on my DB, not a few bytes. :) I'm guessing it's because I use the af_readability which is sucking down these really huge articles entirely, but haven't spent the time to look at the details.

JustAMacUser
Bear Rating Overlord
Bear Rating Overlord
Posts: 373
Joined: 20 Aug 2013, 23:13

Re: Bug in feed purging (git head) of some sort

Postby JustAMacUser » 21 Feb 2016, 20:40

fox's playground so fox's rules...

I agree with fox, too, the current code works (and it works well). The content column is not indexed so there's no concern about excess RAM usage; that stuff stays on disk. And disk space is cheap. Even if the total were several GBs it wouldn't really matter much.

You'd be better off asking the site admins why they feel they need to serve content that's years old in their feed.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: Bug in feed purging (git head) of some sort

Postby fox » 21 Feb 2016, 20:41

op, protip: instead of trying to push horrific hacks into mainline code, you can utilize the plugin system which exists for this exact reason

that said i'd also like to ask you to please stop hurting my brain with your ideas and your code, thanks

>You'd be better off asking the site admins why they feel they need to serve content that's years old in their feed.

^ this


Return to “Development”

Who is online

Users browsing this forum: No registered users and 1 guest