Update database content

dlohan · Postby **dlohan** » 11 May 2016, 17:37

I have several thousand articles I have archived in TT-RSS. Some of them arrive via Google RSS. Others do not. Some of the links resemble:

"https://www.google.com/url?rct=j&sa=t&url=http://thecork.ie/2016/03/16/cork-airport-get-two-new-routes-to-southampton-and-leeds-bradford/&ct=ga&cd=CAIyGzI5ZDZjMWRhMzczNzBlOTU6aWU6ZW46SUU6Ug&usg=AFQjCNGaSw-_EEoppiW7fQFFjFKSbcISEQ"

All I want is the middle part:

https://www.google.com/url?rct=j&sa=t&url=http://thecork.ie/2016/03/16/cork-airport-get-two-new-routes-to-southampton-and-leeds-bradford/&ct=ga&cd=CAIyGzI5ZDZjMWRhMzczNzBlOTU6aWU6ZW46SUU6Ug&usg=AFQjCNGaSw-_EEoppiW7fQFFjFKSbcISEQ

I have looked at plugins such as af_feedmod and ff_feedcleaner. The first appears to be irrelevant. The second will adjust data before it is parsed and I'm not sure whether it can make two adjustments. Aside from this it will not adjust the data already in my database. I also looked at Google New Cleanup Plugin at https://tt-rss.org/forum/viewtopic.php?t=1606. Unless I am wrong this is not relevant either.

I had thought of trying a mySQL trigger. I wrote one to apply after data had been added to the table "entries".

BEGIN
UPDATE `ttrss_entries` SET `DaveLink` = `link` ,
`DaveFlag` =1 WHERE `DaveLink` = "" AND `DaveFlag` =0;

UPDATE `ttrss_entries` SET `link` = Right( `link` , Char_Length( `link` ) - ( Instr( `link` , "http://" ) -1 ) ) ,
`DaveFlag` =2 WHERE Instr( `link` , "https://www.google.com/url?rct=j&sa=t&url=" ) >0 AND Instr( `link` , "&ct=ga&" ) >0 AND `DaveFlag` =1;

UPDATE `ttrss_entries`
SET `link` = Left (`link`, Instr( `link` , "&ct=ga&")-1), `DaveFlag` = 0
WHERE Instr( `link` , "&ct=ga&" ) >0 and DaveFlag = 2;

END

The field DaveLink is a backup of the original link field and the field DaveFlag is a flag field I incorporated to avoid inconsistencies.

While it changed the data it seemed to kill the application and my feeds stopped downloading. I suspect many of you will know precisely why this occurred. Fortunately I'd made a backup and when I restored everything worked fine again, but my links remain as they were.

My questions then, if someone could assist:

1. Can I modify URLS already in the TinyTinyRSS database?
2. If not, can I use ff_feedcleaner to achieve clean urls for new data (examples from https://github.com/wltb/ff_feedcleaner/blob/master/README.md only show one part of the URL being cleaned. I need to remove the first and last parts.)
3. If ff_feedcleaner can be used can someone with expertise advise the code I should use?

All help greatly appreciated.

Latimer · Postby **Latimer** » 11 May 2016, 18:38

Seems like af_unburn does something similar for the new articles.

dlohan · Postby **dlohan** » 11 May 2016, 20:20

Latimer wrote:Seems like af_unburn does something similar for the new articles.

Thank you very much for that Latimer. I shall try it.

I was thinking that if this works I might also look at using the import_export plugin to export all of the articles, modify their links externally and re-import them. That might be one option.

Many thanks again!

JustAMacUser · Postby **JustAMacUser** » 11 May 2016, 20:57

Handling this exclusively in MySQL probably isn't the way to go.

Either hook article filter on import (feed updating) like the af_unburn plugin or hook the maintenance task (which by default runs hourly) like the cache_starred_images plugin.

If you go with the cache_starred_images approach, you can see how it keeps track of changes (there's an existing system for doing this). If you go the af_unburn approach, well it doesn't really matter since you're handling each entry individually as they come in.

Either way, PHP has good support for this type of thing:

Code: Select all

$new_link = false;
$qs = parse_url( $original_link_here, PHP_URL_QUERY );
if ( $qs ) {
   $parts = array();
   parse_str( $qs, $parts );
   
   if ( array_key_exists( 'url', $parts )
      $new_link = $parts['url'];
}

But this isn't a PHP coding forum so that's as much coding-specific help as I'll provide.

e: I would recommend avoiding the exporting/importing route. It's a lot of extra steps that simply aren't necessary. You already have the information in the database, just go ahead and manipulate it. If you need to track original values, create a new table (ideally with foreign keys to keep the data consistent) and store the info there.

dlohan · Postby **dlohan** » 11 May 2016, 21:33

Latimer wrote:Seems like af_unburn does something similar for the new articles.

As it happens af_unburn was already enabled. It clearly states however that it requires cURL. This goes beyond me.

I did do some further research however on using ff_feedcleaner. It allows for the use of replacing string elements. I got a few hints from https://forum.keyboardmaestro.com/t/stripping-out-junk-tracking-code-from-google-alerts-urls/945 what the coding should look like. I was able to test this at https://regex101.com. Basically for a string of the format:

https://www.google.com/url?rct=j&sa=t&url=http://www.bbc.co.uk/programmes/p03j9wx2&ct=ga&cd=CAIyHDRlZjY2ZGU3MDFiNzk1MjA6Y28udWs6ZW46R0I&usg=AFQjCNFCzVgtYLmAuePh4NA5i6_7B0vePg

The regex expression should be: (?:https\:\/\/www\.google\.com)(?:.*?)url=(.*?)\&

This delivers: http://www.bbc.co.uk/programmes/p03j9wx2

It will need some testing but I may yet work it out. Probably won't be able to salvage records already in the database on this basis. A step forward all-the-same.

dlohan · Postby **dlohan** » 11 May 2016, 21:36

Thanks JustAMacUser.

dlohan · Postby **dlohan** » 11 May 2016, 22:15

I have enabled ff_feedcleaner. Link https://github.com/wltb/ff_feedcleaner

I think the code I should be using is something along the lines of:

[
{
"URL": "www.google.com",
"type": "link_regex",
"pattern": "(?:https\:\/\/www\.google\.com)(?:.*?)url=(.*?)\&",
"replacement": "(?:.*?)url=(.*?)\&"
}
]

I know this isn't quite right yet. If there are any regex experts out there perhaps one of you might spot my mistake.

dlohan · Postby **dlohan** » 12 May 2016, 15:09

As opposed to what I posted earlier I think what I need to do is capture the first and last parts and set them to null. To this end I have worked on a regex pattern which appears (using Regex101.com and Debuggex.com) to be correct. I have tried it in ff_feedcleaner but now I am getting an "invalid JSON" message:

[
{
"URL_re": "#https://(www\\.google\\.com#",
"type": "regex",
"pattern": "(.*?)(?:http:\/\/?[\da-z\.-]+\.[a-z\.]{2,6}[\/\w \.-]*)(.*?&.*$)",
"replacement": ""
}
]

dlohan · Postby **dlohan** » 13 May 2016, 15:49

I give up. Completely stumped.

Tiny Tiny RSS

Update database content

Update database content

Re: Update database content

Re: Update database content

Re: Update database content

Re: Update database content

Re: Update database content

Re: Update database content

Re: Update database content

Re: Update database content

Who is online