Plugin ff_FeedCleaner

roshambo · Postby **roshambo** » 25 Jun 2013, 01:52

I think there's something else wrong, always get invalid JSON when pressing save, even when the feedcleaner window is blank.

I messaged espn mid last week with no reply yet, have a feeling they ignored it. The feed works in GR, feedly, tor, inoreader, ttrss is the only reader tried that it doesn't work in.

Latimer · Postby **Latimer** » 25 Jun 2013, 01:54

Thanks for the plugin, I just set it up on the newly updated tt-rss. Messed a bit with regex, looks like it's working.

roshambo wrote:So far I have:
Code: Select all
{ "#^http://feeds.feedburner\\.com/1500espn/sportswire/all\\#" : { "type" : "regex", "pattern" : "/\x80\x99/", "replacement" : "" }, ... }

Which results in an invalid JSON. Any help would be appreciated.

I would guess you need to use double backslashes in your patterns:

Code: Select all

"pattern" : "/\\x85\\x94/"

roshambo wrote:always get invalid JSON when pressing save, even when the feedcleaner window is blank

Try entering

Code: Select all

{}

roshambo · Postby **roshambo** » 25 Jun 2013, 03:19

Double backslashes fixed it, thanks. I read the whole section on regex syntax on php.net and used this regex tester http://www.solmetra.com/scripts/regex/index.php. All tested out okay using every delimiter and meta-character that tested okay but no luck.

Code: Select all

{
  "^http://feeds\\.feedburner\\.com/1500espn/sportswire/all" : {
        "type" : "regex",
        "pattern" : "[&acirc;]",
        "replacement" : ""
   }
}

Should remove just the 2 instances of â if I understand correctly but ttrss still errors at 'Entity 'acirc' not defined'

feader · Postby **feader** » 25 Jun 2013, 20:47

roshambo wrote:Should remove just the 2 instances of â if I understand correctly but ttrss still errors at 'Entity 'acirc' not defined'

Because it doesn't.

roshambo wrote:I read the whole section on regex syntax on php.net

OK. You should have noted then that the regexes need delimiters, so try it with

Code: Select all

  "#^http://feeds\\.feedburner\\.com/1500espn/sportswire/all#" : {
        "type" : "regex",
        "pattern" : "[&acirc;]",
        "replacement" : ""
   }

Works for me at least.

roshambo · Postby **roshambo** » 26 Jun 2013, 20:41

Thanks that worked. I realize now the feed has to update to test, which makes sense, wasn't waiting before.

feader · Postby **feader** » 17 Jul 2013, 21:47

Version 0.8 was released right now. It includes changes in the configuration format, and while old style configurations should still work, all users are encouraged to switch to the new style. Details can be found in the README on the github page.

dlohan · Postby **dlohan** » 20 May 2016, 13:24

I want to extract the URL embedded in a feed from Google RSS. Some of these URLS being http. Others with https. An example of what Google provides me with:

https://www.google.com/url?rct=j&sa=t&url=http://thecork.ie/2016/03/16/cork-airport-get-two-new-routes-to-southampton-and-leeds-bradford/&ct=ga&cd=CAIyGzI5ZDZjMWRhMzczNzBlOTU6aWU6ZW46SUU6Ug&usg=AFQjCNGaSw-_EEoppiW7fQFFjFKSbcISEQ

The part I want to extract is in bold. I have pretty much worked out that the code I need is:

[
{
"URL": “www.google.com",
"type": "regex",
"pattern": "/^http\S+url=|&ct\S+/",
"replacement": ""
}
]

The problem I'm getting now is one of an "invalid JSON". I have read the earlier posts, though I don't think there is anything I need to "escape". Any feedback would be of great help.

dlohan · Postby **dlohan** » 22 May 2016, 22:07

I've sorted the JSON issue. I believe the pattern ought to have been

[
{
"URL": “www.google.com",
"type": "regex",
"pattern": "/^http\\S+url=|&ct\\S+/",
"replacement": ""
}
]

The configuration will now save BUT it is still not working for me.

dlohan · Postby **dlohan** » 31 May 2016, 23:05

I've managed to get this working partially. What's not clear, even from the instructions, is whether ff_feedcleaner can make two changes to one URL simultaneously. I want to crop the first and last part of a redirected URL to extract a URL contained with it. The sub-URL is bracketed on the left by "url=" and on the right by "&ct". Using the code snippet below I can remove everything on the left of the longer URL string, but not on the right.

[
{
"URL_re" : "#www\\.google\\.ie#",
"type" : "regex",
"pattern": "/http\\S+url=|&ct\\S+/",
"replacement": ""
}
]

I have checked and rechecked this code online with Debugexx. I've tested it too with TT-RSS. I'm not even sure ff_feedcleaner can achieve what I'm trying to do.

feader · Postby **feader** » 01 Jun 2016, 00:00

dlohan wrote:I've managed to get this working partially. What's not clear, even from the instructions, is whether ff_feedcleaner can make two changes to one URL simultaneously.

Programmer here. The plugin can do two or more changes if you can pack them into one regex. If you can't, you can split your changes into several regexes and put these in the config, they will be applied in order (roughly speaking).

dlohan wrote:I have checked and rechecked this code online with Debugexx. I've tested it too with TT-RSS. I'm not even sure ff_feedcleaner can achieve what I'm trying to do.

There is a preview pane which is titled Show Diff. Might not be the most suitable name I guess. You can see what the plugin does there on the XML level, but you need Unix' diff for now.

Final word from me on this matter: I think you (and forum subscribers) would be better off if you coded your own plugin for your purpose since the structure of URL query parameters is not that well suited to regexes and php's std_lib has functions that deal with just that, parse_url/str if memory serves.

Postby **fox** » 01 Jun 2016, 00:12

op, try (a)|(b) or something?

dlohan · Postby **dlohan** » 01 Jun 2016, 00:45

Thanks Feader & Fox.

I'll try what is suggested. If I do manage to get a fix I'll post it here.

Appreciate what you said too Feader that custom programming might be the best option.

dlohan · Postby **dlohan** » 01 Jun 2016, 20:04

Speculating right now, but I think I know where the problem is. I have been using:

[
{
"URL_re" : "#www\\.google\\.ie#",
"type" : "regex",
"pattern": "/http\\S+url=|&ct\\S+/",
"replacement": ""
}
]

The http\\S+url= bit works fine stripping everything on the left of url= (including url=), but the part on the right does not (ie. &ct\\S+). It has something to do with the fact that the first character of this part is ampersand. This is clashing with Regex. It's a step closer. I think I'm nearly there if I can isolate this last issue.

JustAMacUser · Postby **JustAMacUser** » 02 Jun 2016, 03:58

I'm guessing it's because you're using an OR operator (the vertical bar, |). The preg function is not internally recursive so it matches the first half and that's it. If the first half was missing, it would strip the second half. Anyway, this is really the wrong approach, as has been mentioned, just use the parse_url/parse_str functions and be done with it.

Code: Select all

function get_url_query_string_value( $link ) {
  $qs = parse_url( $link, PHP_URL_QUERY );
  
  if ( $qs ) {
    $parts = array();
    parse_str( $qs, $parts );
    
    if ( array_key_exists( 'url', $parts ) && filter_var( $parts['url'], FILTER_VALIDATE_URL ) )
        return $parts['url'];
  }

  return $link;
}

Code above is untested, but should be complete enough to do what you want if you wrap it in the plugin class.

If you insist on using regex, then after matching the contents of url simply create a subsequent filter entry for the feed cleaner that does a regex .* to strip the rest (I don't know what all is available in the feed cleaner plugin so I'm generalizing here).

Tiny Tiny RSS

Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Who is online