Plugin ff_FeedCleaner

feader · Postby **feader** » 27 May 2013, 19:48

I created a plugin whose main purpose is to allow for correcting faulty feed data, therefore the FeedCleaner suffix. It can also be used to modify feed URLs for the af_FeedMod plugin. It is available on github. It needs Tiny Tiny RSS version 1.8 or later.

Some documentation is provided on the github page. For a first example, the erroneous feed described here can be corrected with this

Code: Select all

[
   {
      "URL": "http://www.iswintercoming.com/feed.php",
      "type" : "regex",
      "pattern" : "/sid=[0-9a-f]{32}/",
      "replacement" : ""
   }
]

as input, the regular expressions are those from the pcre module.

Latimer · Postby **Latimer** » 28 May 2013, 03:47

Thanks, I'll definitely check it out once 1.7.10, or, rather, 1.8, is out.

wib · Postby **wib** » 13 Jun 2013, 16:01

This is exactly what I needed. Looking forward to testing.

robinmarlow · Postby **robinmarlow** » 15 Jun 2013, 13:40

Thank you, this looks great. Sadly I can't get it to work!

Forhttp://adc.bmj.com/rss/ahead.xml

I want to replace the links to the full articles e.g.
http://adc.bmj.com/cgi/content/short/ar ... 59v1?rss=1
to
http://adc.bmj.com/cgi/content/long/arc ... 59v1?rss=1

I'm trying:

Code: Select all

{
    "#^http://adc\\.bmj\\.com/rss/ahead\\.xml#" : {
        "type" : "regex",
        "pattern" : "#cgi/content/short#",
        "replacement" : "cgi/content/long"
    }
}

but it's not working. What am i doing wrong? Is there a better way?

Thank,

Robin

feader · Postby **feader** » 15 Jun 2013, 15:10

robinmarlow wrote:Forhttp://adc.bmj.com/rss/ahead.xml

I want to replace the links to the full articles e.g.
http://adc.bmj.com/cgi/content/short/archdischild-2013-303959v1?rss=1
to
http://adc.bmj.com/cgi/content/long/archdischild-2013-303959v1?rss=1

Hi Robin,

for me, your RegEx does exactly what you are trying to achieve, I see only links to content/long in Tiny Tiny RSS. When clicking on such a link, for example http://adc.bmj.com/cgi/content/long/archdischild-2013-303959v1?rss=1, I get redirected to http://adc.bmj.com/content/early/2013/06/13/archdischild-2013-303959.long?rss=1.

Is that your problem?

robinmarlow · Postby **robinmarlow** » 15 Jun 2013, 18:19

That is exactly what I want to happen.... but it isn't!
I wondered if it only applied rules to newly fetched articles.

but creating a new feed & applying: "#^http://feeds\\.bbci\\.co\\.uk/news/rss\\.xml?edition=uk#" : {
"type" : "regex",
"pattern" : "#news#",
"replacement" : "test"
}
I would have thought should have got lots of "test".... but again it didn't. Any ideas to how i can troubleshoot it?

I can't see any errors in the tt-rss error log, is there anywhere else I can get a clue?

feader · Postby **feader** » 15 Jun 2013, 18:53

robinmarlow wrote:but creating a new feed & applying: "#^http://feeds\\.bbci\\.co\\.uk/news/rss\\.xml?edition=uk#" : {
"type" : "regex",
"pattern" : "#news#",
"replacement" : "test"
}
I would have thought should have got lots of tests.... but again it didn't. Any ideas to how i can troubleshoot it?

OK, with this feed, I don't see test in the URLs either. At the moment, the plugin doesn't report anything to the debug log because I don't know how to do it right (if anyone knows a plugin that does this and posted a link to it, I'd be grateful).

Only thing we can do at the moment is to test the code from hand. I will look into it.

robinmarlow · Postby **robinmarlow** » 15 Jun 2013, 18:59

Thanks! I was just investigating your (very neat) code to see how it works (I think I get the rough idea).
Adding a way to log something to the debug log would be great & given your code really easy if we knew how!
I had just started poking around to see what I can find, but nothing yet.
I can't see why my news example doesn't work either.

Robin

robinmarlow · Postby **robinmarlow** » 15 Jun 2013, 19:20

Tiny-Tiny-RSS / plugins / af_pennyarcade / init.php

appears to have some logging setup in it - but the same doesn't work when I put it into feedcleaner.
however i think this is actually a problem somwhere between computer and chair....

R

feader · Postby **feader** » 15 Jun 2013, 19:32

robinmarlow wrote:[…]
but creating a new feed & applying: "#^http://feeds\\.bbci\\.co\\.uk/news/rss\\.xml?edition=uk#" : {
[…]

Sometimes … the problem is that '?' is a regex meta character, could you try it with

Code: Select all

"#^http://feeds\\.bbci\\.co\\.uk/news/rss\\.xml\\?edition=uk#"

as key?

robinmarlow · Postby **robinmarlow** » 15 Jun 2013, 20:58

sorry that still didn't work.

But your iswintercoming feed & example do work - so at least my computer can deal with regex - it is just choking on the sites I want!

feader · Postby **feader** » 15 Jun 2013, 21:31

robinmarlow wrote:But your iswintercoming feed & example do work - so at least my computer can deal with regex - it is just choking on the sites I want!

Strange. With

Code: Select all

"#^http://feeds\\.bbci\\.co\\.uk/news/rss\\.xml\\?edition=uk#" : {
"type" : "regex",
"pattern" : "#news#",
"replacement" : "test"
}

I get tests and a nice 404 handler if I click on the URLs (that's our Beeb :wink:

). Sorry, I'm out of ideas at the moment.

robinmarlow · Postby **robinmarlow** » 18 Jun 2013, 14:41

Fixed it. I needed to escape the backslashes in my regex pattern

Code: Select all

    "#^http://adc\\.bmj\\.com/rss/ahead\\.xml#" : {
        "type" : "regex",
        "pattern" : "#cgi\\/content\\/short#",
        "replacement" : "cgi\/content\/long"
    }

Robin

roshambo · Postby **roshambo** » 25 Jun 2013, 00:39

Thanks for this, I'm trying to fix this feed: http://validator.w3.org/feed/check.cgi? ... wire%2Fall but dumbfounded when it comes to regex. Also ttrss is complaining about '&acirc' instead, not sure which is correct. So far I have:

Code: Select all

{
  "#^http://feeds.feedburner\\.com/1500espn/sportswire/all\\#" : {
        "type" : "regex",
        "pattern" : "/\x80\x99/",
        "replacement" : ""
   },
  "#^http://feeds.feedburner\\.com/1500espn/sportswire/all\\#" : {
        "type" : "regex",
        "pattern" : "/\x80\x98/",
        "replacement" : ""
   },
  "#^http://feeds.feedburner\\.com/1500espn/sportswire/all\\#" : {
        "type" : "regex",
        "pattern" : "/\x85\x94/",
        "replacement" : ""
   },
  "#^http://feeds.feedburner\\.com/1500espn/sportswire/all\\#" : {
        "type" : "regex",
        "pattern" : "/&acirc/",
        "replacement" : ""
   }
}

Which results in an invalid JSON. Any help would be appreciated.

feader · Postby **feader** » 25 Jun 2013, 01:03

Different objects may not have the same key. This is a mistake on my side, in the next version the configuration will consist of unnamed objects with a url key. In the mean time, try dropping letters

Code: Select all

{
  "#^http://feeds.feedburner\\.com/1500espn/sportswire/all\\#" : {
       […]
   },
  "#^http://feeds.feedburner\\.com/1500espn/sportswire/al\\#" : {
        […]
   },
  [etc]
}

or do it in one regex with alternation

Code: Select all

{
  "#^http://feeds.feedburner\\.com/1500espn/sportswire/all\\#" : {
        "type" : "regex",
        "pattern" : "/\x80\x99|\x80\x98[|…]/",
        "replacement" : ""
   }

I'm not sure what you want to achieve with the \\# at the end tough. I'm also not sure what ESPN wants with &acirc, but I'd remove the whole â with semicolon. Last not least I'm not sure if the pattern /\x80\x99/ works as intended (consult the doc), and maybe you should first contact the content provider before using this plugin.

Tiny Tiny RSS

Plugin ff_FeedCleaner

Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Re: Plugin ff_FeedCleaner

Who is online