Filter near duplicates

Request new functionality here
PeterDoerrie
Bear Rating Trainee
Bear Rating Trainee
Posts: 2
Joined: 31 May 2012, 17:34

Filter near duplicates

Postby PeterDoerrie » 31 May 2012, 17:40

My Problem: I get a lot of feeds from news sources, who publish lots of agency reports (Reuters, AFP, etc). Each source uses a different headline and often varies the first paragraph slightly, making a 1 to 1 filter for duplicates useless. I end up with tons of feed items with essentially the same content.

Solution: Implement a filter that checks if >80% of the words used in the items are the same. If so, dump one of the items and only import one of them.

No idea how feasible this is, but it would help me enormously in my usecase.

Thanks for the wonderful reader,

Peter

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: Filter near duplicates

Postby fox » 01 Jun 2012, 12:01

This needs something like Lucene to extract keywords and operate on them somehow, unfortunately that's not available for PHP (I think).

fluffy
Bear Rating Trainee
Bear Rating Trainee
Posts: 37
Joined: 20 Jun 2012, 09:24

Re: Filter near duplicates

Postby fluffy » 20 Jun 2012, 09:47

How about doing some sort of basic ngram analysis, similar to Amazon's SIPs stuff?

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: Filter near duplicates

Postby fox » 09 Jul 2012, 19:52

So I did some prototyping on this: http://tt-rss.org/redmine/projects/tt-r ... teChecking

In my limited testing, this works like magic. Mostly.

Image

fluffy
Bear Rating Trainee
Bear Rating Trainee
Posts: 37
Joined: 20 Jun 2012, 09:24

Re: Filter near duplicates

Postby fluffy » 09 Jul 2012, 21:52

Cool, I didn't know Postgres had an n-gram module in it. That makes a lot of stuff way easier. Now more than ever I wish Dreamhost would support that. Maybe I should just break down and move my TTR instance to my email VPS instead.

ginahoy
Bear Rating Disaster
Bear Rating Disaster
Posts: 66
Joined: 02 Jan 2010, 07:10

Re: Filter near duplicates

Postby ginahoy » 09 Jul 2012, 22:20

I hope this eventually gets implemented in the Online version :D


Return to “Feature requests”

Who is online

Users browsing this forum: No registered users and 6 guests