Page 5 of 5

Re: Bayesian classifier for TTRSS

Posted: 20 Nov 2015, 02:56
by masgo
I am also viewing articles in languages other than english, mostly german. Now when I took a look at the ttrss_plugin_af_sort_bayes_wordfreqs tables the top 30 entries are nearly identical to the top 30 most common words in german language.

I suggest to add them to the getIgnoreList() array. The question is now: should I only do this to my installation or should I do a pull request? Most of the top 30 german words do not exist in english. Some exception exist like "die", "des" (which could be the DES Algorithm) or "hat")

Another thing that I observed is that the "in_array" method is used to determine if a token is in the ignore list. Wouldn't using a set be faster?

Re: Bayesian classifier for TTRSS

Posted: 20 Nov 2015, 09:10
by fox
better idea would be making this list configurable instead of hardcoding stuff

Re: Bayesian classifier for TTRSS

Posted: 25 Nov 2015, 18:52
by masgo
You are right. How should I do this?
I would suggest a file (words-to-ignore.txt or similar) where would have all words that should be ignored. I would suggest a simple format like one word per line. And/Or using \n, \r\n, blank and tab as dividers for words. Maybe even comma and ;

Maybe we could provide some example files for common languages which the user could merge into one (cat german-words.txt >> words-to-ignore.txt)

Having checkmarks for each supported (i.e., each language where we have a file) language in the preferences so that the user could simply activate them would be better - but I have no idea how to do this.

Re: Bayesian classifier for TTRSS

Posted: 25 Nov 2015, 19:17
by fox
yeah we're not adding "text files" for user-configurable data, this is a terrible idea on so many levels

>Having checkmarks for each supported (i.e., each language where we have a file) language in the preferences so that the user could simply activate them would be better - but I have no idea how to do this.

i'll tell you how: you add a text area to plugin settings and then your user types this stuff in it

there's maybe three people worldwide who are both using this plugin and are possibly interested in configuring stopwords in it so i really wouldn't start overthinking this whole thing too much

if you or someone else manages to implement this in a clean manner i'll take a look at the diff but beyond that, well, maybe later

Re: Bayesian classifier for TTRSS

Posted: 25 Nov 2015, 20:23
by rknobbe-other
There may be less than 3 people using now. I stopped when the postgresql updates got so slow I would not get any feed updates and my poor ppc mac mini was running at load 11.

If I can find the time and a version of redis that compiles on this dinosaur I might look at using this engine instead: https://github.com/tistaharahap/Simple- ... er-for-PHP

Re: Bayesian classifier for TTRSS

Posted: 25 Nov 2015, 21:06
by fox
I agree, it is more of a proof of concept than anything, maybe I should retire it to -attic repository

garden variety sql backend seems to be unsuitable for this kinda workload

e: done, i guess that's the end of that experiment.

Re: Bayesian classifier for TTRSS

Posted: 01 Dec 2015, 13:11
by masgo
So I coned the attic repo and did some very minor improvements which yielded huge performance improvements for me:
1. add index to word in wordfreqs table (I am using MySQL - the change only applies to my sql since I can not test other DBs)
2. use isset instead of in_array for better performance

Now the plugin is really fast on my system. Everyone who is interested can have a look here:
https://github.com/masgo/tt-rss-attic/commits/master

Re: Bayesian classifier for TTRSS

Posted: 01 Dec 2015, 13:43
by fox
>1. add index to word in wordfreqs table

oh whoops, I didn't think about it

still i think redis or some other fast key/value store is the way to go here, if only for the ungodly amount of queries required to process stuff

Re: Bayesian classifier for TTRSS

Posted: 01 Dec 2015, 17:03
by masgo
There are definitely better ways to implement this, but doing it differently costs time which I do not have at the moment. Since the plugin as it is works quite well for me, I will stick with it for now. Also after the changes done it is really fast. And also it does not delete articles, so the impact of it not working properly is only a score manipulation.

I also think that a Bayesian classifier is not the best there is for the job. I would like to have something that does deduplication and grouping of similar articles. This grouping could also influence the score (e.g., more feeds cover the same topic -> might be something interesting). It could also track how often I open the article in a new window and many other things. ... but as stated before: who has the time?

Re: Bayesian classifier for TTRSS

Posted: 01 Dec 2015, 21:32
by rknobbe-other
Automatic grouping of similar articles (kind of like the old SwiftFile for Lotus Notes, https://web.archive.org/web/20120924112 ... swiftfile/) would be great, and was actually what I was originally hoping for when I asked about Bayesian classification at the beginning of this thread. I'm thinking of taking my external script that does categorization and migrating it to redis in order to prove the concept, if anybody is interested.

Re: Bayesian classifier for TTRSS

Posted: 01 Dec 2015, 22:03
by fox
btw you can sort of get similar headlines with trgm plugin (postgres only)

Image