I made a thing (again)

Development-related discussion, including bundled plugins
User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: I made a thing (again)

Postby fox » 14 Jul 2015, 13:56

what this library does is essentially guesswork it's not going to work all the time on every html page out there

that's why there's no "enable for all feeds" checkbox

Maru
Bear Rating Trainee
Bear Rating Trainee
Posts: 40
Joined: 20 Oct 2013, 14:26

Re: I made a thing (again)

Postby Maru » 14 Jul 2015, 14:28

fox wrote:what this library does is essentially guesswork it's not going to work all the time on every html page out there

that's why there's no "enable for all feeds" checkbox


Ok, so apparently I clicked on this feed once for testing and forgot about it sorry.
But I still have a problem with the code tmpdoc->encoding is just empty in my case. Blame it on cosmic rays or my version or .... I don't know.

Changing the if to

if (!mb_detect_encoding($tmp, 'UTF-8', true)) {
$tmpxpath = new DOMXPath($tmpdoc);

actually fixes the problem for me. I have not tested it with a non utf-8 feed though since I could not find one.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: I made a thing (again)

Postby fox » 14 Jul 2015, 14:35

well its unlikely to detect some random non-unicode feed as utf8 so that's ok i guess

not sure why document->encoding is empty, my theory would be php being garbage as usual

Maru
Bear Rating Trainee
Bear Rating Trainee
Posts: 40
Joined: 20 Oct 2013, 14:26

Re: I made a thing (again)

Postby Maru » 14 Jul 2015, 14:49

fox wrote:well its unlikely to detect some random non-unicode feed as utf8 so that's ok i guess

not sure why document->encoding is empty, my theory would be php being garbage as usual

Reading through the docs shows that a lot of people are struggling with it.
I am wondering if it should be always converted to UTF-8 before being parsed just to be on the save site..

pcause
Bear Rating Master
Bear Rating Master
Posts: 144
Joined: 23 Aug 2013, 19:52

Re: I made a thing (again)

Postby pcause » 14 Jul 2015, 16:27

Fox, there is a user created plugin called af_fullpost that I was using and I turned off and added yours. I noticed I am getting different results. In looking at the code I see you:

Code: Select all

         curl_setopt($ch, CURLOPT_FOLLOWLOCATION,
            !ini_get("safe_mode") && !ini_get("open_basedir"));


the PHP docs say that the safe_mode ini variable is deprecatd and no longer supported and removed after 5.4 (http://php.net/manual/en/curl.constants.php). I think you can just set this true and if open_basedir is set the option is ignored.

pcause
Bear Rating Master
Bear Rating Master
Posts: 144
Joined: 23 Aug 2013, 19:52

Re: I made a thing (again)

Postby pcause » 14 Jul 2015, 16:42

One idea for this plugin: how about adding an icon in posts which when the user clicks calls the plugin to fetch the content for that article. Like embed_original but doesn't need to try to insert iframes and create cross site issues. that way if there is a post for a feed where the user didn't enable for the full feed, it is still possible to fetch article contents. And this would be a great feature for the mobile client as well. save switching to the browser in most cases.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: I made a thing (again)

Postby fox » 14 Jul 2015, 16:47

pcause wrote:One idea for this plugin: how about adding an icon in posts which when the user clicks calls the plugin to fetch the content for that article. Like embed_original but doesn't need to try to insert iframes and create cross site issues. that way if there is a post for a feed where the user didn't enable for the full feed, it is still possible to fetch article contents.


this is gonna blow your mind but this button already exists: click on an article title and the original will magically open in a new browser tab. amazing but true.

And this would be a great feature for the mobile client as well. save switching to the browser in most cases.


indeed, life is too short to waste a literal second of your valuable time waiting while your phone switches between tt-rss and chrome.

i mean yes i understand that for images and videos because it is faster (and for images it is arguably better UX to display a sliding gallery on a phone) than rendering html be it inside tt-rss or in the browser. but displaying html pages? that's what your browser is actually for.

Maru
Bear Rating Trainee
Bear Rating Trainee
Posts: 40
Joined: 20 Oct 2013, 14:26

Re: I made a thing (again)

Postby Maru » 14 Jul 2015, 17:10

Maru wrote:
fox wrote:well its unlikely to detect some random non-unicode feed as utf8 so that's ok i guess

not sure why document->encoding is empty, my theory would be php being garbage as usual

Reading through the docs shows that a lot of people are struggling with it.
I am wondering if it should be always converted to UTF-8 before being parsed just to be on the save site..

Ok I had a look at the readability implementation that is used and it is EXPECTING UTF-8, so I think it is even better if we always convert to UTF-8 and try to figure out the input encoding as well before.

edit1: I have yet to find a non UTF-8 site I can test my code on :)

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: I made a thing (again)

Postby fox » 14 Jul 2015, 17:50


JustAMacUser
Bear Rating Overlord
Bear Rating Overlord
Posts: 373
Joined: 20 Aug 2013, 23:13

Re: I made a thing (again)

Postby JustAMacUser » 14 Jul 2015, 19:02

pcause wrote:the PHP docs say that the safe_mode ini variable is deprecatd and no longer supported and removed after 5.4 (http://php.net/manual/en/curl.constants.php). I think you can just set this true and if open_basedir is set the option is ignored.


The ini_get() function will return false if the option doesn't exist. Therefore, keeping that setting there changes nothing now or in the future, but does allow for backward compatibility. Keep in mind some Linux distros will be using PHP 5.3 for awhile still (e.g. CentOS 6), so it doesn't hurt to leave that in there.

Maru
Bear Rating Trainee
Bear Rating Trainee
Posts: 40
Joined: 20 Oct 2013, 14:26

Re: I made a thing (again)

Postby Maru » 15 Jul 2015, 00:34

fox wrote:http://www.fontanka.ru/fontanka.rss

Oook, this took really longer than I expected. Nevertheless I now have a work in progress version that should work with all encodings. Have a look at it here

https://github.com/maru-sama/Tiny-Tiny- ... 470b974692

So what did I change:

* Use file_get_contents. This populates $http_response_header which can then be used to get the proper charset
* We have to be "clever" here since there might be redirects with other encodings so I filtered the header and only took the last entry
* Completely get rid of DOMDocument since we no longer need it

Todos:
* Clean it up a little bit
* Use the results of curl instead of fetching everything twice. We can use the content AND the content-type respectively charset from it then
* Test it with more pages
* Do some basic error handling

Of course this will break for pages which send a different charset in the page itself than in the content-type but there is not much we can do about it. Well there is we can also parse the whole page for the meta tag as well and then use it instead. That said I prefer the header approach for now.

As for your example page fox, it "looks" ok for me but please check it as well.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: I made a thing (again)

Postby fox » 15 Jul 2015, 08:36

you are going to get broken output from readability if meta charset elements are not removed

Maru
Bear Rating Trainee
Bear Rating Trainee
Posts: 40
Joined: 20 Oct 2013, 14:26

Re: I made a thing (again)

Postby Maru » 15 Jul 2015, 09:20

I did not notice anything breaking in my case, what's happening exactly?
As for my change, I think using the header is not such a good idea afterall since it can be empty. or better the charset not set. a Meta charset tag should be always there. So instead of reading the header and parsing it we can use something like

Code: Select all

 if ($tmp) {
   preg_match("/meta.*charset\s*=\s*\"([^;\"]+)\"/i",$tmp,$charset);
   $tmp = mb_convert_encoding($tmp, "UTF-8", $charset[1]);



which should be there every time and if not we could just default to UTF-8 and hope for the best.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: I made a thing (again)

Postby fox » 15 Jul 2015, 09:24

yeah parsing html with regular expressions sounds like an excellent idea

for a clown

in a circus

I did not notice anything breaking in my case, what's happening exactly?


try reading my posts

maybe your domdocument is broken in a different way but mine always outputs to whatever the meta is set when using saveHTML() which is used after readability. so you get wonderful and amazing stuff like trying to insert cp1251 html in a utf8 database. short of clearing the meta tags before passing the html to readability nothing helps.

anyway you are free to hack whatever but i'm not really interested in changes in the charset-related area coming from someone who obviously has little experience with this. i can continue wasting time checking your assumptions and shit but i'd rather not. especially since you just going full retard piling on hacks on top of hacks now for some reason.

Maru
Bear Rating Trainee
Bear Rating Trainee
Posts: 40
Joined: 20 Oct 2013, 14:26

Re: I made a thing (again)

Postby Maru » 15 Jul 2015, 12:12

You are right, instead of sending bits and pieces I should have send a patch that actually looked sane.
I'll no longer waste your time with this.


Return to “Development”

Who is online

Users browsing this forum: No registered users and 1 guest