MySQL and utf8mb4 (4-byte unicode support)

Development-related discussion, including bundled plugins
JustAMacUser
Bear Rating Overlord
Bear Rating Overlord
Posts: 373
Joined: 20 Aug 2013, 23:13

MySQL and utf8mb4 (4-byte unicode support)

Postby JustAMacUser » 12 Jun 2016, 08:54

This is a follow-up to this thread: Trouble with latest xkcd cartoon

I've spent some time on and off since the issue of MySQL's lack of 4-byte unicode support in the utf8 character set first appeared looking for a better long-term solution. I looked at how applications like WordPress handled their migration from utf8 to utf8mb4 and examined how TT-RSS internally uses the data in the database. In reviewing the available options I wanted to make sure the schema for MySQL and PostgreSQL were as similar as possible, even though there is not always a perfect match for data types.

Obviously we cannot expect users to change the configuration of their MySQL installation and re-create their databases. So that's out. This leaves us with using truncated indices, which I was initially against but am now recommending.

After looking through the database schema I think we can safely change to utf8mb4 (with utf8mb4_unicode_ci collation) with only a few hiccups, which I'll detail shortly.

There are two approaches that can be taken. We can alter only the tables that matter (those tables are open for debate); so ttrss_entries should be enough to provide 4-byte unicode support for feed content, titles, authors, etc. The rest of the data would be restricted to 3-byte unicode characters. This is an easy change, simple and safe. But it means that users could never create a label or filter containing the pile of pool emoji.

¯\_(ツ)_/¯

Modifying only select tables means the tables within the MySQL database will have different character sets and collations. Not a big deal, but for uniformity it makes sense to have the tables all be the same and it provides a more consistent end user experience.

All tables can easily be converted in one query each:

Code: Select all

ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;


With the following exceptions:

  • ttrss_filter_actions and ttrss_filter_types have a UNIQUE index on description, which are both varchar(250) columns. As best as I can tell, these tables and their contents are only used internally by TT-RSS for data consistency with user-defined filters (i.e. it is possible to remove a filter action and all user-created filters with said action will be adjusted appropriately). This means we can more or less trust that we know what's going to be in this table. The UNIQUE index can be truncated to 191 (max. allowed for 4-byte unicode characters) and only suffer a performance hit when the description exceeds 191 characters, which is unlikely given the contents of the columns. To be clear, the columns themselves would remain 250 characters in length, just the index length is being modified.
  • ttrss_entries has a UNIQUE index on guid. As best as I can tell its content should always be: ^SHA1:[a-f0-9]{40}$ But it looks like it's longer for backwards compatibility. No problem, we can safely truncate the index to 191 and everything should work just fine.
  • ttrss_linked_instances has a UNIQUE index on access_key. It seems like this column could get by with a much shorter length, but again, we can adjust just the index to 191 and everything should still be fine. Notwithstanding that "linked instances" feature of TT-RSS doesn't appear to be popular.
  • ttrss_sessions has the id column set to the client's session ID as generated by PHP. Again, this seems to be pretty short, so we should be able to use 191.
  • This leaves ttrss_prefs and ttrss_user_prefs. These are the challenging tables because they utilize foreign key constraints that prevent the index from being truncated. This means we have to reduce the column to varchar(191) if we want these to work. The good news is that like the ttrss_filter_* tables the data in here appears to be controllable. But if we reduced the column length it would create inconsistency with PostgreSQL (unless we also modified its schema).

I've created an sql file (attached) that will do the table alterations and it seems to work fine, but I don't feel comfortable enough with it yet because I don't have enough legacy data in the database and am concerned such data would prevent the schema changes. I've also been running my dev environment for months only the ttrss_entries table set to utf8mb4 without issue. I'm mostly posting this so we can discuss if we want to pursue a better long-term solution than preg_replacing the 4-byte characters, which is less than ideal for the end user and a bit slower from a performance perspective. I am prepared to put together the necessary pull requests based on discussions here.
Attachments
mysql-utf8mb4-update.sql.txt
For dev testing only.
(4.02 KiB) Downloaded 143 times

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: MySQL and utf8mb4 (4-byte unicode support)

Postby fox » 12 Jun 2016, 09:54

in all honesty i'm not sure if this is worth the trouble for some worthless emojis and stuff. if people really want those that much they can use a proper database server i.e. postgres.

i however don't see it as something any sizable percentage of tt-rss userbase would care about in any shape or form. i.e. i don't remember anyone here complaining about missing extended utf characters in posts.

also, not only this is an extensive schema overhaul for arguably little gain there's also additional mysql version requirements - i think it's 5.5.something needed for utf8mb4 with maybe additional requirements on client libraries. there's a huge potential here for breakage / tears.

i understand that this change will remove a hack and make things neater in a way and i appreciate the effort you went through but in the end i really don't think its worth it.

JustAMacUser
Bear Rating Overlord
Bear Rating Overlord
Posts: 373
Joined: 20 Aug 2013, 23:13

Re: MySQL and utf8mb4 (4-byte unicode support)

Postby JustAMacUser » 12 Jun 2016, 10:36

Not a problem. The transition is certainly not without its challenges. At the very least this thread can serve to provide more info to others about why things are the way they are.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: MySQL and utf8mb4 (4-byte unicode support)

Postby fox » 12 Jun 2016, 10:49

yep more input would be appreciated

speak up emoji lovers

DeviousNull
Bear Rating Trainee
Bear Rating Trainee
Posts: 2
Joined: 23 Jun 2016, 21:58

Re: MySQL and utf8mb4 (4-byte unicode support)

Postby DeviousNull » 23 Jun 2016, 22:25

I personally wouldn't mind missing emojis in feed content, but that doesn't seem to be what TTRSS does when it encounters a feed containing them.

For example, the latest post at "http://blog.cryptographyengineering.com/feeds/posts/default" currently contains an emoji. When the post went up, TTRSS stopped updating the feed with (invalid object received - see error console for details). The error console contained the following:

Code: Select all

Query failed (Incorrect string value: '\xF0\x9F\x92\xA9 &...' for column 'errstr' at row 1): INSERT INTO ttrss_error_log (errno, errstr, filename, lineno, context, owner_uid, created_at) VALUES (256, 'Query failed (Incorrect string value: \'\\xF0\\x9F\\x92\\xA9 &...\' for column \'content\' at row 1): INSERT INTO ttrss_entries\n   (title,\n   guid,\n   link,\n   updated,\n   content,\n   content_hash,\n   no_orig_date,\n   date_updated,\n date_entered,\n   comments,\n   num_comments,\n   plugin_data,\n   lang,\n   author)\n   VALUES\n   (\'What is Differential Privacy?\',\n   \'SHA1:d9c6a811e26ef0995ad033140b17455a5f3b028e\',\n   \'http://feedproxy.google.com/~r/AFewThoughtsOnCryptographicEngineering/~3/orEFxcaP7yc/what-is-differential-privacy.html\',\n   \'2016/06/15 14:51:41\',\n   \'Yesterday at the WWDC keynote, Apple announced a series of new security and privacy features, including one feature that\\\'s drawn a bit


From this I deduced that the emoji (F0 9F 92 A9) was the problem. For a quick, hacky fix I changed "include/rssfuncs.php" around line 730 to replace 4-byte codepoints with a placeholder using following:

Code: Select all

$entry_tags = $article["tags"];
$entry_guid = db_escape_string($entry_guid);
$entry_title = preg_replace('/[\x{10000}-\x{10FFFF}]/u', "\xEF\xBF\xBD", db_escape_string($article["title"]));
$entry_author = preg_replace('/[\x{10000}-\x{10FFFF}]/u', "\xEF\xBF\xBD", db_escape_string($article["author"]));
$entry_link = db_escape_string($article["link"]);
$entry_content = preg_replace('/[\x{10000}-\x{10FFFF}]/u', "\xEF\xBF\xBD", $article["content"]); // escaped below


This solved my issue - the feed updates as expected now - but I'm sure it isn't the best solution, and I imagine there are other places in the code (i.g. error logging?) which still have this issue.

User avatar
fox
^ me reading your posts ^
Posts: 6318
Joined: 27 Aug 2005, 22:53
Location: Saint-Petersburg, Russia
Contact:

Re: MySQL and utf8mb4 (4-byte unicode support)

Postby fox » 23 Jun 2016, 22:53

i think that's what we're doing anyway on most article fields (see rssfuncs.php:782).

as far as i could see the pile of poo emoticon gets properly replaced to the utf ? thing, so i'm not sure why additional preg_replaces are needed.

DeviousNull, which tt-rss version are you running?

DeviousNull
Bear Rating Trainee
Bear Rating Trainee
Posts: 2
Joined: 23 Jun 2016, 21:58

Re: MySQL and utf8mb4 (4-byte unicode support)

Postby DeviousNull » 24 Jun 2016, 03:12

Well, now this is embarrassing. My git repo didn't have origin pointed at your repository, and so I wasn't getting updates. When I fixed that and updated, the problem went away.

Sorry about that.


Return to “Development”

Who is online

Users browsing this forum: No registered users and 2 guests