I've spent some time on and off since the issue of MySQL's lack of 4-byte unicode support in the utf8 character set first appeared looking for a better long-term solution. I looked at how applications like WordPress handled their migration from utf8 to utf8mb4 and examined how TT-RSS internally uses the data in the database. In reviewing the available options I wanted to make sure the schema for MySQL and PostgreSQL were as similar as possible, even though there is not always a perfect match for data types.
Obviously we cannot expect users to change the configuration of their MySQL installation and re-create their databases. So that's out. This leaves us with using truncated indices, which I was initially against but am now recommending.
After looking through the database schema I think we can safely change to utf8mb4 (with utf8mb4_unicode_ci collation) with only a few hiccups, which I'll detail shortly.
There are two approaches that can be taken. We can alter only the tables that matter (those tables are open for debate); so ttrss_entries should be enough to provide 4-byte unicode support for feed content, titles, authors, etc. The rest of the data would be restricted to 3-byte unicode characters. This is an easy change, simple and safe. But it means that users could never create a label or filter containing the pile of pool emoji.
¯\_(ツ)_/¯
Modifying only select tables means the tables within the MySQL database will have different character sets and collations. Not a big deal, but for uniformity it makes sense to have the tables all be the same and it provides a more consistent end user experience.
All tables can easily be converted in one query each:
Code: Select all
ALTER TABLE table_name CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;
With the following exceptions:
- ttrss_filter_actions and ttrss_filter_types have a UNIQUE index on description, which are both varchar(250) columns. As best as I can tell, these tables and their contents are only used internally by TT-RSS for data consistency with user-defined filters (i.e. it is possible to remove a filter action and all user-created filters with said action will be adjusted appropriately). This means we can more or less trust that we know what's going to be in this table. The UNIQUE index can be truncated to 191 (max. allowed for 4-byte unicode characters) and only suffer a performance hit when the description exceeds 191 characters, which is unlikely given the contents of the columns. To be clear, the columns themselves would remain 250 characters in length, just the index length is being modified.
- ttrss_entries has a UNIQUE index on guid. As best as I can tell its content should always be: ^SHA1:[a-f0-9]{40}$ But it looks like it's longer for backwards compatibility. No problem, we can safely truncate the index to 191 and everything should work just fine.
- ttrss_linked_instances has a UNIQUE index on access_key. It seems like this column could get by with a much shorter length, but again, we can adjust just the index to 191 and everything should still be fine. Notwithstanding that "linked instances" feature of TT-RSS doesn't appear to be popular.
- ttrss_sessions has the id column set to the client's session ID as generated by PHP. Again, this seems to be pretty short, so we should be able to use 191.
- This leaves ttrss_prefs and ttrss_user_prefs. These are the challenging tables because they utilize foreign key constraints that prevent the index from being truncated. This means we have to reduce the column to varchar(191) if we want these to work. The good news is that like the ttrss_filter_* tables the data in here appears to be controllable. But if we reduced the column length it would create inconsistency with PostgreSQL (unless we also modified its schema).
I've created an sql file (attached) that will do the table alterations and it seems to work fine, but I don't feel comfortable enough with it yet because I don't have enough legacy data in the database and am concerned such data would prevent the schema changes. I've also been running my dev environment for months only the ttrss_entries table set to utf8mb4 without issue. I'm mostly posting this so we can discuss if we want to pursue a better long-term solution than preg_replacing the 4-byte characters, which is less than ideal for the end user and a bit slower from a performance perspective. I am prepared to put together the necessary pull requests based on discussions here.