Last Updated: February 25, 2016
·
1.267K
· dive

Remarkably, even after a decade of such pain, Unicode is, in 2012, still “cutting edge.”

http://golem.ph.utexas.edu/~distler/blog/archives/002539.html

For faster performance, Heterotic Beast caches the rendered (X)HTML of each post in the database. Sure enough, the cached XHTML was truncated just before the “𝒜”, a character which, in Unicode, lies in Plane-1 (U+1D49C). Evidently, there was a problem storing characters outside the BMP.

Now, Rails3, by default, creates MySQL database tables with the ‘utf8’ encoding. Since UTF-8 covers all 16 Unicode planes, you might think that would be sufficient. You would be wrong. MySQL’s utf8 encoding only covers the BMP. It can’t handle 4-byte characters at all.

Fortunately, MySQL 5.5.3 (released in March 2010) introduced a new encoding, ‘utf8mb4’, which actually, y’know, supports Unicode.

ALTER TABLE tablename CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4bin;
did the trick. Now the posts, in the database, didn’t get truncated at the first astral plane character. Unfortunately, instead of astral plane characters, the database entries contained garbage characters. Obviously, Rails had no idea that I had switched encodings in the database. I needed to say so, explicitly, in config/database.yml:

production:
adapter: mysql2
host: 127.0.0.1
database: beast
username: ...
password: ...
encoding: utf8mb4
port: 3306
Ah, if only life were so simple. The release version of the mysql2 gem doesn’t support the utf8mb4 encoding. Fortunately (as of December, 2011), the development version does. So

gem 'mysql2', :git => 'http://github.com/brianmario/mysql2.git'
(finally!) makes everything work as it should.