Last Updated: September 16, 2020
·
49.93K
· aleemb

Mysql: Convert encoding to UTF8 without garbled data

It is possible that converting MySQL dataset from one encoding to another can result in garbled data, for example when converting from Latin1 to UTF8. There are a number of approaches. All examples assume we are converting the title VARCHAR(255) column in the comments table.

#1 Convert to blob then UTF8

The first approach is via Mattias and Perconas:

ALTER TABLE comments MODIFY title BLOB;
ALTER TABLE comments MODIFY title VARCHAR(255) CHARACTER SET utf8;

The same Percona link also has more approaches on how to do this without locking the tables.

#2 convert to BINARY then UTF8

Sidecar has a nice simple 1-liner both to test the results and convert it and a little Ruby script to automate over the entire DB. The test snippet is particularly useful. This doesn't change the column (no ALTER TABLE), instead it just converts the data in place.

-- test
SELECT CONVERT(CAST(CONVERT(title USING latin1) AS BINARY) USING utf8) FROM comments WHERE id = 123;

-- convert
UPDATE comments SET title = CONVERT(cast(CONVERT(title USING latin1) AS BINARY) USING utf8);
ALTER TABLE comments MODIFY title VARCHAR(255) CHARACTER SET utf8;

#3: iconv

Another other approach is to use the iconv command to convert the an entire table:

mysqldump comments --add-drop-table users | replace CHARSET=latin1 CHARSET=utf8 | iconv -f latin1 -t utf8 | mysql some_database

Additionally Percona offers a tool to convert the character-set for an entire DB as discussed in their converting characterset post.

#4: Safest approach without double conversion (PREFERRED METHOD)

It is possible that you are still ending up with garbled data or something is going wrong. This might happen for example if you have stored UTF8 data in a LATIN1 column or some other such scenario. Joni Salonen offers a check along with the update.

UPDATE comments SET title = @txt WHERE char_length(title) =  LENGTH(@txt := CONVERT(BINARY CONVERT(title USING latin1) USING utf8));
ALTER TABLE comments MODIFY title VARCHAR(255) CHARACTER SET utf8;

Debugging Tip

If you feel like digging deeper, you can use the HEX function to test the output, example:

SELECT HEX(CONVERT(`title` USING latin1)) FROM `comments` WHERE id=123;
SELECT HEX(CONVERT(CAST(CONVERT(title USING latin1) AS BINARY) USING utf8)) FROM comments WHERE id = 123;