Importing latin1 data to MySQL is being truncated

#mysql

#rails

#charset

#encoding

I was having an issue this week, when trying to upload data with 'special' chars on MySQL 5.1, using Ruby on Rails.

The data was just getting truncated like described in this thread.

There were many places where this issue could be:

Ruby on Rails database configuration;
MySQL 5.1 could have some kind of bug, like the following query wasn't work for me SHOW COLLATION LIKE 'latin1%';
the 'LOAD DATA INFILE' syntax could be some how wrong

The solution

I tried and tried, it was being painful. I decided to call my great friend @Lynx_Eyes who told me exactly where the problem was, in the CSV charset! So the only thing needed is to fix the CSV itself.

So here goes my solution:

1. Change the MySQL import script

In the RoR/MySQL script I just added CHARACTER SET 'utf8' after the INTO TABLE

# Load the import SQL statment from file & set some variables
import_csv_sql = <<-SQL
      LOAD DATA LOCAL INFILE '#{csv_data_file_path}'
        INTO TABLE #{mysql_table_name} CHARACTER SET 'utf8'
         FIELDS
            TERMINATED BY ','
            ENCLOSED BY '"'
         LINES
            TERMINATED BY '\\n'
      (
      #{column_headers.map{|c| c.name}.join(",\n")}
      );
    SQL

2. convert the CSV file

Note that in this project we're using the old and deprecated Ruby 1.8.7.

Then I created a module to pre-process the CSV file before being passed to MySQL.

In the shell you need first to figure out what is the current charset of your CSV file. The file command will give you a hand with that:

2.1 Run $ file -ib <filename> in the command prompt, this will return you something like this:

application/octet-stream; charset=binary

application/octet-stream; charset=iso-8859-1

In Ruby 2.0.0, this is easily achieved by using the IO method #external_encoding

2.2 Now, use the result you got, as a parameter to the iconv. By default iconv spits to STDOUT, so is better to redirect the output to a new file, something like this:

iconv --from-code=iso-8859-1 --to-code=utf-8 <input_file> --output=<output_file>

You can check my module on the gist I created.

As a side note, I figured out that the version file influences the result. In my development environment I have the version 5.11 which gives the result above. I tested with an older version (file-4.17) which doesn't return the charset unfortunately.