tfewaq
Last Updated: February 25, 2016
·
946
· calamari
6b4b7f11163b1e87571995ce8c3f0bce

Handle encoding of other sources

Encoding is a bitch. Always. There is no argument about that.

My advice to all of you rubyists out there, if you handle input from other sources like when parsing other websites and stuff that could have different encodings, reencode the input to avoid errors like

ArgumentError: invalid byte sequence in UTF-8

How?

In Ruby 1.9 you can do that with String#encode. The documentation of this method lacks the options somehow, but just look at this one line snippet:

unsafe_string.encode!('UTF-8', 'UTF-8', :invalid => :replace, :replace => '')

This reencodes the unsafe_string and simply deleted all byte sequences, that can't be understood by UTF-8. So your code is less likely to fail with this ArgumentError.

In Ruby 1.8 you have to use Iconv to get this result, here is how it could look like:

ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
safe_string = ic.iconv(unsafe_string)

As I understood, this Iconv appoach will still fail if the invalid byte sequence is at the end of the string. But this can be circumvented. Paul Battley is showing how.

Say Thanks
Respond