Encoding is a bitch. Always. There is no argument about that.
My advice to all of you rubyists out there, if you handle input from other sources like when parsing other websites and stuff that could have different encodings, reencode the input to avoid errors like
ArgumentError: invalid byte sequence in UTF-8
In Ruby 1.9 you can do that with
String#encode. The documentation of this method lacks the options somehow, but just look at this one line snippet:
unsafe_string.encode!('UTF-8', 'UTF-8', :invalid => :replace, :replace => '')
This reencodes the unsafe_string and simply deleted all byte sequences, that can't be understood by UTF-8. So your code is less likely to fail with this ArgumentError.
In Ruby 1.8 you have to use Iconv to get this result, here is how it could look like:
ic = Iconv.new('UTF-8//IGNORE', 'UTF-8')
safe_string = ic.iconv(unsafe_string)
As I understood, this
Iconv appoach will still fail if the invalid byte sequence is at the end of the string. But this can be circumvented. Paul Battley is showing how.