Last Updated: February 25, 2016
·
15.39K
· vjt

Recode Windows-1252 characters as UTF-8

You are dealing with "\u008a" or "\u009a" strings in your database. They get rendered correctly as "Š" or "š" in a browser, but you cannot display them in a command line console. Ruby says that they are "valid UTF-8" encoding.

In reality, those are windows-1252 encoded string that were mis-interpreted as UTF-8, and as such they get mapped to the Unicode Latin-1 Supplement Block.

Luckily, characters from 0080 to 009F, spanning the whole windows-1252 encoding, are non-printable in Unicode, so it's perfectly safe to assume those are just wrongly interpreted windows-1252 characters, to be able to match and recode them.

Use this function to recode them to proper UTF-8:

def recode_windows_1252_to_utf8(string)
  string.gsub(/[\u0080-\u009F]/) {|x| x.getbyte(1).chr.
    force_encoding('windows-1252').encode('utf-8') }
end

Here we are stripping the first byte of the (wrong) encoding utf-8 encoding (0xc2), creating a new single-character string with the second byte and telling Ruby it's windows-1252, and letting Ruby itself do the encoding to utf-8.

1 Response
Add your response

It helps me a lot to read the german postbank csv. The first conversion gives the umlauts, the second (mothod above gives the Euro sign. What I do not know is why I have it to convert twice : Here the code

class BankCsv
def read(file)
f=File.open(file, "r:ISO-8859-1")
a= f.read
a= a.encode('ISO-8859-1', :invalid => :replace, :replace => '').encode('UTF-8')
a= recodewindows1252toutf8(a)
puts a
end

def recode_windows_1252_to_utf8(string)
    string.gsub(/[\u0080-\u009F]/) {|x| x.getbyte(1).chr.
                                    force_encoding('windows-1252').encode('utf-8') }
end

end
BankCsv.new.read("1.csv")

over 1 year ago ·