Last Updated: February 25, 2016
·
3.304K
· chluehr

Debugging encodings and character sets.

Garbled text on your screen?

  1. Put your data in a plain text file (using vim - you do not want BOMs in your data!)
  2. use the command hexdump -C file
  3. locate the strange characters and determine the byte (sequences)
  4. look them up, e.g. here: utf8 charset table (german)

An example, the german umlaut ü ("ue"):

Correct utf8 encoding is (you would see c3 bc in the hexdump):

U+00FC  ü  c3 bc   LATIN SMALL LETTER U WITH DIA.

A valid UTF-8 character sequence that displays identically, but is not a "ü" (again, 75 cc 88 in the hexdump):

U+0075  u   75      LATIN SMALL LETTER U
U+0308  ̈  cc 88   COMBINING DIAERESIS