Last Updated: September 25, 2020
·
61.94K
· hermanschaaf

Dealing with Unicode in Go

If Go is normally a walk in the park, working with Unicode in Go can be described as unexpectedly strolling through a minefield. Take, for example, this inconspicuous string from the front page: "Hello, 世界". What happens if we get the length of this string?

fmt.Println(len("Hello, 世界"))
>>> 13

Wait, what just happened? Shouldn't the length be 9? Where did the extra 4 characters come from?

Under the hood, Go is actually encoding the string as a byte array. While it doesn't make you distinguish between normal ASCII strings and Unicode strings like Python 2.x, it still doesn't abstract away the underlying byte encoding of the characters. Since Chinese characters take up three bytes while ASCII characters take only one, Go tells you the length is 1*7+3*2=13. This can be really confusing, and a huge, juicy trap for those who only test their code with ASCII values. Take, for example:

hello := "Hello, 世界"
for i := range hello {
    fmt.Print(string(hello[i]))
}
>>> Hello, äç

Err, okay, how did 世界 become äç? I can already hear you shouting, "but you can just use the second range return value!" Indeed you can!

hello := "Hello, 世界"
for _, c := range hello {
    fmt.Print(string(c))
}
>>> Hello, 世界

Much better! Ah, but we can't always do it that way, can we? As a simple example, suppose we just want to compare a character with the next character in the string . A naive approach might do the following:

func CompareChars(word string) {
    for i, c := range word {
        if i < len(word)-1 {
            fmt.Print(string(word[i+1]) == string(c), ",")
        }
    }
}
...
CompareChars("hello")
>>> false,false,true,false,

And with tests for only ASCII, it will work perfectly. Now, what if we were saying hello in Chinese?

CompareChars("你好好好")
>>> false,false,false,false,

Oops. Of course, the characters will never be found equal, because we are comparing with \xE5, the first byte of .

So, <s>怎么办呢</s> what to do? Luckily, if you dig deep enough, you will find that Go ships with the unicode/utf8 package. It doesn't offer much, but let's use this to go back to our first problem: finding the length of the "Hello, " string:

import (
    "fmt"
    "unicode/utf8"
)
...
fmt.Println(utf8.RuneCountInString("Hello, 世界"))
>>> 9

Great, that's the count we expected at first! Now, how about updating our CompareChars function so it works with Unicode?

func CompareChars(word string) {
    s := []byte(word)
    for utf8.RuneCount(s) > 1 {
        r, size := utf8.DecodeRune(s)
        s = s[size:]
        nextR, size := utf8.DecodeRune(s)
        fmt.Print(r == nextR, ",")
    }
}
...
CompareChars("hello")
>>> false,false,true,false,
CompareChars("你好好好")
>>> false,true,true,

It worked! やった!

The moral of the story:
Be very careful when working with Unicode in Go, especially when looping through strings. Most importantly, always write tests that contain both Unicode and ASCII strings, and use the built-in UTF-8 package where appropriate.

4 Responses
Add your response

When working with unicode you should be converting your strings to []rune. That's why the utf8 package is so sparse, most things are covered by []rune conversion and the unicode package.

over 1 year ago ·

@cthom06 Yeah, you're right, even the utf8 package itself is like a very thin abstraction layer for using strings as []rune. Putting Unicode in strings and assuming the best is an easy pitfall though, one that took me, being new to Go, slightly by surprise!

over 1 year ago ·

@unnali I wanted to leave the impression that you can never be too sure what you're going to get in a string, so I'm glad to hear it was unsettling :)

over 1 year ago ·

Also the code is still wrong. Characters can span multiple runes (code points), basically there are also composed characters (http://en.wikipedia.org/wiki/Unicode#Ready-made_versus_composite_characters).
(also see http://blog.golang.org/normalization, http://blog.golang.org/strings)

And a nice quote by Rob Pike -
"In fact, the definition of "character" is ambiguous and it would be a mistake to try to resolve the ambiguity by defining that strings are made of characters."

over 1 year ago ·