Last Updated: February 25, 2016
·
10.5K
· alfateam123

Python: hashlib and Unicode?

You know, Unicode is a part of our coder toolkit (or at least we should know it, as said by Joel [1]). Unicode is really useful when we're dealing with multilanguage applications, but it can cause some problems...

If you need to hash an Unicode object in Python, you can get an UnicodeEncodeError.

>>>from hashlib import md5 #this is just an example
>>>uFoo=u"why dòn't ìnsért sòme strànge chàrs? ù.ù"
>>>md5(uFoo).hexdigest()
Traceback (most recent call last):
File "<stdin>", line 1, in <module>
UnicodeEncodeError: 'ascii' codec can't encode character u'\xf2' in position 5:ordinal not in range(128)

Searching on the internet, I found an interesting issue in Python Bugs Tracker [2], where you can read a good explanation of the error.

The solution is really quick: just encode your Unicode object in your favourite charset.

>>>from hashlib import md5 #this is just an example
>>>uFoo=u"why dòn't ìnsért sòme strànge chàrs? ù.ù"
>>>md5( uFoo.encode("utf-8") ).hexdigest()
'80a0d8c0e0a53e2e3a9edafa4f0b2c03'

More than one Unicode object? No problem!

>>> uBaz, uBar=u"lòl", u"53è" #another example
>>> md5( (uBaz+uBar).encode("utf-8") ).hexdigest()
'81df2a1da0045cf65ce378a61604828f'

Links:
[1] http://www.joelonsoftware.com/articles/Unicode.html
[2] http://bugs.python.org/issue2948