UTF-8 multibyte characters in Mac OS X filenames
From blog.bmonkeys.net/posts/20
I did a little CLI tool which listed files in a directory plus a few extra information in an ASCII table. To calculate it I need the longest filename and I had issues when UTF-8 multibyte characters were in them: The special characters were counted as two characters.
The issue here is that OS X use a slightly different UTF-8 than you would think. Look at this:
[1] pry(main)> str = File.basename(Dir["Desktop/*"][2])
=> "möp"
[2] pry(main)> str.length
=> 4
[3] pry(main)> "möp".length
=> 3
[4] pry(main)> str.encoding
=> #<Encoding:UTF-8>
[5] pry(main)> "möp".encoding
=> #<Encoding:UTF-8>
[6] pry(main)> str == "möp"
=> false
At this point I was confused. It looked the same, it had the same encoding still it's not the same (and longer). So what's the trick here? Encode the path to UTF-8-MAC
and everything is fine:
[7] pry(main)> str.encode('UTF-8', 'UTF-8-MAC').length
=> 3
[8] pry(main)> str.encode('UTF-8', 'UTF-8-MAC')
=> "möp"
[9] pry(main)> a.encode('UTF-8', 'UTF-8-MAC') == "möp"
=> true
Written by Sven Pachnit
Related protips
Have a fresh tip? Share with Coderwall community!
Post
Post a tip
Best
#Ruby
Authors
Sponsored by #native_company# — Learn More
#native_title#
#native_desc#