Understanding Unicode Encoding in Ruby by Example

April 25, 2017
Unicode es fantástico

Every time I have to troubleshoot a problem with Unicode, it takes time to go through the documentation.

I compiled this list of methods and examples of how to use them. It has proven to save me time by quickly refreshing my memory.

Tested with Ruby 2.4.1 on macOS Sierra 10.12.4.

# the encoding is a property of String
utf8_resume = "Résumé"
=> "Résumé"
utf8_resume.encoding
=> #<Encoding:UTF-8>

# translate the same string to different encodings
latin1_resume = utf8_resume.encode("ISO-8859-1")
latin9_resume = utf8_resume.encode("ISO-8859-15")
utf8_resume.encoding
=> #<Encoding:UTF-8>
latin1_resume.encoding
=> #<Encoding:ISO-8859-1>
latin9_resume.encoding
=> #<Encoding:ISO-8859-15>

# specify the string using codepoints
lower_spanish_accents = "\u00E1\u00E9\u00ED\u00F3\u00FA\u00F1".encode("UTF-8")
=> "áéíóúñ"
upper_spanish_accents = "\u00C1\u00C9\u00CD\u00D3\u00DA\u00D1".encode("UTF-8")
=> "ÁÉÍÓÚÑ"

# Length of the encoded text

# in UTF-8
#   'z' is 1 byte
#   'ñ' is 2 bytes
z = "\u007A"
=> "z"
z.each_byte.map{|c| "%X" % c}
=> ["7A"]
n_tilde = "\u00F1"
=> "ñ"
n_tilde.each_byte.map{|c| "%X" % c}
=> ["C3", "B1"]
n_tilde.bytesize
=> 2

# but in Latin-1 'ñ' is only 1 byte
n_tilde.encode('iso-8859-1').each_byte.map{|c| "%X" % c}
=> ["F1"]

# but Unicode is universal, so in codepoints there's no difference
# between UTF-8 and Latin-1
n_tilde.each_codepoint.map {|c| "%X" % c}
=> ["F1"]
n_tilde.codepoints.size
=> 1
n_tilde.encode('iso-8859-1').each_codepoint.map{|c| "%X" % c}
=> ["F1"]
n_tilde.encode('iso-8859-1').codepoints.size
=> 1

# codepoints are base-10 integers
n_tilde.encode('iso-8859-1').codepoints
=> [241]
n_tilde.codepoints
=> [241]

# formats to specify codepoints

# single codepoint
# exactly 4 hex digits
#   \uXXXX            <==> U+XXXX
# multiple codepoints
# hex digits
# leading 0 is optional
#   \u{X XX XXX XXXX} <==> U+000X U+00XX U+0XXX U+XXXX
"\u0045\u0073\u0070\u0061\u00F1\u0061"
=> "España"
"\u{45 73 70 61 F1 61}"
=> "España"

# sometimes codepoint and byte sequence will match
"\u007f".each_codepoint.map{|c| "%X" % c}
=> ["7F"]
"\u007f".each_byte.map{|c| "%X" % c}
=> ["7F"]

# but this isn't always true
# see also example for ñ above
"\u0080".each_codepoint.map{|c| "%X" % c}
=> ["80"]
"\u0080".each_byte.map{|c| "%X" % c}
=> ["C2", "80"]

# not all byte sequences are valid encodings
"\u3042".valid_encoding?
=> true
"\u3042\x81".valid_encoding?
=> false

# scrub to the rescue
scrubbed = "\u3042\x81".scrub('')
scrubbed.valid_encoding?
=> true
scrubbed.each_codepoint.map{|c| "%X" % c}
=> ["3042"]

# building the string using the internal representation, i.e. byte by byte
espana_utf8 = [0x45, 0x73, 0x70, 0x61, 0xc3, 0xb1, 0x61]
=> [69, 115, 112, 97, 195, 177, 97]
espana_utf8.pack('c*').force_encoding('utf-8')
=> "España"

# now in Latin1 (different byte sequence)
espana_latin1 = [0x45, 0x73, 0x70, 0x61, 0xf1, 0x61]
=> [69, 115, 112, 97, 241, 97]
espana_latin1.pack('c*').force_encoding('ISO-8859-1')
=> "Espa\xF1a"
# although the ñ doesn't look correct, the enconding is correct
espana_latin1.pack('c*').force_encoding('ISO-8859-1').valid_encoding?
=> true

# currency symbols in UTF-8
currency_utf8 = "\u{20AC A3 A5}"
=> "€£¥"

# Convert a number from any base to any base
class String
  def convert_base(from, to)
     to_i(from).to_s(to)
  end
end
# example: letter "~", from base 16 to base 10
"7E".convert_base(16, 10)
=> "126"
# example: decimal 255 to hexadecimal
'255'.convert_base(10, 16)
=> "ff"