Understanding Unicode Encoding in Ruby by Example
April 25, 2017
Every time I have to troubleshoot a problem with Unicode, it takes time to go through the documentation.
I compiled this list of methods and examples of how to use them. It has proven to save me time by quickly refreshing my memory.
Tested with Ruby 2.4.1 on macOS Sierra 10.12.4.
# the encoding is a property of String utf8_resume = "Résumé" => "Résumé" utf8_resume.encoding => #<Encoding:UTF-8> # translate the same string to different encodings latin1_resume = utf8_resume.encode("ISO-8859-1") latin9_resume = utf8_resume.encode("ISO-8859-15") utf8_resume.encoding => #<Encoding:UTF-8> latin1_resume.encoding => #<Encoding:ISO-8859-1> latin9_resume.encoding => #<Encoding:ISO-8859-15> # specify the string using codepoints lower_spanish_accents = "\u00E1\u00E9\u00ED\u00F3\u00FA\u00F1".encode("UTF-8") => "áéíóúñ" upper_spanish_accents = "\u00C1\u00C9\u00CD\u00D3\u00DA\u00D1".encode("UTF-8") => "ÁÉÍÓÚÑ" # Length of the encoded text # in UTF-8 # 'z' is 1 byte # 'ñ' is 2 bytes z = "\u007A" => "z" z.each_byte.map{|c| "%X" % c} => ["7A"] n_tilde = "\u00F1" => "ñ" n_tilde.each_byte.map{|c| "%X" % c} => ["C3", "B1"] n_tilde.bytesize => 2 # but in Latin-1 'ñ' is only 1 byte n_tilde.encode('iso-8859-1').each_byte.map{|c| "%X" % c} => ["F1"] # but Unicode is universal, so in codepoints there's no difference # between UTF-8 and Latin-1 n_tilde.each_codepoint.map {|c| "%X" % c} => ["F1"] n_tilde.codepoints.size => 1 n_tilde.encode('iso-8859-1').each_codepoint.map{|c| "%X" % c} => ["F1"] n_tilde.encode('iso-8859-1').codepoints.size => 1 # codepoints are base-10 integers n_tilde.encode('iso-8859-1').codepoints => [241] n_tilde.codepoints => [241] # formats to specify codepoints # single codepoint # exactly 4 hex digits # \uXXXX <==> U+XXXX # multiple codepoints # hex digits # leading 0 is optional # \u{X XX XXX XXXX} <==> U+000X U+00XX U+0XXX U+XXXX "\u0045\u0073\u0070\u0061\u00F1\u0061" => "España" "\u{45 73 70 61 F1 61}" => "España" # sometimes codepoint and byte sequence will match "\u007f".each_codepoint.map{|c| "%X" % c} => ["7F"] "\u007f".each_byte.map{|c| "%X" % c} => ["7F"] # but this isn't always true # see also example for ñ above "\u0080".each_codepoint.map{|c| "%X" % c} => ["80"] "\u0080".each_byte.map{|c| "%X" % c} => ["C2", "80"] # not all byte sequences are valid encodings "\u3042".valid_encoding? => true "\u3042\x81".valid_encoding? => false # scrub to the rescue scrubbed = "\u3042\x81".scrub('') scrubbed.valid_encoding? => true scrubbed.each_codepoint.map{|c| "%X" % c} => ["3042"] # building the string using the internal representation, i.e. byte by byte espana_utf8 = [0x45, 0x73, 0x70, 0x61, 0xc3, 0xb1, 0x61] => [69, 115, 112, 97, 195, 177, 97] espana_utf8.pack('c*').force_encoding('utf-8') => "España" # now in Latin1 (different byte sequence) espana_latin1 = [0x45, 0x73, 0x70, 0x61, 0xf1, 0x61] => [69, 115, 112, 97, 241, 97] espana_latin1.pack('c*').force_encoding('ISO-8859-1') => "Espa\xF1a" # although the ñ doesn't look correct, the enconding is correct espana_latin1.pack('c*').force_encoding('ISO-8859-1').valid_encoding? => true # currency symbols in UTF-8 currency_utf8 = "\u{20AC A3 A5}" => "€£¥" # Convert a number from any base to any base class String def convert_base(from, to) to_i(from).to_s(to) end end # example: letter "~", from base 16 to base 10 "7E".convert_base(16, 10) => "126" # example: decimal 255 to hexadecimal '255'.convert_base(10, 16) => "ff"