Unicode and encodings is always a fun thing. This script encodes an input string using different encodings and shows the output length:

# -*- coding: utf-8 -*-
import sys

if len(sys.argv) > 1:
code_points = [unicode(c, 'utf-8') for c in sys.argv[1:]]
else:
# Testing values
code_points = [u'\U0001F37A\U00000045\U0000039B', u'\U0001F37A']

def handle_encoding(encoding, code_point):
try:
values = ['{:>15}'.format(encoding),
' ---> ',
':'.join('{0:x}'.format(ord(c)) for c in
code_point.encode(encoding)),
' (', str(len(code_point.encode(encoding))), ')']
print ''.join(values)
except Exception as ex:
values = ['{:>15}'.format(encoding),
' ---> ',
'Unable to encode the codepoint in {0}'.format(encoding)]
print ''.join(values)

for code_point in code_points:
print '{:>15}'.format('character') + ' ---> ' + code_point
print '{:>15}'.format('code points') + ' ---> ' + repr(code_point)
for coding in ('ascii', 'latin-1', 'utf-8', 'utf-16', 'utf-16be', 'utf-16le'):
handle_encoding(coding, code_point)

Example:

python encoding.py "OLA KE ASE"
character ---> OLA KE ASE
code points ---> u'OLA KE ASE'
ascii ---> 4f:4c:41:20:4b:45:20:41:53:45 (10)
latin-1 ---> 4f:4c:41:20:4b:45:20:41:53:45 (10)
utf-8 ---> 4f:4c:41:20:4b:45:20:41:53:45 (10)
utf-16 ---> ff:fe:4f:0:4c:0:41:0:20:0:4b:0:45:0:20:0:41:0:53:0:45:0 (22)
utf-16be ---> 0:4f:0:4c:0:41:0:20:0:4b:0:45:0:20:0:41:0:53:0:45 (20)
utf-16le ---> 4f:0:4c:0:41:0:20:0:4b:0:45:0:20:0:41:0:53:0:45:0 (20)

Happy encoding :monkey:

« Home