Spot any errors? let me know, but Unleash your pedant politely please.

Saturday, 27 November 2010

Writing UTF-8 files from Python

As always, there may be better ways to do this (using XML libraries, for example), but it took far too long to figure this out, given that there's so little to do to fix the problem. I found a lot of the examples found on Google didn't answer this specifically, but just added to the confusion.

The problem:
a = 'âêîôŷ'
print a

gives this error:
UnicodeDecodeError: 'ascii', '\xc3\xa2\xc3\xaa\xc3\xae\xc3\xb4\xc5\xb7', 0, 1, 'ordinal not in range(128)'

The proper way to define a unicode string is this:
a = u'âêîôŷ'
print a

which yields:
âêîôŷ'


In my search, though, there was lots of talk of how to convert strings to UTF-8, and this is *not* what you do if you want to write to a UTF-8 file. If you convert to UTF-8 before writing, you'll probably get errors becasue it'll contain values >=127.
This is how you do it...

ascii='abcdef'
uni = u'⢸ðêƒ'
file=codecs.open('utf-8.xml', mode='w', encoding='utf-8')
file.write(ascii)
file.write(uni)
file.close()


The only difference here is that you must use 'u' when defining literals, and you need to used codecs.open, with the encoding specified, when opening the file.

If, when you read the file, it appears to have 2 strange characters rather than the one unicode character you expect, the file is probably OK, it's the viewer that isn't reading UTF-8 properly.

No comments:

Post a Comment