Spot any errors? let me know, but Unleash your pedant politely please.

Saturday 27 November 2010

Writing UTF-8 files from Python

As always, there may be better ways to do this (using XML libraries, for example), but it took far too long to figure this out, given that there's so little to do to fix the problem. I found a lot of the examples found on Google didn't answer this specifically, but just added to the confusion.

The problem:
a = 'âêîôŷ'
print a

gives this error:
UnicodeDecodeError: 'ascii', '\xc3\xa2\xc3\xaa\xc3\xae\xc3\xb4\xc5\xb7', 0, 1, 'ordinal not in range(128)'

The proper way to define a unicode string is this:
a = u'âêîôŷ'
print a

which yields:
âêîôŷ'


In my search, though, there was lots of talk of how to convert strings to UTF-8, and this is *not* what you do if you want to write to a UTF-8 file. If you convert to UTF-8 before writing, you'll probably get errors becasue it'll contain values >=127.
This is how you do it...

ascii='abcdef'
uni = u'⢸ðêƒ'
file=codecs.open('utf-8.xml', mode='w', encoding='utf-8')
file.write(ascii)
file.write(uni)
file.close()


The only difference here is that you must use 'u' when defining literals, and you need to used codecs.open, with the encoding specified, when opening the file.

If, when you read the file, it appears to have 2 strange characters rather than the one unicode character you expect, the file is probably OK, it's the viewer that isn't reading UTF-8 properly.

Friday 26 November 2010

More fun with Python

I'd recently written a little Python app to create a load of test data. The test data is XML, and should be UTF-8. I'd not really considered this properly, and for my original purposes, it's irrelevant. For a bit of fun/experimentation/learning, I put a tk front end on it, and email ed the project team to let them know, just in case it was useful.

Coincidentally, the vendor of the external product that would be producing these XML files in the real world was going to be late by several months, meaning that the XML files would need to be hand-crafted, the test data generator turned into a deliverable, and I briefly turned from tester into nightCoder.

There were a number of feature requests. My testing colleague started testing and raising defects against my code. Testing revealed areas in which i could be improved. I added a log file, properties files, some exception handling and error reporting dialogs. I had a real developer moment when it was deployed, went wrong and said (with a tester's smile on my face), "well that doesn't happen on my machine!".

While trying to figure out that problem, using a Swiss keyboard, typing garbage into some mandatory field, committing yielded another error as a result of non ASCII characters. As the client is Swiss, and these fields will probably included non ASCII, a fix was definitely required. Had this been just a learning exercise, I may not have been too worried, As I was now delivering this software, I had no option other than to figure it out. This highlights my main problem with self-teaching: I really struggle to find projects, and often abandon them in an unfinished state because nobody is relying on the solution.