# thraxil.org:

## unicodification

by anders pearson Tue 01 Nov 2005 23:05:07

Unicode is a wonderful thing. it is also occasionally the bane of my existance.

Joel Spolsky has a classic article on The Absolute Minimum Every Software Developer Absolutely, Positively Must Know About Unicode and Character Sets (No Excuses!) that covers the basics nicely. he doesn't go much into the specifics of dealing with unicode issues in any particular programming language or platform though.

Python does a decent job of making it possible to write applications that are unicode aware. There are some decent pages out there that cover the basics of python and unicode. it's not very hard. python has two different kinds of internal representations of strings, unicode strings and 8-bit non-unicode strings (basically ASCII). all of python's built-in functionality and core libraries will work with either just fine. you can mix and match them without having to pay much attention to what kind of string you have. it only gets tricky when python has to deal with an outside system, like I/O, network sockets, or databases. unfortunately, that's pretty often and the bugs that pop up can be maddening to track down and fix.

the usual scenario is that you build your application and test it and everything works fine. then you release it to the world and the first user who comes along copies and pastes in some text from MS Word with weird "smart" quotes and assorted non-ASCII junk or tries to write in chinese and your precious application chokes and gurgles and starts spitting up arcane UnicodeDecodeError messages all over the users. then you get to spend some quality time with a pile of tracebacks trying to figure out where in your code (or the code of a library you're using) something isn't getting encoded properly. half the time, fixing the bug that cropped up creates another, more subtle unicode related bug somewhere else. just a fun time all around.

i've been on a unicode kick lately at work and spent some time experimenting and getting very familiar with the unicode related quirks of the particular technology stack that i prefer to work with at the moment: cherrypy, SQLObject, PostgreSQL, simpleTAL, and textile. here are my notes on how i got them all to play nicely together wrt unicode.

the basic strategy is that application code should try to deal with unicode strings at all times and only encode and decode when talking to the browser or some component that for some reason can't handle unicode strings. whenever a string is encoded, it should be encoded as UTF8 (if you're writing applications that would mostly be used by eg, chinese speakers though, you might want to go with UTF16 or UTF32, but for most of us, UTF8 is all kinds of goodness).

### postgresql

postgresql supports unicode out of the box. however, on gentoo at least, it doesn't encode databases in UTF8 by default, instead using "SQL_ASCII" or something. i didn't actually test too much to see what went wrong if you didn't use a UTF8 encoded database. i would assume that kittens get murdered and the baby jesus cries and all sorts of other horrible things happen. anyway, just remember to create databases with:

% createdb -Eunicode mydatabase

and everything should be fine. converting existing databases isn't very hard either using iconv. just dump it, convert it, drop the database, recreate it with the right encoding and import:

% pg_dump mydatabase > mydatabase_dump.sql
% iconv -f latin1 -t utf8 mydatabase_dump.sql > mydatabase_dump_utf8.sql
% dropdb mydatabase
% createdb -Eunicode mydatabase
% psql mydatabase -f mydatabase_dump_utf8.sql

### cherrypy

cherrypy has encoding and decoding filters that make it a cinch to ensure that the application <-> browser boundary converts everything properly. as long as you have:

cherrypy.config.update({'encodingFilter.on' : True,
'encodingFilter.encoding' : 'utf8',
'decodingFilter.on' : True})

in the startup, it should do the right thing. all your output will be encoded as UTF8 when it's sent to the browser, charsets will be set in the headers, and your application will get all its input as nice unicode strings.

### SQLObject

SQLObject has the tough job of playing border patrol with the database. for the most part, it just works. it has a UnicodeCol type that makes most operations smooth. so instead of defining a class like:

class Page(SQLObject):
title = StringCol(length=256)
body  = StringCol()

you do:

class Page(SQLObject):
title = UnicodeCol(length=256)
body  = UnicodeCol()

and all is well. you can do things like:

>>> p = Page(title=u"\u738b\u83f2",body=u"\u738b\u83f2 is a chinese pop star.")
>>> print p.title.encode('utf8')

unicode goes in, unicode comes out. i did discover a few places though that SQLObject wasn't happy about getting unicode. eg, doing:

>>> results = list(Page.select(Page.q.title == u"\u738b\u83f2"))
Traceback ... etc. big ugly traceback ending in:
File "/usr/lib/python2.4/site-packages/sqlobject/dbconnection.py", line 295, in _executeRetry
return cursor.execute(query)
TypeError: argument 1 must be str, not unicode

so you do have to be careful to encode your strings before doing a query like that. ie, this works:

>>> results = list(Page.select(Page.q.title == u"\u738b\u83f2".encode('utf8')))

since it's just a wrapper around the same functionality, you need to use the same care with alternateID columns and Table.byColumnName() queries. so

>>> u = User.byUsername(username)

is out and

>>> u = User.byUsername(username.encode('utf8'))

is in.

similarly, it doesn't like unicode for the orderBy parameter:

>>> r = list(Page.select(Page.q.title == "foo", orderBy=u"title"))

gives you another similar error. this only comes up because i frequently do something like:

# in some cherrypy controller class
@cherrypy.expose
def search(self, q="", order_by="modified"):
r = Page.select(Page.q.title == q, orderBy=order_by)
# ... format the results and send them to the browser

now, using the cherrypy decodingFilter, which otherwise makes unicode errors disappear, the order_by that gets sent in from the browser is a unicode string. once again, you'll need to make sure you encode it as UTF8.

lastly, EnumCol's don't get converted automatically:

>>> class Ex(SQLObject):
...   foo = EnumCol(enumValues=['a','b','c'])
...
>>> e = Ex(foo=u"a")

will give the usual TypeError exception. it also appears that you just can't use unicode in EnumCols at all:

>>> class Ex2(SQLObject):
...    foo = EnumCol(enumValues=[u"a",u"b",u"c"])
...
>>> Ex2.createTable()

will fail right from the start.

i haven't really done enough research to determine if those issues are bugs in SQLObject, bugs in the python postgres driver (psycopg), bugs in postgresql, or if there are good reasons to be the way they are or if i'm just doing something obviously foolish. either way, they are easily worked around so it's not that big a deal.

### simpleTAL

the basic pattern for how i use simpleTAL with cherrypy is something like:

def tal_template(filename,values):
from simpletal import simpleTAL, simpleTALES
import cStringIO
context = simpleTALES.Context()
# omitting some stuff i do to set up macros, etc.
# ...
for k in values.keys():
templatefile = open(filename,'r')
template = simpleTAL.compileXMLTemplate(templatefile)
templatefile.close()
f = cStringIO.StringIO()
template.expand(context,f)
return f.getvalue()

this, unfortunately breaks nicely if it comes across any unicode strings in your context. to fix that, you need to specify an outputEncoding on the expand line:

template.expand(context,f,outputEncoding="utf8")

then, since the cherrypy encodingFilter is going to encode all of our output, i change the last line of the function to return a unicode string:

return unicode(f.getvalue(),'utf8')

and it all comes together nicely.

### textile

textile, i think tries to be too clever for its own good. unfortunately, if you give it a unicode string with some nice non-ascii characters, you get the dreaded UnicodeDecodeError when it tries to convert it to ascii internally:

>>> from textile import textile
>>> textile(u"\u201d")
... blah blah blah... UnicodeDecodeError

it fairs slightly better if you give it a utf8 encoded string:

>>> textile(u"\u201d".encode('utf8'))
'&lt;p>&amp;#226;&amp;#128;&amp;#157;&lt;/p>'

except that that's... wrong. rather than spend too much time trying to figure out what textile's problem was, i reasoned that since it's purpose in life is just to spit out html, there was no harm in letting python convert the non-ascii characters to XML numerical entities before running it through textile:

>>> textile(u"\u201d".encode('ascii','xmlcharrefreplace'))
'&lt;p>&amp;#8221;&lt;/p>'

which is correct.

[update: 2005-11-02] as discussed in the comments of a post on Sam Ruby's blog, numerical entities are, in general not a very good solution. it's better than nothing, but ultimately it looks like i or someone else is going to have to fix textile's unicode support if i really want things done properly.

### memcached (bonus!)

once i'd done all this research, it didn't take me very long to audit one of our applications at work and get fairly confident that it can now handle anything that's thrown at it (and of course it now has a bunch more unicode related unit tests to make sure it stays that way).

so this evening i decided to do the same audit on the thraxil.org code. going through the above checklist i had it more or less unicode clean in short order. the only thing i missed at first is that the site uses memcached to cache things and memcached doesn't automatically marshal unicode strings. so a .encode('utf8') in the set_cache() and a unicode(value,'utf8') in the get_cache() were needed before everything was happy again.

i'm probably missing something, but that's basically what's involved in getting a python web application to handle unicode properly. there are some additional shortcuts that i didn't mention like setting your global default encoding to 'utf8' instead of 'ascii' but it doesn't change much, isn't safe to rely on, and i think it's useful to understand the details of what's going on anyway.

for the record, the exact versions i'm using are: Python 2.4, PostgreSQL 8.0.3, cherrypy 2.1, SQLObject 0.7, simpleTAL 3.13, textile 2.0.10, and memcache.py 1.2_tummy5. and psycopg 1.1.15.

bq. im probably missing something

Just a dash.


from urllib2 import urlopen
from xml.dom import minidom
minidom.parse(urlopen('http://thraxil.org/feeds/atom.xml'))



thanks, Sam.

it was actually trickier to fix that than i'd thought it would be. the template for the atom feeds had the encoding specified correctly but simpleTAL was replacing it. i figured that that was because the python code did:

template.expand(context,fakeout,outputEncoding="utf8")

so i changed it to "utf-8" there and that made it not include any encoding. same with other values. so it looks like simpleTAL will only give you the wrong encoding or none at all. i had to work around it by manually fixing the encoding in the rendered template with a regexp. ugh.

This is what happens if you don't force the programmers to take care of Unicode upfront. Having 2 strings or a string with a flag is broken.

formatting is with Markdown syntax. Comments are not displayed until they are approved by a moderator. Moderators will not approve unless the comment contributes value to the discussion.

name email required required remember info?