Friday, August 5, 2011

Gevent pywsgi and UnicodeDecodeError

When switching from the libevent-based wsgi to Python-based pywsgi, you might encounter some strange error like the one reported at Google code http://code.google.com/p/gevent/issues/detail?id=86.

The problem is WSGI (or the underlying HTTP) does not understand Unicode. You might think that because you can read other languages just fine on the Internet, certainly HTTP must understand Unicode, right? Wrong! HTTP only transfer bytes. How to decode these bytes into characters totally lies with the browser. There is charset hint from the Content-Type header, but WSGI does not use that header to encode your unicode response.

And so, WSGI response must not be unicode objects. All unicode objects must have been encoded into plain byte strings. This applies to everything: the status code, status message, the headers, and the body.

This condition must be true at the WSGI server level. In WSGI, we can stack/wrap several WSGI applications (so-called middleware) on top of each other. The bottommost layer, the one nearest to the WSGI server, must ensure that all strings are byte strings. For example, it could happen that Beaker wraps your application. Your application does not return any unicode string but you might still encounter UnicodeDecodeError problem. That is because Beaker may need to return some  headers (to set or delete session cookie). In case any attribute (such as the path, or domain) of this session cookie was a unicode string, the whole cookie header would be a unicode string. And this violates WSGI specification.

No comments:

Post a Comment