Machine Words: Unicode character tables

I decided to change it up a bit and work on something different today, and also do a bit of Python programming - in fact, Python3 programming. So I worked on generating the Unicode character tables. These are the tables that allow you to do things like toUpper(), isAlpha() and so on.

The tables have to be compressed because otherwise they'd be huge: 640k to cover the entire Unicode character set (0-0xfffff) and that's just for 8 bits of properties per character. At the same time, however, you don't want to use a compression method that will slow down character lookups. Rather than invent something new, I decided to look up on the net to discover what other people were doing, and quickly discovered this article: ftp://fox-toolkit.org/pub/FOX_Unicode_Tables.pdf

Basically, what the Fox guys do is slice the character into bitfields, and then do a three-stage lookup, with each stage producing an offset into the lookup table for the next stage. Because the Unicode tables are very sparse, most of the leaf tables can be merged or overlapped with other tables. By tweaking the table size constants a bit, I was able to get my script to generate tables even smaller than the Fox ones. (I even considered doing a simulated annealing algorithm to crunch the tables down to the smallest possible size, but decided that would be overkill.)

Right now, the "category" table is about 15k total, and the three case-mapping tables (upper, lower, title) are about 3.5k each.

Of course, these tables only handle "simple" case mappings, not the more complex context- and locale-sensitive mappings which would need to be coded as exceptions. I also have not worked on any of the other tables - numeric values, combining characters, bidi, and so on - I'll leave those for another time when I actually need them.

The Python script generates Tart source code for all of these tables. Unfortunately, there's a problem in the Tart compiler preventing this from actually compiling, so checking it in will have to wait for another day.

Oh, another Python-related thing I worked on - I decided to update my generated docs to the most recent version of Sphinx, and I made a custom syntax-coloring stylesheet for Tart code - the current styles were a bit hard to read. At some point I want to write a Sphinx extension that reads in the autodoc comments in the Tart library source code, but that's a major project.

Machine Words

Saturday, March 12, 2011

Unicode character tables

No comments:

Post a Comment