Thursday, March 31, 2011

Tart status update

Current tasks:
  • Still working on updating the language manual, got a few more pages done in the last couple of days.
  • Fixed one bug in DWARF debugging, there's at least one more left.
  • Trying to get "dwarflint" to build. This is a new tool from GCC which does validation of DWARF metadata. It's relatively new, however, so getting it to configure and compile is a challenge.
  • I added the '+' operator for string concatenation. I also enabled implicit conversion of objects to strings. I have not yet added implicit conversion of primitive types to strings, however.
  • I still want to figure out why things are crashing on 32-bit processors.

Tuesday, March 22, 2011

Initial check-in of API docs

I've still got a lot of work to do yet on the documentation extractor, but here's a demo of what the output looks like:
Note that I've coded all of the pages in HTML5, taking maximum advantage of <section>, <nav>, web fonts, css3 shadows, and so on.

Also note that a lot of the links still lead no-where, template parameters aren't being rendered, advanced formatting isn't finished, base classes aren't listed, and lots of other stuff.

Let me know what you think.

Friday, March 18, 2011

Working on API docs

I've spent pretty much the last week (full time - I took a week of vacation) working on something that has needed to be done for a long time, which is the API documentation for the Tart standard library.

Unfortunately, the first two days I spent chasing a dead end and had to backtrack. The issue revolves around the use of Sphinx for generating docs. I've been using it to generate the Tart language docs, and I've very happy with it, both in the authoring process and in the way it looks.

The question was whether I should use Sphinx for the API documentation as well. Now, I've been adding doc comments to classes from the beginning, so all I need is a way to extract and format those docs. While Sphinx does have an auto-doc facility, it is pretty closely tied to the Python language - the various autodoc directives import Python modules and introspect them.

My first approach was to try and use the existing autodoc directives. My plan was to parse Tart source code into Python objects, and then subclass the Sphinx documentor classes to look at the various attributes of those objects to generate the documentation. I essentially re-coded my entire Tart parser in Python to do this.

However, I ran into a number of problems with this approach. The first was that this part of Sphinx is very complicated, and not at all documented. Another problem is that Sphinx is not able to generate multiple files from a single input file, where by "input file" I mean a .rst file. This is more of a limitation of DocUtils than it is of Sphinx. So for example, if I put an "autopackage" directive in a file that tells it to extract documentation from all of the modules in a package, it will put all of the documentation in a single HTML file, rather than having one HTML file per class, which is what I wanted. To get that, I would have to create a separate .rst file for each class. One way to do this would be to auto-generate the input files, but by that time I was starting to run into so many problems that I started to question this whole approach.

A third problem was that my Python-based parser was not as smart as the real Tart compiler. Unless I was willing to re-code the entire frontend in Python (and not just the parser), it was never going to be able to do things like proper template argument resolution (i.e. being able to hyperlink to the proper page when you click on a template argument.)

My next thought was to write a separate program in C++ that used the real compiler to parse the code, and then replace the code-generation module with a backend that would generate .rst source files. The advantage here is that I wouldn't be dependent on any of the internals of Sphinx. However, I decided that before I do that, I should really think more about what I actually wanted - that is, what I wanted the API docs to look like. I decided to manually write some API docs as .rst files, run them through Sphinx, and tweak them until they looked like I wanted.

However, as I did this, I started to realize run into some fairly fundamental limitations of Sphinx. For example, I was never going to get Sphinx to generate the kind of class index I wanted - Sphinx tends to format things as hierarchical tables of contents, and what I wanted was something more like JavaDoc - in particular I wanted to separate out the classes that were exceptions and attributes into their own category. Also, it's difficult to embed arbitrary styles in the docs, you have to fit everything into the ReST document model.

So at this point I'm thinking whether I want to use Sphinx for my API docs at all. This is not something I take lightly, because there's a lot of good stuff that you get from Sphinx that would be hard to get otherwise, such as having both your intro/language docs sharing the same namespace as your API docs, making it easy to cross-link between them. Also, you get ReST-style inline markup, which is not bad.

Now, I have my own markup language which I have been using. Most doc-comment markup languages (JavaDoc, doxygen, etc.) are based around the principle of maximizing the readabilty of the generated documentation, but at the expense of adding a lot of syntax into the source-code comments, making the source code less readable. However, it's been my experience that these days programmers are more likely to go to the source code than to a web page for documentation - assuming the source code is available. Most modern IDEs and even some "smart" text editors allow you to navigate to a particular declaration with one or two keystrokes, whereas going to a webpage and looking up the symbol you are interested in is generally an order of magnitude more complex. So it makes sense that the doc comments should not impact the source-code readability if possible.

I was inspired by Doc-o-matic, a commercial system in use at EA, which has a very lightweight syntax that doesn't detract from the readability of the source, and my markup is based on that. If I were to continue to use Sphinx, I would want to translate my markup language into the ReST equivalent.

In any case, I switched gears last Wednesday, and decided to go down the path of generating the API documents with my own programs. I would divide the problem into two stages. The first stage would extract all the doc comments and generate XML files containing all of the comments (with the markup converted to XML), as well as all of the class, function, and type declarations, all of which would be have fully-qualified names. This is similar to the previous scheme, but outputting XML instead of ReST. I refactored parts of the Compiler class into an AbstractCompiler which parses and analyzes input files but doesn't generate any code, and then created a new DocExtractor subclass to output the XML. The output looks something like this:

  <module name="tart.annex.GenerateStackTrace" visibility="public">
    <typedef type="class" name="GenerateStackTrace" visibility="public">
      <method name="apply" visibility="public">
        <param name="t">
          <type>
            <typename kind="class">tart.reflect.Method</typename>
          </type>
        </param>
      </method>
      <method name="construct" visibility="public"/>
    </typedef>
  </module>

The second pass is to load in this XML file and run it through some HTML templates. I noticed that both Jinja and Genshi now work in Python3, so I grabbed a copy of Genshi (since I prefer it's style of directives). I have not actually written the templates yet.

Note that the XML approach means that I still have the option of translating the XML into ReST if I decide to use Sphinx after all. So I haven't burned any bridges yet.

Sunday, March 13, 2011

Unicode tables are working

In between doing my taxes and working on a few OS X related build bugs, I managed to get the Unicode stuff to the point where it is ready to be checked in. Now we can do fun stuff like this:

    assertEq('\u01DE', Character.toUpperCase('\u01DE')); // Latin capital letter A with diaresis and macron
    assertEq('\u01DE', Character.toUpperCase('\u01DF')); // Latin small letter A with diaresis and macron
    assertEq('\uA764', Character.toUpperCase('\uA764')); // Latin capital letter thorn with stroke
    assertEq('\uA764', Character.toUpperCase('\uA765')); // Latin small letter thorn with stroke

And the whole set of tables weighs in at around 20k or so.

Saturday, March 12, 2011

Unicode character tables

I decided to change it up a bit and work on something different today, and also do a bit of Python programming - in fact, Python3 programming. So I worked on generating the Unicode character tables. These are the tables that allow you to do things like toUpper(), isAlpha() and so on.

The tables have to be compressed because otherwise they'd be huge: 640k to cover the entire Unicode character set (0-0xfffff) and that's just for 8 bits of properties per character. At the same time, however, you don't want to use a compression method that will slow down character lookups. Rather than invent something new, I decided to look up on the net to discover what other people were doing, and quickly discovered this article: ftp://fox-toolkit.org/pub/FOX_Unicode_Tables.pdf

Basically, what the Fox guys do is slice the character into bitfields, and then do a three-stage lookup, with each stage producing an offset into the lookup table for the next stage. Because the Unicode tables are very sparse, most of the leaf tables can be merged or overlapped with other tables. By tweaking the table size constants a bit, I was able to get my script to generate tables even smaller than the Fox ones. (I even considered doing a simulated annealing algorithm to crunch the tables down to the smallest possible size, but decided that would be overkill.)

Right now, the "category" table is about 15k total, and the three case-mapping tables (upper, lower, title) are about 3.5k each.

Of course, these tables only handle "simple" case mappings, not the more complex context- and locale-sensitive mappings which would need to be coded as exceptions. I also have not worked on any of the other tables - numeric values, combining characters, bidi, and so on - I'll leave those for another time when I actually need them.

The Python script generates Tart source code for all of these tables. Unfortunately, there's a problem in the Tart compiler preventing this from actually compiling, so checking it in will have to wait for another day.

Oh, another Python-related thing I worked on - I decided to update my generated docs to the most recent version of Sphinx, and I made a custom syntax-coloring stylesheet for Tart code - the current styles were a bit hard to read. At some point I want to write a Sphinx extension that reads in the autodoc comments in the Tart library source code, but that's a major project.

Sunday, March 6, 2011

Tart status update

Over the course of the last week I completely gutted and re-built the statement analyzer and major parts of the code generator. The result is a significantly simpler design, as well as a bunch of additions:
  • Nested 'finally' blocks are now handled.
  • If you break out of a try block via a return, break, or continue statement, the finally block will execute correctly.
  • The generated IR code is now significantly more readable, due to the emission of sensible names for many internal variables.
This change also paves the way for a couple of things - statements returning expressions, and the 'with' statement. It will also make it easier to simplify the logic for generating source line debug information.

Basically what has been changed is that the Tart compiler no longer builds its own code flow graph (CFG) on top of LLVM's CFG. The new compiler goes directly from an expression node tree to the LLVM CFG. This means that when generating code, the compiler always has complete information about block scopes, as opposed to a 'flattened' CFG graph in which everything has been reduced to simple 'goto'-like branches. This does mean that the compiler now generates more branch instructions than before, but LLVM's optimizer will clean those up trivially.

A few other changes made over the course of this week:
  • Streamlined the configuration scripts, which now use the 'llvm-config' to determine what libraries need to be linked for each of the various Tart programs. This will also make it easier to adapt to future changes to the LLVM libraries.
  • Fixed a bunch of miscellaneous build problems.
  • Updated everything to work with the latest LLVM head.