Monday, April 18, 2011

Tart status update

First, apologies to Liam - I have not yet committed in his latest patch. although it's on my list of things to do.

For this last week I've been completely absorbed by a different task, which is writing a serializer / deserializer for Tart ASTs. This is something I've wanted to do for a long time (since the beginning of the project, in fact). Basically, there are two classes, ASTWriter, which converts an AST into a string, and ASTReader, which does the inverse. Each serialized AST node starts with a byte indicating the node type, followed immediately by the children of that node (which may be nodes or primitive data types.) Identifiers and other string data are stored using a "define on first use" strategy - that is, the first time a particular character sequence appears, it is embedded directly in the stream, any subsequent reference to that identifier is done via a numeric index.

On top of that, there is a set of classes (MDReader / MDWriter) which can read and write a Tart module as a set of LLVM metadata nodes. These classes use the AST serializers to convert the content of the module into a form that can be stored in a metadata node. Note, however, that not all of the module data is serialized into character strings, because I want to be able to do random-access lookups on certain symbols without deserializing the entire module. Properties which need to be random-accessible are stored as trees of metadata nodes rather than as strings. One drawback of using metadata nodes is that they are much larger, and also harder to work with.

The next piece is the ability of the compiler to load a serialized AST from a compiled bitcode file, instead of from a text source file. The compiler's import path list, which used to be a list of strings, is now a list of Importer objects, of which there are two kinds: DirectoryImporter, which (given a module name) looks for a source file in a specific directory and attempts to parse it; And ArchiveImporter, which attempts to load a serialized module from within a bitcode file. When you use the -i option on the command line, it will create the correct type of importer depending on whether the path is a directory or a bitcode file.

The final piece, that is not finished yet, is updating all of the various analysis phases of the compiler to be able to handle deserialized modules. The issue here is that there are a number of differences between a module that is loaded from a bitcode file and one that is parsed from a source file. These differences mainly stem from the desire to avoid having the compiler repeat work it's already done. For example, many of the symbols in the serialized module are fully-qualified (i.e 'tart.core.Array' instead of just 'Array'), because we don't want the compiler to have to re-figure out which scope Array is defined in. Similarly, the AST tree goes through a number of transformations in the compiler, and the point where the module's AST is written out occurs after those transformations have been done. The good news is that this means that certain analysis passes can be skipped over when dealing with a deserialized module, however this also means that certain passes can now "see" certain kinds of nodes which they wouldn't have been able to see before, and need to know how to handle them.

Another complication has to do with writing out template definitions. The analysis passes referred to in the previous paragraph are never applied to templates directly. Instead, when a template is instantiated, the analysis passes are applied to the template instance, which is a copy of the template definition with certain substitutions. So the serializer has to accommodate both pre-analysis and post-analysis trees, which are substantially different. An example is the fully-qualified names mentioned earlier - in the case of a template, the name hasn't been looked up yet, and in fact the name lookup may give a different result depending on what arguments you give to the template, so we can't store a fully-qualified name in that case.

There are lots of small details to work through - such as the fact that when you define an enumeration, the compiler automatically generates both a 'minVal' and 'maxVal' definition for it. Should we store the min / max values in the serialized form, or should we recalculate them when we deserialize? I opted for the latter. The same is true for the 'self' parameter of methods, or the 'value' parameter of property setters, both of which are synthesized by the compiler during analysis.

Anyway, as I said, this is not finished, but its getting close. The net result is that when you want to distribute or install a library, you'll have the choice of whether to include the source code or not. Of course, we want to encourage library authors to always make the source code available, but from a user's perspective, they may not want to have a bunch of loose source files cluttering up their system.

No comments:

Post a Comment