Thursday, February 24, 2011

ASTs, expression trees, and CFGs

I'm in the middle of doing a fairly major refactoring of the compiler, in particular the part that handles statements.

Currently, the Tart compiler goes through a number of phases, most of which are pretty standard:
  • Parsing, which produces an AST (Abstract Syntax Tree).
  • Analysis which takes the AST and produces a CFG (Control Flow Graph).
  • Code generation, which produces an LLVM module.
The analysis phase consists of a bunch of different passes, including things like building the expression tree, type inference, macro expansion, and so forth. The output of the analysis phase is a code flow graph for each function. That is, each function will have a list of basic blocks, and each basic block has a list of expressions which can be evaluated. At the end of each basic block is a branch or return instruction.

The StmtAnalyzer class is responsible for creating the overall CFG for a function. It is responsible for transforming something like an if-statement into a sequence of blocks:
  • Block 1:
    • evaluate test-expression
    • Jump to either Block 2 or Block 3
  • Block 2: ('then' block)
    • execute the body of the if-statement
    • jump to Block 4
  • Block 3: ('else' block)
    • execute the body of the else-statement
    • jump to Block 4
  • Block 4:
Within each block, there is a list of expressions to evaluate. Since this isn't a pure functional languages, these expressions will almost always have side effects. The expressions are represented as trees, similar to the original AST tree, except that each tree node also has information about the expression's type, and AST nodes which contain names have been replaced by nodes which point directly to the the named thing.

Something to note about the CFG is several pieces of information that were in the original AST are lost. The first is what kind of statement it was - that is, you can't determine just by looking at the blocks that the original statement was an 'if' statement. The second thing that is lost is information about scoping, that is when variables come in and out of scope.

Another thing to notice about the CFG is that it is a hybrid: It isn't just blocks or expression trees, but a mixture of both.

OK so all of this is pretty standard - so why is it changing? Well, this has to do with the earlier discussion of being able to evaluate some statements as expressions and have them return a value. It turns out that the data structure described above doesn't work so well in this case. For example, what is the return type of an if statement? The answer is, it's the common type returned by both the 'true' and 'false' blocks. However, in order to determine this, the analyzer needs to know that it is, in fact, an if-statement, which is unfortunate since we've already thrown that information away.

So in the new scheme, I want to split the analysis phase into two parts. The first part will produce a pure expression tree - that is, all statements will be represented as expression nodes. In the second phase, we'll convert the expression tree into a CFG. The important bit is that all of the complex analysis phases - type inference, macro expansion, and so on, will operate on the expression tree, which is in some ways an easier data structure to work with since it contains more information and is not flattened into a simple list of blocks.

This will also make it easier to track when a variable goes in and out of scope. This is important for two reasons, one of which is generating correct debugging information, and the other reason is related to garbage collection, since we want to zero out any object references that go out of scope.

Tart release criteria, revisisted

The last time we talked about this (November 2010), we talked about what things needed to be done before I could make an official "0.1" release of Tart. One of the items discussed was garbage collection, which has now been implemented. Although the collector is very rudimentary level (for example, the heap is a fixed size, and there's no way to increase it), I'm satisfied that it is functioning at a level that it should no longer count as a release blocker.

Other features that have been implemented during this period were variadic templates, integer parsing, and improved debugging support.

The next question is, what are the remaining release blockers? Here's my current list.
  • Packaging and installation. At the moment, the only way to run the tart compiler is to check out the source tree and compile it. There is no "make install" build target or package builder. There are a number of tasks which need to be done:
    • CMake supports installation via the 'install' command - this creates the appropriate 'make install' targets. There is also CPack, a companion to CMake, which will generate package installers for various platforms.
      • TASK: investigate these tools and write the appropriate rules for the Tart build files. This needs to include:
        • Executables tartc and tartln
        • Bitcode files for the standard library and the testing library.
        • The static library for the Tart runtime.
        • Shared objects for the Tart linker plugins (these plug into LLVM tools for people who don't want to use tartln).
        • Possibly docs, although maybe those should have a separate install.
    • Currently the Tart compiler has no way to import a compiled module - it has to reparse every imported module from source. This means that a Tart installation will have to install all of the source files for the standard library somewhere. It would be much better to have a single library file for the standard library. The plan is to serialize the AST tree for a module as metadata nodes in the LLVM bitcode files. That way, when all of the bitcode files for the standard library are linked together, the compiler can simply import this one file and find the definitions of all the standard library classes within it.
      • TASK: Implement serialization (both directions) of the AST for a module.
      • TASK: Implement some scheme for determining if the AST in a library is out of sync with it's source, so that the compiler can use the up-to-date source if it's available.
  • Samples. It would be good to have some sample programs, in particular samples that have build scripts that don't require the full Tart source tree and don't require CMake.
    • Because the Tart compiler and linker are only part of a tool chain for creating programs, we need to explain to people all of the steps needed to create a program.
  • Input. The Tart I/O library is incomplete, and in particular there's no means to read from stdin or the command line.
    • TASK: Finish the implementation of StdFileStream to allow for reading from a file.
    • TASK: Work on a library for passing command-line options. I done a little bit of work on this, but it's not checked in.
Of course, some of the above tasks are not required if we're willing to stipulate that the initial release requires the complete source tree in order to create programs in Tart.

In addition to this, there's a refactoring change that I'm in the middle of that I want to get done before I start on any of the tasks listed above.

There is also a giant TODO list of things that need to be done eventually, but aren't release blockers. One open question is at what point do I need to start using the bug tracker on the Tart project page to track these tasks (as opposed to a text file which I do now.) Some of the items on that list are worthy of note:
  • A documentation extractor for Tart files. I've been writing all of the source code comments in the Tart standard library on the assumption that there will eventually be a JavaDoc-style program for generating documentation from these (although I think my markup syntax is better than the @javadoc style.)
  • Unicode support (so we can write things like String.toLower()). My plan here is to write Python scripts that process the unicode database files and generate Tart code that represents the compressed form of those tables.
  • Exception backtracing.
  • Debugging support is still lacking in many ways - for example, you can't currently inspect the contents of a local variable.
  • Foreign Function Interface for calling C functions. (Partly done, but only works with the most primitive data types. One thing completely missing is Pinning in the garbage collector.)
  • Finalizer functions (to be called when an object gets deleted).
  • "finally" and "with" statements.
  • Package up the Tart eclipse plugin and make it available as an Eclipse package.

Sunday, February 20, 2011

Garbage has been collected

I've managed to successfully run a garbage collection cycle without causing a crash:

def testCollect {
  // Allocate a string that's not a constant.
  var sampleString = String.format("a{0}", "b");
  // Force a collection
  tart.gc.GC.collect();
  // Ensure that the string is still valid.
  assertEq("ab", sampleString);
}

Woot.

Wednesday, February 16, 2011

LLBrowse

There hasn't been any Tart updates in a while, because I've been working on a debugging tool to help me locate problems in my compiler's output. The tool is called LLBrowse, and it's a graphical browser for LLVM module files. Here's a link to the docs and some screenshots:


LLBrowse allows you to examine all of the contents of a module - functions, global variables, type definitions, and so on. It also allows you to navigate through all of the DWARF debugging information for those items. This has already led me to discover a bug in my compiler.

Wednesday, February 2, 2011

More on garbage collection

I have to say, one requirement of implementing a garbage collector is that it really forces you to take a holistic view of your code generation process. You have to consider every possible object reference that could be generated, because if you miss one, the garbage collector will cheerfully free your object before you're done with it.

A discussion on the llvm-dev mailing list has yielded some valuable new information: LLVM metadata nodes can refer to global variables. LLVM supports the concept of arbitrary metadata associated with a module, and these metadata can contain arbitrary language-specific information which is ignored by the various LLVM optimization passes. Moreover, if a value gets eliminated during an optimization pass, the pointer in the metadata node will automatically get changed to NULL.

This is exactly what I need - I want the optimizer to be able to eliminate any garbage collection roots that are never used, and then after it's done, I want to take the roots that are left over and string them together into a table.