Thursday, February 24, 2011

ASTs, expression trees, and CFGs

I'm in the middle of doing a fairly major refactoring of the compiler, in particular the part that handles statements.

Currently, the Tart compiler goes through a number of phases, most of which are pretty standard:
  • Parsing, which produces an AST (Abstract Syntax Tree).
  • Analysis which takes the AST and produces a CFG (Control Flow Graph).
  • Code generation, which produces an LLVM module.
The analysis phase consists of a bunch of different passes, including things like building the expression tree, type inference, macro expansion, and so forth. The output of the analysis phase is a code flow graph for each function. That is, each function will have a list of basic blocks, and each basic block has a list of expressions which can be evaluated. At the end of each basic block is a branch or return instruction.

The StmtAnalyzer class is responsible for creating the overall CFG for a function. It is responsible for transforming something like an if-statement into a sequence of blocks:
  • Block 1:
    • evaluate test-expression
    • Jump to either Block 2 or Block 3
  • Block 2: ('then' block)
    • execute the body of the if-statement
    • jump to Block 4
  • Block 3: ('else' block)
    • execute the body of the else-statement
    • jump to Block 4
  • Block 4:
Within each block, there is a list of expressions to evaluate. Since this isn't a pure functional languages, these expressions will almost always have side effects. The expressions are represented as trees, similar to the original AST tree, except that each tree node also has information about the expression's type, and AST nodes which contain names have been replaced by nodes which point directly to the the named thing.

Something to note about the CFG is several pieces of information that were in the original AST are lost. The first is what kind of statement it was - that is, you can't determine just by looking at the blocks that the original statement was an 'if' statement. The second thing that is lost is information about scoping, that is when variables come in and out of scope.

Another thing to notice about the CFG is that it is a hybrid: It isn't just blocks or expression trees, but a mixture of both.

OK so all of this is pretty standard - so why is it changing? Well, this has to do with the earlier discussion of being able to evaluate some statements as expressions and have them return a value. It turns out that the data structure described above doesn't work so well in this case. For example, what is the return type of an if statement? The answer is, it's the common type returned by both the 'true' and 'false' blocks. However, in order to determine this, the analyzer needs to know that it is, in fact, an if-statement, which is unfortunate since we've already thrown that information away.

So in the new scheme, I want to split the analysis phase into two parts. The first part will produce a pure expression tree - that is, all statements will be represented as expression nodes. In the second phase, we'll convert the expression tree into a CFG. The important bit is that all of the complex analysis phases - type inference, macro expansion, and so on, will operate on the expression tree, which is in some ways an easier data structure to work with since it contains more information and is not flattened into a simple list of blocks.

This will also make it easier to track when a variable goes in and out of scope. This is important for two reasons, one of which is generating correct debugging information, and the other reason is related to garbage collection, since we want to zero out any object references that go out of scope.

No comments:

Post a Comment