Monday, April 18, 2011

Tart status update

First, apologies to Liam - I have not yet committed in his latest patch. although it's on my list of things to do.

For this last week I've been completely absorbed by a different task, which is writing a serializer / deserializer for Tart ASTs. This is something I've wanted to do for a long time (since the beginning of the project, in fact). Basically, there are two classes, ASTWriter, which converts an AST into a string, and ASTReader, which does the inverse. Each serialized AST node starts with a byte indicating the node type, followed immediately by the children of that node (which may be nodes or primitive data types.) Identifiers and other string data are stored using a "define on first use" strategy - that is, the first time a particular character sequence appears, it is embedded directly in the stream, any subsequent reference to that identifier is done via a numeric index.

On top of that, there is a set of classes (MDReader / MDWriter) which can read and write a Tart module as a set of LLVM metadata nodes. These classes use the AST serializers to convert the content of the module into a form that can be stored in a metadata node. Note, however, that not all of the module data is serialized into character strings, because I want to be able to do random-access lookups on certain symbols without deserializing the entire module. Properties which need to be random-accessible are stored as trees of metadata nodes rather than as strings. One drawback of using metadata nodes is that they are much larger, and also harder to work with.

The next piece is the ability of the compiler to load a serialized AST from a compiled bitcode file, instead of from a text source file. The compiler's import path list, which used to be a list of strings, is now a list of Importer objects, of which there are two kinds: DirectoryImporter, which (given a module name) looks for a source file in a specific directory and attempts to parse it; And ArchiveImporter, which attempts to load a serialized module from within a bitcode file. When you use the -i option on the command line, it will create the correct type of importer depending on whether the path is a directory or a bitcode file.

The final piece, that is not finished yet, is updating all of the various analysis phases of the compiler to be able to handle deserialized modules. The issue here is that there are a number of differences between a module that is loaded from a bitcode file and one that is parsed from a source file. These differences mainly stem from the desire to avoid having the compiler repeat work it's already done. For example, many of the symbols in the serialized module are fully-qualified (i.e 'tart.core.Array' instead of just 'Array'), because we don't want the compiler to have to re-figure out which scope Array is defined in. Similarly, the AST tree goes through a number of transformations in the compiler, and the point where the module's AST is written out occurs after those transformations have been done. The good news is that this means that certain analysis passes can be skipped over when dealing with a deserialized module, however this also means that certain passes can now "see" certain kinds of nodes which they wouldn't have been able to see before, and need to know how to handle them.

Another complication has to do with writing out template definitions. The analysis passes referred to in the previous paragraph are never applied to templates directly. Instead, when a template is instantiated, the analysis passes are applied to the template instance, which is a copy of the template definition with certain substitutions. So the serializer has to accommodate both pre-analysis and post-analysis trees, which are substantially different. An example is the fully-qualified names mentioned earlier - in the case of a template, the name hasn't been looked up yet, and in fact the name lookup may give a different result depending on what arguments you give to the template, so we can't store a fully-qualified name in that case.

There are lots of small details to work through - such as the fact that when you define an enumeration, the compiler automatically generates both a 'minVal' and 'maxVal' definition for it. Should we store the min / max values in the serialized form, or should we recalculate them when we deserialize? I opted for the latter. The same is true for the 'self' parameter of methods, or the 'value' parameter of property setters, both of which are synthesized by the compiler during analysis.

Anyway, as I said, this is not finished, but its getting close. The net result is that when you want to distribute or install a library, you'll have the choice of whether to include the source code or not. Of course, we want to encourage library authors to always make the source code available, but from a user's perspective, they may not want to have a bunch of loose source files cluttering up their system.

Wednesday, April 6, 2011

Statements as Expressions

I decided to take a break from writing docs, and add a language feature that I've been wanting for a while: The ability for certain statements to act as expressions:

// If statement as an expression
let a = if x { 10 } else { 20 };

// Switch statement as an expression
let a = switch x {
  case 1 { "Hello" }
  case 2 { "World" }
  else   { "!" }
}

// Match statement as an expression
let a = match x {
  as s:String { "String" }
  as w:Widget { "Widget" }
  else        { "Object" }
}

Note that the type inference engine is able to dive deep into the nested block structure to determine the result type of the expression - as in the second example, the type of 'a' is determined to be a String, since the result of each of the three case blocks is a String. As usual, type constraints flow in both directions:

// Explicitly declare 'a' as type String. The result
// values from each case block will get coerced into
// Strings.
let a:String = match x {
  as s:String { 1 }
  as w:Widget { 2.0 }
  else        { "else" }
}

There are lots of other subtleties here. For example, a case block might contain a return statement, or other non-local exit, in which case the variable never gets assigned:

let a = match x {
  as s:String { doSomething(); 1 } // The last expression in the block is the result.
  as w:Widget { return false }
  else        { throw IllegalArgumentError() }
}

I have unit tests for all of the above cases and dozens more edge cases. For example, what happens if the different branches return different types, and the result type is unconstrained? Well, the compiler tries to find a common type that all of the other types can be converted into. If there is no solution, or more than one possible solution, then the compiler complains.

The way this was done was to introduce a new type - I call it the "PHI" type, inspired by LLVM's "PHI" nodes - that represents a set of possible types that could be assigned to a variable. I then added code to the type solver to deal with PHI types. The type solver attempts to fill in the unknown types in an expression tree by finding the best possible fit, given the constraints (where constraints are things like "expression A must be convertible into type B", or "type A will be the return type of one of these five overloaded methods", and so on.) In the case of PHI types, the ranking for the PHI type is equal to the worst ranking for any of its members types. In case you were wondering, the rankings are:

enum ConversionRank {
  Incompatible,      // let x:Exception = 1  No conversion possible
  Truncation,        // let x:ubyte = 256    Value will be truncated
  SignedUnsigned,    // let x:uint = -1      Signed / unsigned mismatch
  PrecisionLoss,     // let x:int = 1.2      Loss of decimal precision
  IntegerToBool,     // let x:bool = 256     Exact but costly
  NonPreferred,      // let x:int = 1.0      Requires transformation
  ExactConversion,   // let x:byte = int(1)  Lossless conversion
  IdenticalTypes,    // let x:int = int(1)   No conversion, same type
};

I have not yet made "try/catch" usable in an expression context, but doing so will be relatively easy now that I have the infrastructure in place. Note that none of the other statement types - for, while, break, continue, and so on - make sense as expressions, so try/catch is the only one left to do.

Monday, April 4, 2011

Tart status update

I'm continuing to spend most of my time working on updating the "Introduction to Tart" document. I'm about 2/3 done at this point.

I'm also trying to move some of the more complicated details into the language reference document. My writing style tends to have a lot of parenthetical remarks and "asides", and these can detract from readability, and I'm looking closely at these and seeing if they actually need to be there.

On the coding side, I've definitely determined the DWARF info being produced by LLVM is corrupted in some manner. I'm hoping to get some help on llvm-dev to figure out what is causing this.

I also added a "addAll(collection)" and "insertAll(collection)" methods to ArrayList.

The "undef" keyword can now be used to override a method in an interface. Attempting to call the undefined method will throw an UnsupportedOperationError. For example:

   interface List[%T] {
     def add(element:T);
   }

  class ImmutableList[%T] : List[T] {
     // Can't add to an immutable list
     undef add(element:T);
  }

Essentially what happens is that when the compiler generates the dispatch table for ImmutableList, the slot for 'add' is filled in with the address of "Object.__undefinedMethod()", which simply throws an exception.