Monday, July 26, 2010

Starting to work on Tart again

My life seems to have calmed down quite a bit and lately I've started poking away at Tart again. It's been about 4 months since I worked on it, and although I've forgotten a few things, it's also given me an opportunity to step back and take a fresh look at some of the problems.

One problem that Tart has is that the binary files it produces are huge. There are a number of reasons for this:
  • DWARF debugging information
  • Template bloat
  • Reflection data
For DWARF, there's not much I can do except generate less code.

For templates, a big win will be gained once I start using Nick Lewycky's function merging pass. This merges LLVM functions which would produce the same sequence of machine instructions. It does this by effectively doing type erasure on pointer variables where possible, and then merging the resulting functions if they hash to the same value. Visual Studio has done this for a long time, and it is an effective way to deal with template bloat.

However, there's another kind of bloat caused by template expansion, which is the bloat in the amount of reflection data - that is, for each expansion of a template, you need to generate a copy of all of the reflection metadata for that class.

Reflection data is large because it needs to describe everything about a type, including the names of all methods and fields. A typical method takes around 128 bytes just to describe it's calling signature. My previous strategy was to combine identical definitions - so for example, if two different classes had a method named "foo", then they would both point to the same string. Similarly, if they had the same return type, they would both point to the same type descriptor struct.

The problem with this approach is that on a 64-bit machine, pointers are 8 bytes. So unless the method or field name is longer than 8 bytes, you're not really saving much by putting a pointer to the name instead of embedding the name directly.

I have two strategies for reducing the size of the template data:

1) Store the reflection data for template instances in an unexpanded form: In other words, instead of generating all of the metadata for each expansion of the template, merely store a reference to the original template definition and the list of type parameters. So for example, for type List[String] we simply store a reference to "List", along with the template parameter list "[String]". The reflection metadata for the specialized type will be generated lazily at runtime the first time someone attempts to access it. I would imagine that the large majority of such type descriptors will never be dereferenced, so the amount of memory consumed will be small compared to the amount required to store all templates in expanded form.

Of course, one drawback of this approach is that we now have to store the original template definition in a runtime-accessible form, which we were not doing before. References to template type variables can be represented by a special "type parameter" type (just as it is in the compiler). Creating an expansion of a template at runtime is a process of cloning all of the data structures of the template, replacing type variables with their associated definitions in the current context.

For purposes of reflection, we don't need the method bodies, just the type signatures of the methods. However, there's a reason why we might want to store the bodies of the methods anyway: To support importation of compiled modules. Right now, whenever the Tart compiler sees an import statement, it loads the source file for that import and (partially) compiles it, which may require resolving that module's imports, and so on. Instead I would like it to work more like Java - the compiler can load in a compiled module instead of having to recompile the original source. (Of course, this will also require the modules to have datestamp and size information so that it can detect when the object module is out of date. In essence the compiler needs to incorporate some of the responsibilities of "make".)

For regular (non-template) classes, we could simply use the existing reflection information that the compiler already generates. In other words, the reflection metadata within a module would have two uses: First, for runtime reflection, and second, for importation into the compiler.

With templates the situation is more complex, because currently unexpanded templates produce no output - a module that defines only templates will produce an empty binary module. In order for the compiler to instantiate a template that was imported from a binary module, the module needs to contain the *complete* definition of the template.

My current thinking is that the template data will be stored in two parts, which are the type descriptors and the method body ASTs. The reason for separating them is so that we can throw away the ASTs at linkage time (since they aren't needed in the final binary.) I will probably store the ASTs using LLVM metadata nodes. The type descriptors will be stored as static data structures, just like they are for non-template classes.

2) The other improvement to the reflection metadata is to transition from a pointer-based to an index-based reference system. Each module will contain a constants pool containing a list of strings, common types, and other information that is referenced by integer index. Variable-length integers will be used to refer to these tables, using an encoding scheme similar to UTF-8. This means that the first 128 entries in each table can be referred to with a single byte index; The first 4096 entries require a 2-byte index, and so on. All of these tables will be sorted in order of by number of references, so very few indices will be more than 1 byte.

Long path names will be stored using a tree-structure. So for example, the string "tart.core.Memory.Address" will be stored as (((tart . core) . Memory) . Address), where (X . Y) is a pair of varints. Here's what the tables might look like:

    Strings       Pairs
    ===========   ============
    s1: tart      p1: (s1, s2)
    s2: core      p2: (p1, s3)
    s3: Memory    p3: (p2, s4)
    s4: Address

Thus, to refer to the string 'tart.core.Memory.Address', we need only a single byte containing the value 'p3'. (The low bit of the byte determines whether it's a reference to a string or to a pair) Moreover, if there is another string with the same prefix (e.g. "tart.core.Memory.ptrDiff"), then only the last part needs to be added to the table, and the prefix parts only need to be stored once per module.

An additional side-benefit is that each unique string within a module now has a unique numeric ID which can be compared directly without having to do a string comparison (although a string comparison is still needed for cross-module references.)

The same scheme can be used for types. Each type descriptor is a string of bytes, which are a combination of type codes and indices into the module types table.

Of course, with a scheme like this, there will be a lot of duplication of definitions between modules: If module A has a definition of the tuple type (int, int) and module B has the same definition, then unlike the previous scheme they won't be shared. But that's OK, because each definition is only 3 bytes, whereas sharing a common definition would require an 8-byte pointer for each reference.

These highly-compressed type definitions will need to be expanded into data structures in order to be useful. This can be done lazily as types are inspected.

No comments:

Post a Comment