Friday, July 30, 2010

More thoughts on Arrays

The more I work on collections in Tart, the more I realize how odd Arrays are.

For example, in my work in Java I almost never use an array type. Sequential containers are almost always ArrayList or ImmutableList. Pythons 'list' class (and 'array' class as well) are much more like Java's ArrayList, in that they can grow as needed.

At the same time, fixed length arrays are needed because the other container type - such as ImmutableList and ArrayList - are implemented in terms of arrays.

Arrays are also problematic in other ways as well. For one thing, there is the dreaded array-initialization problem: In a language where the use of null pointers is heavily discouraged, what do you use for the initial fill value of an array? Not all types have an obvious "default value" instance.

One idea that I have been toying with is the idea of making the basic array class more like fixed_vector: That is, a container with a fixed capacity, but variable length up to that capacity. You could grow the array up to it's maximum size, but in doing so you would have to provide either an explicit fill value, or a list of elements to copy into the array. From an implementation standpoint, this would require adding one extra field to the array, which is the current length. (And moreover, that field would disappear from ArrayList, since it could now use the length field of the underlying array.)

One problem with doing this is that "variable size with a maximum size" is a more complicated concept than just a fixed size. People might be surprised when their arrays start throwing capacity exceeded exceptions.

Wednesday, July 28, 2010

More on templates and reflection

As I mentioned in my last posting, I'm working on a more efficient way to encode reflection data for templates. Rather than pre-generating the reflection data for each expansion of the template, instead I want to store the template definition, and let the expansion be generated lazily at runtime.

Note that under this scheme, the actual compiled methods, and the method dispatch tables, are still being generated by the compiler. It's only the reflection data that isn't generated.

One thing I need to work out still is how to integrate the compiled methods and dispatch tables into the reflection data when it is generated. With non-template classes, this is easy: The reflection data for each method (a static structure generated by the compiler) contains a pointer to that method. In the case of a template method, we need to somehow attach the method pointer to the method descriptor when it is created. This gets tricky because the compiler may have generated multiple copies of the original template method, one for each expansion of the template with different type parameters.

One way to approach this is to have some kind of map of functions: The key consists of all the elements that uniquely identify a function (the name, the arguments, the template arguments, etc) and the value is the function pointer. The trick will be representing the keys compactly.

For the name and template arguments, we can use indices. The name will be an variable-length index into the names table for the module, typically 1 byte. The template are represented as a tuple of types, which itself a type - given types T1, T2, you can create a new tuple type Tuple(T1, T2), and this new type can be assigned a unique index. This index can also be represented as a variable-length integer, and thus can also be (usually) stored in a single byte (it's rare that a module will have more than 128 unique types). The arguments to the method can be treated similarly, with the caveat that method arguments have additional properties than just the type, such as variadic arguments.

So, assuming we can figure a way to encode those additional properties compactly, we're looking at a key size of only a few bytes. Still, that's a minimum of a key and a pointer for each method, and if there's a lot of methods that adds up. Now, some of those methods are already in a table - the dispatch table for the class. It would be nice if we could use that table instead of having to generate a separate table for reflection purposes. Unfortunately, not all methods are in that table - for example, methods declared 'final' or 'static' don't need to be in the dispatch table because the compiler can determine at compile-time which method to dispatch to. And I'd like to avoid making the dispatch table longer, since every subclass has to include the dispatch table of its parent class. Also, methods which are not in the dispatch table can be discarded by the linker if they are never called elsewhere, whereas methods in the dispatch table must be included in the final binary.

One approach would be to create an additional table of only the methods that weren't in the dispatch table. Unfortunately, once we do this, it means that those methods can no longer be discarded by the linker. I'm not sure how to get around this, or even if it's important - one of the perils of reflection is that it breaks the compiler's ability to determine what symbols are reachable from other symbols - if you can access a class or method via it's name, then how can the compiler know whether or not it's called? It has to be conservative and include all potentially reachable symbols, which might be larger than what is actually needed.

(Another possible approach is to use the dwarf debugging information to be able to look up a function by name. However, I want to preserve the option to strip out all debugging symbols from the binary, which would cause reflection to break.)

In any case, assuming we have some small key that can uniquely identify a method, when we expand a template, we need to loop through all the methods, substitute type parameters, generate tuples from those types and look them up in the module's type registry to generate a unique index. Then combine the indices to generate a method key, look up that method in the map, and then use the resulting pointer to fill in the method slot in the method descriptor.

Monday, July 26, 2010

Starting to work on Tart again

My life seems to have calmed down quite a bit and lately I've started poking away at Tart again. It's been about 4 months since I worked on it, and although I've forgotten a few things, it's also given me an opportunity to step back and take a fresh look at some of the problems.

One problem that Tart has is that the binary files it produces are huge. There are a number of reasons for this:
  • DWARF debugging information
  • Template bloat
  • Reflection data
For DWARF, there's not much I can do except generate less code.

For templates, a big win will be gained once I start using Nick Lewycky's function merging pass. This merges LLVM functions which would produce the same sequence of machine instructions. It does this by effectively doing type erasure on pointer variables where possible, and then merging the resulting functions if they hash to the same value. Visual Studio has done this for a long time, and it is an effective way to deal with template bloat.

However, there's another kind of bloat caused by template expansion, which is the bloat in the amount of reflection data - that is, for each expansion of a template, you need to generate a copy of all of the reflection metadata for that class.

Reflection data is large because it needs to describe everything about a type, including the names of all methods and fields. A typical method takes around 128 bytes just to describe it's calling signature. My previous strategy was to combine identical definitions - so for example, if two different classes had a method named "foo", then they would both point to the same string. Similarly, if they had the same return type, they would both point to the same type descriptor struct.

The problem with this approach is that on a 64-bit machine, pointers are 8 bytes. So unless the method or field name is longer than 8 bytes, you're not really saving much by putting a pointer to the name instead of embedding the name directly.

I have two strategies for reducing the size of the template data:

1) Store the reflection data for template instances in an unexpanded form: In other words, instead of generating all of the metadata for each expansion of the template, merely store a reference to the original template definition and the list of type parameters. So for example, for type List[String] we simply store a reference to "List", along with the template parameter list "[String]". The reflection metadata for the specialized type will be generated lazily at runtime the first time someone attempts to access it. I would imagine that the large majority of such type descriptors will never be dereferenced, so the amount of memory consumed will be small compared to the amount required to store all templates in expanded form.

Of course, one drawback of this approach is that we now have to store the original template definition in a runtime-accessible form, which we were not doing before. References to template type variables can be represented by a special "type parameter" type (just as it is in the compiler). Creating an expansion of a template at runtime is a process of cloning all of the data structures of the template, replacing type variables with their associated definitions in the current context.

For purposes of reflection, we don't need the method bodies, just the type signatures of the methods. However, there's a reason why we might want to store the bodies of the methods anyway: To support importation of compiled modules. Right now, whenever the Tart compiler sees an import statement, it loads the source file for that import and (partially) compiles it, which may require resolving that module's imports, and so on. Instead I would like it to work more like Java - the compiler can load in a compiled module instead of having to recompile the original source. (Of course, this will also require the modules to have datestamp and size information so that it can detect when the object module is out of date. In essence the compiler needs to incorporate some of the responsibilities of "make".)

For regular (non-template) classes, we could simply use the existing reflection information that the compiler already generates. In other words, the reflection metadata within a module would have two uses: First, for runtime reflection, and second, for importation into the compiler.

With templates the situation is more complex, because currently unexpanded templates produce no output - a module that defines only templates will produce an empty binary module. In order for the compiler to instantiate a template that was imported from a binary module, the module needs to contain the *complete* definition of the template.

My current thinking is that the template data will be stored in two parts, which are the type descriptors and the method body ASTs. The reason for separating them is so that we can throw away the ASTs at linkage time (since they aren't needed in the final binary.) I will probably store the ASTs using LLVM metadata nodes. The type descriptors will be stored as static data structures, just like they are for non-template classes.

2) The other improvement to the reflection metadata is to transition from a pointer-based to an index-based reference system. Each module will contain a constants pool containing a list of strings, common types, and other information that is referenced by integer index. Variable-length integers will be used to refer to these tables, using an encoding scheme similar to UTF-8. This means that the first 128 entries in each table can be referred to with a single byte index; The first 4096 entries require a 2-byte index, and so on. All of these tables will be sorted in order of by number of references, so very few indices will be more than 1 byte.

Long path names will be stored using a tree-structure. So for example, the string "tart.core.Memory.Address" will be stored as (((tart . core) . Memory) . Address), where (X . Y) is a pair of varints. Here's what the tables might look like:

    Strings       Pairs
    ===========   ============
    s1: tart      p1: (s1, s2)
    s2: core      p2: (p1, s3)
    s3: Memory    p3: (p2, s4)
    s4: Address

Thus, to refer to the string 'tart.core.Memory.Address', we need only a single byte containing the value 'p3'. (The low bit of the byte determines whether it's a reference to a string or to a pair) Moreover, if there is another string with the same prefix (e.g. "tart.core.Memory.ptrDiff"), then only the last part needs to be added to the table, and the prefix parts only need to be stored once per module.

An additional side-benefit is that each unique string within a module now has a unique numeric ID which can be compared directly without having to do a string comparison (although a string comparison is still needed for cross-module references.)

The same scheme can be used for types. Each type descriptor is a string of bytes, which are a combination of type codes and indices into the module types table.

Of course, with a scheme like this, there will be a lot of duplication of definitions between modules: If module A has a definition of the tuple type (int, int) and module B has the same definition, then unlike the previous scheme they won't be shared. But that's OK, because each definition is only 3 bytes, whereas sharing a common definition would require an 8-byte pointer for each reference.

These highly-compressed type definitions will need to be expanded into data structures in order to be useful. This can be done lazily as types are inspected.