Thursday, November 11, 2010

Tart: Still Too Large

Although the new reflection system has managed to reduce the size of the executables somewhat, I'm still concerned about the overall size.

For example, the unit test SwitchStmtTest weighs in at about 350K binary executable size. This is with optimization turned on (including dead code and global elimination), and with all debugging info stripped. That seems quite large to me - I'd like the size to be more on the order of 50K (that's my rough guess at how large an equivalent C++ program would be.) Note that with optimization off and debugging turned on, the size balloons to a whopping 1.5 megs.

I decided to do an analysis to see what exactly was taking up so much space. I used 'nm' for this purpose, along with a Python script that computed the size of each symbol in the object file based on the address deltas. These numbers are very rough, since there's a lot of stuff in the object file that is not accounted for. I'm only using the numbers as an order-of-magnitude guesstimate of sizes.

The contents of the object file can be broken down in several ways. Here's a breakdown by package:
  • tart.atomic: 16
  • tart.collections: 24712
  • tart.core: 37875
  • tart.gc: 288
  • tart.reflect: 34823
  • tart.testing: 3536
  • tart.text: 6179
  • SwitchStmtTest: 4435
  • Total: 111864
The numbers represent the total size, in bytes, of all symbols whose names start with the package name.

One of the first things to notice is that the total is only about a third of the size of the object file. I'm not at all sure why this is.

Also, I'm curious why tart.atomic was even included at all, given that nothing refers to it.

In the case of templates, the template expansions are charged to the package defining the template, not the package that instantiated it. In fact, template expansions account for a large portion of these numbers. Two templates dominate:
  • class tart.core.Array: 26879
  • class tart.collections.ArrayList: 13840
This leads me to believe that I need to think very hard about how to reduce template expansion sizes. I tried Nick's -mergefunc LLVM pass (which coalesces functions that produce identical machine instructions), and it had very little effect. That may be due to a bug in the -mergefunc pass, or it may be that there are in fact tiny differences in the generated code for different templates.

Another way to break down the numbers is by symbol type. For example, if I add up all of the static metadata for reflection, the total comes to 27645.

Here's a breakdown of some of the symbols (only items > 500 bytes are listed):
  • Non-reflection data (dynamic dispatch tables, instance allocators, etc.)
    • Constant tart.core.String instances: 16256 (Average size 66)
    • Type info blocks: 6784 (average size 45)
    • Type base class lists: 1952 (average size 33)
    • Interface dispatch functions 3168: (average size 36)
    • Compiler-generated instance allocators: 800 (average size 38)
    • Trace tables for garbage collection: 736 (that seems low!)
  • Reflection-related data
    • Type adapter functions (for boxing / unboxing reflected calls): 20912
    • Class and module reflection structures: 13216
    • Compressed name tables: 7864
    • Field offset tables: 1080
    • Tables of pointers of types referred to by other types: 2400
Here's my list of take-aways from all this:
  • 64-bit code is really fat. Even a trivial function can be >1K
  • Template expansions are taking a large amount of space.
  • I can probably optimize the type adapter functions. (These are functions generated by the compiler that auto-box and unbox function parameters.)
  • The refection structures for classes and methods can probably be trimmed a bit if I am willing to use void* pointers (means I can combine multiple tables into one.)
  • It may be possible to reduce the number of generated strings somewhat.
Coming up with a solution for template expansions is going to be the hardest. Ideally, we want to re-use the generic code as much as possible - so instead of separate definitions for tart.core.Array[Object].Iterator.next() and tart.core.Array[String].Iterator.next() - which do exactly the same thing - we can point them both to a common definition. The problem is that not all template methods are type-independent. For example, the tart.core.Array[Object].copyOf() method allocates a new array instance, and the type of that instance will be different for different template instances of copyOf().

Each template class method will be impacted to a greater or lesser degree by variances in the type of the template parameter. We can group functions in to classes by the degree of this impact:
  • No impact - the function is exactly the same regardless of the template parameter. An example is the method that returns the length of an array, which is the same regardless of the type of the array element.
  • Different type sizes - for example, the array copyElements() treats all the array elements opaquely (it just memcpys them) so the only thing it cares about is the size of the elements. Thus, all pointer arrays are the same as far as it is concerned.
    • For these cases, the solution is to generate a version of the function for each distinct size - so you have a version for 1-byte array elements, 2-byte, 4-byte, and so on.
  • Parametrically polymorphic - this means that difference instances of the method do the same thing, but with different types - so for example, one version might add two ints, while another adds two floats.
    • One possible approach would be to somehow (I'm not sure how) to migrate the type references outside the function, possibly in the static class data - so for example, when copyOf() creates a new instance, instead of hard-coding the type, you look it up. However, this seems pretty complicated, and I'm not sure how to do it.
  • Ad hoc polymorphic - this basically means "anything goes" - two different versions of the same template method might end up with completely different behavior. While functional language purists frown heavily on such things, in practice there are a few cases where this can be useful, especially in the metaprogramming domain.
    • An example of how the code can be different: Suppose we have a type A which defines "foo" as a constant, and type B which defines it as a variable - if a template "Example" has a method method "bar" which operates on the "foo" property, then the generated code for Example[A].bar() will be very different from the generated code for Example[B].bar().
Detecting which group a method falls into requires tracing through the code flow graph of the function, and analyzing each expression, type, and value to determine how that item will vary based on the template's parameters.

In a language with a JIT, the solution to these problems is simple - you instantiate the template instance methods lazily. So for example, if no one ever calls tart.core.Array[String].copyOf(), then just don't generate that method. The problem with a language that generates native binaries (and this applies to more than just templates) is that everything has to be generated in advance, because you don't know ahead of time what will and won't be used.

A final issue that I need to consider is "when?". Do I need to address the size issues right now, or can it wait until more language features go in? I know that the common wisdom is "optimize last", but at some level a program can perform so poorly that no one will consider it worth the trouble of using, let alone optimizing. I know for example that a typical XBox developer would never bother to use Tart in it's current state, where program size is a critical issue (there's no virtual memory on the XBox and games tend to use all of memory.) Also, I don't like building on shaky foundations.

No comments:

Post a Comment