Friday, October 30, 2009

Factoring reflection

Now that boxing and unboxing are out of the way, that clears the way for me to start working on function reflection. The reason I needed the boxing support is due to the way that reflected methods are called. For example, suppose we have a method foo of some class Bar:

   class Bar {
     def foo(a:int, b:int) -> int { }
   }

Reflecting 'foo' yields an object of type tart.reflect.Method, which among other things has a "call" method:

   class Method {
     def call(obj:Object, args:Object[]) -> Object;
   }

What 'call' needs to do is typecast the arguments from type Object into the correct argument types. For arguments whose type is a subclass of Object, we need to downcast the argument to that type. For arguments that are primitive types, the input argument must be a boxed value, and 'call' must therefore unbox it. In either case, if the input argument is not the right type, then a TypecastException is thrown. Note that all of this code is generated automatically by the compiler - for each function, there is an instance of 'Method' which is a static immutable global containing the information for that function.

One problem here is that the amount of code and data generated for a single function is large - thus, a simple 'getter' that returns a value might only require a handful of instructions, but the 'call' method needed to typecheck and convert its arguments may be an order of magnitude larger. Generating a 'call' function for every user-defined function would lead to major code bloat.

One obvious optimization is to recognize that the argument type checking logic is the same for functions that have the same type. Thus, if I have two methods a(int, int) and b(int, int), they can share the same type-casting code. We can factor out this code into a function, which I will call "invoke", which takes the arguments of "call" plus a reference to the function itself:

  var invoke:static fn (func:fn (), obj:Object, args:Object[]) -> Object;

The syntax is a bit unfamiliar - the sequence 'static fn' is used to declare a reference to a function that does not have a 'self' argument. What the above declaration represents is a pointer to a function which accepts three arguments, the first of which is a function.

The compiler-generated 'Method' object will now contain two pointers: one to the function itself, and one to the 'invoke' function for that function type. The 'invoke' function is shared by all functions that have the same type. This sharing even works across module boundaries, because each 'invoke' function has a unique name which is derived from the argument types, such as ".invoke(int, int)->int". (LLVM allows function names to contain any printable character, a feature which I rely on heavily.)

Next we need to consider how to handle the 'self' argument. The first approach is to treat it like any other argument to 'invoke'. However, this greatly increases the number of permutations of 'invoke' that get generated. Since the 'self' argument is different for every class, it means that no class can share their 'invoke' methods with other classes.

A better way is to hoist the type-checking of self out of the invoke method and put it in a separate function. So now we have 3 pointers in the Method class: the function pointer, the pointer to invoke, and the pointer to the self type check. Actually the last one doesn't need to be a function pointer, it only needs to be a pointer to data needed to perform an 'isa' test.

As an aside, this same organization can be used for the other data stored in the Method object, such as the list of function parameters. For example, if we store the parameter names in a separate array from the parameter types, then the parameter type descriptor objects can be shared with all other methods that have the same parameter types. Similarly, the array of parameter names can be shared with any function that has the same param names, even if the parameter types are different.

One final issue to address: What if the type of 'self' is not an object? You see, in Tart, there are two major branches on the tree of types: Reference types, which are always passed by reference, and Value types, which are always passed by value. Primitive types are value types; so are structs. Reference types are allocated on the heap and are garbage-collectable; Value types can only live inside of other objects, or on the stack.

Normally any value type can be converted into an Object by boxing, so if you need to pass an int to a function that expects an Object, you can just box it. However, boxing creates a *copy* of the object. That's fine for most parameters, but for 'self' we don't want to create a copy, since that would make it impossible to write a method that mutates the fields of the struct. We need a pointer to the original object.

Tart deals with this by handling the 'self' parameter of value types specially: It is always a reference, even if the type is not a reference type. This has to be handled carefully, however. With Reference types, the 'self' pointer always points to the beginning of a heap allocation, but the 'self' pointer of a value type might point into the middle of an object. The garbage collector doesn't know how to handle such pointers. To avoid memory leaks or data corruption, we can't allow the 'self' argument of a method of a value type to leave the function. We can do this by making 'self' a special type that can be member-dereferenced only - that is, you can access any member, but you can't assign the value to anything. This is for value types only, of course.

But now we want to be able to call methods on value types via reflection, which means we *have* to have some way to pass a struct reference around. For structs which live in the middle of an object, the safe thing to do is to create a wrapper that contains both the struct pointer and a 'base' pointer to the object that contains the struct. That way, the garbage collector can trace the 'base' pointer normally, ignoring the struct pointer.

For structs which live on the stack, there's only one remedy, which is to box them ahead of time, and then pass the boxed reference around. Essentially we will have to have the compiler notice if anyone is using the struct in a way that requires a persistent reference to the object, and convert the stack-based object into a heap-based one at compile time. Note that this is not that different from what happens when we create a closure cell - we notice that someone is using the variable from within a closure, in which case we wrap the variable and put it on the heap.

--
-- Talin

No comments:

Post a Comment