Saturday, February 27, 2010
I've come to an impasse on Tart, and I'm not sure what to do. The issue is garbage collection, in particular the issue of finding all roots on the stack.

LLVM provides a built-in mechanism for recording stack roots called "shadow-stack". Essentially what this does is it maintains a separate data structure that mirrors the real stack. This data structure consists of a singly-linked list, where each list entry points to a constant data structure that describes the locations of roots within the a call frame. Each time a function is entered, a shadow-stack entry is pushed onto the shadow stack, and popped on exit to the function.

The main problem with shadow-stack is that it's not thread-safe. The head pointer of the singly-linked list is a global variable, which is modified on each function entry and exit. There's no way this can work with a multi-threaded runtime environment.

What about modifying the shadow-stack to use a thread-local linked list? Well, on Linux this would work fine. GCC on Linux supports fast access to thread-local variables via the __thread attribute. From what I understand, a special register is dedicated to pointing at the base of stack-local variables, so accessing these variables is quick.

However, other operating systems don't support fast access to thread-locals. For example, Mac OS X / Darwin does not have support for __thread. Instead, one must use the pthreads library to gain access to thread-local data. On Windows there is a similar set of functions to pthreads. In order to make shadow stack work on these platforms, one would need to call pthread_getspecific() in every function (although I suppose you could skip functions that had no roots.) Even if the overhead for a single call to pthread_getspecific() is modest, having to call it on every function will likely slow down execution considerably.

Another alternative is to add an additional parameter to every function containing the address of the head of the stack frame list. Alternatively, you could eliminate the global entirely and pass the head of the stack frame list to each child function, which would append to it. This head pointer would be passed in to gc_alloc() or any other function that could potentially cause a collection cycle.

Adding an extra parameter to every function call would not only degrade performance, it also has a problem with calling external C functions which might want to in turn call Tart functions. The stack frame list pointer would need to pass through the non-Tart code (one assumes that the stack frame for the external C function contains no roots.) This limits the ability of Tart to call arbitrary foreign functions.

The ideal solution would be to walk the stack directly, without needing an auxiliary data structure. For example, with _Unwind_Backtrace, I can step through all of the call frames of the current thread. For each of those call frames, I can get either the current IP address (using _Unwind_GetIP) or the start of the function (using _Unwind_GetRegionStart). Theoretically, one could then take the value of these registers and look them up in a map, which would return a constant data structure describing the locations of the roots in the call frame. Only functions actually containing local variable roots would need to be in the map.

I don't know how slow the _Unwind functions are, but even if they are considerably slower than the pthread functions this is still a win - because these functions only need to be called during a collection, not on every function entry.

However, several important pieces are still missing. The various _Unwind functions give me the IP and the function start, but not the address of the call frame. There is an ability to get the value of any arbitrary register, but I don't know which register contains what I would need, which would likely be different on different processors and maybe different OSs as well. Also, even if I had the address of the call frame, I don't know if the table of offsets from the call frame to the stack roots would be reliable, since I imagine things might shift around during code generation. The bigger problem for me is that this is an area of programming that I am unfamiliar with - the last time I did serious poking around with call frames and return addresses, it was in the early 80s and I was doing Z-80 and 6809 assembly. Things have gotten a lot more complicated since then and none of my previous knowledge is applicable.

Another problem with the _Unwind functions is that they aren't supported on all platforms either. (This is a problem I will eventually have to solve anyway, since my exception personality function also uses these functions.)

LLVM has a couple of functions for walking back the stack, however they are accompanied by large warnings saying that these functions are unreliable for anything other than the immediate caller frame. Also, these functions supply the return address and the frame pointer, but not the start of the function - that would mean that my table of function addresses to stack maps would have to contain ranges rather than single addresses, and lookup would be necessarily slower.


sam said...

was wondering if you took a look at the implementation of GC in LDC compiler ( that uses LLVM as backend.They have GC implemented in their library called Tango which is also used by other compilers implemented for D language.

Post a Comment