Machine Words: February 2010

Saturday, February 27, 2010

Stuck

I've come to an impasse on Tart, and I'm not sure what to do. The issue is garbage collection, in particular the issue of finding all roots on the stack.

LLVM provides a built-in mechanism for recording stack roots called "shadow-stack". Essentially what this does is it maintains a separate data structure that mirrors the real stack. This data structure consists of a singly-linked list, where each list entry points to a constant data structure that describes the locations of roots within the a call frame. Each time a function is entered, a shadow-stack entry is pushed onto the shadow stack, and popped on exit to the function.

The main problem with shadow-stack is that it's not thread-safe. The head pointer of the singly-linked list is a global variable, which is modified on each function entry and exit. There's no way this can work with a multi-threaded runtime environment.

What about modifying the shadow-stack to use a thread-local linked list? Well, on Linux this would work fine. GCC on Linux supports fast access to thread-local variables via the __thread attribute. From what I understand, a special register is dedicated to pointing at the base of stack-local variables, so accessing these variables is quick.

However, other operating systems don't support fast access to thread-locals. For example, Mac OS X / Darwin does not have support for __thread. Instead, one must use the pthreads library to gain access to thread-local data. On Windows there is a similar set of functions to pthreads. In order to make shadow stack work on these platforms, one would need to call pthread_getspecific() in every function (although I suppose you could skip functions that had no roots.) Even if the overhead for a single call to pthread_getspecific() is modest, having to call it on every function will likely slow down execution considerably.

Another alternative is to add an additional parameter to every function containing the address of the head of the stack frame list. Alternatively, you could eliminate the global entirely and pass the head of the stack frame list to each child function, which would append to it. This head pointer would be passed in to gc_alloc() or any other function that could potentially cause a collection cycle.

Adding an extra parameter to every function call would not only degrade performance, it also has a problem with calling external C functions which might want to in turn call Tart functions. The stack frame list pointer would need to pass through the non-Tart code (one assumes that the stack frame for the external C function contains no roots.) This limits the ability of Tart to call arbitrary foreign functions.

The ideal solution would be to walk the stack directly, without needing an auxiliary data structure. For example, with _Unwind_Backtrace, I can step through all of the call frames of the current thread. For each of those call frames, I can get either the current IP address (using _Unwind_GetIP) or the start of the function (using _Unwind_GetRegionStart). Theoretically, one could then take the value of these registers and look them up in a map, which would return a constant data structure describing the locations of the roots in the call frame. Only functions actually containing local variable roots would need to be in the map.

I don't know how slow the _Unwind functions are, but even if they are considerably slower than the pthread functions this is still a win - because these functions only need to be called during a collection, not on every function entry.

However, several important pieces are still missing. The various _Unwind functions give me the IP and the function start, but not the address of the call frame. There is an ability to get the value of any arbitrary register, but I don't know which register contains what I would need, which would likely be different on different processors and maybe different OSs as well. Also, even if I had the address of the call frame, I don't know if the table of offsets from the call frame to the stack roots would be reliable, since I imagine things might shift around during code generation. The bigger problem for me is that this is an area of programming that I am unfamiliar with - the last time I did serious poking around with call frames and return addresses, it was in the early 80s and I was doing Z-80 and 6809 assembly. Things have gotten a lot more complicated since then and none of my previous knowledge is applicable.

Another problem with the _Unwind functions is that they aren't supported on all platforms either. (This is a problem I will eventually have to solve anyway, since my exception personality function also uses these functions.)

LLVM has a couple of functions for walking back the stack, however they are accompanied by large warnings saying that these functions are unreliable for anything other than the immediate caller frame. Also, these functions supply the return address and the frame pointer, but not the start of the function - that would mean that my table of function addresses to stack maps would have to contain ranges rather than single addresses, and lookup would be necessarily slower.

Monday, February 15, 2010

New intro document

I've added a new document called "A Taste of Tart" that is intended to be a very brief introduction to the language:

http://code.google.com/p/tart/wiki/TasteOfTart

Sunday, February 14, 2010

Switch statements with string values

I decided to go ahead and implement string case values for switch statements:

def setOption(key:String, value:String) {

  switch key {
   case "TargetLanguage" {

   targetLanguage = value;
   }

   case "TokenType" {

   tokenTypeName = value;
   }


   case "TokenPrefix" {

   tokenNamePrefix = value;
   }


   case "StateType" {

   stateTypeName = value;
   }

   case "StartState" {

   startStateName = value;
   }

    else {

   // log unknown option
   }

  }
}

Currently it just transforms this construct into a sequence of if/else statements. However, it would be possible to optimize this by precomputing a hash value for each string, or making a first-letter index table.

In fact, it wouldn't be hard to extend this to any data type that is (a) hashable, and (b) representable as a constant literal. However, I'm not sure I want to go that far.

Tuesday, February 9, 2010

Some new ideas on garbage collection

Up to now, I haven't been making very much progress on garbage collection, because there are a number of problems that I haven't been able to solve. For example, if each mutator thread has its own private thread-local allocation space (for the young generation), what happens when an object belonging to one thread points to an object in another thread's allocation area? It means that in order to do a collection, both threads would have to be stopped - but how would you know which threads to stop without scanning all of the objects to see which ones refer to other spaces?

So lately I've been researching what other collectors do, and one place I looked was OpenJDK, which lead me to a couple of whitepapers on HotSpot garbage collection, from which I learned a few things.

First off, threads in HotSpot don't each have separate young generations. Rather, you have a single young generation which is divided up into TLABs (Thread-local allocation buffers). The key difference is that *all* of the threads are stopped during a young-generation (YG) collection.

You see, in my original design, each thread had a separate YG allocation space for young objects, and when that space got full, that one thread only would be paused and a collection cycle would be run on that one space. Of course, there might be a few pointers from inside the space pointing to objects outside the YG space, and vice versa, but I figured that those could be handled as special cases - perhaps by keeping a separate list of such references. However, I never came up with a way to do this efficiently.

However, in the HotSpot JVM, there is only a single young generation which is divided into TLABs. Each thread has an assigned TLAB, which is a linear region of memory in which it can use to do efficient, lockless "bump-pointer" allocation. When a TLAB runs out of space, the thread gets a new TLAB from the available pool in the young generation (this action requires a lock). Only when we run out of TLABs do we need to actually perform a YG collection. At this point all threads have to be stopped.

According to the whitepaper, YG collections are fast enough that pause times created by stopping all threads and doing the collection aren't an issue - it's only the OG collection that creates problematic pauses. And because all threads are stopped at once, the collector is free to move objects around as needed, which completely solves my problem.

I already came up with a fairly efficient and portable mechanism for stopping threads as part of Scarcity. It basically involves inserting sync points into the generated code, where each sync point tests a flag that says whether or not we are ready to do a collection. If the flag is true, then we use the conventional synchronization primitives to stop the thread.

One issue with sync points is whether the testing of the flag is going to incur an unacceptable overhead, especially if it has to be synchronized between processors. My take on this is that there are two kinds of sync points: "mandatory" (as in "you need to stop right now") sync points, which occur before heavyweight operations such as memory allocation or i/o, and "advisory" or "probabalistic" sync points (as in "we'd like you to stop sometime soon") sync points which occur in inner loops. In other words for the inner loop case, it's ok if testing the flag is non-deterministic as long as its also low-overhead. (Well, it has to be guaranteed that *eventually* the flag will be read. I don't know if any of the processors with weak memory semantics provide that as an option.)

So for now the solution is to use double-checked locking, sort of. I realize that's a hazardous course, but it's the only technique I know of that won't cause unacceptable overhead.

Monday, February 8, 2010

Atomics

A fairly rudimentary AtomicInt type:

def testAtomicInt {

  var a = AtomicInt32(0);
  assertFalse(a.cas(1, 1));

  assertEq(0, a.value);
  assertTrue(a.cas(0, 1));

  assertEq(1, a.value);
}

Friday, February 5, 2010

Dependency injection

I've written a unit test that demonstrates a very rudimentary form of dependency injection in Tart. Here's a small code snippet:

import tart.inject.Inject;
import tart.inject.Injector;

// A class that has a no-arg constructor.

class NoArgConstructor {
  def construct {}

}

// A class that has a single arg constructor annotated with @Inject.
class SingleArgConstructor {

  @Inject def construct(a0:NoArgConstructor) {}
}

def testInjection() {
  // Request an instance of type 'SingleArgConstructor'.

  let instance = Injector().getInstance(SingleArgConstructor);
}

In the above example, the call to 'getInstance' causes two objects to be created. First, it creates an instance of 'NoArgConstructor', because that is needed as a constructor parameter to the second instance created which is of type 'SingleArgConstructor'.

This example does not yet support any of the Guice-style features such as interface bindings, singletons, and so on. But those could easily be added.

Under the hood, what is happening is this: When the injector's 'getInstance' method is called, it uses reflection to examine the type literal passed to it. It searches the resulting class definition for a constructor annotated with @Inject, or failing that, a constructor with no arguments. It then examines the list of constructor parameter types, and for each one it recursively calls getInstance() for that type. These constructor parameters are then stored in an array, and are then used to invoke the constructor method via reflection.

This is also the first time that Tart has supported runtime attributes (the @Inject annotation), up to this point all attributes have not been retained past compile time. (Only annotations who are specially marked are retained in the output binary.)

I plan to model my dependency injection framework after Guice, up to a point. One deviation is that in a language in which function closures are a first-class type, there's no need for a Provider type, since the injector can simply return a factory function instead. So for example, if you want to create an object lazily, you will eventually be able to say:

let provider = Injector().getProvider(SingleArgConstructor);

// later...
let instance = provider();

The call to 'getProvider' will return a function closure that will construct a new instance of the requested type each time it is called. (Unless the type is marked as a singleton, in which case it will return the same instance each time.)