Friday, September 3, 2010

Tart status update and unsolved problems

I've been making fairly good progress on the reflection stuff. At the moment, I have both the old system and new system running in parallel, working towards a point where the old system can be completely deprecated.

There are two long-standing problems that I have not yet been able to solve, both of which are related to LLVM.

The first big problem is generating DWARF debugging info. I've put several man-months into this over the course of the past several years - I recently did a near-total rewrite of the code for generating source line information - and it still doesn't work.

Part of the problem is that the LLVM docs on source-level debugging, while extensive, are also vague and ambiguous in many places. I find that often times the only way to understand how things work is to cross-reference between the docs, the clang source code, and the LLVM source code. Even then, there are a lot of things aren't clear, especially when the code in clang contradicts what's in the doc. For example, according to the source-level debugging doc, the signature of a function is defined using an array of formal parameter descriptors using the DWARF tag DW_TAG_formal_parameter. However, according to Google code search, the identifier "DW_TAG_formal_parameter" appears nowhere in the clang source tree, and if you trace through the code that is used to generate a function type descriptor, you see that in fact the array is an array of type descriptors, not parameter descriptors.

Another difficulty with DWARF symbol generation is the difficulty of isolating a problem. If you make a mistake in your generator and then attempt to debug the compiled program, gdb prints various obscure error messages telling you that your debug info is invalid, but doesn't tell you where in your code the error lies or what debug symbol is having a problem. There's also dwarfdump, which pretty-prints the debug info for your program (typically thousands of pages worth, so good luck finding the problem by eye.) The easiest way to use use dwarfdump is to have it print out all of the debugging info using the -a option, and hope that it segfaults (which it usually does) when it gets to the point where the problem is. Based on the last few lines of printout before the segfault, you can sometime deduce which symbols are having problems, although if your problem lies in a call frame descriptor, this won't work because call frame descriptors aren't self-identifying - you can look at the text dump of a call frame descriptor, but it won't tell you which function's call frame is being described.

Tonight I am going to start experimenting with readelf, although that requires me getting it running on OS X.

The other big problem I have is with llvm.gcroot. This is the pseudo-function in LLVM which is used to mark a local variable as a garbage collection root. A linker plugin (which I have written for Tart) collects all of the calls to llvm.gcroot and uses that information to build the stack frame descriptors - a table which the garbage collector can use to trace the stack. The actual call to llvm.gcroot is removed, although it can have an effect on certain optimization passes.

The llvm.gcroot function has a curious limitation: It only works on data values that are pointers that have been allocated with a local alloc() instruction.

Typically, the call frame for a function contains pointers to objects on the heap. llvm.gcroot isn't responsible for the pointers which are contained inside of heap objects - that's the job of your collector. Instead, llvm.gcroot only deals with pointers that are on the stack. (Note that it does not deal with pointers that are SSA values, i.e. only in registers, since there's no way for a trace to access those anyway. Any pointer must have a location on the stack where it can be traced.)

The LLVM 'alloca' instruction is used to reserve a stack slot for a variable. So typically you would have a stack slot containing a pointer, which points to some heap object. The stack tracer  then iterates through all of these stack slots (only the ones containing pointers to heap objects) and trace the objects pointed to.

The problem comes when you have a stack slot that contains pointers, but isn't a pointer itself. An example would be a small structure that contains several pointers. Because the structure is not a heap object, it needs to be traced by the stack tracer. Unfortunately, llvm.gcroot won't accept a stack slot that's not a pointer - it simply emits a fatal error when you try. Nor can you call llvm.gcroot on the members of the structure either - the argument to llvm.gcroot must be the result of an alloca instruction (a stack slot) and not a member of a stack slot.

Now, this is not a problem for languages like Java - Java only allows stack slots for primitive types and pointers. (There's no such thing as a 'struct' in Java). It's not a problem for C and C++, which don't have garbage collection. It *is* a problem for C# and D - those compilers get around the problem by simply not using LLVM's garbage collection framework at all. I don't want to do that because it requires writing a whole lot of code that is specific to a each different CPU architecture, which I'm not up to.

So far the only helpful suggestion from the LLVM mailing list is to allocate an extra stack slot for each local struct variable, containing a pointer to that struct. You would then declare the pointer as a gcroot instead of declaring the struct directly. However, I've been hesitant to go this route because it seems like an ugly hack, not to mention that there's now an additional 8-byte overhead for each local structure variable. Also, it makes the tracer more complicated, because now you have to deal with pointers to stack objects as well as pointers to heap objects.

I've thought about digging in and changing the way llvm.gcroot works. But the code is complicated and I don't entirely understand it, and I don't like mucking around with things I don't understand.

No comments:

Post a Comment