Sunday, October 30, 2011

Thoughts on a build configuration language

Work on Tart is on hold for the moment while I am waiting to hear back from the LLVM folks about a bug in llc.

In the mean time, I want to write down a few notes about "mint", which is the name of the hypothetical future build tool that I want to create. Although my plan is to write mint in Tart, there's nothing Tart-specific about it, and it could be written either in C++ or Python or some other language.

I've been thinking about this a lot lately, especially because there's a big discussion on llvm-dev about the future of the build system, and part of that includes some very passionate opinions about the pros and cons of cmake.

Anyway, here are some basic requirements for mint:
  • It should be fast.
  • It should be extensible, especially with respect to 3rd-party tools. One should be able to write an extension module for yacc or antlr that makes it as easy to invoke those tools as it is to invoke gcc.
  • It should support hermetic builds, so that a particular binary can be produced on a given host machine that is identical to one produced on another machine, even if those machines have different environments.
  • It should support generating project files for popular IDEs such as Eclipse, XCode, Visual Studio, and so on.
I'd also like it to support automatic download of dependent packages, but unfortunately this is an area which I don't have a lot of experience in.

So the basic idea is that the build configuration language (BCL) is a hybrid functional / declarative language that supports prototype inheritance.

The language is declarative in the sense that what it primarily does is describe a graph of objects, the Build Dependency Graph (BDG). Objects in the graph include targets, files, tools, actions, and other entities which are used in the build. Making the language declarative means that a build configuration file describes the relationships between these objects, but doesn't say how these relationships are processed or in what order - that is determined by whatever tool is actually doing the build. This is what makes it feasible to generate a project file for an IDE.

The language is functional in that an object's properties can be expressions which are lazily evaluated. For example, the set of flags passed to a compiler is calculated from a number of different inputs, including the user's choices as to whether to enable debug symbols, assertions, and so on.

Prototype inheritance allows for objects in the graph to be re-used. For example, when you declare a build target, you start by inheriting from an abstract build target prototype, and then fill in the details about your specific target. The abstract build target prototype knows about source files, output files, and dependent targets; your target fills in the details of what specific files are involved. Or you may choose to inherit from a build target prototype that knows how to invoke the C++ compiler, which saves you the trouble of having to write the command-line flags for the compiler.

Because prototype inheritance is used so much in this language, the syntax is very simple:

<prototype> { <properties> }

Where 'prototype' is some previously defined object, and 'properties' is a list of key/value pairs. So for example:

myProgram = program.cpp {
  sources = [ "main.cpp" ]
}

Here we are starting with the standard prototype "program.cpp", and filling in the 'sources' slot ('sources' is defined in the prototype, but it's empty). We are then assigning that to the name 'myProgram'.

We can then, if we wish, create a variation of 'myProgram' that is compiled with optimization:

myProgramOpt = myProgram {
  opt.level = 2
}

The "opt.level" property was also defined in the program.cpp prototype. The program.cpp prototype has a tool definition for the cpp compiler. Here's a glimpse of what the program.cpp prototype might look like, with many details omitted.

program.cpp = program.base {
  opt = dict {
    level = 0
  }
  cflags = []
  tool = compiler.cpp {
    flags = [
      match (opt.level,
        0 => "-O0",
        1 => "-O1",
        2 => "-O2",
        3 => "-O3",
        * => raise("Invalid optimization level: " ++ opt.level))
    ] ++ cflags
  }
}

The first few lines define a property called 'opt' whose value is a dictionary type - basically just a namespace. Within that is the property 'level', which defaults to zero. (I'm undecided as to whether 'level' should have a type as well as a default value - it wouldn't make sense to assign a string to 'level'.)

Next we have the tool that will get run when this target gets updated. The "flags" property is a list of flags that gets passed to the compiler on the command line. When the compiler is invoked, each element of 'flags' will get evaluated and transformed into a string. That string is then passed to the compiler as the shell arguments.

In the case of opt.level, we use a match expression to transform the optimization level numbers 0..3 into compiler flags of the form -On. The idea here is to abstract away the specific syntax of the compiler's command-line arguments, so that the same syntax can be used to control different compilers.

There is also a 'cflags' property that can be overridden by descendant objects, allowing flags specific to a particular compiler to be passed in.

The main idea here is that "builtin" compilers for C and C++ use the same descriptive syntax as for 3rd-party extension compilers. In other words, the built-in rules aren't special.

Another kind of prototype is used to define the command-line options to the build script itself:

enable-asserts = option {
  type = bool
  default = false
  help = "Whether to enable assertion checks"
}

This means that the user can type "mint --enable-assertions" to generate a build configuration with assertions enabled. Or even type "mint --help" to print out the various options available.

BTW, one thing that would be neat is if "--enable-assertions" could change just that one option, while leaving other options set to their current values.

Anyway, I'd be curious to know what people think of this idea.

Tuesday, October 18, 2011

JIT / AOT Hybrids

Something I've been pondering a lot lately is the notion of link-time code generation.

The Tart compiler generates a lot of synthetic data and code, compared to a language like C++. This includes reflection tables, stack maps for garbage collection, type adapters, and so on.

What I've been wondering is whether it would make sense to do this work in the linker rather than in the compiler. There's even a precedent for this: there's one data structure which is already generated at link time, which is the list of static roots - global data structures that need to be traced in the collector.

There's a variety of things that could be generated at link time, ranging from the "only slightly ambitious" such as interface dispatch methods, to the "wildly ambitious" such as generating DWARF debugging info or even instantiating templates.

In order to do any of this, the linker would need much more information about high-level data structures and types. Ironically, this information - basically a serialized form of the AST - is already generated by the compiler and stored in the module. This information isn't currently used by the linker - instead, it's used by the compiler to allow already-compiled module files to be imported, without having to have the original source files.

If we take this idea to it's logical extreme, what we end up with is something like a Java class file that's compiled ahead of time (AOT), instead of just in time (JIT). That is, you have a single, comprehensive encoding of all of the types and methods in the module, and this encoded information is then used to produce the reflection tables, tracing tables, dispatch tables, and DWARF debug info. (As opposed to what it does now, which is to generate all of these tables separately, at compile time, and then glue them together in the linker.)

This whole approach basically blurs the line between AOT and JIT compilation, in that the inputs to the linker look a lot like the inputs to a VM.

There's some problems to solve however.

1) One problem with using the serialized AST is that lacks a lot of information, since it hasn't been fully analyzed. Type names in the AST are just names - we don't know yet which type is the one being referred to. Similarly with methods and variables, all we know is the name at this point. The largest part of the compiler is the part that analyzes the AST and converts it into a CFG (Control flow graph) in which all types are resolved, template parameters expanded, and so on. We don't want all of this functionality to live in the linker - instead, we want the compiler to do the analysis and then pass to the linker the results of that analysis.

So the first problem to be solved is to come up with a way to serialize the CFG rather than an AST.

2) A second problem has to do with template instantiation. Currently templates are not analyzed until after they have been instantiated. The reason for this is that until they are analyzed, templates don't have enough information to be able to do a successful analysis.

As a simple example, take the max() function whose declaration is "def max[%T](a:T, b:T) -> T". This function requires that it's template parameter, T, have a less-than operator defined for it. If there's no less than operator, the template instantiation fails. Until we actually attempt to instantiate the template and bind some type to T, we don't know if there's a less than operator defined or not. That's why templates are written out as un-analyzed ASTs, rather than fully-analyzed CFGs.

To solve this problem, we would need to change the way template parameters work. In order to do any operation on a variable whose type is a template parameter, you would have to declare the template parameter as supporting that operation.

Say we have a template parameter T, and we have some value 'a' whose type is T. Now we want to call 'a.toString()'. In the current system, we don't know whether a has a toString() method or not, but if we attempt to instantiate the template with some type that doesn't have a toString() method, then an error will occur. In the new system, in order to call a.toString(), you would have to declare template parameter T as [%T <: HasToString], where 'HasToString' is a protocol type that includes a 'toString()' method. In other words, having a 'toString()' method is now an explicit requirement for parameter T, any attempt to instantiate the template with a type that doesn't have a toString() method will fail early.

(FYI - in the literature on type theory, what I have described corresponds to "parametric polymorphism" whereas the way things work now is "ad-hoc polymorphism".)

Unfortunately, Tart's protocol system only works on instance methods, not on regular functions. Right now there's no way to express the concept "match any type that has a less-than operator defined for it". In other words, what we would need is some way to say "given a type T, return true if there is a function 'def infixLessThan(:T, :T) -> bool' defined somewhere." At the moment, I can't even imagine what that syntax would look like. But for now, let's assume we come up with some way to do this.

By pre-declaring what methods we need, we can do the analysis passes on the template and create a CFG. This CFG won't actually be complete, because some types are still not known. But we can at least do enough analysis to be able to write the CFG out to a file.

3) Currently, you can link Tart programs in two ways: Either by using the Tart linker, 'tartln', or by using the standard LLVM tools (llvm-ld, opt, and llc). If using the latter, however, you need to load in the Tart-specific plugins for reflection and garbage collection.

In order to do the link-time code generation that I'm describing, we would need to put all of the code to do that into the shared library, so that it can be loaded as a plugin as well.

Tuesday, October 11, 2011

Another idea on mutability

I was thinking about the way that 'const' gets used in C++ programs. In particular, it's extremely tedious work to take a body of source code that doesn't use 'const' and refactor it to be const-correct. Typically what happens is you add 'const' modifiers to a small section of the code, and then try to compile, at which point you get a jillion errors. You then go and fix up all of those errors to be const-correct, try to compile, and now you have a jillion-squared errors.

(Of course, this tends to be more true for some uses of const than others - it depends on how widely referred-to is the thing that you are making const.)

In other words, because const is so viral in C++, you end up having to convert your entire program in one go - there's no easy way to break up the change into manageable chunks.

So I wonder if it makes sense to give programmers a "soft landing" option, in which const-correct and const-incorrect code can interoperate without creating a ton of warnings. There are of course some serious downsides to this idea. The fact that the C++ compiler tells you exactly what values you need to add the 'const' modifier to is pretty useful. And allowing programmers to write non-const-correct code means a loss in safety.

BTW, all of this started because I was trying to figure out how to write the following function:

  def tracePointer[%T <: Object](v:T);

The <: operator means "isSubtype" - basically it's saying that the T parameter must be type Object, or a subtype of Object - it can't be an int for example.

The problem is that this fails if you attempt to bind T to "readonly(Object)". The reason is because "Object" means the same as "mutable(Object)", which is a specialization of "readonly(Object)". From the compiler's point of view, mutable(Object) <: readonly(Object), not the other way around.

Unfortunately, while this makes sense from a theoretical point of view, it fails the practicality test. When you say "A <: B", with no mutability qualifiers on either A or B, what you generally mean is "type A is a subtype of type B, and I don't care what the mutability modifiers are". If you'd explicitly given a mutability modifier, like "A <: mutable(B)", that would be an entirely different story.

Since one of the design rules of Tart is that practicality should dominate over theoretical purity, I put in a temporary hack that basically says "when using the subtype operator, if neither side of the expression has an explicit type qualifier, then don't consider the type qualifiers of the values they are bound to when doing the comparison." However, this hack is rather ugly, and is yet one more thing to remember when programming in Tart.

A different, and much more radical idea is this: That a type name with no mutability qualifiers ("Object") doesn't mean "mutable" like it does now. Instead, it means "mutability unspecified", in other words there are no guarantees about mutability or immutability at all. So the hierarchy would look like this:
  • (no qualifier)
    • readonly
      • mutable
      • immutable
Basically it means that you can implicitly convert from (no qualifier) types to readonly, mutable, or immutable types and the compiler won't complain. So upcasting is implicit and silent - downcasting (converting from (no qualifier) to a qualified type) requires an explicit cast.

What this means is that a lazy programmer can write code without thinking about mutability at all - they just don't use the mutability qualifiers in their code. If at some future point they decide that explicit mutability is a good idea, they can go and start adding the keywords where they are needed. Unfortunately, the compiler won't be as helpful as it would be in the case of C++ - if you forget to add the keyword in some places, the compiler won't always detect that, since it's legal to convert from readonly(Object) to Object without an explicit cast.

It also means that the expression "T <: Object" now says exactly what we intuitively think it ought to mean.

This is an interesting idea, but I think it goes a little too far, in that it does more than just allow the programmer not to be distracted by mutability concerns - it *actively encourages* programmers to write code that doesn't address mutability issues by making such code much cleaner-looking and easy to type. You see, I think that the easiest option should also be the correct one - that is, you can break the rules if you want, but if you follow the rules you'll not only get the satisfaction of having correct code, but you'll also get clean-looking, succinct code as well.

Unfortunately, with mutability this is hard to do - making the compiler smart enough to guess the correct default without the programmer having to explicitly type it means making the rules of the language much more complicated and hard to remember.

Tart status update

I decided to back off on the "make function parameters default to read-only" change because it was getting too big and too complex. I still plan to go ahead with it, but I need to fix some other things first. The whole mutability thing has introduced a lot of small problems that need to be solved one by one, and I've just been working through them.

One question that has come up is: If function parameters are read-only by default, should function return values also be? I even thought about making all object references readonly by default, but I really don't want to drink quite so deeply from the functional programming kool-aid.

Another question that keeps rolling around in my brain is this: Was it the right choice to make Tart an AOT (Ahead of Time) language instead of a JIT? I originally made tart an AOT because I was looking for a replacement for C++, not another Java. And AOTs do have a couple of obvious advantages: quicker startup time, for example.

VM-based languages, however, have plenty of advantages of their own. For example, in the current incarnation of Tart, up to 4 different versions of a class can be written to an output file: Once as executable code, once as DWARF debugging info, once as reflection data, and once as importable symbols for compilation of another module. The reason that these 4 different representations are necessary is because any of the last three can be deleted from the module without affecting any of the others - so for example you can strip out all the DWARF and reflection will still work. With a JIT, we'd only need to write a single representation, and then produce on demand any or all of those representations as needed. And the whole "dynamic linking" problem completely disappears.

Similarly, a JIT can make all kinds of optimizations that are not available to an AOT compiler. For example, if it knows that class A is never subclassed (because no such class has been loaded into memory), it can optimize calls to A's methods as direct calls rather than dispatched through a jump table. Of course, how much optimization you get depends on how long you are willing to wait at startup.

One could, in fact, design an intermediate representation that would allow better optimizations than C++, Java, or even LLVM IR, because it would preserve certain high-level invariants that could be used to guide optimizations that are normally removed from byte-code representations. (Note: This means that the "compiled" class files would not be LLVM bitcode files, but some higher-level representation.)

Writing a JIT for Tart wouldn't be all that hard, if we used LLVM's JIT engine as a base. And it would be fun. On the other hand - it would be completely, recklessly insane. Therefore I won't do it. That doesn't prevent me, however, from writing a design doc about it.

Wednesday, October 5, 2011

Tart status update

I was out of town most of last week, and didn't get too much done on Tart. However, now that I'm back I'm continuing to work on type qualifiers and mutability.

Tart now allows type qualifiers (readonly, mutable, immutable) to be bound to a template parameter. The syntax looks like this:

def somefunc[%Mod(?)](arg:Mod(Object)) -> Mod(String);

Basically what we're saying here is that "Mod" is essentially a function that operates on types - that is, it takes a type as input, and returns a modified type as output. Moreover, the type inference engine can deduce what kind of function it is by examining the context in which the function is used. If 'arg' is an immutable object, then "Mod(?)" gets bound to "immutable(?)", which in turn causes the result of the function to be immutable as well.

More generally, each place where "Mod" appears in the function definition is a constraint. Function arguments are contravariant, so the type of 'arg' must be a subtype of Mod(Object). Function return values are covariant, so Mod(String) must be a subtype of whatever variable or parameter the return value is going to be assigned to. Given those two constraints, the compiler attempts to find a definition of Mod that satisfies both.

One downside of the current syntax is that it doesn't let you choose which type qualifiers the parameter is bound to - which is not a big issue right now, because a type currently can only have one qualifier, but may be an issue in the future.

Other than that there's not much to report. I still need to update the exception handling code to work with the new LLVM exception stuff, I just haven't gotten to it yet.