Machine Words: July 2011

Sunday, July 17, 2011

When object orientation is the wrong answer

One of the major refactorings I did this week is to completely replace the implementation of the "isSubtype" test.

The "is subtype" test is a binary relation between two types. There's even an operator for it in Tart: "<:" which is the same one used by Scala. So the expression "A <: B" is true if A is either the same as B or is a subtype of B. This particular test gets called in a lot of places - for example, deciding which of two functions is a closer match involves doing subtype tests on all of the arguments.

The way I originally implemented this in the compiler was by following the standard object-oriented approach: class Type, which is the base class of all types, has a pure virtual method "isSubtypeOf(const Type * base) const". Each subclass of Type (PrimitiveType, FunctionType, CompositeType, etc) overrides this and provides an implementation that is specific to that type.

However, the subtype relation isn't quite as cut and dried as all that. For example, what if the type being tested is a type alias that points to some other type? Well, you have to dereference the alias before you can do the subtype test. OK so we do that. But now, what if the base type is also an alias? Well we have to dereference that too. So now every type has to know about aliases - either that, or every place that calls isSubtypeOf has to dereference the types first.

It gets more complicated: What if the base type is a protocol? Any type can be a subtype of a protocol, because a protocol is simply a constraint - it says "this type has a method with such-and-such name and such-and-such signature". So for example, even primitive types have a "toString" method, which means that primitive types are technically considered subtypes of the "HasToString" protocol. OK so now every type has to know about protocols.

What about ambiguous types? These are temporary types, created during the type inference pass, that represent a "cloud" of related types which haven't been fully resolved yet. So for example, if we call "min(a, b)" the return value might be an int, or a float, or even a String depending on which overload of "min" eventually gets chosen. Each ambiguous type has a generator which can iterate through all of the possible types, and that generator will return fewer and fewer results as we tighten up the constraints looking for a singular solution.

The "isSubclassOf" test for ambiguous types requires that all of the "possible" types must each pass the subclass test, if any of them fails then the whole thing does.

There are also "type assignments" such as "%T=int" - meaning that the type variable T is assigned the value "int" within the current environment. Now, what happens if you have a template instance, where one or more of the type parameters is a type assignment or some other funky type? For example: given that ArrayList[int] is a subtype of List[int], is ArrayList[%T=int] also subtype of List[int]? The answer is, it depends on the environment - so now isSubtypeOf() needs to know about environments.

As you can imagine, with all these complexities the code is starting to look pretty nasty. Time for a redo.

In the new scheme, isSubtype(a, b) is now just a regular function - not even a class method. It's a fairly large function, since it has to handle all of the logic I've outlined above. However, it's not as large as you might think. Mainly what it does in complex cases is simplify the case by one step and then call itself recursively with the simplified arguments. Thus, if either of the two arguments are an ambiguous type, it generates the set of possibilities, and then calls itself for each possibility.

The biggest benefit is that all of the logic for the subtype relation is all located in a single file, which is much easier to understand than having it scattered across two dozen source files. Although some types have different logic, there's a lot of commonality too - for example, there's a whole set of types which have no concept of inheritance of subtyping - booleans and machine address types are an example - they both handle the "isSubtype" test the same way - by delegating to "isEqual".

New type solver and related progress

The great type solver rewrite is finally finished and all tests are passing! I've been working really hard on this, I even took 3 days off of work this week so that I could work on the compiler - I felt that things were getting really close to a conclusion and I was having trouble focusing on my regular work. (Although I did do *some* vacation-y stuff - like hiking in the redwoods with a friend.) However, it wasn't until yesterday morning that I actually solved the last bug and was able to commit a working version. Since then I've done a bunch of cleanup work, deleting old code that is now no longer needed, and so on.

I also noticed that LLVM's output stream class now supports ANSI color, so I added some code to display error messages in red, warnings in yellow and so on. Another thing I added was a new macro which helps debug non-fatal errors - in debug builds, it will print out the file and line number to set a breakpoint in the debugger, so you can just cut and paste it from the console instead of having to search for the error message in the source code. (A typical code pattern I use in the compiler is that when an operation fails, after printing out the error message, I reset the state and try the operation again - this makes it easy to trace through it in the debugger.)

All that being said, there's one caveat - while all this was going on, LLVM just dropped in a massive change to their APIs. I've actually been a supporter of this particular change and have been awaiting it with much anticipation, however it means that the next time I sync to LLVM I've got a fair bit of work to do.

In terms of release plans: I did a sort of "pre-announcement" of my 0.1 release on the LLVM mailing lists. I didn't get much response, which is no surprise - most people there are focused on their own projects. There are several other places where I could publicize Tart. One is the internal mailing list for language enthusiasts at Google. I will probably get a much larger reaction there, although a lot of it will most likely be negative - I expect a lot of criticism from the functional-language enthusiasts, and a lot of heat from partisans of competing languages like Scala and Go. There are also various newsgroups and mailing lists, although I don't really know my way around those fora and would be concerned about posting in the wrong place. At one point I considered posting on one of the Python mailing lists, based on the rationale that Tart is in many ways "inspired by" Python, but that rationale now seems weak to me.

Of course, I have another outlet, which is my 1500+ followers on Google+, most of whom I'm guessing are non-technical, so my announcement would be meaningless to the vast majority of them I suspect.

Friday, July 8, 2011

Fun with Symbolic Reasoning

So I'm about 3-4 weeks into the "great type solver re-write". It really wasn't my plan to re-write the type solver, but things have kind of snowballed. Originally my goal was to allow you to define an anonymous function without having to explicitly specify the types of the parameters - so you could say "fn (x, y) { ... }" instead of "fn (x:int, y:int) { ... }". However, it turns out that in order to make this work, I had to fix a lot of problems in the solver. There were a lot of places in the code where I had hacked around a particular problem instead of thinking it through and doing the correct thing. The new code is more abstract, but in many ways its easier to understand, which is always a good sign.

One interesting aspect of all this is that the type solver now does a kind of deductive reasoning: the input expression gets converted into a set of Constraint objects, and the solver attempts to optimize the constraints before it searches for a solution. An example would be two constraints such as "T is a subtype of List[X]" and "T is a subtype of ArrayList[X]". Since the first constraint is implied by the second, it is unnecessary and can be removed. A different example would be "T is a subtype of ArrayList[X]" and "T is a supertype of ArrayList[X]". Regardless of what X is, these two constraints can be satisfied only if T is exactly ArrayList[X].

The data model for the solver now looks like this:

Template

TypeVariable

Environment

TypeAssignment

Constraint / ConstraintSet

Provision / ProvisionSet

A TypeVariable is essentially a template parameter. Type variables can be bound to values within an environment. For each type variable, a TypeAssignment is created which represents the value of that variable within the environment - essentially it's a slot that is filled in with the value.

Associated with each TypeAssignment are an arbitrary number of Constraints. There are three types of Constraints: UPPER_BOUND, LOWER_BOUND, and EXACT. So UPPER_BOUND(x) says that the value of the TypeAssignment must be X or a subtype of X.

With each constraint there are zero or more "Provisions" (think of provisions of a contract). All provisions must be true, otherwise the constraint has no effect.

Here's an example of how this all works:

var a:int8;
var b:int16;
var c:int32

let d = min(a, b + c);

There are two function calls in the last line: A call to 'min' and a call to '+'. The type solver first identifies all of the overloads of these two functions:

def min[%T](lhs:T, rhs:T) -> T;

def infixAdd(a:int8, b:int8) -> int8;

def infixAdd(a:int16, b:int16) -> int16;
def infixAdd(a:int32, b:int32) -> int32;
(and so on)

After unification, the TypeAssignment for %T looks like this:

%T:
>= int8
>= int8 provided [infixAdd(a:int8, b:int8) -> int8]

>= int16 provided [infixAdd(a:int16, b:int16) -> int16]
>= int32 provided [infixAdd(a:int32, b:int32) -> int32]

Each '>=' represents a LOWER_BOUND constraint. The first constraint says that T must be at least an 'int8' if not larger - because the first argument to 'min' is 'a', which is an int8.

The second constraint says that T must be at least an int8 if the int8 version of infixAdd is chosen as the overload. Similarly, the third constraint says T must be at least int16 if the int16 version of infixAdd is chosen.

We can immediately eliminate the second constraint because it's implied by the first constraint.

After minimizing the constraints, we can now search for a solution by trying different values for 'T' (as suggested by the constraints), and seeing which solution has the highest ranking (where the ranking is calculated from a number of factors, including whether the assignment would produce a warning message.)