Saturday, June 18, 2011

A much simpler question: Shortcut 'and' and 'or'

I was recently reading some blog posts comparing various languages, and in particular a "wish list" of language features that was informative. One of the things that the author wished for is to have the 'and' and 'or' operators work like they do in Python / Perl / JavaScript and other dynamic languages - specifically, that they could take non-boolean arguments, and return a non-boolean value.

So for example, a common JavaScript idiom is this:

function makeButton(opt_title) {
  this.title = opt_title || "OK";
}

The 'title' property of the object will be assigned the value of 'opt_title' if it evaluates to true, otherwise the default string "OK" will be used.

My question for the readers is - does this make sense in a statically typed language? There are a number of issues to consider:

Both arguments to the logical operator would have to have the same static type (or at least, be convertible to a common type). The resulting value would also have the same static type. From a compiler standpoint, this is fairly easy, no different really than the arithmetic operators which also require that both sides be promotable to the same type. The real question is whether this restriction - that the two sides as well as the result value have the same static type - makes the feature less useful than it would in a dynamic language.

The other issue is that both arguments would need to be implicitly convertible to boolean types. C and C++ implicitly map the integer value 0 and the pointer value NULL to boolean false. Other languages go farther and map the empty string "" and the empty array [] to false as well. And of course, in Tart you also have unions which can have 'void' as a member, so I'm assuming that those would be convertible to false as well. And what about floats? Or decimal numbers? To be completely general, we'd have to allow user-defined types to declare their own implicit conversions to boolean.

I've avoided such implicit conversions up to this point because in my experience they are a source of potential confusion. Right now, you have to explicitly compare values with 0 or NULL or the empty string in order to convert the value into a boolean.

Allowing implicit conversion to booleans would have far-reaching changes. For one thing, the 'if', 'for' and 'while' statements would have to change to be consistent with the 'and' and 'or' operators. It would be confusing otherwise - given a set of strings s1 and s2, if I'm allowed to say 's1 or s2' I would also expect to be able to say 'if s1' or 'while s1'.

Do you think that the convenience worth the possible confusion?

Type inference

I've written extensively about the type inference algorithm in previous posts, so I'll try not to repeat myself.

The type solver's purpose is to generate solutions for two kinds of problems:
  • For functions which have multiple overloads, select the correct overload based on the input arguments.
  • For functions which have template parameters, determine what type is bound to each template parameter.
I want to focus on this second item. For each templated function call, the solver creates a BindingEnv ("binding environment") object that stores the values of each type variable. So if I have a function such as:

def min[%T](a0:T, a1:T) -> T;

The BindingEnv stores the definition of T. If there is more than one call to "min" in a given expression:

let a = min(b, min(c, d));

...then the algorithm creates two BindingEnvs, each of which may have a different value for T. These values are referred to as substitutions. For example if I add a substitution declaring that "T" is an "int", what I am really saying is that types 'T' and 'int' are equivalent.

Unfortunately, having multiple environments creates a number of problems. The first has to do with backtracking.

Here's how backtracking works: Say we call our 'min' function with two incompatible arguments, such as List[String] and float. The solver first attempts to add a substitution for T := List[String]. However, when it gets to  the second argument, it realizes that T cannot be equivalent to both List[String] and float, so it needs to 'unbind' the previous substitution. The way this works is that BindingEnv has the ability to backtrack to a previous state - you call env.getCurrentState(), which returns a token (an int, actually), which can then be used to call env.backtrack(token) and remove all of the substitutions that were added subsequent to the call to getCurrentState().

However, where we run into trouble is when more than one BindingEnv is involved, because now we need to backtrack multiple environments and keep them in sync. A more robust approach is to just have a single environment for the whole expression, with one single backtracking state.

But then how do we deal with multiple different occurrences of 'T' in the same expression? The answer is we need to relabel all of the variables in each call: That is, we transform 'T' into 'T1' or 'T2', depending on which T we are talking about. Now all type variables are unique, so we don't have to worry about collisions anymore, and we can have just a single environment.

Another limitation of the current solver is that within an environment any given variable can only have a single substitution that is currently active. Say for example we know that 'T' == 'List[S]' where S is another type variable. Later we refine this and add 'T' == 'List[String]'. We don't remove the old substitution - instead we add a new one on top, so that it hides the previous substitution. That way, if we have to backtrack, it will revert to the previous value of 'T' rather than losing it entirely.

The problem with this approach is that it's attempting to do the reconciliation between the new and old substitutions too early - for example if we have 'T' == 'List[A]' and 'T' == 'List[B]', and we don't know what 'A' and 'B' are yet, there's no way to know which of the two definitions should win. Instead, what we should do is keep both definitions, and reconcile them later.

In fact, what I'd like to do is to make the type inference algorithm have clearly defined phases. (It has phases now, but they aren't cleanly separated.) The phases would be:
  • Gather constraints - walk through the entire expression and find all overloaded functions, type variables, and other "choice points" - places in the expression where the type solver is able to choose between several alternatives.
  • Relabeling - replace all type variables with new ones that have unique names.
  • Unification - this is where we identify all possible substitutions. Note that some substitutions are conditional - say I have two overloaded versions of "print", one which takes a String, and the other takes a %T, then the substitution for %T is only meaningful if the second overload is chosen.
  • Constraint simplification - if we have a set of substitutions {T := S, S := int} we can replace them with {T := int, S := int}.
  • Overload selection - this is a fairly standard constraint minimization algorithm. For each choice point, we compute a ranking that says how good the choice is. For example, if we have a function that returns a uint8, but the variable being assigned to is a uint16, that's not as good of a choice as a function that returns a uint16. The algorithm attempts to maximize the ranking for the whole expression. (Because the choice points are interdependent, choosing the best result for one node in the expression tree may cause a worse choice elsewhere.)
  • Error reporting: If any of the phases fails, or the final result would produce type conversion warnings, we want to clearly inform the user of what went wrong.
  • Finalization: Replace all of the temporary type variables with the actual concrete types that they are bound to in the environment.
In any case, I have a ways to go before this particular vision can be achieved.

Good news

I've finished setting up my new desktop machine, and I've managed to compile Tart and run all of the tests - which passed.

So the overall status looks like this:
  • 64-bit Linux (Ubuntu): Pass
  • 64-bit OS X (Snow Leopard): Pass
  • 32-bit OS X (Leopard): Fail
That's all the data points I have at the moment, but it'd a good bet that the problem has to do with 64 vs. 32 bit compatibility.

Thursday, June 16, 2011

Tart status update

So, it's been a while since I've posted to this list. That does not mean, however, that I have not been busy working on Tart.

Well, that's only partly true. I've actually been working on Tart somewhat sporadically, about 2-3 nights a week. I've had a lot of distractions lately - some pleasant (Minecraft / Civilization V), and others not so much (my primary desktop machine had a hard drive failure, recovering and setting up the new system has consumed a lot of my time.)

One side effect of the hard drive failure is that my desktop is now running a 64-bit OS (Snow Leopard) instead of a 32-bit OS (Leopard). This means that it will be harder for me to test Tart on a 32-bit system. I do have a spare 32-bit laptop available, but testing and debugging on that system will be somewhat inconvenient.

As far as Tart itself goes: there are still two lingering release-blocker bugs. (1) All of the unit tests segfault on some platforms - specifically in the call to llvm.memcpy(). (2) The DWARF debugging metadata generated by the compiler is invalid on some platforms. (One thing that would be useful to know is whether these crashes occur on platforms other than the ones available to me personally - so if anyone feels like checking out Tart and running "make check" I'd be grateful.)

I've spent a huge amount of personal time on these two bugs, and I'm starting to feel somewhat burned out by my lack of progress. I don't even want to think about how many hours I've spent in gdb, or looking at x86 dissassembly listings.

Sometimes when I get particularly frustrated, I go off and work on some part of Tart that's more fun - like adding a new language feature or standard library class. Several minor new features have resulted, such as the addition of collections.Deque.

I've also been thinking a lot about improving Tart's type inference algorithm - the code needs cleaning up, and there are some types of deductions that I want it to be able to do that it can't right now. But I'll save that for another posting.