Wednesday, December 30, 2009

The problem with null pointers is that they are too darned useful

It would be nice if null pointers could be eliminated from the language, or at least integrated into the type system in a way that allows null pointer dereferences to be detected in advance. Unfortunately, 'null' tends to get used a lot - it's a great sentinel value for lazy construction, for example. So far all my ideas for making null type-safe also have the side-effect of making the code more complex and verbose.

Right now, the Tart compiler treats the value 'null' as a distinct type 'Null' (which is different from 'void'). You can compare any pointer or object reference with Null, but you can't assign Null to a variable of reference type. What you can do, however, is create a union of type 'SomeType or Null'.

var n:SomeType or Null = null;

In C++, if we wanted to lazily construct 'n', we'd write something like this:

if n == NULL {
  n = new SomeType();
}
      
return n;

However, this won't work in Tart because the final 'return n' returns a 'SomeType or Null', when what we want for it to return is 'SomeType'.

We could put an explicit type cast, except that in Tart type casts are always checked - which means that we're checking for null twice.

Another way to do this in Tart is to use the classify statement:

classify n {
  as r:SomeType { return r; }
  else {
    let r = SomeType();
    n = r;
    return r;
  }
}

This works, but seems fairly clumsy by comparison to the C++ version.

--
-- Talin

Tuesday, December 29, 2009

Status update

Lots of filling in of details lately:
  • Reflection of default constructors now works, allowing objects to be created via reflection.
  • Started implementing the standard hash functions (using MurmurHash). This is needed so I can write a HashTable collection.
  • Sketched out a few more i/o classes and interfaces.
  • Added a new linker pass that attempts to calculate the overhead of reflection. The largest chunk is the strings representing class and method names.
  • Started work on Tart implementation of String.format.
  • Tracking down bugs.
Unfortunately, there are a number of issues that are vexing me and won't go away. This includes:
  • CMake making me crazy.
    • I have two libraries, almost identical - yet when I declare both libs to be a dependency of a custom target, only one gets rebuilt. I can force CMake to build the second one by explicitly giving the name on the command line, but it fails as a dependency.
    • Another CMake problem is that I can't seem to make a rule that causes dsymutil to run on OS X but not on Linux.
    • And yet another issue is that CMake's lack of extensibility means that there's no way to scan a Tart module for dependencies. I'm constantly running into problems where a module is out of sync with one that it is based on. (Eventually I am going to make my own Tart-based built tool, "mint", but that's a long ways off.) I do have the dependency info embedded in the generated module (using metadata nodes), but there's nothing that can read it.
    • However, being able to generate an eclipse / MSVC / makefile project is just too useful to give up. So I don't want to migrate away from CMake.
  • Several of the unit tests fail when LLVM optimization is turned on, and I can't figure out why.
  • Still haven't been able to generate correct DWARF info for parameters and local vars. Also, I sort of figured out how to get my exception personality function to "see" the stack trace (using dlsym), but unfortunately none of the LLVM-generated functions appear on it - only the ones written in C.
  • LLVM IR is so hard to read, what the world really needs is a graphical browser for global variables and functions.
--
-- Talin

Saturday, December 26, 2009

Method reflection and unit tests

I just finished v2 of the simple unit test framework. The v1 version simply looked for global functions beginning with "test" and called them. The v2 version now supports JUnit-like test fixtures, where. Unit tests derive from tart.testing.Test.

Here's a short example of a test class:

import tart.collections.ArrayList;
import tart.reflect.ComplexType;
import tart.testing.Test;

@EntryPoint
def main(args:String[]) -> int {
  return ArrayListTest().runTests(ArrayListTest);
}

class ArrayListTest : Test {
  def testConstruct() {
    let a = ArrayList[int](1, 2, 3);
    assertEq(3, a.length);
    assertEq(1, a[0]);
    assertEq(2, a[1]);
    assertEq(3, a[2]);
  }
  def testImplicitType() {
    let a = ArrayList(1, 2, 3);
    assertEq(3, a.length);
    assertEq(1, a[0]);
    assertEq(2, a[1]);
    assertEq(3, a[2]);
  }
}

This demonstrates a number of interesting language features:
  • Reflection of class methods.
  • "import namespace". The 'assertEq' methods are defined in class "Asserts". Class Test does an "import namespace Asserts", which means that all of the assert methods are available without needing a qualifier (i.e. Asserts.assertEq).
The test output looks like this:

[Running] ArrayListTest.testConstruct
[Running] ArrayListTest.testImplicitType
[OK     ]

Tuesday, December 22, 2009

More progress on tuples

I've made a lot of progress on tuple types in the last several days, including the "unpacked" or "destructuring" assignment:

def returnTupleLarge() -> (int, String, int, String) {
  return 3, "Hello", 2, "World";
}

var d1, e1, d2, e2 = returnTupleLarge();

Unfortunately the "Python swap idiom" doesn't quite work yet:

a, b = b, a;

For some reason the code generator isn't generating the store to 'b'. Still tracking this one down.

In order to fully support tuple types I had to make some significant changes to the way that the code generator treats l-values. You see, LLVM currently has a limitation on the way that it supports first-class aggregates: Only structs that can fit into two registers can be passed by value - if the struct is larger than that, it has to be passed by reference. Eventually LLVM will support abitrary-sized parameters and return types, but that support is not there yet.

So the first problem is how to determine whether a struct will fit into two registers. Since my front-end is target-independent, I don't know how big a register is. For the moment I am using a somewhat dodgy heuristic for this.

The second issue is that we now have two kinds of structs as far as the codegen is concerned: "small" and "large". Similarly, we also have small and large tuples, since a tuple is very much like a struct. However, there is one important difference, which is that structs must always be addressable. The reason for this is the handling of 'self'. When you define a method on a struct, the 'self' argument has to be a pointer to the struct, not the value of the struct - otherwise, it would be impossible to write setter methods, for example.

Tuples, on the other hand, don't have methods, so they don't have to be l-values since there's never a 'self' pointer to worry about. However, if a tuple contains a struct, then that struct has to be addressable, which means that the tuple has to be treated like a struct.

I created a spreadsheet of all the combinations of Tart types and how they are represented. The columns represented various types - primitive, class, etc. The rows represented various forms of use - parameter, return val, variable, constant, intermediate value, member field, and so on. Using this, I was able to boil down all of the runtime representations into just 5 categories, which for lack of a better term I am calling "Shapes":

Shape_Primitive,          // float, int, pointer, etc.
Shape_Small_RValue,       // Small tuples
Shape_Small_LValue,       // Small structs
Shape_Large_Value,        // Large tuples or structs
Shape_Reference,          // Class or interface

Primitive types are always passed by value in a single register. Small_RValue shapes are also passed by value, but may require more than one register. Neither type is guaranteed to have a memory address, since the value may be contained in registers.

Small_LValue is passed and returned by value, but internally it is stored in memory and is guaranteed to have an address.

Large_Value is passed by reference. When used as a parameter, a local allocation is created and the value copied into it. For function return values, the compiler generates a 'struct return' parameter which holds a pointer to a buffer which the caller provides. The return value is copied into this buffer before exiting the function. Similarly, for variable assignments, the contents of the local allocation is copied from one buffer to another.

Reference types are always passed as a single pointer value. Assigning to a reference type merely assigns to the pointer, it does not affect the contents of the memory being pointed to.

The "shape" is determined for each type based on the contents of the type. For example, the shape of a union type may be different depending on what the types within the union are. So if a union contains a Large_Value shaped type, then the union must also be Large_Value as well.

With each type now able to report its shape, I can define the various operations on types - such as load, store, return, argument, copy, and so on - in terms of these shapes. This means that rather than having to have a bunch of special cases for each particular type, the resulting code generator can be considerably simplified.

Saturday, December 19, 2009

Tuple progress

Made some terrific progress today on tuples. Here's the complete unit test so far:

import tart.reflect.Module;
import tart.testing.TestUtils;

@EntryPoint
def main(args:String[]) -> int {
  return TestUtils.runModuleTests(Module.thisModule());
}

def testTupleCreate() {
  var a = 1, 2;
  Debug.assertEq(1, a[0]);
  Debug.assertEq(2, a[1]);
}

def testTupleCreate2() {
  var a = 1, "Hello";
  Debug.assertEq(1, a[0]);
  Debug.assertEq("Hello", a[1]);
}

def testTupleCreateExplicitType() {
  var a:(int, int) = 1, 2;
  Debug.assertEq(1, a[0]);
  Debug.assertEq(2, a[1]);
}

def testTupleAssign() {
  var a = 1, 2;
  var b = a;
  Debug.assertEq(1, b[0]);
  Debug.assertEq(2, a[1]);
}

def testTupleReturn() {
  var a = returnTuple();
  Debug.assertEq(3, a[0]);
  Debug.assertEq("Hello", a[1]);
}

def returnTuple() -> (int, String) {
  return 3, "Hello";
}

Still to do:

-- Test passing tuples as parameters.
-- Test member-wise conversions.
-- Test tuples as members of structs / classes.
-- Write a compilation failure test to insure tuple immutability.
-- Implement unpacking assignment.

I'm really looking forward to being able to write:

   for key, value in map {
     // Do stuff
   }

As opposed to Java style:

   for (Map.Entry<KeyType, ValueType> entry : map) {
     KeyType key = entry.getKey();
     ValueType value = entry.getValue();
     // Do stuff
   }

The former is so much nicer in my opinioin.

Tuesday, December 15, 2009

Doc updates

I recently redid the way that the Tart documentation is being hosted.

Now that Google Code has the ability to host multiple Mercurial repositories per project, I have split off the compiled documentation into a separate repository. The documentation sources are still in the same repo as the compiler sources, as they should be. (Because one should ideally change with the other, although the docs are sadly out of date.)

So here's how it all works: on my laptop I have two local Mercurial repositories, 'tart' and 'tart-docs'. The 'build/html' directory in 'tart' is a symlink to 'tart-docs'. Thus, whenever I do a "make html", it updates the 'tart-docs' repository. Then all I have to do is an hg commit and hg push to update the documentation on the site. Since Google Code's Mercurial support is able to host the raw HTML files straight out of the repository (with images and CSS, no less!), it means that as soon as I check in, the updated docs are immediately visible. Really neat!

You can see the lovely results here:


BTW, did I mention how great Sphinx and Pygments are? And how awesome Georg Brandl is?

--
-- Talin

Monday, December 14, 2009

Status report

There's a pesky bug in the compiler which has been sucking up my time for several days. Essentially I'm having to play whack-a-mole with dependencies whenever I try to enable reflection - the data structures for reflection need certain classes to be imported into the module, and that's not happening for some reason.

A bit of good news is that the tartc compiler now runs under Windows, although the linker does not yet. Even if I get the linker working, there is still the issue of getting the resulting programs to actually run, which will require a lot of work, especially in the area of exception handling.

I have a new language idea that I have been thinking about:

Phil Gossett was trying to educate me recently as to the difference between "adhoc" polymorphism and "parametric" polymorphism, specifically as regards to my template syntax. (He was also trying to get me to drink the 'strictly functional' kool-aid, but I wasn't having any of it.)

Parametric polymorphism (which is what Phil considers the "good" kind) is where you have some form of generic or parameterized type that has one or more type parameters. An example is List[T], where T is a type parameter. In parametric polymorphism, you have to declare 'T' as having some sort of upper bound type, and you can only call methods which are defined in that upper bound. An example would be, say, List[Number], which could then be specialized to List[int], List[float] and so on, but not List[String].

Adhoc polymorphism (which he considers the "bad" kind - as in "that way leads to madness") is more like duck typing - once you define List[T], T can be anything, and if you call something like T.toString(), that works as long as T has a toString() method. There's no requirement that the type bound to T have any particular base class as long as it conforms to the implicit contract.

C++ "concepts" are somewhere in between these two - that is, you make the contract explicit rather than implicit (by requiring that T have certain methods) but you don't require that T derive from some particular base class.

Now, given my experience with Python, where adhoc polymorphism has historically been the only kind, I think that such fears are overblown. Just like dynamic typing, it's something that scares the "bondage and dominance" school of language designers - when in practice the kinds of disasters they worry about almost never happen. (I'm surprised that no one has yet done a PHd thesis on whether the lack of strict type checking makes dynamic languages less reliable in practice - and more interestingly, if the answer is "it doesn't" then why?)

However, it occurs to me that there's one concrete benefit to parametric polymorphism, and that is avoiding template bloat. You see, if a template parameter must derive from some class, and can only use methods of that class, then the exact same generated code will work for all subclasses, assuming that those classes have binary compatibility (which is true for all reference types, but not for all value types.) Thus, if a template only calls "toString()" then the same exact set of machine instructions can be used for any subclass of Object.

Note that the savings in space can in some cases cause a small loss in speed - by re-using the same template instance for both Object and String, say, it means that some method calls have to be dispatched dynamically, whereas if the compiler generated a template instance for class String only, it would then be able to resolve certain method calls at compile time and thus generate faster code.

So it seems to me that there is perhaps a use for both types of polymorphism. One way to implement this would be to have two versions of the "<:" (issubclass) operator - a 'strict' and a 'relaxed' version. By making it an operator, it allows you to specify the strictness for each template parameter without affecting the other template parameters. It means that programmers can decide on a case-by-case basis whether to optimize for size or speed.

How to specify this in syntax I really haven't thought too deeply about yet.

-- Talin

Tuesday, December 8, 2009

Enum.toString()

Last night I got the 'toString()' method working for enum constants:

var x = TestEnum.Rock;
Debug.assertEq("Rock", x.toString());

Speaking of toString, I haven't yet figured out what to call the "formatted toString" function. This is a version of toString that takes a single argument string which is the format spec, the equivalent of PEP 3101's '__format__' method. I don't want to use '__format__' because Tart doesn't use double underscores for special methods. (Rather, I just choose names that are sufficiently unique, such as "infixAdd".) Some variations I have considered are: toStringFmt, toStringF, toStringFormatted, but I'm not that happy with any of them.

I also fixed a couple of bugs in Type.of, so now you can say:

var ty:Type = Type.of(String);
Debug.assertEq("tart.core.String", ty.toString());
Debug.assertTrue(ty isa ComplexType);

... and so on.

Monday, December 7, 2009

TypeLiteral

I've been following Guido's suggestion about keeping the methods that relate to reflection in a separate namespace from the actual class methods. For each user-defined class, there is a separate instance of tart.reflect.Type which contains information about the type name, base classes, lists of methods, and so on.

This reflection structure is entirely separate from the regular "class" object which contains the jump tables and superclass array. in fact, the class contains no references to the reflection data at all, as the relationship is one-way only. This means that if you aren't using reflection, the various data structures can be deleted by the linker during optimization. Unfortunately, this also means that if you do need reflection, you have to explicitly register the classes that you want to reflect. Since you can register whole modules and/or packages with a single call, this is not that much of a hardship.

So now that we have these two data structures, how do you get from one to the other? Here's what I implemented today:

Normally, when a type name is used, it is either used as a constructor: Foo(), or as a namespace for statics: Foo.method. However, what happens if we attempt to use Foo as a value? Another way to ask this question is to say "what is the type of the word 'Foo' in the source code?" The answer is that Foo is a type literal. Specifically, it's type is TypeLiteral[Foo], meaning that it's an type that has a template parameter which is the type represented by the literal.

There are a couple of different ways you can use a type literal. First, you can convert it into a pointer to the type reflection object using Type.of(). So for example, if I say Type.of(String), what I get is the reflection info for class String. (The 'of' method in class Type is an intrinsic that does this.) Note that this use of 'of' automatically pulls in the reflection data for that type, so there's no need to explicitly register that class.

The other way you can use a type literal is by binding it's template argument to a type variable. For example:

   def newInstance[%T](type:TypeLiteral[T]) -> T { return T(); }

If we pass "String" as the argument, what actually gets passed is TypeLiteral[String]. This in turn causes the type variable T to be bound to 'String', at which point the definition of 'T' is now available in the body of the function. Note that we never actually use the value of the type literal in this case, only it's type - which will most often be the case, since TypeLiterals have no properties other than the single type parameter.

Why did we not simply declare the argument as T rather than TypeLiteral[T]? Because if we're passing in, say, a String, then 'T' means an actual instance of a String, whereas TypeLiteral[T] means we are talking about the String *type*.