Machine Words: November 2009

Sunday, November 29, 2009

What's next?

I still have a long TODO list, but things are steadily progressing. Here are a few of the items that are up next:

* Compile under Visual Studio. I've set up my new thinkpad to dual-boot both Windows and Ubuntu (my normal development machine is a Mac), and installed Visual Studio 2008 Express on it. Using cmake to generate a VS project for LLVM worked beautifully, and compiled without a hitch.

Unfortunately, Tart's cmake config files aren't in such good shape. My FindLLVM.cmake module relies rather heavily on the llvm-config script to generate header and library dependencies - and llvm-config is not available under Windows. I did some code searches, and it appears that what many projects do is to specify the dependencies manually for Windows, and automatically for other platforms. I don't really want to have to maintain the dependencies manually, part of why I migrated over to using llvm-config was to get away from that. (LLVM has like 30 or so separate libs, keeping the dependencies straight is a pain.)

My current thought is to create a Python script that calls llvm-config when running under Unix, and then transforms the output to cmake syntax. This would then be used to keep FindLLVM.cmake up to date.

Note that all this has to be done before I can even *think* about trying to compile Tart. At this point, I have no idea if it can even be done - at the very least, I would assume that my exception personality function will have to be completely re-written.

* Get source-level debugging working. Most of the code to generate source-level debugging is done, but it's currently turned off because LLVM does wrong things when it's turned on.

* Finish up closures.

* Garbage collection. I started sketching out some Tart classes for the garbage collector (Originally I was going to write the GC in C++, but I think I can do it in Tart nearly as well.) The issue that needs to be solved is how to handle platform differences - I can call vmalloc() on linux to get raw blocks of aligned core memory, but on Windows I'll need to call something else I guess. The important thing is that the decision can't be made at runtime, since the program won't link if it contains calls to APIs that aren't present on that platform.

For swapping out code on different platforms, there are generally two strategies: Conditional compilation (#ifdef in C++), and Makefile conditionals (i.e. replace entire source files based on the current platform). Unfortunately in a language like Tart where there's a relationship between the scope hierarchy and the arrangement of files on the filesystem, you have to be extra tricky to support whole-file swapping, although it can be done. (The easiest way is to define a separate package root for each platform and put them in the class path.)

For source-level conditional compilation, I had planned to implement something like D's "static if" - an if-statement that can appear outside of a function (or inside), in which the test expression must be a constant, and which causes the code within the 'if' to not even be compiled if the test is false.

As far as the test expressions themselves, my current thinking is to define a package, tart.config, that contains the configuration options for the current host. One issue is how much do we care about cross-compilation - it would be much easier if we didn't have to think about that just yet. For cross compilation we need lots of command-line options to the compiler; for simple host-based compilation, we can let the config script do most of the work.

* Some sort of command-line options package. I'd kind of like to do some sort of annotation-based command-line options, where each module can define its own options. Ideally something Guice-like. I know this approach can lead to problems with low-level modules exporting options that are of no interest to the user, but I think that problem is manageable.

One thing I haven't quite figured out is how to collect all of the various command-line variables. The brute force approach is to do reflection on every single field of every class, see if it has a "CmdLineParameter" annotation, and if so, process it. Another approach would be to not use annotations at all, but instead have the command-line options that are statically-constructed instances which register themselves at construction time with the command-line arguments processor. This would of course require me to get static constructors working :)

The third approach is to have some sort of link-time process that collects all fields which are annotated with some particular annotation class and then creates a list, which the program can simply iterate through upon startup. This has to be done carefully, because we only want to include in that list fields which are actually used by the program. You see, the linker currently traces through all of the symbols reachable from "main" and discards any that aren't reachable - sort of a link-time garbage collector. But this hypothetical list of annotated symbols is itself a reference, which would prevent the symbol from being discarded, even if no other part of the program referred to that symbol. So we would want to construct such as list after the dead-global elimination phase.

* Tuple types. These are partly done, but we need to finish them so we can do multiple return values and unpacking assignment.

* Reflection work - currently the only thing reflected is global functions, we need to get other declaration types reflected as well.

* Stack dumps - need a way to print out the current call stack. This one is going to be way hard.

* Finish up constructors - lots of work on inheritance, chaining, final checking, etc.

* Compiled imports. Right now when the compiler sees an "import" statement, it actually loads the source file and parses it, although it only analyzes the AST tree lazily (i.e. only stuff that is actually referred to by the module being compiled). What I'd like to do is take all of the high-level info that is lost when creating a .bc file, and encode that using LLVM metadata nodes. The idea is that if a file was already compiled, instead of recompiling it, it would just load the bitcode file and read the metadata. Assuming that the file had not changed, of course.

* "finally" statement. Although exceptions are working great, I haven't done 'finally'. However, this will be much easier now that LLVM allows for local indirect branches.

* Enum.toString(), Enum.parse().

* String formatting functions.

* I haven't even started on the i/o libraries. (No "Hello World" yet, unless you call Debug.writeLn()).

* yield / with statements.

* Docs, docs, docs...

--
-- Talin

Anonymous functions

Anonymous functions are now working, as shown in the following example:

def testAnonFn {

  let f = fn (i:int) -> int {
   return i + i;

  };

  Debug.assertEq(2, f(1));
  Debug.assertEq(4, f(2));

}

The keyword 'fn' basically means 'lambda'. In this case, the lambda function is being assigned to the variable 'f', which is then called in the two assert statements below.

Note however, that the lambda is not yet a true closure, although I have started work on that. For now, the anonymous function can't access local variables from the outer scope. (Although it can indeed access variables from global scopes, since otherwise the '+' operation would not compile.)

I have thought about allowing the 'fn' keyword to be dropped entirely if there are no arguments and no return values - essentially the equivalent of a clang block - which would be designated with just { and }. I'm not sure that this is a great idea, though. There are other possible uses of {} (map literals, perhaps, although those could easily be done via [ => ] as well.) From a parsing standpoint, there's no ambiguity problems - it's easy to distinguish a "block" from a "suite", except in the case of a standalone suite that is not part of some other statement, in which case semantically they are the same thing anyway.

The advantage would be that you could do things like "onExit({ print("Bye!"); })" as opposed to "onExit(fn { print("Bye!"); })". Not sure if the small increase in brevity is really worth it.

--
-- Talin

Status update

Lots of work done over the thanksgiving weekend. The biggest news is that reflected methods are now callable. This makes it possible to write a very simple unit tests runner,

Here's how the test runner is invoked:

import tart.reflect.Module;
import tart.testing.TestUtils;

@EntryPoint
def main(args:String[]) -> int {
return TestUtils.runModuleTests(Module.thisModule());
}

'runModuleTests' simply iterates through all of the global methods defined in the module, and calls any that begin with the letters "test". Here's what the function looks like:

import tart.reflect.Module;

namespace TestUtils {

  /** Simple function to run all of the test functions in a module. */
  def runModuleTests(m:Module) -> int {
   try {
   for method in m.methods {
   if method.name.startsWith("test") {
   Debug.write("Running: ", method.name, "...");
   method.call(m);
   Debug.writeLn("DONE");
   }
   }
   } catch t:Throwable {
   Debug.fail("Unexpected exception: ", t.toString());
   return 1;
   }

   return 0;
  }
}

Obviously, there's a lot of room for refinement here, but it demonstrates the basic principle.

The other thing I worked on is finishing up union types - and adding support for unconditional dynamic casting. More sample code:

  var x:int or float; // A union type

  // Integer type
  x = 1;

  // isa tests
  Debug.assertTrue(x isa int);
  Debug.assertFalse(x isa float);
  Debug.assertFalse(x isa String);

  // Type casts
  Debug.assertEq(1, typecast[int](x));
  try {
   Debug.assertEq(1.0, typecast[float](x));
   Debug.fail("union member test failed");
  } catch t:TypecastException {}

  // classify
  classify x as xi:int {
   Debug.assertTrue(xi == 1);
  } else {
   Debug.fail("union member test failed");
  }

The difference between 'classify' and 'typecast' is that the latter throws an exception if it fails, while the former transfers control to a different suite.

Monday, November 23, 2009

Protocols and template conditions

I spent most of the weekend working on protocols and template conditions, and while they aren't working yet, they are very close.

Template conditions allow you specify a restriction on what types a template parameter can bind to. For example, say we wanted to be able to define a template that could only be used with subclasses of Exception:

class ExceptionFilter[%T <: Exception] { ... }

The '<:' operator is equivalent to the Python function "issubclass". %T is a type variable (the percent sign introduces a new type variable.)

Tart allows you to have multiple definitions of the same template, as long as it is unambiguous which one to use:

class LogFormatter[%T <: Exception] { ... }

class LogFormatter[String] { ... }

Note that the second declaration of LogFormatter did not declare a type variable. In this case, the type argument 'String' is a type literal, not a variable - in other words, the template pattern will only match if the type argument is type 'String' exactly.

The template conditions can also be given outside of the template parameter list, using the 'where' clause:

class LogFormatter[%T]

where T <: Exception {

...

}

There can be an arbitrary number of 'where' clauses on a template. (Note: Where clauses are not implemented, but they are not hard to do now that I have the basic condition architecture in place.)

One particularly useful template condition is whether a class has a particular method or not. Rather than defining as "hasmethod" operator, Tart allows you declare a pseudo-interface called a "protocol". For example, if we wanted to test whether a class has a "toString()" method, we would start by defining a protocol containing the toString() method:

protocol HasToString {

def toString() -> String;

}

Syntactically, protocols are like interfaces, however they are psuedo-types, not real types - you can't declare a variable whose type is a protocol. You can inherit from a protocol, but all that does is cause the compiler to emit a warning if any of the protocol's methods are not implemented. It does not change the layout or the behavior of the class in any way.

Another difference with protocols is that any class that meets the requirements of the protocol is considered a subtype of the protocol - regardless of whether the class explicitly declares the protocol as a base type or not.

Here's a concrete example: Suppose we want to make a "universal stringifier" that will convert any value to a string:

def str(obj:Object) -> String { return obj.toString(); }

def str[%T <: HasToString](v:T) -> String { return v.toString(); }

def str[%T](v:T) { return "???"; }

The first overload handles all of the types which are subclasses of Object. Since we know "Object" has a toString() method, and since that method is dynamically dispatched (i.e. the equivalent of C++ "virtual"), the toString() call will get routed to the right place.

The next line handles all types that have a 'toString' method, whether or not they explicitly inherit from "HasToString". This includes things like integers and floats, which have a toString() method (in Tart, you can say things like "true.toString()"). Since this is a template, it will create a copy of the function for each different type of T. (That's why we handled 'Object' as a separate case, to limit the number of such copies generated.)

The third overload handles classes which don't have a 'toString' method. (If you are wondering what kinds of types don't have a default toString() method, the answer includes structs, unions, tuples, native C data types, and so on.) In the example, it just returns the string "???" but in fact it could be much smarter - at minimum printing the string name of type T.

The overload resolver will attempt to match the most specific template whose constraints are met. Thus, the first method is preferred over the second since the second requires binding a pattern variable, and the second is preferred over the third because it has a condition.

-- Talin

Sunday, November 15, 2009

Tart Status

Last week, my new laptop arrived, a ThinkPad T400s, which is turning out to be a pretty nice laptop. I initially had some problems with Wubi, the Ubuntu Windows installer. Wubi creates a Linux filesystem as a file within the Windows NTFS filesystem, avoiding the need to repartition the drive. It then modifies the Windows boot loader to allow dual booting into either Windows or Linux.

All of this worked great until the first apt package update, which installed a new version of GRUB, the boot manager. This apparently messed up the Windows 7 bootloader so that it could no longer boot the Linux filesystem, and my reading the Ubuntu forums didn't inspire much hope of ever getting it to work. In order to recover my 3 days worth of work, I had to create a rescue CD, mount NTFS, and then mount the virtual Linux filesystem within it. Then partition the drive to add a new Linux partition, rsync the data over to the new partition, and configure a boot loader for a more conventional dual-boot system. This took about a day to figure out.

In any case, I now have Ubuntu working well, although I haven't yet learned how to set up the boot table needed to boot back into Windows. That can wait for the moment. I have Google Chrome and Eclipse humming along nicely, with sufficient tweaking and customization that I can be decently productive. The fact that I am able to type this email relatively fast, with no illumination on the keyboard is an indicator that from a form-factor standpoint this laptop was the right choice. And it's nice and light - the case is carbon fiber.

(BTW, with regards to Chrome - at first I had planned to just get by with Firefox, and I spent a couple of days using that. But I'd gotten used to using Chrome at work - in particular, the "control-T + start typing" gesture to open a new tab and search is surprisingly addictive. So I broke down and installed Chrome and I find that I am a lot happier.)

In any case, I was able to spend the rest of the week focused on the reason that I got the laptop in the first place: Getting Tart to compile and run under an operating system other than Darwin/OS X. That took a while, but I was finally able to check in a version that runs all tests on Linux as well as OS X. One issue that bit me a number of times is that the order of link libraries is significant under Linux, but not under OS X - although I think that's probably more a function of the version of GCC that's available on those systems.

In addition to all that, I worked on a number of other minor Tart issues - during those periods where I was stumped on getting it to compile and run on Linux. One thing that I have been working on is a rough sketch of a "pure Tart" garbage collector. Recent changes in the Tart language (or more precisely, in the "unsafe" mode language extensions) have led me to the belief that the performance penalty for writing the collector in Tart might not be as significant as I originally estimated.

Of course, some parts, such as the mutex code, will have to be written in C unless I can figure a way to get Tart to call POSIX and Win32 functions directly. But the collector algorithms might not have to be. At least, when I look at the Tart classes, and their C++ equivalents, I can imagine how the code would be generated in either case and I'm thinking that it might not be that different. The biggest hurdle will be dealing with pointers that have flags in the low bits - i'll probably have to add some special compiler intrinsics to deal with that case. There are plenty of other potential bottlenecks to performance that I can perceive, but those bottlenecks would be present in a C/C++ implementation as well as a Tart one.

The advantage of writing a Tart GC is that it will stress the language design in a way that I would like it to be stressed - that is, the challenge of writing efficient low-level code. The disadvantage is that it's less likely that someone will come along and re-use my GC for another language (and hopefully improve it in the process) - in other words, there might be less eyeballs on the code. However, at this point it's by no means certain that such additional eyeballs will ever manifest.

Another thing I've done (which is not checked in yet) is created an Eclipse plugin for Tart source code. It handles syntax highlighting and a few other minor editing tweaks like auto-indentation. It doesn't handle reformatting or refactoring of any of the advanced stuff. Sadly, it appears that the Eclipse framework doesn't handle this in a generalized way - meaning that each different language has to define it's own AST, Parser, and refactoring engine. Thus, JDT, CDT and PyDev - for example - are all separate plugins that don't share any code other than the basic Eclipse platform.

For now, the plugin is good enough to edit Tart code with, and I'm hoping that some Eclipse enthusiast will come along and finish what I've started :)

--
-- Talin