Machine Words: type Memory.Address

One of the challenges in the design of Tart is reconciling two competing goals:

The desire to make Tart a "safe" language, in the sense of not allowing invalid pointer accesses and such
The desire for efficiency and the ability to get "close to the metal" when needed.

There is also a third goal which depends on the second: That the Tart core libraries be written in Tart to the maximum extent possible. Since many of the core library functions require low-level access to the hardware, this means that there has to be some means to express that within Tart.

The route that I have chosen is inspired by C#, which is to have a "safe" subset of the language that is used normally, and an "unsafe" set of language extensions which give more direct access.

One question to be asked is: What keeps a programmer from enabling the unsafe language extensions always? As much as any competent software engineer may be dedicated to the principle of writing reliable code, there is always the temptation to take shortcuts, especially when faced with short deadlines.

My approach is to design the language so that such shortcuts are gently discouraged but not prohibited. In particular, there are a couple of speedbumps:

The ability to enable unsafe language extensions is done via a command-line switch to the compiler, rather than in code. This means that turning it on requires modifying build files.
The unsafe language extensions have been designed such that their syntax is deliberately verbose. (Actually, it would be more accurate to say that when it comes time to dole out the limited number of available syntactical shortcuts to various language features, the unsafe language extensions are last in line.) Thus, a native pointer is declared as "NativePointer[type]" rather than the more convenient "type*" or "type^".

In my initial design, I had two types for dealing with raw memory: NativePointer and NativeArray. NativePointer represented a C style pointer, except without the ability to do pointer arithmetic. NativeArray represented a C style array of fixed length, with no range checking.

However, as I have been working on creating the core libraries, I realized that these two types are too limiting. In particular, there are a lot of low-level functions that deal with ranges of memory of varying size, which were hard to express with the two native types. Up to this point, I had been using NativePointer[NativeArray[type, 0]] to represent the start of a range of memory, but this was both cumbersome and incorrect.

So I have added a new native type, whose full name is "tart.core.Memory.Address". ("Memory" is a module that contains a set of utility functions for dealing with memory - it's the primary source of "unsafe" operations dealing with memory access.)

Semantically the "Address" type is just like NativePointer, except that it also allows array element dereference like a C pointer: data[index]. Although array element dereference implicitly involves pointer math, It does not allow arithmetic operators - for example getting the Address of an array element is done via Memory.addressOf(array[index]). (Memory.addressOf() is an intrinsic that is equivalent to C's '&'.)

However, the real difference between Address and NativePointer is in the connotation. "Address" is meant to represent a *boundary* rather than a pointer to an element. Two such boundaries, or a boundary and a length, define a range of memory.

As an example of what this allows, the UTF8 decoder class initially had a "decode" function that looked like this:

def decode(dst:char[], dstIndex:int, src:ubyte[], srcIndex:int, count:int) -> int;

However, that decode function is now written using this to do the actual work:

def decodeRaw(dst:Address[char], dstLength:int, src:Address[ubyte], srcLength:int) -> int;

There are a number of advantages to doing this:

Classes such as String which contain an internal buffer of characters can call decodeRaw without having to convert their buffers to an array object.
Range checking is done once in the higher-level function, rather than checking the range on each element access individually.

Of course, there is always the possibility of a bug in the "unsafe" code which would have been caught if I had restricted myself to the slower but safer language subset.

My response to that is that the use of "unsafe" language extensions should be restricted to a small number of modules which are subject to additional scrutiny.

--
-- Talin

Machine Words

Sunday, October 4, 2009

type Memory.Address

No comments:

Post a Comment