The Irken Language

Irken is a simplified, statically-typed dialect of Scheme. It uses an ML-like type system supporting parametric polymorphism (i.e., "let polymorphism") and algebraic datatypes. With type inference and a pattern-matching syntax, it could be considered ML with a lisp/scheme syntax.

Syntax

[A few other discrepancies will remain undocumented in the hope that they will someday be fixed]

Datatypes

Compared to Scheme, the biggest change is that all expressions are statically typed. Expressions like

'(1
"two" three)

will not pass the type checker. All lists must be monomorphic, i.e., a list of ints, or a list of bools, etc... This sounds horribly restrictive, but in practice it is not; indeed the type safety guarantees made by this type system make for a more secure design that can outperform dynamic languages because it needs no runtime type checking.

You may, for example, declare a datatype (i.e., a union) of these three types and make a list of those objects.

;; -*- Mode: Irken -*-

(include "lib/core.scm")
(include "lib/pair.scm")

(datatype thing
  (:int int)
  (:str string)
  (:sym symbol)
  )

Note: int, string, and symbol are builtin base/ground types

Objects of this datatype are created by calling their constructors:

(LIST (thing:int 1) (thing:str "two") (thing:sym 'three))

The list datatype is declared thus:

(datatype list
  (:nil)
  (:cons 'a (list 'a))
 )

In English, this means that the datatype list consists of two different variants. The first, nil, contains no additional data, and corresponds to an empty list. The second, cons, contains two objects. The first element is of type 'a, and the second is of type (list 'a), making this a recursively defined datatype. In other words, a cons object has a CAR of any type, and a CDR of a list of that same type. The type 'a is a type variable, which means it can take on any type.

list is not built into the compiler, but rather declared as part of the prologue in the file "lib/pair.scm".

If you want to write a function that deals with lists, the type checker will force you to deal with each branch of the datatype. This is the source of Irken's type safety: it's impossible to accidentally not handle a nil case.

Constructors

A constructor is an automatically generated function that creates an object based on one of the branches/variants of a datatype declaration. The name of a constructor function is formed from the names of the datatype and variant, delimited with a colon: <datatype>:<constructor>.

(list:cons x (list:nil))

This may seem a little tedious, and it can be. Like ML, Irken has some extra support for the list datatype (like the LIST macro used above) to make it a bit more convenient.

In the ML family of languages, constructors are usually globally unique, so the datatype prefix is unnecessary. A single global namespace for constructors means you can't use the same name more than once. This leads to the use of Hungarian notation (e.g, "TXNil"); I think it's cleaner to just require the name of the datatype to disambiguate.

Pattern Matching

How do you write functions that deal with this datatype? Irken has a builtin vcase special form (similar to a Scheme case), that covers the branches of a datatype constructor. Here's how the length function looks using vcase:


(define (length l)
  (let loop ((l l)
	     (n 0))
    (vcase list l
       ((:nil) n)
       ((:cons _ tl)
	(loop tl (+ n 1))))))

(define length
    ()       -> 0
    (_ . tl) -> (+ 1 (length tl)))

(length '(1 2 3 4 5))

Wow, that looks a lot cleaner! (There's actually something a little unfair about the comparison that I will explain later). The vcase special form isn't really needed in Irken any more; however pattern matching expressions are translated automatically by the compiler into vcase expressions.

Notice a few things about this definition of length. First, there is no argument list; the name of the function follows define unadorned. That's because names for the formal arguments are superfluous - the patterns destructure the cases, (optionally) binding variables to the objects inside each branch.

Secondly, you see some of the builtin language support for the list datatype. Normally, the patterns would have to be written like this:

(define length1
  (list:nil)       -> 0
  (list:cons _ tl) -> (+ 1 (length1 tl)))

The empty list () corresponds to (list:nil), and (_ . tl) corresponds to (list:cons _ tl). (If you're from the ML/Haskell world, (hd . tl) is the same thing as [hd::tl]).

A third thing to note are the wildcard variables, denoted with underscores (_). These are "don't cares", and internally the compiler doesn't even bother to look at them or bind them.

Let's look at another example of pattern matching on something other than lists.

;; -*- Mode: Irken -*-

(include "lib/core.scm")

(datatype tree
  (:empty)
  (:node 'a (tree 'a) (tree 'a))
  )

(define indent
  0 -> #f
  n -> (begin (print-string "  ") (indent (- n 1)))
  )

(define tree/print
  d (tree:empty)               -> #f
  d (tree:node val left right) -> (begin
				    (indent d)
				    (tree/print (+ d 1) left)
				    (print val)
				    (print-string "\n")
				    (tree/print (+ d 1) right)))

(let ((t (tree:node
	  5
	  (tree:node 7 (tree:empty) (tree:node 12 (tree:empty) (tree:empty)))
	  (tree:node 9 (tree:empty) (tree:empty))
	  )))
  (tree/print 0 t)
  )

The datatype tree defines a simple binary tree. A tree is either empty, or a node containing an object of type 'a (i.e., this datatype is polymorphic), and two more trees of that same type - the left and right branches of that node.

The tree/print function uses a helper for indenting its output. We've defined it succinctly using pattern matching. The indent function takes a single integer argument - the amount of indentation to print. The base case of zero simply returns a boolean, which becomes the return type of the whole function. The second case binds the argument to the variable n, which is then used to print two spaces before recursively calling indent.

The tree/print function itself takes two arguments - a depth argument d and the tree to print. It prints the left side of the tree first (indented by one step), the value of that node, then finally the right tree.

The `match` special form

Sometimes you want to pattern match somewhere other than the arguments to a defined function. The match special form facilitates this:

(match <exp0> <exp1> ... with
  <pat0> <pat1> ... -> <result0>
  <pat0> <pat1> ... -> <result1>
  )

  (match (+ 1 x) y with
     12 z -> (+ z 9)
     15 9 -> 0
     x  y -> (+ x y)
     ))

Tail Recursion is Iteration

Like Scheme, Irken is properly tail recursive. Irken doesn't have imperative-style looping constructs. Instead, the compiler detects tail-recursive calls and translates them into a goto (with the C back end, literally). For most people, picking up this style of programming is relatively easy, especially when you learn the 'accumulator' trick. Let's take another look at that pattern-matching length function:

(define length
    ()       -> 0
    (_ . tl) -> (+ 1 (length tl)))

This function is recursive, but it's not tail-recursive. Why? Look at the recursive call to length. After calling length, it adds one to the result. When length is called, the stack will fill up with a bunch of +1 calls until it reaches the end of the list. It's O(N) time, but it's also O(N) space. The accumulator trick will fix the problem:

(define length2
  ()       acc -> acc
  (_ . tl) acc -> (length2 tl (+ 1 acc))
  )

(define (length3 l)
  (length2 l 0))

Note here that the call to the + function is inside the recursive call (i.e., one of its arguments), rather than outside. This version of the length function is properly tail recursive, and will compute the length of a list in O(N) time and O(1) space; just like the loop you might have written in C. [It's called a tail call because it's the very last thing the function does]. Another thing: length2 takes two arguments, not one. Note how the acc variable passes through the pattern match untouched.

Notice the auxiliary function length3. It provides the same interface as the original function, and provides the initial value for the accumulator. You could hide the definition of length2 inside length3, like this:

(define (length4 l)
  (define recur
    ()       acc -> acc
    (_ . tl) acc -> (recur tl (+ 1 acc))
    )
  (recur l 0)
  )

Another approach would be to bind acc inside length4, avoiding the acc argument inside recur. These are style issues, ones that I'm still working out myself.

Now you understand why the vcase comparison above was a little unfair. The vcase version was written with an accumulator loop, and the first pattern-matching version was not.

Macros

Irken includes a simplified version of Scheme's syntax-rules macro facility. The main purpose of macros in Irken is for syntax extension, not for inlining of functions (the compiler already inlines aggressively).

The macro facility is of the 'macro by example' variety. The syntax superficially resembles normal pattern matching in Irken, but it's important to remember that you are manipulating source code, before it gets to the compiler.

Here's the definition of the and special form (see "lib/derived.scm"):

(defmacro and
  (and)                 -> #t
  (and test)            -> test
  (and test1 test2 ...) -> (if test1 (and test2 ...) #f)
  )

This macro defines three cases of translation. An empty and is converted into the boolean #t. An and with only one case in it simplifies to that case. The normal use of and is converted into an if conditional. Notice that the macro itself is recursive!

Here's another useful macro. Given a datatype for a lisp 'association list'... (see "lib/alist.scm")

(datatype alist
  (:nil)
  (:entry 'a 'b (alist 'a 'b))
 )

...you might want to create a table. But the constructor syntax is pretty annoying:

(define numbers
  (literal
   (alist:entry
    0 'zero
    (alist:entry
     1 'one
     (alist:entry
      2 'two
      (alist:entry
       3 'three
       (alist:entry
	4 'four
	(alist:entry
	 5 'five
	 (alist:entry
	  6 'six
	  (alist:entry
	   7 'seven
	   (alist:entry
	    8 'eight
	    (alist:entry
	     9 'nine
	     (alist:nil)))))))))))))

Here's a macro to the rescue:

(defmacro alist/make
  (alist/make)                     -> (alist:nil)
  (alist/make (k0 v0) (k1 v1) ...) -> (alist:entry k0 v0 (alist/make (k1 v1) ...))
 )

And the much cleaner table definition

(define numbers2
  (literal
   (alist/make 
    (0 'zero)
    (1 'one)
    (2 'two)
    (3 'three)
    (4 'four)
    (5 'five)
    (6 'six)
    (7 'seven)
    (8 'eight)
    (9 'nine)
    )))

Polymorphic Records

Irken supports polymorphic records based on rows. A great introduction to Row Polymorphism (especially to support object-oriented features) is a slide set from Neel Krishnaswami at CMU. Rows are a very powerful feature, one that you won't find in Standard ML or Haskell. Of the mainstream ML's, only OCaml supports them, and uses them as the base for their OO/class features.

Ok, now in English/Tutorialese.

Here's how you can create a record ('struct' in C parlance) in Irken:

;; -*- Mode: Irken -*-

(include "lib/core.scm")

(define my-record {f0=#t f1="stringy" f2=1234})

(printn my-record.f0)
(printn my-record.f1)

The brackets { and } surround the record literal. Each slot in the record is named with a label.

When you run this file through the compiler, you'll see the type assigned to my-record:

my-record: {f2=int, f1=string, f0=bool}

(Note: at the moment the compiler is printing row types in a much more frightening way, this will be fixed soon).

As far as the compiler is concerned, the ordering of these fields is arbitrary.

Here's a function that will deal with that record:

(define (fun r)
  (+ r.f2 34))

If you look at the type of fun as discovered by the compiler, you'll see that it is polymorphic on the field f2.:

fun: {f2=int, ...}

What this means is that fun will accept for its argument any record type containing an integer field named "f2".

There is no need to declare record types, just use them. Here's an example of pattern matching on records:

(include "lib/core.scm")

(define thing
  {a=x b=2} -> x
  {a=3 b=y} -> y
  {a=m b=n} -> (+ m n)
  )

(printn (thing {a=3 b=1}))
(printn (thing {a=3 b=2}))
(printn (thing {a=4 b=5}))

Polymorphic Variants

There's a type-theoretic flip side to polymorphic records: polymorphic variants are the dual of polymorphic records. Mathematically, records are a "product type", and variants are a "sum type". The datatype declaration in Irken declares a boring, statically defined variant/sum type. But in Irken, you can create and use polymorphic variants on the fly:

;; -*- Mode: Irken -*-

(include "lib/core.scm")

(define v0 (:pair 3 #t))
(define v1 (:thingy 12))

The constructor syntax is very similar to the normal datatype constructor; only that the name of the datatype is left off.

(define fun
  (:pair n _) -> n
  (:thingy n) -> n
  )

(printn v0)
(printn (fun v0))
(printn (fun v1))

The type for the function fun is:

Σ{thingy='a pair=('a,'b)}

This means it'll accept any polymorphic variant of that shape: a variant called 'thingy' containing one object (of any type), or a variant named 'pair' containing two objects (where the first one matches the type held by the thingy variant)

Polymorphic variants are very handy; their ease of use, without declaration, anywhere, reminds me of a dynamically typed language like Python. They feel very Pythonic, yet have all the type safety you could ask for. In fact, I spent some months trying to get by in Irken with polymorphic variants as the main datatype abstraction. Two issues stopped me: First, they're inappropriate for systems-level programming - i.e., they are underspecified and would never fly in something like an OS Kernel. The second issue had to do with performance of the type solver given huge data structures (like parser tables) built using them.

OCaml supports polymorphic variants, they are introduced with a backquote in OCaml's syntax.

Type Inference

Irken uses a constraint-based type inference algorithm, based directly on Chapter 10 of "Advanced Topics in Types and Programming Languages". An extended version of the chapter is available online. Currently, the type solver does not support subtyping (it uses equality where a subtype rule would apply), but this should be relatively easy to change when the time is right.

The Compiler

The Irken compiler is written in Python, and compiles to C. It's a whole-program compiler that generates a single C file, in fact, a single C function, to represent the Irken program. The compiler translates Irken source to a core lambda-calculus language. After typing and optimization, the program is transformed into a CPS register-machine language before the final output to C.

The C output relies critically on two gcc extensions: the ability to take the address of a label, and inline/lexical functions. Other C compilers support the address-of-label extension, but fewer support lexical functions. The garbage collector is implemented as a lexical function to give it access to local heap-related variables, while still allowing the compiler to put them in registers. I've experimented with a non-lexical gc() function (and thus putting the heap-related variables into globals), and the impact on performance is hugely bad.

In practice, there are three compilers that can be used on the output from Irken:

gcc
llvm-gcc
dragonegg

The LLVM compiler actually does a better job of compiling the Irken output. Both compilers are able to see right through the integer tags and generate amazingly good code for integer math, for example.

The CPS/RTL output is extremely well suited for an LLVM backend, and I hope to someday explore this possibility.

In the meanwhile, the best performance to be had is with the dragonegg compiler. This is an LLVM backend jammed into gcc using its new plugin architecture.

The Runtime

Irken has a very lisp/scheme-like runtime. It uses runtime type tagging to enable the garbage collector to work without needing extra data from the compiler on the composition of the stack. This simplifies debugging enormously, and allows the runtime to print out objects in a useful manner. Do not underestimate the importance of this! However, there is nothing in the language that requires runtime type tagging. At some point in the future, a tagless, unboxed runtime could be designed (similar to OCaml), that would cut down on some of the heap and runtime overhead.

Integers

Integers are tagged with a single bit, the least significant. This allows the runtime to distinguish an integer by looking only at this bit. Any odd value is an integer. This means our int datatype loses one bit of information. In the 16-bit days, this was horribly painful. In the 32-bit days, it was a mild inconvenience. Today, with 64 bits, I would claim that it hardly matters at all.

Other immediate types

Characters, #t, #f, #undefined, nil, enum types, etc... are all represented with the second bit set and lowest bit cleared (i.e., multiples of 2, but not 4). A dataype declaration with an empty variant constructor is also represented from this code space. This allows all these values to be stored in a single word, and thus in a register.

Pointer Types

All pointer types have a one-word header stored on the heap. The lowest byte of this header gives a type code (allowing the runtime to distinguish between datatypes, and between builtin types). The remainder of the word gives the number of words to follow. Pointer types are distinguished at runtime by having their lowest two bits cleared - in other words, they are aligned to (at least) a 32-bit boundary.

The Garbage Collector

The garbage collector is a very simple Cheney two-space copying collector. It consists of 153 lines of C. Someday, someone will write a generational collector for Irken.

Image Dump/Load

The copying collector enables Irken to have a dump/load facility. An image of the running system can be dumped to disk and reloaded later very quickly. This would be useful for things like moving or distributing a computation, checkpointing, or to get a CGI to load quickly.

Features/Motivation

Why have I written this compiler? The compiler's main purpose is to compile a VM for a python-like language. The continuation-passing design in combination with the address-of-label extension just happens to generate code identical to what someone would write by hand when implementing a threaded virtual machine. I hope to be able to write a high-performance VM in Irken directly. The other main advantage is scalability. Both the VM (for a python-like language) and its low-level runtime (written in Irken) will support full-blown continuations. This will enable the design of massively scalable systems built on user/green/non-preemptive threads. The engineering goal is to support 100,000+ threads. This goal cannot be efficiently reached with a runtime that uses pre-allocated C stacks for each thread, so we avoid using the C stack completely.

Look here to get an idea what the VM code will look like.

It has been my experience that having a green-threaded VM is not enough to build world-class systems. Usually there is a requirement for performance that dictates dropping down into C. With a system like Stackless Python, this means writing C in continuation-passing style. This is not a job for mortals. Now, to write CPS code in C without introducing bugs that impact stability and security is even harder... a job for the gods themselves.

The plan is thus to drop down to Irken when extra performance is needed, and this will be much safer than C. It will also be possible to push performance-critical networking code down to the Irken layer.

History

Irken began life as a straightforward implementation of Scheme, intended for writing a VM without having to do so in C (which I despise). One of my favorite Scheme implementations is scheme48, which uses a similar implementation technique - a restricted version of Scheme called 'PreScheme' that uses type inference to cut down on runtime type checking as much as possible. I decided to try to learn about type inference. After about 18 months, I finally had something working well enough for my purposes.

Related Work

A similar project, with much bigger goals, is the BitC project. It is in many ways much further along than Irken, but less suited to my purposes because it uses the C stack. And they dumped the lisp syntax. 8^) But it's definitely worth following the project, I very much look forward to seeing kernels written in BitC soon!

References

This project is the culmination of about 15 years of work with Scheme, originally inspired by The Essentials of Programming Languages. All three editions are excellent, and very different books. The first edition is the largest and most challenging; but it contains a lot more information about continuation-passing style than the others.

Other critical references:

Friedman, Wand, & Haynes, "The Essentials of Progamming Languages". (all three editions)
Appel, "Compiling with Continuations". (also available online)
Appel, "Modern Compiler Implemenation in ML".
Jones & Lins, "Garbage Collection: Algorithms for Automatic Dynamic Memory Management".
Pierce, "Types and Programming Languages".
Pierce, "Advanced Topics in Types and Programming Languages".
Okasaki, "Purely Functional Data Structures".

The following books have the additional advantage of being available online for free. Many thanks to the authors!

Peyton-Jones, "The Implementation of Functional Programming Languages".
Krishnamurthi, "Programming Languages: Application and Interpretation"

[XXX: I have a stack of 20 or so papers that should also be listed here]

Last modified: Thu Feb 28 19:10:16 PST 2013