How to Bootstrap a Compiler for a Functional Programming Language

Chapter 1: S-Expressions

Jan 31, 2024

S-Expressions will be the primary data structure used to represent almost everything in the compiler.

The Type Signature

In a functional language we might define an S-Expression as follows.

type S = Nil | Atom(String) | Cons(S,S)

In C we can define the implementation as follows.

struct S {
   void* head,
   void* tail
};

The data structure is meant to represent three cases: Nil, Atom, and Cons. Nil represents the concept of emptiness, end, or nothing. Atom represents a single character string. Cons represents a pair of S-Expressions that have been stuck together.

Why are S-Expressions good?

The simple S-Expression is surprisingly powerful enough to represent the following data structures which are used in the compiler:

Strings
Rope
Lists
Tuples
Tagged Unions
Trees

How do we construct an S-Expression in assembly?

To generate the assembly for constructing S-Expressions, we first must make a few decisions about representation. For starters, we need to introduce the concept of "this expression". "this expression" is an S-Expression that will be stored in registers %r12 and %r13 (head and tail respectively). After evaluating any expression, the result of the expression will be left in these registers. Now we should be able to define the constructors for our S-expressions:

Nil sets the value of this expression to zeroes.

nil := λ. (mov $0 %r12) (mov $0 %r13);

Atom sets the value of head to the string pointer.

atom := λs. (mov s %r12) (mov $0 %r13);

Cons closes two expressions before moving the result into head and tail. close moves an expression from the registers onto the heap with the resulting pointer stored in %r8.

cons := λh t. (close h) (mov %r8 %r12) (close t) (mov %r8 %r13);

The close operation for moving data onto the heap is defined as:

close := λe. e (allocate-heap 16) (mov %r12 0\[%r8\]) (mov %r13 8\[%r8\]);

What common functions operate on S-Expressions?

The head function returns the head of the cons cell, if this is a cons cell, otherwise it returns nil.

head:
   (cmp $0 %r13)
   (je head_is_nil)
   (mov 8\[%r12\] %r13)
   (mov 0\[%r12\] %r12)
   ret
head_is_nil:
   (mov $0 %r12)
   ret

The tail function returns the tail of the cons cell, if this is a cons cell, otherwise it returns nil.

tail:
   (cmp $0 %r13)
   (je tail_is_nil)
   (mov 0\[%r13\] %r12)
   (mov 8\[%r13\] %r13)
   ret
tail_is_nil:
   (mov $0 %r12)
   ret

The eq function returns this if this references two equivalent atoms, otherwise it returns nil.

eq:
   (_string_eq())
   (jne eq_return_nil)
   ret
eq_return_nil:
   (mov $0 %r12)
   (mov $0 %r13)
   ret

This is an Excerpt from The Bootstrap Book

The Bootstrap Book is released under the terms of the permissive MIT license. If you quote or copy the book, please make a reference. That is all I ask, please.

Andrew’s Substack

Discussion about this post