How to Bootstrap a Compiler for a Functional Programming Language

Chapter 5: Tokenization

Feb 07, 2024

Tokenization is a process whereby a text document is broken down into symbols.

The Tokenizer Function

tokenize := λbuff text. match text (
   # \o is an escape sequence that represents the octothorpe "#" character
   # octothorpe is the beginning of line comments
   ((\o ts) tokenize-comment ts)
   #newlines or spaces clear the buffer
   ((\n ts) seq-if-nonempty (clone-rope buff) (tokenize () ts))
   ((\t ts) seq-if-nonempty (clone-rope buff) (tokenize () ts))
   ((\s ts) seq-if-nonempty (clone-rope buff) (tokenize () ts))
   # some characters, like parens, always become standalone tokens
   ((\[ ts) seq-if-nonempty (clone-rope buff) \[ (tokenize () ts)) 
   ((\] ts) seq-if-nonempty (clone-rope buff) \] (tokenize () ts))
   ((\' ts) seq-if-nonempty (clone-rope buff) \' (tokenize () ts))
   ((; ts) seq-if-nonempty (clone-rope buff) ; (tokenize () ts))
   ((. ts) seq-if-nonempty (clone-rope buff) . (tokenize () ts))
   # otherwise put this character into the buffer
   ((c ts) tokenize (buff c) ts)
   # if this is the end, then return buffer                                   
   (c (clone-rope (buff c))                                        
);

tokenize-comment := λtext. match text (
   # a newline ends the comment
   ((\n ts) tokenize () ts)      
   # otherwise keep dropping comment characters
   ((_ ts) tokenize-comment ts)  
);

seq-if-nonempty := λt1 ts. if t1 (t1 ts) ts;
seq-if-nonempty := λt1 t2 ts. if t1 (t1 (t2 ts)) (t2 ts);

Tokenization by Example

Here is an example file of source code.

a.b(c de);#comment

These are the tokens that will be returned from this file.

a . b ( c de ) ;

That is all that tokenization does. Not so bad right?

This is an Excerpt from The Bootstrap Book

The Bootstrap Book is released under the terms of the permissive MIT license. If you quote or copy the book, please make a reference. That is all I ask, please.

Andrew’s Substack

Discussion about this post