Tokenization is a process whereby a text document is broken down into symbols.
The Tokenizer Function
tokenize := λbuff text. match text (
# \o is an escape sequence that represents the octothorpe "#" character
# octothorpe is the beginning of line comments
((\o ts) tokenize-comment ts)
#newlines or spaces clear the buffer
((\n ts) seq-if-nonempty (clone-rope buff) (tokenize () ts))
((\t ts) seq-if-nonempty (clone-rope buff) (tokenize () ts))
((\s ts) seq-if-nonempty (clone-rope buff) (tokenize () ts))
# some characters, like parens, always become standalone tokens
((\[ ts) seq-if-nonempty (clone-rope buff) \[ (tokenize () ts))
((\] ts) seq-if-nonempty (clone-rope buff) \] (tokenize () ts))
((\' ts) seq-if-nonempty (clone-rope buff) \' (tokenize () ts))
((; ts) seq-if-nonempty (clone-rope buff) ; (tokenize () ts))
((. ts) seq-if-nonempty (clone-rope buff) . (tokenize () ts))
# otherwise put this character into the buffer
((c ts) tokenize (buff c) ts)
# if this is the end, then return buffer
(c (clone-rope (buff c))
);
tokenize-comment := λtext. match text (
# a newline ends the comment
((\n ts) tokenize () ts)
# otherwise keep dropping comment characters
((_ ts) tokenize-comment ts)
);
seq-if-nonempty := λt1 ts. if t1 (t1 ts) ts;
seq-if-nonempty := λt1 t2 ts. if t1 (t1 (t2 ts)) (t2 ts);
Tokenization by Example
Here is an example file of source code.
a.b(c de);#comment
These are the tokens that will be returned from this file.
a . b ( c de ) ;
That is all that tokenization does. Not so bad right?
This is an Excerpt from The Bootstrap Book
The Bootstrap Book is released under the terms of the permissive MIT license. If you quote or copy the book, please make a reference. That is all I ask, please.