MLisp - Handling strings and the tokenizer
In this post I'll go the changes I needed to do to make MLisp handle strings and I go into my extremely simple tokenizer.
Handling strings
To handle strings I needed to update the representation of an expression in my code. I have an enum Expr that represents an expression. In Rust enums can have values associated with their value. Thanks to this you can do all sorts of nifty stuff!
Here is what the Expr enum looks like:
#[derive(Debug, Clone)]
pub enum Expr {
Symbol(String),
Number(f64),
String(String),
List(Vec),
Func(String, fn(&[Expr]) -> Result),
Lambda(Vec, Vec),
}
I hadded the String(String) to the enum, which allows an expression to be of type String and store a String type value.
Aside: The beauty of enums and match
This is an aside just about Rust. I think enums combined with the match control flow is a very beautiful thing.
If we have our Expr from above, we can use the match control flow to do very cool things:
let e = Expr::Number(9);
match e {
Expr::Number(n) => println!("Expression is a number with value {}", n);
}
The example above creates an Number variant of our Expr enum and stores the value 9 inside. In the match statement, we can check what variant the enum e is. We also temporarily extract the value in to the variable n and can use it in the branch of the match statement that is executed based on the matching.
As Rust is a very strict language, the example above won't compile. After all, we did not explicitly handle all variants of the enum! To fix this, we need to handle all variants, or add a catch all clause:
let e = Expr::Number(9);
match e {
Expr::Number(n) => println!("Expression is a number with value {}", n),
Expr::String(s) => println!("Expression is a string vith value {}", s),
_ => println!("Expression is some other type"),
}
Moving on!
Now that we can represent strings and their values in our interpreter, we need to parse the source code into expressions. This happens by splitting the source code into tokens and then parsing the tokens into expressions. Later, we can evaluate those expressions to run our program.
Tokens
What are Tokens? Simply put, tokens are the smallest unit of source code. A number is a token, a string is a token, an operand is a token, etc.
Based on the tokens, expressions are created by parsing the tokens. In MLisp, all tokens match a single expression. But there might be use cases in the future where two tokens lead to a single expression, or a single token leads to multiple expressions.
I haven't dived into it any further, but one use case I can think of the lisp code (first '(1 2 3)). The ' token isn't supported in MLisp currently. We do have the quote function that does the same. So at some point, I might update the token parser to parse the tokens '( 1 2 3) into the expressions that match (quote (1 2 3))
The tokenizer
The tokenizer used in the current version of MLisp is extremely, possibly embarassingly simple. I just add extra spaces around some predefined characters and then split on the whitespace.
This is a very simple, somewhat clever approach to creating tokens. The string (+ 1 2) is padded with spaces to become ( + 1 2 ) . After splitting on the whitespace I get the tokens (, +, 1, 2 and ).
The rust code for my tokenizer function:
// Very simple tokenizer
fn tokenize(input: &str) -> Vec {
input
.replace("(", " ( ")
.replace(")", " ) ")
.replace("\"", " \" ")
.split_whitespace()
.map(String::from)
.collect()
}