Things I learned from Crafting Interpreters

If you've not already heard of Crafting Interpreters, it's a book about writing your very own "full-featured, efficient scripting language". I am currently following along with it and learning a lot about both PL (Programming Languages) and software design.

Because I love the language and want an extra challenge, I decided to write my interpreter in rust. This meant coming up with my own solutions for how to organize my code and what data structures would be most appropriate. This was most apparent in the later chapters as more and more code would rely on one another and adjusted some of the structure accordingly.

Initial implementation

For the first few chapters, I used CodeCrafters to check that my code does what it's supposed to. They use a simple string-compare, which meant that at some points, I had to write a printing function for the numbers, as it expected them to end in a ".0". Other than that, it was quite a rewarding way to write code, and my first time really using something like test-driven development, as I usually am either on a really short timeline that wouldn't allow any unnecessary code or working on code that doesn't have strictly defined behaviour or interfaces.

In those first few chapters, I mainly just replaced Class inheritance with enums and the visitor pattern in Java with matches or pre-made traits, like Debug and Display.
I also chose a different global structure, as each phase of the interpreter is its own struct, instead of having all linked to a global Interpreter class.

Starting with the Lexer, I added one big TokenType enum for all the Tokens, where I made use of the sum types in Rust to contain the data for the Literals

enum TokenType {
    ...
    Identifier(String),
    String(String),
    Number(f64),
    ...
}

The tokenize function takes the source code as a String as input and returns a Vec<Token> is a big match statement in a while loop matching over characters and putting Tokens in a Vec that is then returned at the end of the function. At first I tried to iterate over lines in the input, but that doesn't allow for multi-line strings, so I reverted back to using the solution provided in the book and counted lines manually.
Sometimes, you really need to manually loop over the data to get the control you need, even if there is a function in stdthat almost does what you need.

To track where errors in the user code occur, I added the SouceCodeRange struct which saves the position and length in each line, additionally to the line itself:

struct SouceCodeRange {
    line: usize,
    start_column: usize,
    length: usize,
}

In the Parser, the merge function can then be used to keep track of what AST node corresponds to what range in the source code. Unfortunately, this would prove somewhat difficult and I mostly neglected making sure the right ranges would be merged. The error printing also currently ignores the start and length.
I hope to add that functionality one I get done with the basics of the language however. And in preparation, it's probably good to have the structure in place, even it it's not used yet.

As the Parser is a recursive descent parser, it lends itself quite nicely to recursive function calls with a &mut self as a parameter. It is quite rare that I get to write recursive functions for the code that I usually write, but it turns out, it wasn't very difficult to wrap my head around at all. All the parser code was quickly ported from Java to Rust with one exception: the mtch(renamed, because match is a keyword in Rust) function takes a Vec<TokenType> and because I decided to save the data in Literals inside of there, I can't just use that function to match on a type of token anymore. There were several ways so solve this problem:

I could have made the mtch function take a closure and used the matches! macro inside the closure, but this would have added clutter to all calls to mtch

Another option would have been to use the mem::discriminant to check, which enum variant should be matched against, but again, this would add clutter to all calls.

Without adding more code to each call, but adding much more, and less maintainable code, I could have added an enum without data in the variants. There is the possibility of macro magic here to automatically generate it, but alas, I chose the most straight-forward solution:

Just just match directly in the few places, where I need to match on a literal. This does add more, and in my opinion ugly, code in a few places, but it's a compromise I'm ok with.

With the parser working, it was time to interpret the ast! That's the responsibility of the eval module in my implementation. For interpreting, I added the Eval trait:

pub trait Eval {
    fn eval(&self, ctx: &mut EvalCtx) -> ExecResult<Literal>;
}

The idea is to, again recursively, call methods for each ast node. The EvalCtx can then be accessed and modified by each call and the program can do it's magic. Statements however don't return a Literal, so they had to implement a different method that does essentially the same thing, but doesn't return anything:

pub fn eval(&self, ctx: &mut EvalCtx) -> ExecResult<()>;

Well, "nothing" is not quite right. They do return an ExecResult, which allows for errors to be handled correctly.

Both those function are implemented very similarly to the parser, where each function matches on the type of Statement/Expression and calls their eval methods until it reaches something to do, like a Literal, which of course just returns itself.

One major point where my implementation deviates from the book is the beak, continue and return handling. Each of those statement simply sets a variable to true (or Some(Literal) in the case of return). The "opposing" code then reads them and resets them. Opposing meaning a loop in the case of break and continue, and a function in case of the return.
The top level eval for statement then just returns immediately, if any of them are set.

This is also one of the places I made use of the Rust From trait to convert any literal into a bool.
Now that I think about it, I could have probably written the parser using the TryInto trait... A thing to consider should I ever write another parser.

Rewrite to make it my own

Now that the basics are working, it's time to do some refactoring on the codebase and change up the language a bit.

The first step ist to remove the LexerInstance and just have the tokenize function free standing and return a Result<Vec<Token>, ()>. I also used the TryFrom trait to convert from char and &str to TokenType more concisely.
Step two is a quick one: improve the error handling code in the main function by using map_err and the ? syntax.

In the ast, I was able to reduce manual Display and TryFrom<&str> implementations using strum_macros to automatically implement them and change the "token" using #[strum(serialize = "[token]")] . Unfortunately, as of writing this, version 0.26 of strum has a bug that means braces can't be used as the name of an enum, so I had to use 0.25.

All good languages need good tooling, so I added a super basic autoformatter. The only really hard part about formatting once you have an ast is the indentation, if you accept that any user formatting will be overwritten. If you want to include user formatting, the entire ast needs to be able to support that, which practically probably means just having a separate tokenizer and parser implementation that don't discard whitespace.
For now though, I just wanted anything out, so the current ast has to do. At first, I tried just using the Display trait, but quickly realized that I needed a struct to keep track of the current indentation level, so I made a new struct which holds a reference to the Statement and the current indent level, on which I can then implement Display.

pub struct StmtFormatter<'a> {
    pub stmt: &'a Stmt,
    pub indent_lvl: usize,
}

After writing the Display implementation, I realized that it'd be much easier to just save the inner content in a variable and for every line, write that out indented:

let mut result = String::new();
result.push_str(&format!("{}", inner));
for line in result {
    write!(f, "  {}\n", line)?;
}

Bytecode rewrite

In the book, the second half rewrites the language in C and does so using a different parsing technique. Instead of that, I reused the parser written in Rust to implement the Bytecode interpreter described in part two.

This obviously means that I don't get to learn quite as much about the data-structures discussed, but I feel like I already have a decent grasp at those and this is a learning exercise for writing a language for me. So the decision to deviate more was made.

The most obvious difference between the implementation in the book and mine is my use of the ustr library, which implements String Interning for me. It uses pointers as the inner type, as it stores the interned strings in global storage. This means, that my Strings are accessed by a pointer sized type, and not like in the book, an 8 bit integer.
That does have the advantage that I can effectively store an infinite number of globals, but it does use 4-8 times the number of bytes in the bytecode. For now, I think that's fine, but it might be worth investigating, if a custom implementation using a smaller number type is worth it.

I made one mistake that took me about half an hour of looking at the disassembly and stack to debug: I mistakenly popped a value when setting local variables instead of just getting the top of the stack.

For function definitions, it's much easier for a scripting language to put them into a different chunk, because the interpreter would otherwise execute into the function itself when running the assignment to a global variable... Although it should be possible to begin the function with a "jump to the end of the function" and executing that function jumps to just after that instruction. In the current implementation that limits the size of functions to ~65k bytes, but that should be fine for now. A proper language would surely support long jumps and therefore be fine.
This is something I'd like to try out after having finished following along with compiling into separate chunks.

But for now, I'm leaving lox behind and staring my own language from scratch: sola
It will be (at least for now) using LALRPOP for the lexer and parser, because it is super easy to get running and experiment with new features.