Writing a Scheme interpreter in Rust, Part I: The Lexer

Here is just another blog post, explaining how to write a Scheme interpreter. This is a part of my Rust learning process, so take everything written here with a grain of salt. But at the end of the day, we will have a working Scheme interpreter in Rust. Of course I’ll do it for a subset of Scheme but my main goal will be writing an extensible interpreter. So some of the things may seem overcomplicated for such a small job, but it’s because I’m trying to do this interpreter a future-proof one. Maybe one day, it will cover whole R5RS, who knows?

In this part, I’m just going to talk about the lexer. But before that, if you don’t know how a interpreter works, here is the general idea: There are two main components of an interpreter: The parser and the evaluator. Parsing is a process of turning raw source code into an internal representation. Basically, you take the source code and turn it into some structs and/or enums, in case of Rust.

There are two parts of parsing: Lexical analysis and Syntactic analysis. In lexical analysis, the lexer divides the source code into tokens, smallest meaningful units of a language. Basically, it creates a very basic internal representation of the raw source code. Let’s say you have this source code of written in some made-up language:

string str = "hello world!"

The lexers job is turning this into something like this:

[Symbol "string", Symbol "str", Operator "=", String "hello, world!"]

There is no information of hierarchy here, the lexer just went trough the source code and find the smallest meaningful units then turned them into an internal representation. Think like dividing a sentence into its words.

After lexical analysis, the syntactic analysis starts. It uses the output of lexer to build an Abstract Syntax Tree, which is a true representation of the source code. The parser constructs a hierarchy, which is in fact just a tree. A tree is a best way to represent a program, because you can see where it starts and where it goes on certain conditions, what is the scope of this part of the program etc. A tree can easily represent these kind of information.

Scheme source code is much like a tree, so it is really easy to parse it. While writing a parser for a language like above will be somewhat hard, writing a parser for Scheme is pretty straightforward.

One last note, I’m not going to use any library and just going to use Rust’s standard library. Let’s start.

The Lexer

Let’s see a piece of Scheme source code, to just get some feel of it:

;; This function will print "Fizz" if the input is a multiple of 3,
;;                          "Fuzz" if the input is multiple of 5,
;;                          "FizzBuzz" if input is both multiple of 3 and 5.
(define fizzbuzz
    (lambda (n)
        (cond ((= (modulo n 15) 0) (display "FizzBuzz") (newline))
              ((= (modulo n  3) 0) (display "Fizz")     (newline))
              ((= (modulo n  5) 0) (display "Buzz")     (newline))
              (else (display n) (newline)))))

The thing we are going to do is just looping over all the characters and try to find tokens. To achieve that we need a function named tokenize. But before starting to write tokenize, we should conclude what the tokens are.

#[derive(Debug)]
enum Token {
    LParen,
    RParen,
    Symbol(String),
    Integer(i64),
    Float(f64),
    Boolean(bool),
    Chr(char),
    Str(String),
    //QuoteSugar,
    //UnQuoteSugar
}

Based on these definitions, the fizzbuzz function will be represented as this:

The source code is a string. Best way to loop over all characters is using iterators. Getting an iterator over strings is easy:

let mut iter = input.chars();

Iterators have a lot of handy functions, like take_while() which is a function that takes a predicate and returns set of characters based on that predicate. See this:

let a = [1, 2, 3, 4];
let mut iter = a.into_iter();

let result: Vec<i32> = iter.by_ref()
                           .take_while(|n| **n != 3)
                           .cloned()
                           .collect();

assert_eq!(result, &[1, 2]);

Here you can see that we were able to take the numbers until 3. But there is a problem with take_while, see this:

let result: Vec<i32> = iter.cloned().collect();
assert_eq!(result, &[4]);

As you can see our original iterator only have 4, but it should’ve contain 3 too. The problem is take_while uses next() function of Iterator which consumes the current element. In the last step of iteration, take_while used next() and tested 3 to conclude if the iteration should stop or not. Because of that, it consumed 3. This is not a thing we want.

Instead of using next(), we can use peek() function of Peekable struct. It lets us peek at next char without consuming it. Here is how we can get a peekable iterator:

let mut iter = input.chars().peekable();

But there aren’t any function like take_while that uses peek(). So we need to write one. To make things more compact, I wrapped the function inside a trait and named it take_until() to avoid name clashes, then I wrote an implementation for Peekable:

// I should find a better name for this trait
trait GentleIterator<I: Iterator> {
    fn take_until<F>(&mut self, predicate: F) -> IntoIter<I::Item>
        where F: Fn(&I::Item) -> bool;
}

impl<I: Iterator> GentleIterator<I> for Peekable<I> {
    fn take_until<F>(&mut self, predicate: F) -> IntoIter<I::Item>
        where F: Fn(&I::Item) -> bool {

        let mut v: Vec<I::Item> = vec![];
        while self.peek().map_or(false, &predicate) {
                v.push(self.next().unwrap());
        }

        v.into_iter()
    }
}

I choose to return IntoIter to mimic the behavior of take_while() so that caller can collect() anyway s/he wants. If we re-run the example above with this, we can see that we don’t loose any information:

let a = [1, 2, 3, 4];
let mut iter = a.into_iter().peekable();

let result: Vec<i32> = iter.by_ref()
                           .take_until(|n| **n != 3) // Here we used take_until.
                           .cloned()                 // Thanks to GentleIterator trait
                           .collect();               // we can use it this way.

assert_eq!(result, &[1, 2]);
let result: Vec<i32> = iter.cloned().collect();
assert_eq!(result, &[3, 4]);

Now we are ready to write our lexer:

fn tokenize(input: &str) -> Vec<Token> {
    let mut tokens: Vec<Token> = vec![];
    let mut iter = input.chars().peekable();

    loop {
        let token = match iter.peek() {
            Some(&x) => match x {
                ' ' | '\n' => { 
                    iter.next(); 
                    continue // Simply consume
                }
                '(' => { 
                    iter.next(); 
                    Token::LParen 
                },
                ')' => { 
                    iter.next(); 
                    Token::RParen 
                },
                '"' => {
                    iter.next(); // Consume the opening "
                    let value = iter
                        .take_until(|c| *c != '"')
                        .collect();
                    iter.next(); // Consume the closing "
                    Token::Str(value)
                }
                '#' => match iter.next() {
                    Some('t') => Token::Boolean(true),  // #t means true
                    Some('f') => Token::Boolean(false), // #f means false
                    Some('\\') => {
                        let value = iter.next()
                            .expect("Expected a char, got nothing.");
                        Token::Chr(value)
                    },
                    Some(c) => {
                        // #\a represents char 'a'
                        // #\b represents char 'b'
                        // ...
                        panic!("Expected #t, #f or #\\<char> got: #{}", c)
                    },
                    None => {
                        panic!("Expected something , got nothing: ....")
                    } 
                },
                _ => {
                    if x.is_numeric() {
                        let value = iter
                            .take_until(|c| *c != ' ' && *c != ')')
                            .collect::<String>()
                            .parse::<i64>()
                            .expect("Expected a number, got something else");

                        Token::Integer(value)
                    } else {
                        let value = iter
                            .take_until(|c| *c != ' ' && *c != ')')
                            .collect();

                        Token::Symbol(value)
                    }
                }
            },
            None => break
        };

        tokens.push(token);
    };

    tokens
}

This lexer has some flaws. For instance, it does not handle escape characters and it does not handle quote sugars. I will eventually add them but, implementing them would be a good exercise.

Leave a Reply

Your email address will not be published. Required fields are marked *