Skip to content

Parsing in Rust

Overview

This lecture covers the implementation of a recursive descent parser in Rust. We examine how Rust's enum system with associated data provides a natural way to represent ASTs, how ownership and Box<T> handle tree structure, and how pattern matching simplifies parsing logic compared to C's if-else chains.

Learning Objectives

  • Understand AST representation using Rust enums with associated data
  • Use Box<T> for recursive data structures
  • Implement parsing functions using Rust's ownership model
  • Apply pattern matching for token and node handling
  • Compare Rust and C approaches to parsing

Prerequisites

  • Understanding of parsing concepts (previous lecture)
  • Rust programming basics (enums, structs, ownership)
  • Familiarity with the NTLang scanner interface

1. AST Node Structure in Rust

The ParseNode Enum

In Rust, we use an enum with associated data to represent different node types:

/// Parse tree node types
pub enum ParseNode {
    IntVal { value: i32 },
    Oper1 { oper: Operator, operand: Box<ParseNode> },
    Oper2 { oper: Operator, left: Box<ParseNode>, right: Box<ParseNode> },
}

Each variant directly contains its data - no separate type field needed!

The Operator Enum

#[derive(Debug, Clone, Copy)]
pub enum Operator {
    Plus,
    Minus,
    Mult,
    Div,
}

impl Operator {
    pub fn name(&self) -> &str {
        match self {
            Operator::Plus => "PLUS",
            Operator::Minus => "MINUS",
            Operator::Mult => "MULT",
            Operator::Div => "DIV",
        }
    }
}

Comparison: C vs Rust Node Representation

Aspect C Rust
Type identification enum + switch Pattern matching on enum variant
Data storage union (manual type tracking) Enum variants with associated data
Memory All nodes same size (union) Each variant sized to its data
Type safety Manual (can access wrong union member) Enforced by compiler

2. Why Box<T>?

The Problem with Recursive Types

This doesn't compile:

pub enum ParseNode {
    IntVal { value: i32 },
    Oper2 { left: ParseNode, right: ParseNode },  // ERROR!
}

Error: ParseNode has infinite size - it would contain itself!

The Solution: Heap Allocation with Box

pub enum ParseNode {
    IntVal { value: i32 },
    Oper2 { left: Box<ParseNode>, right: Box<ParseNode> },  // OK!
}

Box<T> is a pointer to heap-allocated data:

Stack:                 Heap:
+-----------+          +------------+
| Oper2     |          | ParseNode  |
|   left ---+--------> | (IntVal)   |
|   right --+--+       +------------+
+-----------+  |
               |       +------------+
               +-----> | ParseNode  |
                       | (IntVal)   |
                       +------------+

Memory Model Comparison

C Rust
All nodes in pre-allocated array Nodes allocated on heap with Box
Pointers to array elements Owned Box<ParseNode>
Manual memory tracking Automatic memory management
No allocator calls Uses allocator (but hidden)

3. Constructor Methods

Creating Nodes

impl ParseNode {
    /// Create a new IntVal node
    pub fn new_intval(value: i32) -> Self {
        ParseNode::IntVal { value }
    }

    /// Create a new Oper2 node
    pub fn new_oper2(oper: Operator, left: ParseNode, right: ParseNode) -> Self {
        ParseNode::Oper2 {
            oper,
            left: Box::new(left),
            right: Box::new(right),
        }
    }
}

The new_oper2 function: 1. Takes ownership of left and right (they're moved) 2. Wraps them in Box::new() to allocate on heap 3. Returns the new Oper2 node


4. Token Interface for Parser

Token and TokenType

Rust uses two separate types for tokens:

/// Token type identifier for matching
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum TokenType {
    IntLit,
    Plus,
    Minus,
    Eot,
}

/// Token enum with associated data
#[derive(Debug, Clone)]
pub enum Token {
    IntLit(String),
    Plus,
    Minus,
    Eot,
}

impl Token {
    /// Returns the token type for matching
    pub fn token_type(&self) -> TokenType {
        match self {
            Token::IntLit(_) => TokenType::IntLit,
            Token::Plus => TokenType::Plus,
            Token::Minus => TokenType::Minus,
            Token::Eot => TokenType::Eot,
        }
    }

    /// Returns the string value of the token
    pub fn value(&self) -> &str {
        match self {
            Token::IntLit(s) => s,
            Token::Plus => "+",
            Token::Minus => "-",
            Token::Eot => "",
        }
    }
}

ScanTable Methods

impl ScanTable {
    /// Get the token at position cur + i
    pub fn get(&self, i: isize) -> &Token {
        let index = (self.cur as isize + i) as usize;
        &self.tokens[index]
    }

    /// Accept the current token if it matches
    pub fn accept(&mut self, expected: TokenType) -> bool {
        if self.cur >= self.tokens.len() {
            return false;
        }

        if self.tokens[self.cur].token_type() == expected {
            self.cur += 1;
            return true;
        }

        false
    }

    /// Accept any token (wildcard)
    pub fn accept_any(&mut self) -> bool {
        if self.cur >= self.tokens.len() {
            return false;
        }
        self.cur += 1;
        true
    }
}

5. Parser Implementation

parse_program

/// Parse a complete program: expression EOT
pub fn parse_program(scan_table: &mut ScanTable) -> ParseNode {
    let expr = parse_expression(scan_table);

    if !scan_table.accept(TokenType::Eot) {
        parse_error("Expecting EOT");
    }

    expr
}

parse_expression

/// Parse an expression: operand (operator operand)*
fn parse_expression(scan_table: &mut ScanTable) -> ParseNode {
    let mut left = parse_operand(scan_table);

    loop {
        let token = scan_table.get(0);
        match token.token_type() {
            TokenType::Plus => {
                scan_table.accept_any();
                let right = parse_operand(scan_table);
                left = ParseNode::new_oper2(Operator::Plus, left, right);
            }
            TokenType::Minus => {
                scan_table.accept_any();
                let right = parse_operand(scan_table);
                left = ParseNode::new_oper2(Operator::Minus, left, right);
            }
            _ => break,
        }
    }

    left
}

Key Points in parse_expression

  1. let mut left: We need to reassign left each iteration
  2. Pattern matching: match token.token_type() is cleaner than if-else chains
  3. Ownership transfer: left is moved into new_oper2, returned value assigned back to left
  4. Wildcard _: Catches all non-operator tokens

parse_operand

/// Parse an operand: intlit
fn parse_operand(scan_table: &mut ScanTable) -> ParseNode {
    if scan_table.accept(TokenType::IntLit) {
        let token = scan_table.get(-1);
        let value: i32 = token.value().parse().unwrap_or_else(|_| {
            parse_error("Invalid integer literal");
        });
        ParseNode::new_intval(value)
    } else {
        parse_error("Bad operand");
    }
}

Extended parse_operand (with unary and parentheses)

fn parse_operand(scan_table: &mut ScanTable) -> ParseNode {
    if scan_table.accept(TokenType::LParen) {
        // Parenthesized expression
        let expr = parse_expression(scan_table);
        if !scan_table.accept(TokenType::RParen) {
            parse_error("Missing right paren");
        }
        expr
    } else if scan_table.accept(TokenType::IntLit) {
        // Integer literal
        let token = scan_table.get(-1);
        let value: i32 = token.value().parse().unwrap_or_else(|_| {
            parse_error("Invalid integer literal");
        });
        ParseNode::new_intval(value)
    } else if scan_table.accept(TokenType::Minus) {
        // Unary minus
        let operand = parse_operand(scan_table);
        ParseNode::Oper1 {
            oper: Operator::Minus,
            operand: Box::new(operand),
        }
    } else {
        parse_error("Bad operand");
    }
}

6. Ownership Transfer in Tree Building

Understanding Move Semantics

When building "1 + 2":

let mut left = parse_operand(scan_table);  // left owns IntVal(1)

// In the loop:
let right = parse_operand(scan_table);     // right owns IntVal(2)
left = ParseNode::new_oper2(Operator::Plus, left, right);
//     ^-- left and right are MOVED into new_oper2
//     The new Oper2 is assigned to left

After the assignment: - Original IntVal(1) and IntVal(2) are now owned by Boxes inside Oper2 - left now refers to the Oper2 node

Ownership Flow Diagram

Before new_oper2:
  left -----> IntVal(1)
  right ----> IntVal(2)

During new_oper2:
  left -----> [moved]
  right ----> [moved]

  new Oper2 {
      left: Box::new(IntVal(1)),   // takes ownership
      right: Box::new(IntVal(2)),  // takes ownership
  }

After assignment:
  left -----> Oper2 {
                left: Box -> IntVal(1)
                right: Box -> IntVal(2)
              }

7. Pattern Matching for Parsing

Matching Token Types

let token = scan_table.get(0);
match token.token_type() {
    TokenType::Plus => { /* handle + */ }
    TokenType::Minus => { /* handle - */ }
    TokenType::Mult => { /* handle * */ }
    _ => break,  // No match, exit loop
}

Matching on ParseNode

fn eval(node: &ParseNode) -> i32 {
    match node {
        ParseNode::IntVal { value } => *value,
        ParseNode::Oper1 { oper, operand } => {
            let val = eval(operand);
            match oper {
                Operator::Minus => -val,
                Operator::Not => !val,
                _ => panic!("Unknown unary operator"),
            }
        }
        ParseNode::Oper2 { oper, left, right } => {
            let l = eval(left);
            let r = eval(right);
            match oper {
                Operator::Plus => l + r,
                Operator::Minus => l - r,
                Operator::Mult => l * r,
                Operator::Div => l / r,
            }
        }
    }
}

8. C vs Rust Comparison

Node Creation

Aspect C Rust
Allocation parse_node_new(pt) Implicit with Box::new()
Type setting np->type = EX_INTVAL Variant determines type
Data access np->intval.value ParseNode::IntVal { value }

Code Comparison

C:

np = parse_node_new(pt);
np->type = EX_OPER2;
np->oper2.oper = OP_PLUS;
np->oper2.left = np1;
np->oper2.right = parse_operand(pt, st);
np1 = np;

Rust:

left = ParseNode::new_oper2(Operator::Plus, left, right);

Key Differences

Feature C Rust
Memory management Manual (parse table) Automatic (Box/ownership)
Type safety Runtime (check type field) Compile-time (enum variants)
Node access Union member access Pattern matching
Child pointers Raw pointers Owned Box<T>
Mutability Always mutable Explicit mut

Key Concepts

Concept Description
enum with data Rust enums can hold different data per variant
Box<T> Heap-allocated pointer for recursive types
Ownership transfer Values moved into function calls
Pattern matching match on enum variants extracts data
&mut Mutable reference (for scan_table)

Practice Problems

Problem 1: Add Multiplication

Extend parse_expression to handle TokenType::Mult. Show the modified code.

Click to reveal solution
fn parse_expression(scan_table: &mut ScanTable) -> ParseNode {
    let mut left = parse_operand(scan_table);

    loop {
        let token = scan_table.get(0);
        match token.token_type() {
            TokenType::Plus => {
                scan_table.accept_any();
                let right = parse_operand(scan_table);
                left = ParseNode::new_oper2(Operator::Plus, left, right);
            }
            TokenType::Minus => {
                scan_table.accept_any();
                let right = parse_operand(scan_table);
                left = ParseNode::new_oper2(Operator::Minus, left, right);
            }
            TokenType::Mult => {  // NEW
                scan_table.accept_any();
                let right = parse_operand(scan_table);
                left = ParseNode::new_oper2(Operator::Mult, left, right);
            }
            _ => break,
        }
    }

    left
}

Problem 2: Trace Ownership

For input "1 + 2", trace the ownership of each ParseNode through parsing.

Click to reveal solution
1. parse_operand() returns:
   -> ParseNode::IntVal { value: 1 }
   Owned by: left variable in parse_expression

2. parse_operand() returns:
   -> ParseNode::IntVal { value: 2 }
   Owned by: right variable (temporary)

3. ParseNode::new_oper2(Plus, left, right)
   -> left is MOVED into new_oper2
   -> right is MOVED into new_oper2
   -> Box::new(left) creates Box owning IntVal(1)
   -> Box::new(right) creates Box owning IntVal(2)
   -> Returns Oper2 { left: Box, right: Box }
   Owned by: left variable (reassigned)

4. parse_expression returns left
   -> Ownership transferred to caller

Final ownership chain:
  returned Oper2
    |-- left: Box owns IntVal(1)
    |-- right: Box owns IntVal(2)

Problem 3: Write new_oper1

Implement a constructor for unary operator nodes.

Click to reveal solution
impl ParseNode {
    /// Create a new Oper1 node
    pub fn new_oper1(oper: Operator, operand: ParseNode) -> Self {
        ParseNode::Oper1 {
            oper,
            operand: Box::new(operand),
        }
    }
}

// Usage:
let neg_five = ParseNode::new_oper1(
    Operator::Minus,
    ParseNode::new_intval(5)
);

Problem 4: Pattern Match Practice

Write a function that returns true if a ParseNode is a leaf node (IntVal).

Click to reveal solution
fn is_leaf(node: &ParseNode) -> bool {
    match node {
        ParseNode::IntVal { .. } => true,
        _ => false,
    }
}

// Or using matches! macro:
fn is_leaf(node: &ParseNode) -> bool {
    matches!(node, ParseNode::IntVal { .. })
}

Summary

  1. Rust enums with associated data replace C's struct+union pattern, providing compile-time type safety.

  2. Box<T> enables recursive data structures by moving node data to the heap.

  3. Ownership transfer happens naturally when passing nodes to constructors - no explicit memory management needed.

  4. Pattern matching with match provides exhaustive handling of all node types and token types.

  5. Constructors like new_oper2 encapsulate Box::new() calls and provide clean APIs.

  6. Mutable references (&mut) are used for the scan table since parsing modifies the cursor position.

  7. The overall parsing logic is identical to C - only the data representation and memory management differ.