Parsing in Rust¶

Overview¶

This lecture covers the implementation of a recursive descent parser in Rust. We examine how Rust's enum system with associated data provides a natural way to represent ASTs, how ownership and Box<T> handle tree structure, and how pattern matching simplifies parsing logic compared to C's if-else chains.

Learning Objectives¶

Understand AST representation using Rust enums with associated data
Use Box<T> for recursive data structures
Implement parsing functions using Rust's ownership model
Apply pattern matching for token and node handling
Compare Rust and C approaches to parsing

Prerequisites¶

Understanding of parsing concepts (previous lecture)
Rust programming basics (enums, structs, ownership)
Familiarity with the NTLang scanner interface

1. AST Node Structure in Rust¶

The ParseNode Enum¶

In Rust, we use an enum with associated data to represent different node types:

/// Parse tree node types
pub enum ParseNode {
    IntVal { value: i32 },
    Oper1 { oper: Operator, operand: Box<ParseNode> },
    Oper2 { oper: Operator, left: Box<ParseNode>, right: Box<ParseNode> },
}

Each variant directly contains its data - no separate type field needed!

The Operator Enum¶

#[derive(Debug, Clone, Copy)]
pub enum Operator {
    Plus,
    Minus,
    Mult,
    Div,
}

impl Operator {
    pub fn name(&self) -> &str {
        match self {
            Operator::Plus => "PLUS",
            Operator::Minus => "MINUS",
            Operator::Mult => "MULT",
            Operator::Div => "DIV",
        }
    }
}

Comparison: C vs Rust Node Representation¶

Aspect	C	Rust
Type identification	`enum` + `switch`	Pattern matching on enum variant
Data storage	`union` (manual type tracking)	Enum variants with associated data
Memory	All nodes same size (union)	Each variant sized to its data
Type safety	Manual (can access wrong union member)	Enforced by compiler

2. Why Box<T>?¶

The Problem with Recursive Types¶

This doesn't compile:

pub enum ParseNode {
    IntVal { value: i32 },
    Oper2 { left: ParseNode, right: ParseNode },  // ERROR!
}

Error: ParseNode has infinite size - it would contain itself!

The Solution: Heap Allocation with Box¶

pub enum ParseNode {
    IntVal { value: i32 },
    Oper2 { left: Box<ParseNode>, right: Box<ParseNode> },  // OK!
}

Box<T> is a pointer to heap-allocated data:

Stack:                 Heap:
+-----------+          +------------+
| Oper2     |          | ParseNode  |
|   left ---+--------> | (IntVal)   |
|   right --+--+       +------------+
+-----------+  |
               |       +------------+
               +-----> | ParseNode  |
                       | (IntVal)   |
                       +------------+

Memory Model Comparison¶

C	Rust
All nodes in pre-allocated array	Nodes allocated on heap with `Box`
Pointers to array elements	Owned `Box<ParseNode>`
Manual memory tracking	Automatic memory management
No allocator calls	Uses allocator (but hidden)

3. Constructor Methods¶

Creating Nodes¶

impl ParseNode {
    /// Create a new IntVal node
    pub fn new_intval(value: i32) -> Self {
        ParseNode::IntVal { value }
    }

    /// Create a new Oper2 node
    pub fn new_oper2(oper: Operator, left: ParseNode, right: ParseNode) -> Self {
        ParseNode::Oper2 {
            oper,
            left: Box::new(left),
            right: Box::new(right),
        }
    }
}

The new_oper2 function: 1. Takes ownership of left and right (they're moved) 2. Wraps them in Box::new() to allocate on heap 3. Returns the new Oper2 node

4. Token Interface for Parser¶

Token and TokenType¶

Rust uses two separate types for tokens:

/// Token type identifier for matching
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum TokenType {
    IntLit,
    Plus,
    Minus,
    Eot,
}

/// Token enum with associated data
#[derive(Debug, Clone)]
pub enum Token {
    IntLit(String),
    Plus,
    Minus,
    Eot,
}

impl Token {
    /// Returns the token type for matching
    pub fn token_type(&self) -> TokenType {
        match self {
            Token::IntLit(_) => TokenType::IntLit,
            Token::Plus => TokenType::Plus,
            Token::Minus => TokenType::Minus,
            Token::Eot => TokenType::Eot,
        }
    }

    /// Returns the string value of the token
    pub fn value(&self) -> &str {
        match self {
            Token::IntLit(s) => s,
            Token::Plus => "+",
            Token::Minus => "-",
            Token::Eot => "",
        }
    }
}

ScanTable Methods¶

impl ScanTable {
    /// Get the token at position cur + i
    pub fn get(&self, i: isize) -> &Token {
        let index = (self.cur as isize + i) as usize;
        &self.tokens[index]
    }

    /// Accept the current token if it matches
    pub fn accept(&mut self, expected: TokenType) -> bool {
        if self.cur >= self.tokens.len() {
            return false;
        }

        if self.tokens[self.cur].token_type() == expected {
            self.cur += 1;
            return true;
        }

        false
    }

    /// Accept any token (wildcard)
    pub fn accept_any(&mut self) -> bool {
        if self.cur >= self.tokens.len() {
            return false;
        }
        self.cur += 1;
        true
    }
}

5. Parser Implementation¶

parse_program¶

/// Parse a complete program: expression EOT
pub fn parse_program(scan_table: &mut ScanTable) -> ParseNode {
    let expr = parse_expression(scan_table);

    if !scan_table.accept(TokenType::Eot) {
        parse_error("Expecting EOT");
    }

    expr
}

parse_expression¶

/// Parse an expression: operand (operator operand)*
fn parse_expression(scan_table: &mut ScanTable) -> ParseNode {
    let mut left = parse_operand(scan_table);

    loop {
        let token = scan_table.get(0);
        match token.token_type() {
            TokenType::Plus => {
                scan_table.accept_any();
                let right = parse_operand(scan_table);
                left = ParseNode::new_oper2(Operator::Plus, left, right);
            }
            TokenType::Minus => {
                scan_table.accept_any();
                let right = parse_operand(scan_table);
                left = ParseNode::new_oper2(Operator::Minus, left, right);
            }
            _ => break,
        }
    }

    left
}

Key Points in parse_expression¶

let mut left: We need to reassign left each iteration
Pattern matching: match token.token_type() is cleaner than if-else chains
Ownership transfer: left is moved into new_oper2, returned value assigned back to left
Wildcard _: Catches all non-operator tokens

parse_operand¶

/// Parse an operand: intlit
fn parse_operand(scan_table: &mut ScanTable) -> ParseNode {
    if scan_table.accept(TokenType::IntLit) {
        let token = scan_table.get(-1);
        let value: i32 = token.value().parse().unwrap_or_else(|_| {
            parse_error("Invalid integer literal");
        });
        ParseNode::new_intval(value)
    } else {
        parse_error("Bad operand");
    }
}

Extended parse_operand (with unary and parentheses)¶

fn parse_operand(scan_table: &mut ScanTable) -> ParseNode {
    if scan_table.accept(TokenType::LParen) {
        // Parenthesized expression
        let expr = parse_expression(scan_table);
        if !scan_table.accept(TokenType::RParen) {
            parse_error("Missing right paren");
        }
        expr
    } else if scan_table.accept(TokenType::IntLit) {
        // Integer literal
        let token = scan_table.get(-1);
        let value: i32 = token.value().parse().unwrap_or_else(|_| {
            parse_error("Invalid integer literal");
        });
        ParseNode::new_intval(value)
    } else if scan_table.accept(TokenType::Minus) {
        // Unary minus
        let operand = parse_operand(scan_table);
        ParseNode::Oper1 {
            oper: Operator::Minus,
            operand: Box::new(operand),
        }
    } else {
        parse_error("Bad operand");
    }
}

6. Ownership Transfer in Tree Building¶

Understanding Move Semantics¶

When building "1 + 2":

let mut left = parse_operand(scan_table);  // left owns IntVal(1)

// In the loop:
let right = parse_operand(scan_table);     // right owns IntVal(2)
left = ParseNode::new_oper2(Operator::Plus, left, right);
//     ^-- left and right are MOVED into new_oper2
//     The new Oper2 is assigned to left

After the assignment: - Original IntVal(1) and IntVal(2) are now owned by Boxes inside Oper2 - left now refers to the Oper2 node

Ownership Flow Diagram¶

Before new_oper2:
  left -----> IntVal(1)
  right ----> IntVal(2)

During new_oper2:
  left -----> [moved]
  right ----> [moved]

  new Oper2 {
      left: Box::new(IntVal(1)),   // takes ownership
      right: Box::new(IntVal(2)),  // takes ownership
  }

After assignment:
  left -----> Oper2 {
                left: Box -> IntVal(1)
                right: Box -> IntVal(2)
              }

7. Pattern Matching for Parsing¶

Matching Token Types¶

let token = scan_table.get(0);
match token.token_type() {
    TokenType::Plus => { /* handle + */ }
    TokenType::Minus => { /* handle - */ }
    TokenType::Mult => { /* handle * */ }
    _ => break,  // No match, exit loop
}

Matching on ParseNode¶

fn eval(node: &ParseNode) -> i32 {
    match node {
        ParseNode::IntVal { value } => *value,
        ParseNode::Oper1 { oper, operand } => {
            let val = eval(operand);
            match oper {
                Operator::Minus => -val,
                Operator::Not => !val,
                _ => panic!("Unknown unary operator"),
            }
        }
        ParseNode::Oper2 { oper, left, right } => {
            let l = eval(left);
            let r = eval(right);
            match oper {
                Operator::Plus => l + r,
                Operator::Minus => l - r,
                Operator::Mult => l * r,
                Operator::Div => l / r,
            }
        }
    }
}

8. C vs Rust Comparison¶

Node Creation¶

Aspect	C	Rust
Allocation	`parse_node_new(pt)`	Implicit with `Box::new()`
Type setting	`np->type = EX_INTVAL`	Variant determines type
Data access	`np->intval.value`	`ParseNode::IntVal { value }`

Code Comparison¶

C:

np = parse_node_new(pt);
np->type = EX_OPER2;
np->oper2.oper = OP_PLUS;
np->oper2.left = np1;
np->oper2.right = parse_operand(pt, st);
np1 = np;

Rust:

left = ParseNode::new_oper2(Operator::Plus, left, right);

Key Differences¶

Feature	C	Rust
Memory management	Manual (parse table)	Automatic (Box/ownership)
Type safety	Runtime (check type field)	Compile-time (enum variants)
Node access	Union member access	Pattern matching
Child pointers	Raw pointers	Owned `Box<T>`
Mutability	Always mutable	Explicit `mut`

Key Concepts¶

Concept	Description
`enum` with data	Rust enums can hold different data per variant
`Box<T>`	Heap-allocated pointer for recursive types
Ownership transfer	Values moved into function calls
Pattern matching	`match` on enum variants extracts data
`&mut`	Mutable reference (for scan_table)

Practice Problems¶

Problem 1: Add Multiplication¶

Extend parse_expression to handle TokenType::Mult. Show the modified code.

Click to reveal solution

fn parse_expression(scan_table: &mut ScanTable) -> ParseNode {
    let mut left = parse_operand(scan_table);

    loop {
        let token = scan_table.get(0);
        match token.token_type() {
            TokenType::Plus => {
                scan_table.accept_any();
                let right = parse_operand(scan_table);
                left = ParseNode::new_oper2(Operator::Plus, left, right);
            }
            TokenType::Minus => {
                scan_table.accept_any();
                let right = parse_operand(scan_table);
                left = ParseNode::new_oper2(Operator::Minus, left, right);
            }
            TokenType::Mult => {  // NEW
                scan_table.accept_any();
                let right = parse_operand(scan_table);
                left = ParseNode::new_oper2(Operator::Mult, left, right);
            }
            _ => break,
        }
    }

    left
}

Problem 2: Trace Ownership¶

For input "1 + 2", trace the ownership of each ParseNode through parsing.

Click to reveal solution

1. parse_operand() returns:
   -> ParseNode::IntVal { value: 1 }
   Owned by: left variable in parse_expression

2. parse_operand() returns:
   -> ParseNode::IntVal { value: 2 }
   Owned by: right variable (temporary)

3. ParseNode::new_oper2(Plus, left, right)
   -> left is MOVED into new_oper2
   -> right is MOVED into new_oper2
   -> Box::new(left) creates Box owning IntVal(1)
   -> Box::new(right) creates Box owning IntVal(2)
   -> Returns Oper2 { left: Box, right: Box }
   Owned by: left variable (reassigned)

4. parse_expression returns left
   -> Ownership transferred to caller

Final ownership chain:
  returned Oper2
    |-- left: Box owns IntVal(1)
    |-- right: Box owns IntVal(2)

Problem 3: Write new_oper1¶

Implement a constructor for unary operator nodes.

Click to reveal solution

impl ParseNode {
    /// Create a new Oper1 node
    pub fn new_oper1(oper: Operator, operand: ParseNode) -> Self {
        ParseNode::Oper1 {
            oper,
            operand: Box::new(operand),
        }
    }
}

// Usage:
let neg_five = ParseNode::new_oper1(
    Operator::Minus,
    ParseNode::new_intval(5)
);

Problem 4: Pattern Match Practice¶

Write a function that returns true if a ParseNode is a leaf node (IntVal).

Click to reveal solution

fn is_leaf(node: &ParseNode) -> bool {
    match node {
        ParseNode::IntVal { .. } => true,
        _ => false,
    }
}

// Or using matches! macro:
fn is_leaf(node: &ParseNode) -> bool {
    matches!(node, ParseNode::IntVal { .. })
}

Summary¶

Rust enums with associated data replace C's struct+union pattern, providing compile-time type safety.
Box<T> enables recursive data structures by moving node data to the heap.
Ownership transfer happens naturally when passing nodes to constructors - no explicit memory management needed.
Pattern matching with match provides exhaustive handling of all node types and token types.
Constructors like new_oper2 encapsulate Box::new() calls and provide clean APIs.
Mutable references (&mut) are used for the scan table since parsing modifies the cursor position.
The overall parsing logic is identical to C - only the data representation and memory management differ.