Parsing in Rust¶
Overview¶
This lecture covers the implementation of a recursive descent parser in Rust. We examine how Rust's enum system with associated data provides a natural way to represent ASTs, how ownership and Box<T> handle tree structure, and how pattern matching simplifies parsing logic compared to C's if-else chains.
Learning Objectives¶
- Understand AST representation using Rust enums with associated data
- Use
Box<T>for recursive data structures - Implement parsing functions using Rust's ownership model
- Apply pattern matching for token and node handling
- Compare Rust and C approaches to parsing
Prerequisites¶
- Understanding of parsing concepts (previous lecture)
- Rust programming basics (enums, structs, ownership)
- Familiarity with the NTLang scanner interface
1. AST Node Structure in Rust¶
The ParseNode Enum¶
In Rust, we use an enum with associated data to represent different node types:
/// Parse tree node types
pub enum ParseNode {
IntVal { value: i32 },
Oper1 { oper: Operator, operand: Box<ParseNode> },
Oper2 { oper: Operator, left: Box<ParseNode>, right: Box<ParseNode> },
}
Each variant directly contains its data - no separate type field needed!
The Operator Enum¶
#[derive(Debug, Clone, Copy)]
pub enum Operator {
Plus,
Minus,
Mult,
Div,
}
impl Operator {
pub fn name(&self) -> &str {
match self {
Operator::Plus => "PLUS",
Operator::Minus => "MINUS",
Operator::Mult => "MULT",
Operator::Div => "DIV",
}
}
}
Comparison: C vs Rust Node Representation¶
| Aspect | C | Rust |
|---|---|---|
| Type identification | enum + switch |
Pattern matching on enum variant |
| Data storage | union (manual type tracking) |
Enum variants with associated data |
| Memory | All nodes same size (union) | Each variant sized to its data |
| Type safety | Manual (can access wrong union member) | Enforced by compiler |
2. Why Box<T>?¶
The Problem with Recursive Types¶
This doesn't compile:
pub enum ParseNode {
IntVal { value: i32 },
Oper2 { left: ParseNode, right: ParseNode }, // ERROR!
}
Error: ParseNode has infinite size - it would contain itself!
The Solution: Heap Allocation with Box¶
pub enum ParseNode {
IntVal { value: i32 },
Oper2 { left: Box<ParseNode>, right: Box<ParseNode> }, // OK!
}
Box<T> is a pointer to heap-allocated data:
Stack: Heap:
+-----------+ +------------+
| Oper2 | | ParseNode |
| left ---+--------> | (IntVal) |
| right --+--+ +------------+
+-----------+ |
| +------------+
+-----> | ParseNode |
| (IntVal) |
+------------+
Memory Model Comparison¶
| C | Rust |
|---|---|
| All nodes in pre-allocated array | Nodes allocated on heap with Box |
| Pointers to array elements | Owned Box<ParseNode> |
| Manual memory tracking | Automatic memory management |
| No allocator calls | Uses allocator (but hidden) |
3. Constructor Methods¶
Creating Nodes¶
impl ParseNode {
/// Create a new IntVal node
pub fn new_intval(value: i32) -> Self {
ParseNode::IntVal { value }
}
/// Create a new Oper2 node
pub fn new_oper2(oper: Operator, left: ParseNode, right: ParseNode) -> Self {
ParseNode::Oper2 {
oper,
left: Box::new(left),
right: Box::new(right),
}
}
}
The new_oper2 function:
1. Takes ownership of left and right (they're moved)
2. Wraps them in Box::new() to allocate on heap
3. Returns the new Oper2 node
4. Token Interface for Parser¶
Token and TokenType¶
Rust uses two separate types for tokens:
/// Token type identifier for matching
#[derive(Debug, Clone, Copy, PartialEq)]
pub enum TokenType {
IntLit,
Plus,
Minus,
Eot,
}
/// Token enum with associated data
#[derive(Debug, Clone)]
pub enum Token {
IntLit(String),
Plus,
Minus,
Eot,
}
impl Token {
/// Returns the token type for matching
pub fn token_type(&self) -> TokenType {
match self {
Token::IntLit(_) => TokenType::IntLit,
Token::Plus => TokenType::Plus,
Token::Minus => TokenType::Minus,
Token::Eot => TokenType::Eot,
}
}
/// Returns the string value of the token
pub fn value(&self) -> &str {
match self {
Token::IntLit(s) => s,
Token::Plus => "+",
Token::Minus => "-",
Token::Eot => "",
}
}
}
ScanTable Methods¶
impl ScanTable {
/// Get the token at position cur + i
pub fn get(&self, i: isize) -> &Token {
let index = (self.cur as isize + i) as usize;
&self.tokens[index]
}
/// Accept the current token if it matches
pub fn accept(&mut self, expected: TokenType) -> bool {
if self.cur >= self.tokens.len() {
return false;
}
if self.tokens[self.cur].token_type() == expected {
self.cur += 1;
return true;
}
false
}
/// Accept any token (wildcard)
pub fn accept_any(&mut self) -> bool {
if self.cur >= self.tokens.len() {
return false;
}
self.cur += 1;
true
}
}
5. Parser Implementation¶
parse_program¶
/// Parse a complete program: expression EOT
pub fn parse_program(scan_table: &mut ScanTable) -> ParseNode {
let expr = parse_expression(scan_table);
if !scan_table.accept(TokenType::Eot) {
parse_error("Expecting EOT");
}
expr
}
parse_expression¶
/// Parse an expression: operand (operator operand)*
fn parse_expression(scan_table: &mut ScanTable) -> ParseNode {
let mut left = parse_operand(scan_table);
loop {
let token = scan_table.get(0);
match token.token_type() {
TokenType::Plus => {
scan_table.accept_any();
let right = parse_operand(scan_table);
left = ParseNode::new_oper2(Operator::Plus, left, right);
}
TokenType::Minus => {
scan_table.accept_any();
let right = parse_operand(scan_table);
left = ParseNode::new_oper2(Operator::Minus, left, right);
}
_ => break,
}
}
left
}
Key Points in parse_expression¶
let mut left: We need to reassignlefteach iteration- Pattern matching:
match token.token_type()is cleaner than if-else chains - Ownership transfer:
leftis moved intonew_oper2, returned value assigned back toleft - Wildcard
_: Catches all non-operator tokens
parse_operand¶
/// Parse an operand: intlit
fn parse_operand(scan_table: &mut ScanTable) -> ParseNode {
if scan_table.accept(TokenType::IntLit) {
let token = scan_table.get(-1);
let value: i32 = token.value().parse().unwrap_or_else(|_| {
parse_error("Invalid integer literal");
});
ParseNode::new_intval(value)
} else {
parse_error("Bad operand");
}
}
Extended parse_operand (with unary and parentheses)¶
fn parse_operand(scan_table: &mut ScanTable) -> ParseNode {
if scan_table.accept(TokenType::LParen) {
// Parenthesized expression
let expr = parse_expression(scan_table);
if !scan_table.accept(TokenType::RParen) {
parse_error("Missing right paren");
}
expr
} else if scan_table.accept(TokenType::IntLit) {
// Integer literal
let token = scan_table.get(-1);
let value: i32 = token.value().parse().unwrap_or_else(|_| {
parse_error("Invalid integer literal");
});
ParseNode::new_intval(value)
} else if scan_table.accept(TokenType::Minus) {
// Unary minus
let operand = parse_operand(scan_table);
ParseNode::Oper1 {
oper: Operator::Minus,
operand: Box::new(operand),
}
} else {
parse_error("Bad operand");
}
}
6. Ownership Transfer in Tree Building¶
Understanding Move Semantics¶
When building "1 + 2":
let mut left = parse_operand(scan_table); // left owns IntVal(1)
// In the loop:
let right = parse_operand(scan_table); // right owns IntVal(2)
left = ParseNode::new_oper2(Operator::Plus, left, right);
// ^-- left and right are MOVED into new_oper2
// The new Oper2 is assigned to left
After the assignment:
- Original IntVal(1) and IntVal(2) are now owned by Boxes inside Oper2
- left now refers to the Oper2 node
Ownership Flow Diagram¶
Before new_oper2:
left -----> IntVal(1)
right ----> IntVal(2)
During new_oper2:
left -----> [moved]
right ----> [moved]
new Oper2 {
left: Box::new(IntVal(1)), // takes ownership
right: Box::new(IntVal(2)), // takes ownership
}
After assignment:
left -----> Oper2 {
left: Box -> IntVal(1)
right: Box -> IntVal(2)
}
7. Pattern Matching for Parsing¶
Matching Token Types¶
let token = scan_table.get(0);
match token.token_type() {
TokenType::Plus => { /* handle + */ }
TokenType::Minus => { /* handle - */ }
TokenType::Mult => { /* handle * */ }
_ => break, // No match, exit loop
}
Matching on ParseNode¶
fn eval(node: &ParseNode) -> i32 {
match node {
ParseNode::IntVal { value } => *value,
ParseNode::Oper1 { oper, operand } => {
let val = eval(operand);
match oper {
Operator::Minus => -val,
Operator::Not => !val,
_ => panic!("Unknown unary operator"),
}
}
ParseNode::Oper2 { oper, left, right } => {
let l = eval(left);
let r = eval(right);
match oper {
Operator::Plus => l + r,
Operator::Minus => l - r,
Operator::Mult => l * r,
Operator::Div => l / r,
}
}
}
}
8. C vs Rust Comparison¶
Node Creation¶
| Aspect | C | Rust |
|---|---|---|
| Allocation | parse_node_new(pt) |
Implicit with Box::new() |
| Type setting | np->type = EX_INTVAL |
Variant determines type |
| Data access | np->intval.value |
ParseNode::IntVal { value } |
Code Comparison¶
C:
np = parse_node_new(pt);
np->type = EX_OPER2;
np->oper2.oper = OP_PLUS;
np->oper2.left = np1;
np->oper2.right = parse_operand(pt, st);
np1 = np;
Rust:
Key Differences¶
| Feature | C | Rust |
|---|---|---|
| Memory management | Manual (parse table) | Automatic (Box/ownership) |
| Type safety | Runtime (check type field) | Compile-time (enum variants) |
| Node access | Union member access | Pattern matching |
| Child pointers | Raw pointers | Owned Box<T> |
| Mutability | Always mutable | Explicit mut |
Key Concepts¶
| Concept | Description |
|---|---|
enum with data |
Rust enums can hold different data per variant |
Box<T> |
Heap-allocated pointer for recursive types |
| Ownership transfer | Values moved into function calls |
| Pattern matching | match on enum variants extracts data |
&mut |
Mutable reference (for scan_table) |
Practice Problems¶
Problem 1: Add Multiplication¶
Extend parse_expression to handle TokenType::Mult. Show the modified code.
Click to reveal solution
fn parse_expression(scan_table: &mut ScanTable) -> ParseNode {
let mut left = parse_operand(scan_table);
loop {
let token = scan_table.get(0);
match token.token_type() {
TokenType::Plus => {
scan_table.accept_any();
let right = parse_operand(scan_table);
left = ParseNode::new_oper2(Operator::Plus, left, right);
}
TokenType::Minus => {
scan_table.accept_any();
let right = parse_operand(scan_table);
left = ParseNode::new_oper2(Operator::Minus, left, right);
}
TokenType::Mult => { // NEW
scan_table.accept_any();
let right = parse_operand(scan_table);
left = ParseNode::new_oper2(Operator::Mult, left, right);
}
_ => break,
}
}
left
}
Problem 2: Trace Ownership¶
For input "1 + 2", trace the ownership of each ParseNode through parsing.
Click to reveal solution
1. parse_operand() returns:
-> ParseNode::IntVal { value: 1 }
Owned by: left variable in parse_expression
2. parse_operand() returns:
-> ParseNode::IntVal { value: 2 }
Owned by: right variable (temporary)
3. ParseNode::new_oper2(Plus, left, right)
-> left is MOVED into new_oper2
-> right is MOVED into new_oper2
-> Box::new(left) creates Box owning IntVal(1)
-> Box::new(right) creates Box owning IntVal(2)
-> Returns Oper2 { left: Box, right: Box }
Owned by: left variable (reassigned)
4. parse_expression returns left
-> Ownership transferred to caller
Final ownership chain:
returned Oper2
|-- left: Box owns IntVal(1)
|-- right: Box owns IntVal(2)
Problem 3: Write new_oper1¶
Implement a constructor for unary operator nodes.
Click to reveal solution
Problem 4: Pattern Match Practice¶
Write a function that returns true if a ParseNode is a leaf node (IntVal).
Click to reveal solution
Summary¶
-
Rust enums with associated data replace C's struct+union pattern, providing compile-time type safety.
-
Box<T>enables recursive data structures by moving node data to the heap. -
Ownership transfer happens naturally when passing nodes to constructors - no explicit memory management needed.
-
Pattern matching with
matchprovides exhaustive handling of all node types and token types. -
Constructors like
new_oper2encapsulateBox::new()calls and provide clean APIs. -
Mutable references (
&mut) are used for the scan table since parsing modifies the cursor position. -
The overall parsing logic is identical to C - only the data representation and memory management differ.