Parsing and Abstract Syntax Trees¶
Overview¶
This lecture covers the second phase of language processing: parsing. After the scanner produces a stream of tokens, the parser analyzes the token sequence according to a grammar and constructs an Abstract Syntax Tree (AST). The AST represents the hierarchical structure of the program and serves as the foundation for interpretation or code generation.
Learning Objectives¶
- Understand the role of parsing in language processing
- Learn different parser types (LL, LR, GLR, PEG) and their trade-offs
- Recognize and eliminate left recursion in grammars
- Read and apply EBNF grammar rules for parsing
- Implement a recursive descent parser that constructs an AST
- Understand the structure and purpose of Abstract Syntax Trees
Prerequisites¶
- Understanding of scanning and token streams (Lecture 3)
- Basic understanding of formal grammars and EBNF notation
- Familiarity with tree data structures
- C or Rust programming experience
1. Language Processing Pipeline¶
Recall the language processing pipeline from scanning:
flowchart LR
A[Input String] --> B[Scanning]
B --> C[Tokens]
C --> D[Parsing]
D --> E[AST]
E --> F["Evaluation / Code Gen"]
F --> G[Result]
style D fill:#f9f,stroke:#333,stroke-width:2px
style E fill:#f9f,stroke:#333,stroke-width:2px
Today's focus: The parsing phase that converts tokens into an AST.
The scanner converts input like "1 + 2 + 3" into tokens:
The parser takes these tokens and builds a tree that captures the expression's structure.
2. From Tokens to Trees - Why Structure Matters¶
Why Not Just Use a Flat Token List?¶
Consider the expression 1 + 2 * 3. Without structure, we might evaluate left-to-right getting 9, but mathematically the answer should be 7 (multiply first).
Trees capture structure explicitly:
Left-to-right evaluation (no precedence):
Mathematical precedence:
In NTLang, we use a simpler approach: all operators have the same precedence and associate left-to-right. Parentheses control grouping explicitly.
NTLang Evaluation Rules¶
| Expression | Interpretation | Result |
|---|---|---|
1 + 2 * 3 |
(1 + 2) * 3 |
9 |
1 + (2 * 3) |
1 + (2 * 3) |
7 |
1 + 2 + 3 |
(1 + 2) + 3 |
6 |
3. Parser Types Overview¶
Parsers transform a sequence of tokens into a structured representation (typically an AST). Different parsing strategies offer various trade-offs:
| Parser Type | Description | Strengths | Limitations |
|---|---|---|---|
| LL | Top-down, left-to-right, leftmost derivation | Simple to implement by hand | Cannot handle left recursion |
| LR | Bottom-up, left-to-right, rightmost derivation | Fast, handles more grammars | Complex to implement manually |
| GLR | Generalized LR | Handles ambiguous grammars | More complex, slower |
| PEG | Parsing Expression Grammar | Deterministic, handles left recursion | Different semantics than CFGs |
For NTLang, we use an LL parser because: - Simple to implement by hand using recursive descent - Each grammar rule becomes a function - Easy to understand and debug
LL(k) Notation¶
The notation LL(k) indicates how many tokens of lookahead are needed:
- LL(1): One token lookahead (most common)
- LL(2): Two tokens lookahead
- LL(k): k tokens lookahead
Our NTLang parser is LL(1) - we only need to look at the current token to decide what to do.
4. EBNF Grammar for NTLang Parsing¶
The parser follows this grammar:
program ::= expression EOT
expression ::= operand (operator operand)*
operand ::= intlit
| hexlit
| binlit
| '-' operand
| '~' operand
| '(' expression ')'
operator ::= '+' | '-' | '*' | '/' | '>>' | '<<' | '&' | '|' | '^' | '>-'
Understanding the Grammar¶
- program: A complete NTLang program is an expression followed by end-of-text
- expression: An operand optionally followed by any number of (operator, operand) pairs
- operand: Can be a literal, a unary operation, or a parenthesized expression
The key insight is that (operator operand)* means "zero or more" pairs of operator and operand.
Grammar to Code Mapping¶
| Grammar Element | Code Construct |
|---|---|
| Rule name | Function name |
::= |
Function body |
A B (sequence) |
Call A then B |
A \| B (alternative) |
if/else or match |
(A)* (repetition) |
while loop |
[A] (optional) |
if statement |
| Terminal (token) | accept(TOKEN) |
5. Recursive Descent Parsing¶
Recursive descent is a top-down parsing technique where:
- One function per grammar rule - Each non-terminal in the grammar becomes a function
- Functions call each other - Following the grammar structure
- Recursion handles nesting - Parenthesized expressions call back to
parse_expression
The Accept/Get Pattern¶
Two key operations for consuming tokens:
| Operation | Purpose | Returns |
|---|---|---|
accept(TOKEN) |
Check if current token matches; if so, consume it | true/false |
get(i) |
Look at token at position cur + i without consuming |
Token reference |
Common pattern:
if (accept(TK_PLUS)) {
// Token was TK_PLUS and has been consumed
tp = get(-1); // Get the token we just consumed
}
Parsing Function Structure¶
Each parsing function follows this pattern:
parse_X:
1. Check what token we're looking at
2. Based on the token, decide which alternative to take
3. Consume expected tokens
4. Recursively parse sub-expressions
5. Build and return AST node
6. The Left Recursion Problem¶
What is Left Recursion?¶
Left recursion occurs when a grammar rule references itself as its leftmost symbol:
This grammar correctly describes expressions like 1 + 2 + 3, but causes infinite recursion in a top-down parser:
void parse_expression() {
parse_expression(); // Infinite recursion!
accept(TK_PLUS);
parse_operand();
}
The parser calls itself immediately without consuming any input, causing a stack overflow.
Eliminating Left Recursion¶
We rewrite left-recursive grammars using iteration (the Kleene star *):
Original (left-recursive):
Transformed (iteration):
Both grammars accept the same language, but the second can be parsed top-down:
parse_node_st *parse_expression(...) {
np = parse_operand(...); // Parse first operand
while (is_operator(current_token)) {
accept(operator);
right = parse_operand(...);
np = create_oper2_node(op, np, right);
}
return np;
}
7. Abstract Syntax Trees¶
What is an AST?¶
An Abstract Syntax Tree (AST) is a tree representation of the syntactic structure of source code. Unlike a parse tree, it:
- Abstracts away syntactic details (parentheses, keywords)
- Preserves essential structure (operations, operands)
- Enables interpretation or code generation
AST Node Types in NTLang¶
| Node Type | Description | Children |
|---|---|---|
INTVAL |
Integer literal (leaf node) | None |
OPER1 |
Unary operator | 1 child (operand) |
OPER2 |
Binary operator | 2 children (left, right) |
AST Examples¶
Expression: 1 + 2
Expression: 1 + 2 + 3 (left-associative)
Expression: (1 + 2) * 3
Expression: -5 (unary minus)
Building Left-Associative Trees¶
The key to left-associativity is making the previous result the left child of each new operator:
flowchart TD
subgraph Step1["Step 1: Parse '1'"]
A1["np = (1)"]
end
subgraph Step2["Step 2: See '+', parse '2'"]
A2["np1 = (+)"]
B2["left = (1)"]
C2["right = (2)"]
A2 --> B2
A2 --> C2
D2["np = np1"]
end
subgraph Step3["Step 3: See '+', parse '3'"]
A3["np1 = (+)"]
B3["left = prev (+)"]
C3["right = (3)"]
A3 --> B3
A3 --> C3
end
Step1 --> Step2 --> Step3
Key Concepts¶
| Concept | Definition | Example |
|---|---|---|
| Parsing | Converting tokens to structured representation | Tokens → AST |
| AST | Abstract Syntax Tree - hierarchical program representation | Tree with operator nodes |
| Recursive Descent | Parsing technique with one function per grammar rule | parse_expression(), parse_operand() |
| Left Associativity | Operators group from left to right | 1+2+3 = (1+2)+3 |
| Left Recursion | Grammar rule that calls itself first | E ::= E '+' T |
| Unary Operator | Operator with one operand | -x, ~x |
| Binary Operator | Operator with two operands | x + y |
| Lookahead | Examining tokens without consuming them | get(0) |
| Accept | Consuming a token if it matches | accept(TK_PLUS) |
Practice Problems¶
Problem 1: Draw the AST¶
Draw the AST for the expression "(1 + 2) * 3".
Click to reveal solution
Problem 2: Left-Associative Tree¶
Draw the AST for "1 - 2 - 3" and explain why it evaluates to -4, not 2.
Click to reveal solution
Problem 3: Trace the Parser¶
For input "-5", trace through the parsing functions called.
Click to reveal solution
1. parse_program() called
2. parse_expression() called
3. parse_operand() called
4. accept(TK_INTLIT) -> false
5. accept(TK_MINUS) -> true (consumes -)
6. Create np1: OPER1, oper=MINUS
7. parse_operand() called recursively
8. accept(TK_INTLIT) -> true (consumes 5)
9. Create np: INTVAL, value=5
10. Return np
11. np1.operand = np
12. Return np1
13. No operator found, return np (the MINUS node)
14. accept(TK_EOT) -> true
15. Return tree
Final tree:
(-) <- OPER1
|
(5) <- INTVAL
Problem 4: Node Count¶
For expression "1 + 2 + 3 + 4", how many nodes are in the AST?
Click to reveal solution
**7 nodes total:** - 4 INTVAL nodes (1, 2, 3, 4) - 3 OPER2 nodes (three + operators) **Tree structure:** This represents `((1 + 2) + 3) + 4`.Problem 5: Grammar Transformation¶
Transform this left-recursive grammar into one suitable for recursive descent:
Click to reveal solution
**Transformed grammar:** **Implementation pattern:**Further Reading¶
- Crafting Interpreters - Parsing Expressions
- Recursive Descent Parsing (Wikipedia)
- Abstract Syntax Trees (Wikipedia)
- Dragon Book Chapter 4: Syntax Analysis
Summary¶
-
The parser converts tokens into an AST that represents the hierarchical structure of the program.
-
LL parsers are simple to implement by hand but cannot handle left-recursive grammars directly.
-
Recursive descent parsing uses one function per grammar rule, with functions calling each other recursively to build the tree.
-
Left recursion is eliminated by transforming the grammar to use iteration (
*instead of recursion). -
The AST has three node types:
INTVALfor values,OPER1for unary operators, andOPER2for binary operators. -
Left-associative tree construction means
1 + 2 + 3becomes(1 + 2) + 3, built by always making the previous result the left child. -
Parentheses create subtrees by recursively calling
parse_expression(), ensuring proper grouping regardless of the surrounding context.