Parsing and Abstract Syntax Trees¶

Overview¶

This lecture covers the second phase of language processing: parsing. After the scanner produces a stream of tokens, the parser analyzes the token sequence according to a grammar and constructs an Abstract Syntax Tree (AST). The AST represents the hierarchical structure of the program and serves as the foundation for interpretation or code generation.

Learning Objectives¶

Understand the role of parsing in language processing
Learn different parser types (LL, LR, GLR, PEG) and their trade-offs
Recognize and eliminate left recursion in grammars
Read and apply EBNF grammar rules for parsing
Implement a recursive descent parser that constructs an AST
Understand the structure and purpose of Abstract Syntax Trees

Prerequisites¶

Understanding of scanning and token streams (Lecture 3)
Basic understanding of formal grammars and EBNF notation
Familiarity with tree data structures
C or Rust programming experience

1. Language Processing Pipeline¶

Recall the language processing pipeline from scanning:

flowchart LR
    A[Input String] --> B[Scanning]
    B --> C[Tokens]
    C --> D[Parsing]
    D --> E[AST]
    E --> F["Evaluation / Code Gen"]
    F --> G[Result]

    style D fill:#f9f,stroke:#333,stroke-width:2px
    style E fill:#f9f,stroke:#333,stroke-width:2px

Today's focus: The parsing phase that converts tokens into an AST.

The scanner converts input like "1 + 2 + 3" into tokens:

TK_INTLIT("1") | TK_PLUS("+") | TK_INTLIT("2") | TK_PLUS("+") | TK_INTLIT("3") | TK_EOT

The parser takes these tokens and builds a tree that captures the expression's structure.

2. From Tokens to Trees - Why Structure Matters¶

Why Not Just Use a Flat Token List?¶

Consider the expression 1 + 2 * 3. Without structure, we might evaluate left-to-right getting 9, but mathematically the answer should be 7 (multiply first).

Trees capture structure explicitly:

Left-to-right evaluation (no precedence):

      *
     / \
    +   3
   / \
  1   2
Result: (1 + 2) * 3 = 9

Mathematical precedence:

      +
     / \
    1   *
       / \
      2   3
Result: 1 + (2 * 3) = 7

In NTLang, we use a simpler approach: all operators have the same precedence and associate left-to-right. Parentheses control grouping explicitly.

NTLang Evaluation Rules¶

Expression	Interpretation	Result
`1 + 2 * 3`	`(1 + 2) * 3`	9
`1 + (2 * 3)`	`1 + (2 * 3)`	7
`1 + 2 + 3`	`(1 + 2) + 3`	6

3. Parser Types Overview¶

Parsers transform a sequence of tokens into a structured representation (typically an AST). Different parsing strategies offer various trade-offs:

Parser Type	Description	Strengths	Limitations
LL	Top-down, left-to-right, leftmost derivation	Simple to implement by hand	Cannot handle left recursion
LR	Bottom-up, left-to-right, rightmost derivation	Fast, handles more grammars	Complex to implement manually
GLR	Generalized LR	Handles ambiguous grammars	More complex, slower
PEG	Parsing Expression Grammar	Deterministic, handles left recursion	Different semantics than CFGs

For NTLang, we use an LL parser because: - Simple to implement by hand using recursive descent - Each grammar rule becomes a function - Easy to understand and debug

LL(k) Notation¶

The notation LL(k) indicates how many tokens of lookahead are needed:

LL(1): One token lookahead (most common)
LL(2): Two tokens lookahead
LL(k): k tokens lookahead

Our NTLang parser is LL(1) - we only need to look at the current token to decide what to do.

4. EBNF Grammar for NTLang Parsing¶

The parser follows this grammar:

program    ::= expression EOT

expression ::= operand (operator operand)*

operand    ::= intlit
             | hexlit
             | binlit
             | '-' operand
             | '~' operand
             | '(' expression ')'

operator   ::= '+' | '-' | '*' | '/' | '>>' | '<<' | '&' | '|' | '^' | '>-'

Understanding the Grammar¶

program: A complete NTLang program is an expression followed by end-of-text
expression: An operand optionally followed by any number of (operator, operand) pairs
operand: Can be a literal, a unary operation, or a parenthesized expression

The key insight is that (operator operand)* means "zero or more" pairs of operator and operand.

Grammar to Code Mapping¶

Grammar Element	Code Construct
Rule name	Function name
`::=`	Function body
`A B` (sequence)	Call A then B
`A \\| B` (alternative)	if/else or match
`(A)*` (repetition)	while loop
`[A]` (optional)	if statement
Terminal (token)	`accept(TOKEN)`

5. Recursive Descent Parsing¶

Recursive descent is a top-down parsing technique where:

One function per grammar rule - Each non-terminal in the grammar becomes a function
Functions call each other - Following the grammar structure
Recursion handles nesting - Parenthesized expressions call back to parse_expression

The Accept/Get Pattern¶

Two key operations for consuming tokens:

Operation	Purpose	Returns
`accept(TOKEN)`	Check if current token matches; if so, consume it	`true`/`false`
`get(i)`	Look at token at position `cur + i` without consuming	Token reference

Common pattern:

if (accept(TK_PLUS)) {
    // Token was TK_PLUS and has been consumed
    tp = get(-1);  // Get the token we just consumed
}

Parsing Function Structure¶

Each parsing function follows this pattern:

parse_X:
    1. Check what token we're looking at
    2. Based on the token, decide which alternative to take
    3. Consume expected tokens
    4. Recursively parse sub-expressions
    5. Build and return AST node

6. The Left Recursion Problem¶

What is Left Recursion?¶

Left recursion occurs when a grammar rule references itself as its leftmost symbol:

expression ::= expression '+' operand
             | operand

This grammar correctly describes expressions like 1 + 2 + 3, but causes infinite recursion in a top-down parser:

void parse_expression() {
    parse_expression();  // Infinite recursion!
    accept(TK_PLUS);
    parse_operand();
}

The parser calls itself immediately without consuming any input, causing a stack overflow.

Eliminating Left Recursion¶

We rewrite left-recursive grammars using iteration (the Kleene star *):

Original (left-recursive):

expression ::= expression operator operand
             | operand

Transformed (iteration):

expression ::= operand (operator operand)*

Both grammars accept the same language, but the second can be parsed top-down:

parse_node_st *parse_expression(...) {
    np = parse_operand(...);        // Parse first operand

    while (is_operator(current_token)) {
        accept(operator);
        right = parse_operand(...);
        np = create_oper2_node(op, np, right);
    }

    return np;
}

7. Abstract Syntax Trees¶

What is an AST?¶

An Abstract Syntax Tree (AST) is a tree representation of the syntactic structure of source code. Unlike a parse tree, it:

Abstracts away syntactic details (parentheses, keywords)
Preserves essential structure (operations, operands)
Enables interpretation or code generation

AST Node Types in NTLang¶

Node Type	Description	Children
`INTVAL`	Integer literal (leaf node)	None
`OPER1`	Unary operator	1 child (operand)
`OPER2`	Binary operator	2 children (left, right)

AST Examples¶

Expression: 1 + 2

         (+)          <- OPER2, oper=PLUS
        /   \
      (1)   (2)       <- INTVAL nodes

Expression: 1 + 2 + 3 (left-associative)

            (+)           <- OPER2
           /   \
         (+)   (3)        <- OPER2, INTVAL
        /   \
      (1)   (2)           <- INTVAL nodes

Evaluates as: (1 + 2) + 3 = 6

Expression: (1 + 2) * 3

            (*)
           /   \
         (+)   (3)
        /   \
      (1)   (2)

Evaluates as: (1 + 2) * 3 = 9

Expression: -5 (unary minus)

         (-)          <- OPER1, oper=MINUS
          |
         (5)          <- INTVAL

Building Left-Associative Trees¶

The key to left-associativity is making the previous result the left child of each new operator:

flowchart TD
    subgraph Step1["Step 1: Parse '1'"]
        A1["np = (1)"]
    end

    subgraph Step2["Step 2: See '+', parse '2'"]
        A2["np1 = (+)"]
        B2["left = (1)"]
        C2["right = (2)"]
        A2 --> B2
        A2 --> C2
        D2["np = np1"]
    end

    subgraph Step3["Step 3: See '+', parse '3'"]
        A3["np1 = (+)"]
        B3["left = prev (+)"]
        C3["right = (3)"]
        A3 --> B3
        A3 --> C3
    end

    Step1 --> Step2 --> Step3

Key Concepts¶

Concept	Definition	Example
Parsing	Converting tokens to structured representation	Tokens → AST
AST	Abstract Syntax Tree - hierarchical program representation	Tree with operator nodes
Recursive Descent	Parsing technique with one function per grammar rule	`parse_expression()`, `parse_operand()`
Left Associativity	Operators group from left to right	`1+2+3` = `(1+2)+3`
Left Recursion	Grammar rule that calls itself first	`E ::= E '+' T`
Unary Operator	Operator with one operand	`-x`, `~x`
Binary Operator	Operator with two operands	`x + y`
Lookahead	Examining tokens without consuming them	`get(0)`
Accept	Consuming a token if it matches	`accept(TK_PLUS)`

Practice Problems¶

Problem 1: Draw the AST¶

Draw the AST for the expression "(1 + 2) * 3".

Click to reveal solution

        (*)
       /   \
     (+)   (3)
    /   \
  (1)   (2)

Node types:
- Root: OPER2, oper=MULT
  - Left child: OPER2, oper=PLUS
    - Left: INTVAL, value=1
    - Right: INTVAL, value=2
  - Right child: INTVAL, value=3

Problem 2: Left-Associative Tree¶

Draw the AST for "1 - 2 - 3" and explain why it evaluates to -4, not 2.

Click to reveal solution

            (-)
           /   \
         (-)   (3)
        /   \
      (1)   (2)

Evaluation (left-associative):
- Inner subtraction: 1 - 2 = -1
- Outer subtraction: -1 - 3 = -4

If it were right-associative (1 - (2 - 3)):
- Inner: 2 - 3 = -1
- Outer: 1 - (-1) = 2

The tree structure determines the evaluation order!

Problem 3: Trace the Parser¶

For input "-5", trace through the parsing functions called.

Click to reveal solution

1. parse_program() called
2.   parse_expression() called
3.     parse_operand() called
4.       accept(TK_INTLIT) -> false
5.       accept(TK_MINUS) -> true (consumes -)
6.       Create np1: OPER1, oper=MINUS
7.       parse_operand() called recursively
8.         accept(TK_INTLIT) -> true (consumes 5)
9.         Create np: INTVAL, value=5
10.        Return np
11.      np1.operand = np
12.      Return np1
13.   No operator found, return np (the MINUS node)
14. accept(TK_EOT) -> true
15. Return tree

Final tree:
    (-)         <- OPER1
     |
    (5)         <- INTVAL

Problem 4: Node Count¶

For expression "1 + 2 + 3 + 4", how many nodes are in the AST?

Click to reveal solution

**7 nodes total:** - 4 INTVAL nodes (1, 2, 3, 4) - 3 OPER2 nodes (three + operators) **Tree structure:**

                (+)         <- node 7, root
               /   \
             (+)   (4)      <- nodes 6, 4
            /   \
          (+)   (3)         <- nodes 5, 3
         /   \
       (1)   (2)            <- nodes 1, 2

This represents `((1 + 2) + 3) + 4`.

Problem 5: Grammar Transformation¶

Transform this left-recursive grammar into one suitable for recursive descent:

term ::= term '*' factor
       | factor

Click to reveal solution

**Transformed grammar:**

term ::= factor ('*' factor)*

**Implementation pattern:**

parse_node_st *parse_term(...) {
    np = parse_factor(...);

    while (accept(TK_MULT)) {
        right = parse_factor(...);
        np = create_oper2(MULT, np, right);
    }

    return np;
}

Summary¶

The parser converts tokens into an AST that represents the hierarchical structure of the program.
LL parsers are simple to implement by hand but cannot handle left-recursive grammars directly.
Recursive descent parsing uses one function per grammar rule, with functions calling each other recursively to build the tree.
Left recursion is eliminated by transforming the grammar to use iteration (* instead of recursion).
The AST has three node types: INTVAL for values, OPER1 for unary operators, and OPER2 for binary operators.
Left-associative tree construction means 1 + 2 + 3 becomes (1 + 2) + 3, built by always making the previous result the left child.
Parentheses create subtrees by recursively calling parse_expression(), ensuring proper grouping regardless of the surrounding context.

Parsing and Abstract Syntax Trees¶

Overview¶

Learning Objectives¶

Prerequisites¶

1. Language Processing Pipeline¶

2. From Tokens to Trees - Why Structure Matters¶

Why Not Just Use a Flat Token List?¶

NTLang Evaluation Rules¶

3. Parser Types Overview¶

LL(k) Notation¶

4. EBNF Grammar for NTLang Parsing¶

Understanding the Grammar¶

Grammar to Code Mapping¶

5. Recursive Descent Parsing¶

The Accept/Get Pattern¶

Parsing Function Structure¶

6. The Left Recursion Problem¶

What is Left Recursion?¶

Eliminating Left Recursion¶

7. Abstract Syntax Trees¶

What is an AST?¶

AST Node Types in NTLang¶

AST Examples¶

Building Left-Associative Trees¶

Key Concepts¶

Practice Problems¶

Problem 1: Draw the AST¶

Problem 2: Left-Associative Tree¶

Problem 3: Trace the Parser¶

Problem 4: Node Count¶

Problem 5: Grammar Transformation¶

Further Reading¶

Summary¶