Parsing in C¶

Overview¶

This lecture covers the implementation of a recursive descent parser in C. We examine how to represent Abstract Syntax Trees using C's struct and union constructs, manage memory efficiently using a parse table, and implement parsing functions that follow the grammar rules.

Learning Objectives¶

Understand AST node representation using structs and unions in C
Implement a memory-efficient parse table for node allocation
Use the scanner's accept/get interface for token consumption
Build a recursive descent parser following the EBNF grammar
Construct left-associative expression trees

Prerequisites¶

Understanding of parsing concepts (previous lecture)
C programming with structs, pointers, and unions
Familiarity with the NTLang scanner interface

1. AST Node Structure in C¶

Node Type Enumeration¶

The AST has three types of nodes, identified by an enumeration:

enum parse_expr_enum {
    EX_INTVAL,  /* Integer value (leaf node) */
    EX_OPER1,   /* Unary operator (one child) */
    EX_OPER2    /* Binary operator (two children) */
};

Operator Enumeration¶

Operators are also identified by an enumeration:

enum parse_oper_enum {
    OP_PLUS,   /* + */
    OP_MINUS,  /* - */
    OP_MULT,   /* * */
    OP_DIV     /* / */
    /* Add more operators as needed */
};

The parse_node_st Structure¶

Each AST node uses a union to store type-specific data:

struct parse_node_st {
    enum parse_expr_enum type;  /* Node type */
    union {
        /* For EX_INTVAL: stores the integer value */
        struct {
            int value;
        } intval;

        /* For EX_OPER1: unary operator with one child */
        struct {
            enum parse_oper_enum oper;
            struct parse_node_st *operand;
        } oper1;

        /* For EX_OPER2: binary operator with two children */
        struct {
            enum parse_oper_enum oper;
            struct parse_node_st *left;
            struct parse_node_st *right;
        } oper2;
    };
};

Why Use a Union?¶

A union allows different "views" of the same memory:

Node Type	Active Union Member	Contents
`EX_INTVAL`	`intval`	Integer value
`EX_OPER1`	`oper1`	Operator + 1 child pointer
`EX_OPER2`	`oper2`	Operator + 2 child pointers

Memory efficiency: All nodes are the same size (the size of the largest union member), but we only use the relevant fields for each node type.

2. Memory Management with parse_table_st¶

The Parse Table Structure¶

Instead of using malloc() for each node, we pre-allocate an array:

#define PARSE_TABLE_LEN 1024

struct parse_table_st {
    struct parse_node_st table[PARSE_TABLE_LEN];  /* Node storage */
    int len;  /* Number of nodes allocated */
};

Table Initialization¶

void parse_table_init(struct parse_table_st *pt) {
    pt->len = 0;
}

Allocating New Nodes¶

struct parse_node_st * parse_node_new(struct parse_table_st *pt) {
    struct parse_node_st *np;

    np = &(pt->table[pt->len]);
    pt->len += 1;

    return np;
}

This function: 1. Gets a pointer to the next available slot in the array 2. Increments the length counter 3. Returns the pointer to the caller

Memory Layout Visualization¶

For parsing "1 + 2":

parse_table_st:
+------------------+
| table[0]         |  <- INTVAL, value=1
|   type: EX_INTVAL|
|   intval.value: 1|
+------------------+
| table[1]         |  <- OPER2, PLUS
|   type: EX_OPER2 |
|   oper2.oper: +  |
|   oper2.left ----+--> table[0]
|   oper2.right ---+--> table[1]
+------------------+
| table[2]         |  <- INTVAL, value=2
|   type: EX_INTVAL|
|   intval.value: 2|
+------------------+
| len = 3          |
+------------------+

Benefits of This Approach¶

Benefit	Explanation
No heap allocation	All nodes on stack (in `main`)
No memory leaks	No `free()` needed
Cache-friendly	Nodes are contiguous in memory
Fast allocation	Just increment a counter

3. Token Interface for Parser¶

scan_table_get¶

Get a token at a relative position without consuming it:

struct scan_token_st * scan_table_get(struct scan_table_st *st, int i) {
    return &(st->table[st->cur + i]);
}

Call	Returns
`get(0)`	Current token (at position `cur`)
`get(-1)`	Previous token (just consumed)
`get(1)`	Next token (lookahead)

scan_table_accept¶

Accept a token if it matches the expected type:

bool scan_table_accept(struct scan_table_st *st,
                       enum scan_token_enum tk_expected) {
    struct scan_token_st *tp;

    /* TK_ANY is a wildcard - always matches */
    if (tk_expected == TK_ANY) {
        st->cur += 1;
        return true;
    }

    tp = scan_table_get(st, 0);

    if (tp->id == tk_expected) {
        st->cur += 1;
        return true;
    }

    return false;
}

The TK_ANY Wildcard¶

TK_ANY is useful when you've already checked the token type:

tp = scan_table_get(st, 0);      /* Peek at current token */
if (tp->id == TK_PLUS) {
    scan_table_accept(st, TK_ANY);  /* Consume it (we know it's PLUS) */
    /* ... */
}

4. Parser Implementation¶

parse_program¶

The top-level parsing function:

struct parse_node_st * parse_program(struct parse_table_st *pt,
                                     struct scan_table_st *st) {
    struct parse_node_st *np1;

    /* A program is a single expression followed by EOT */
    np1 = parse_expression(pt, st);

    if (!scan_table_accept(st, TK_EOT)) {
        parse_error("Expecting EOT");
    }

    return np1;
}

parse_expression¶

Implements the grammar rule expression ::= operand (operator operand)*:

struct parse_node_st * parse_expression(struct parse_table_st *pt,
                                        struct scan_table_st *st) {
    struct scan_token_st *tp;
    struct parse_node_st *np1, *np2;

    /* An expression must start with an operand */
    np1 = parse_operand(pt, st);

    while (true) {
        tp = scan_table_get(st, 0);  /* Peek at current token */

        if (tp->id == TK_PLUS) {
            /* Found a plus operator */
            scan_table_accept(st, TK_ANY);  /* Consume the operator */

            /* Create a new OPER2 node */
            np2 = parse_node_new(pt);
            np2->type = EX_OPER2;
            np2->oper2.oper = OP_PLUS;
            np2->oper2.left = np1;          /* Previous result is left child */
            np2->oper2.right = parse_operand(pt, st);  /* Parse right operand */

            np1 = np2;  /* New node becomes current result */
        } else {
            break;  /* No more operators */
        }
    }

    return np1;
}

Key Points in parse_expression¶

Start with an operand: np1 = parse_operand(pt, st)
Loop while operators found: while (true) { ... if not operator, break }
Build left-associative tree: np2->oper2.left = np1 (previous is left child)
Update current node: np1 = np2 (new node becomes current)

parse_operand¶

Handles literals, unary operators, and parentheses:

struct parse_node_st * parse_operand(struct parse_table_st *pt,
                                     struct scan_table_st *st) {
    struct scan_token_st *tp;
    struct parse_node_st *np1;

    if (scan_table_accept(st, TK_INTLIT)) {
        /* Integer literal */
        tp = scan_table_get(st, -1);  /* Get the token we just consumed */
        np1 = parse_node_new(pt);
        np1->type = EX_INTVAL;
        np1->intval.value = atoi(tp->value);
    } else {
        parse_error("Bad operand");
    }

    return np1;
}

Extended parse_operand (with unary and parentheses)¶

struct parse_node_st * parse_operand(struct parse_table_st *pt,
                                     struct scan_table_st *st) {
    struct scan_token_st *tp;
    struct parse_node_st *np1;

    if (scan_table_accept(st, TK_LPAREN)) {
        /* Parenthesized expression */
        np1 = parse_expression(pt, st);  /* Recursive call! */
        if (!scan_table_accept(st, TK_RPAREN)) {
            parse_error("Missing right paren");
        }
    } else if (scan_table_accept(st, TK_INTLIT)) {
        /* Integer literal */
        tp = scan_table_get(st, -1);
        np1 = parse_node_new(pt);
        np1->type = EX_INTVAL;
        np1->intval.value = atoi(tp->value);
    } else if (scan_table_accept(st, TK_MINUS)) {
        /* Unary minus */
        np1 = parse_node_new(pt);
        np1->type = EX_OPER1;
        np1->oper1.oper = OP_MINUS;
        np1->oper1.operand = parse_operand(pt, st);  /* Recursive! */
    } else {
        parse_error("Bad operand");
    }

    return np1;
}

5. Building Left-Associative Trees¶

Step-by-Step Example: "1 + 2 + 3"¶

Initial state after scanning:

Tokens: [INTLIT(1), PLUS, INTLIT(2), PLUS, INTLIT(3), EOT]
        ^
        cur=0

Step 1: Parse first operand

np1 = parse_operand()  → Creates INTVAL(1)

table[0]: INTVAL, value=1
np1 → table[0]

Step 2: See PLUS, parse second operand

tp->id == TK_PLUS, so enter loop
scan_table_accept(TK_ANY)  → consumes PLUS
parse_operand()  → Creates INTVAL(2)

table[0]: INTVAL, value=1
table[1]: INTVAL, value=2

Create np2 = OPER2(PLUS, np1, new_operand)

table[2]: OPER2, oper=PLUS
          left → table[0]
          right → table[1]

np1 = np2 (np1 now points to OPER2)

Step 3: See PLUS, parse third operand

tp->id == TK_PLUS, so continue loop
scan_table_accept(TK_ANY)  → consumes PLUS
parse_operand()  → Creates INTVAL(3)

table[3]: INTVAL, value=3

Create np2 = OPER2(PLUS, np1, new_operand)

table[4]: OPER2, oper=PLUS
          left → table[2]  (the previous OPER2!)
          right → table[3]

np1 = np2

Step 4: No more operators

tp->id == TK_EOT, break from loop
Return np1 (points to table[4])

Final Tree Structure¶

            (+)  [table[4]]
           /   \
         (+)   (3) [table[3]]
    [table[2]]
        /   \
      (1)   (2)
 [table[0]] [table[1]]

6. Complete Example¶

main.c¶

#include "ntlang.h"

int main(int argc, char **argv) {
    struct scan_table_st scan_table;
    struct parse_table_st parse_table;
    struct parse_node_st *parse_tree;
    char *input = "1 + 2 + 3";

    /* Initialize tables */
    scan_table_init(&scan_table);
    parse_table_init(&parse_table);

    /* Scan the input */
    scan_table_scan(&scan_table, input);

    /* Parse the tokens */
    parse_tree = parse_program(&parse_table, &scan_table);

    /* Print the tree */
    parse_tree_print(parse_tree);

    return 0;
}

Output¶

EXPR OPER2 PLUS
..EXPR OPER2 PLUS
....EXPR INTVAL 1
....EXPR INTVAL 2
..EXPR INTVAL 3

Key Concepts¶

Concept	Description
`parse_node_st`	AST node structure with union for type-specific data
`parse_table_st`	Pre-allocated array for node storage
`parse_node_new()`	Allocate next node from table
`scan_table_get(i)`	Get token at offset i from current position
`scan_table_accept()`	Consume token if it matches expected type
`TK_ANY`	Wildcard token type for unconditional accept

Practice Problems¶

Problem 1: Trace Node Allocation¶

For input "(1 + 2) * 3", show the contents of the parse table after parsing.

Click to reveal solution

table[0]: EX_INTVAL, value=1
table[1]: EX_INTVAL, value=2
table[2]: EX_OPER2, oper=PLUS, left→[0], right→[1]
table[3]: EX_INTVAL, value=3
table[4]: EX_OPER2, oper=MULT, left→[2], right→[3]

len = 5

Tree structure:
        (*)
       /   \
     (+)   (3)
    /   \
  (1)   (2)

Note: The parentheses don't create nodes - they just control which subtree becomes the left child.

Problem 2: Extend the Parser¶

Add support for TK_MINUS operator in parse_expression. Show the modified code.

Click to reveal solution

struct parse_node_st * parse_expression(struct parse_table_st *pt,
                                        struct scan_table_st *st) {
    struct scan_token_st *tp;
    struct parse_node_st *np1, *np2;

    np1 = parse_operand(pt, st);

    while (true) {
        tp = scan_table_get(st, 0);

        if (tp->id == TK_PLUS) {
            scan_table_accept(st, TK_ANY);
            np2 = parse_node_new(pt);
            np2->type = EX_OPER2;
            np2->oper2.oper = OP_PLUS;
            np2->oper2.left = np1;
            np2->oper2.right = parse_operand(pt, st);
            np1 = np2;
        } else if (tp->id == TK_MINUS) {   /* NEW */
            scan_table_accept(st, TK_ANY);
            np2 = parse_node_new(pt);
            np2->type = EX_OPER2;
            np2->oper2.oper = OP_MINUS;    /* Use OP_MINUS */
            np2->oper2.left = np1;
            np2->oper2.right = parse_operand(pt, st);
            np1 = np2;
        } else {
            break;
        }
    }

    return np1;
}

Problem 3: Fix the Bug¶

The following code has a bug. What is it?

if (scan_table_accept(st, TK_LPAREN)) {
    np = parse_expression(pt, st);
    // What's missing here?
}

Click to reveal solution

**Bug**: Missing check for closing parenthesis! **Fixed code**:

if (scan_table_accept(st, TK_LPAREN)) {
    np = parse_expression(pt, st);
    if (!scan_table_accept(st, TK_RPAREN)) {
        parse_error("Missing right paren");
    }
}

Without this check: - Input `"(1 + 2"` would parse without error - The `)` would be left unconsumed, causing problems later

Summary¶

AST nodes use a union to store type-specific data efficiently - all nodes are the same size but hold different information.
The parse table pre-allocates nodes in a contiguous array, avoiding heap allocation and memory management complexity.
scan_table_accept() consumes a token if it matches, returning true/false. Use TK_ANY as a wildcard.
scan_table_get(-1) retrieves the most recently consumed token to access its value.
Left-associative trees are built by making the previous result the left child of each new operator node.
Recursive descent naturally handles nested constructs - parse_operand calls parse_expression for parenthesized expressions.