Skip to content

Parsing in C

Overview

This lecture covers the implementation of a recursive descent parser in C. We examine how to represent Abstract Syntax Trees using C's struct and union constructs, manage memory efficiently using a parse table, and implement parsing functions that follow the grammar rules.

Learning Objectives

  • Understand AST node representation using structs and unions in C
  • Implement a memory-efficient parse table for node allocation
  • Use the scanner's accept/get interface for token consumption
  • Build a recursive descent parser following the EBNF grammar
  • Construct left-associative expression trees

Prerequisites

  • Understanding of parsing concepts (previous lecture)
  • C programming with structs, pointers, and unions
  • Familiarity with the NTLang scanner interface

1. AST Node Structure in C

Node Type Enumeration

The AST has three types of nodes, identified by an enumeration:

enum parse_expr_enum {
    EX_INTVAL,  /* Integer value (leaf node) */
    EX_OPER1,   /* Unary operator (one child) */
    EX_OPER2    /* Binary operator (two children) */
};

Operator Enumeration

Operators are also identified by an enumeration:

enum parse_oper_enum {
    OP_PLUS,   /* + */
    OP_MINUS,  /* - */
    OP_MULT,   /* * */
    OP_DIV     /* / */
    /* Add more operators as needed */
};

The parse_node_st Structure

Each AST node uses a union to store type-specific data:

struct parse_node_st {
    enum parse_expr_enum type;  /* Node type */
    union {
        /* For EX_INTVAL: stores the integer value */
        struct {
            int value;
        } intval;

        /* For EX_OPER1: unary operator with one child */
        struct {
            enum parse_oper_enum oper;
            struct parse_node_st *operand;
        } oper1;

        /* For EX_OPER2: binary operator with two children */
        struct {
            enum parse_oper_enum oper;
            struct parse_node_st *left;
            struct parse_node_st *right;
        } oper2;
    };
};

Why Use a Union?

A union allows different "views" of the same memory:

Node Type Active Union Member Contents
EX_INTVAL intval Integer value
EX_OPER1 oper1 Operator + 1 child pointer
EX_OPER2 oper2 Operator + 2 child pointers

Memory efficiency: All nodes are the same size (the size of the largest union member), but we only use the relevant fields for each node type.


2. Memory Management with parse_table_st

The Parse Table Structure

Instead of using malloc() for each node, we pre-allocate an array:

#define PARSE_TABLE_LEN 1024

struct parse_table_st {
    struct parse_node_st table[PARSE_TABLE_LEN];  /* Node storage */
    int len;  /* Number of nodes allocated */
};

Table Initialization

void parse_table_init(struct parse_table_st *pt) {
    pt->len = 0;
}

Allocating New Nodes

struct parse_node_st * parse_node_new(struct parse_table_st *pt) {
    struct parse_node_st *np;

    np = &(pt->table[pt->len]);
    pt->len += 1;

    return np;
}

This function: 1. Gets a pointer to the next available slot in the array 2. Increments the length counter 3. Returns the pointer to the caller

Memory Layout Visualization

For parsing "1 + 2":

parse_table_st:
+------------------+
| table[0]         |  <- INTVAL, value=1
|   type: EX_INTVAL|
|   intval.value: 1|
+------------------+
| table[1]         |  <- OPER2, PLUS
|   type: EX_OPER2 |
|   oper2.oper: +  |
|   oper2.left ----+--> table[0]
|   oper2.right ---+--> table[1]
+------------------+
| table[2]         |  <- INTVAL, value=2
|   type: EX_INTVAL|
|   intval.value: 2|
+------------------+
| len = 3          |
+------------------+

Benefits of This Approach

Benefit Explanation
No heap allocation All nodes on stack (in main)
No memory leaks No free() needed
Cache-friendly Nodes are contiguous in memory
Fast allocation Just increment a counter

3. Token Interface for Parser

scan_table_get

Get a token at a relative position without consuming it:

struct scan_token_st * scan_table_get(struct scan_table_st *st, int i) {
    return &(st->table[st->cur + i]);
}
Call Returns
get(0) Current token (at position cur)
get(-1) Previous token (just consumed)
get(1) Next token (lookahead)

scan_table_accept

Accept a token if it matches the expected type:

bool scan_table_accept(struct scan_table_st *st,
                       enum scan_token_enum tk_expected) {
    struct scan_token_st *tp;

    /* TK_ANY is a wildcard - always matches */
    if (tk_expected == TK_ANY) {
        st->cur += 1;
        return true;
    }

    tp = scan_table_get(st, 0);

    if (tp->id == tk_expected) {
        st->cur += 1;
        return true;
    }

    return false;
}

The TK_ANY Wildcard

TK_ANY is useful when you've already checked the token type:

tp = scan_table_get(st, 0);      /* Peek at current token */
if (tp->id == TK_PLUS) {
    scan_table_accept(st, TK_ANY);  /* Consume it (we know it's PLUS) */
    /* ... */
}

4. Parser Implementation

parse_program

The top-level parsing function:

struct parse_node_st * parse_program(struct parse_table_st *pt,
                                     struct scan_table_st *st) {
    struct parse_node_st *np1;

    /* A program is a single expression followed by EOT */
    np1 = parse_expression(pt, st);

    if (!scan_table_accept(st, TK_EOT)) {
        parse_error("Expecting EOT");
    }

    return np1;
}

parse_expression

Implements the grammar rule expression ::= operand (operator operand)*:

struct parse_node_st * parse_expression(struct parse_table_st *pt,
                                        struct scan_table_st *st) {
    struct scan_token_st *tp;
    struct parse_node_st *np1, *np2;

    /* An expression must start with an operand */
    np1 = parse_operand(pt, st);

    while (true) {
        tp = scan_table_get(st, 0);  /* Peek at current token */

        if (tp->id == TK_PLUS) {
            /* Found a plus operator */
            scan_table_accept(st, TK_ANY);  /* Consume the operator */

            /* Create a new OPER2 node */
            np2 = parse_node_new(pt);
            np2->type = EX_OPER2;
            np2->oper2.oper = OP_PLUS;
            np2->oper2.left = np1;          /* Previous result is left child */
            np2->oper2.right = parse_operand(pt, st);  /* Parse right operand */

            np1 = np2;  /* New node becomes current result */
        } else {
            break;  /* No more operators */
        }
    }

    return np1;
}

Key Points in parse_expression

  1. Start with an operand: np1 = parse_operand(pt, st)
  2. Loop while operators found: while (true) { ... if not operator, break }
  3. Build left-associative tree: np2->oper2.left = np1 (previous is left child)
  4. Update current node: np1 = np2 (new node becomes current)

parse_operand

Handles literals, unary operators, and parentheses:

struct parse_node_st * parse_operand(struct parse_table_st *pt,
                                     struct scan_table_st *st) {
    struct scan_token_st *tp;
    struct parse_node_st *np1;

    if (scan_table_accept(st, TK_INTLIT)) {
        /* Integer literal */
        tp = scan_table_get(st, -1);  /* Get the token we just consumed */
        np1 = parse_node_new(pt);
        np1->type = EX_INTVAL;
        np1->intval.value = atoi(tp->value);
    } else {
        parse_error("Bad operand");
    }

    return np1;
}

Extended parse_operand (with unary and parentheses)

struct parse_node_st * parse_operand(struct parse_table_st *pt,
                                     struct scan_table_st *st) {
    struct scan_token_st *tp;
    struct parse_node_st *np1;

    if (scan_table_accept(st, TK_LPAREN)) {
        /* Parenthesized expression */
        np1 = parse_expression(pt, st);  /* Recursive call! */
        if (!scan_table_accept(st, TK_RPAREN)) {
            parse_error("Missing right paren");
        }
    } else if (scan_table_accept(st, TK_INTLIT)) {
        /* Integer literal */
        tp = scan_table_get(st, -1);
        np1 = parse_node_new(pt);
        np1->type = EX_INTVAL;
        np1->intval.value = atoi(tp->value);
    } else if (scan_table_accept(st, TK_MINUS)) {
        /* Unary minus */
        np1 = parse_node_new(pt);
        np1->type = EX_OPER1;
        np1->oper1.oper = OP_MINUS;
        np1->oper1.operand = parse_operand(pt, st);  /* Recursive! */
    } else {
        parse_error("Bad operand");
    }

    return np1;
}

5. Building Left-Associative Trees

Step-by-Step Example: "1 + 2 + 3"

Initial state after scanning:

Tokens: [INTLIT(1), PLUS, INTLIT(2), PLUS, INTLIT(3), EOT]
        ^
        cur=0

Step 1: Parse first operand

np1 = parse_operand()  → Creates INTVAL(1)

table[0]: INTVAL, value=1
np1 → table[0]

Step 2: See PLUS, parse second operand

tp->id == TK_PLUS, so enter loop
scan_table_accept(TK_ANY)  → consumes PLUS
parse_operand()  → Creates INTVAL(2)

table[0]: INTVAL, value=1
table[1]: INTVAL, value=2

Create np2 = OPER2(PLUS, np1, new_operand)

table[2]: OPER2, oper=PLUS
          left → table[0]
          right → table[1]

np1 = np2 (np1 now points to OPER2)

Step 3: See PLUS, parse third operand

tp->id == TK_PLUS, so continue loop
scan_table_accept(TK_ANY)  → consumes PLUS
parse_operand()  → Creates INTVAL(3)

table[3]: INTVAL, value=3

Create np2 = OPER2(PLUS, np1, new_operand)

table[4]: OPER2, oper=PLUS
          left → table[2]  (the previous OPER2!)
          right → table[3]

np1 = np2

Step 4: No more operators

tp->id == TK_EOT, break from loop
Return np1 (points to table[4])

Final Tree Structure

            (+)  [table[4]]
           /   \
         (+)   (3) [table[3]]
    [table[2]]
        /   \
      (1)   (2)
 [table[0]] [table[1]]

6. Complete Example

main.c

#include "ntlang.h"

int main(int argc, char **argv) {
    struct scan_table_st scan_table;
    struct parse_table_st parse_table;
    struct parse_node_st *parse_tree;
    char *input = "1 + 2 + 3";

    /* Initialize tables */
    scan_table_init(&scan_table);
    parse_table_init(&parse_table);

    /* Scan the input */
    scan_table_scan(&scan_table, input);

    /* Parse the tokens */
    parse_tree = parse_program(&parse_table, &scan_table);

    /* Print the tree */
    parse_tree_print(parse_tree);

    return 0;
}

Output

EXPR OPER2 PLUS
..EXPR OPER2 PLUS
....EXPR INTVAL 1
....EXPR INTVAL 2
..EXPR INTVAL 3

Key Concepts

Concept Description
parse_node_st AST node structure with union for type-specific data
parse_table_st Pre-allocated array for node storage
parse_node_new() Allocate next node from table
scan_table_get(i) Get token at offset i from current position
scan_table_accept() Consume token if it matches expected type
TK_ANY Wildcard token type for unconditional accept

Practice Problems

Problem 1: Trace Node Allocation

For input "(1 + 2) * 3", show the contents of the parse table after parsing.

Click to reveal solution
table[0]: EX_INTVAL, value=1
table[1]: EX_INTVAL, value=2
table[2]: EX_OPER2, oper=PLUS, left→[0], right→[1]
table[3]: EX_INTVAL, value=3
table[4]: EX_OPER2, oper=MULT, left→[2], right→[3]

len = 5

Tree structure:
        (*)
       /   \
     (+)   (3)
    /   \
  (1)   (2)
Note: The parentheses don't create nodes - they just control which subtree becomes the left child.

Problem 2: Extend the Parser

Add support for TK_MINUS operator in parse_expression. Show the modified code.

Click to reveal solution
struct parse_node_st * parse_expression(struct parse_table_st *pt,
                                        struct scan_table_st *st) {
    struct scan_token_st *tp;
    struct parse_node_st *np1, *np2;

    np1 = parse_operand(pt, st);

    while (true) {
        tp = scan_table_get(st, 0);

        if (tp->id == TK_PLUS) {
            scan_table_accept(st, TK_ANY);
            np2 = parse_node_new(pt);
            np2->type = EX_OPER2;
            np2->oper2.oper = OP_PLUS;
            np2->oper2.left = np1;
            np2->oper2.right = parse_operand(pt, st);
            np1 = np2;
        } else if (tp->id == TK_MINUS) {   /* NEW */
            scan_table_accept(st, TK_ANY);
            np2 = parse_node_new(pt);
            np2->type = EX_OPER2;
            np2->oper2.oper = OP_MINUS;    /* Use OP_MINUS */
            np2->oper2.left = np1;
            np2->oper2.right = parse_operand(pt, st);
            np1 = np2;
        } else {
            break;
        }
    }

    return np1;
}

Problem 3: Fix the Bug

The following code has a bug. What is it?

if (scan_table_accept(st, TK_LPAREN)) {
    np = parse_expression(pt, st);
    // What's missing here?
}
Click to reveal solution **Bug**: Missing check for closing parenthesis! **Fixed code**:
if (scan_table_accept(st, TK_LPAREN)) {
    np = parse_expression(pt, st);
    if (!scan_table_accept(st, TK_RPAREN)) {
        parse_error("Missing right paren");
    }
}
Without this check: - Input `"(1 + 2"` would parse without error - The `)` would be left unconsumed, causing problems later

Summary

  1. AST nodes use a union to store type-specific data efficiently - all nodes are the same size but hold different information.

  2. The parse table pre-allocates nodes in a contiguous array, avoiding heap allocation and memory management complexity.

  3. scan_table_accept() consumes a token if it matches, returning true/false. Use TK_ANY as a wildcard.

  4. scan_table_get(-1) retrieves the most recently consumed token to access its value.

  5. Left-associative trees are built by making the previous result the left child of each new operator node.

  6. Recursive descent naturally handles nested constructs - parse_operand calls parse_expression for parenthesized expressions.