Parsing in C¶
Overview¶
This lecture covers the implementation of a recursive descent parser in C. We examine how to represent Abstract Syntax Trees using C's struct and union constructs, manage memory efficiently using a parse table, and implement parsing functions that follow the grammar rules.
Learning Objectives¶
- Understand AST node representation using structs and unions in C
- Implement a memory-efficient parse table for node allocation
- Use the scanner's accept/get interface for token consumption
- Build a recursive descent parser following the EBNF grammar
- Construct left-associative expression trees
Prerequisites¶
- Understanding of parsing concepts (previous lecture)
- C programming with structs, pointers, and unions
- Familiarity with the NTLang scanner interface
1. AST Node Structure in C¶
Node Type Enumeration¶
The AST has three types of nodes, identified by an enumeration:
enum parse_expr_enum {
EX_INTVAL, /* Integer value (leaf node) */
EX_OPER1, /* Unary operator (one child) */
EX_OPER2 /* Binary operator (two children) */
};
Operator Enumeration¶
Operators are also identified by an enumeration:
enum parse_oper_enum {
OP_PLUS, /* + */
OP_MINUS, /* - */
OP_MULT, /* * */
OP_DIV /* / */
/* Add more operators as needed */
};
The parse_node_st Structure¶
Each AST node uses a union to store type-specific data:
struct parse_node_st {
enum parse_expr_enum type; /* Node type */
union {
/* For EX_INTVAL: stores the integer value */
struct {
int value;
} intval;
/* For EX_OPER1: unary operator with one child */
struct {
enum parse_oper_enum oper;
struct parse_node_st *operand;
} oper1;
/* For EX_OPER2: binary operator with two children */
struct {
enum parse_oper_enum oper;
struct parse_node_st *left;
struct parse_node_st *right;
} oper2;
};
};
Why Use a Union?¶
A union allows different "views" of the same memory:
| Node Type | Active Union Member | Contents |
|---|---|---|
EX_INTVAL |
intval |
Integer value |
EX_OPER1 |
oper1 |
Operator + 1 child pointer |
EX_OPER2 |
oper2 |
Operator + 2 child pointers |
Memory efficiency: All nodes are the same size (the size of the largest union member), but we only use the relevant fields for each node type.
2. Memory Management with parse_table_st¶
The Parse Table Structure¶
Instead of using malloc() for each node, we pre-allocate an array:
#define PARSE_TABLE_LEN 1024
struct parse_table_st {
struct parse_node_st table[PARSE_TABLE_LEN]; /* Node storage */
int len; /* Number of nodes allocated */
};
Table Initialization¶
Allocating New Nodes¶
struct parse_node_st * parse_node_new(struct parse_table_st *pt) {
struct parse_node_st *np;
np = &(pt->table[pt->len]);
pt->len += 1;
return np;
}
This function: 1. Gets a pointer to the next available slot in the array 2. Increments the length counter 3. Returns the pointer to the caller
Memory Layout Visualization¶
For parsing "1 + 2":
parse_table_st:
+------------------+
| table[0] | <- INTVAL, value=1
| type: EX_INTVAL|
| intval.value: 1|
+------------------+
| table[1] | <- OPER2, PLUS
| type: EX_OPER2 |
| oper2.oper: + |
| oper2.left ----+--> table[0]
| oper2.right ---+--> table[1]
+------------------+
| table[2] | <- INTVAL, value=2
| type: EX_INTVAL|
| intval.value: 2|
+------------------+
| len = 3 |
+------------------+
Benefits of This Approach¶
| Benefit | Explanation |
|---|---|
| No heap allocation | All nodes on stack (in main) |
| No memory leaks | No free() needed |
| Cache-friendly | Nodes are contiguous in memory |
| Fast allocation | Just increment a counter |
3. Token Interface for Parser¶
scan_table_get¶
Get a token at a relative position without consuming it:
struct scan_token_st * scan_table_get(struct scan_table_st *st, int i) {
return &(st->table[st->cur + i]);
}
| Call | Returns |
|---|---|
get(0) |
Current token (at position cur) |
get(-1) |
Previous token (just consumed) |
get(1) |
Next token (lookahead) |
scan_table_accept¶
Accept a token if it matches the expected type:
bool scan_table_accept(struct scan_table_st *st,
enum scan_token_enum tk_expected) {
struct scan_token_st *tp;
/* TK_ANY is a wildcard - always matches */
if (tk_expected == TK_ANY) {
st->cur += 1;
return true;
}
tp = scan_table_get(st, 0);
if (tp->id == tk_expected) {
st->cur += 1;
return true;
}
return false;
}
The TK_ANY Wildcard¶
TK_ANY is useful when you've already checked the token type:
tp = scan_table_get(st, 0); /* Peek at current token */
if (tp->id == TK_PLUS) {
scan_table_accept(st, TK_ANY); /* Consume it (we know it's PLUS) */
/* ... */
}
4. Parser Implementation¶
parse_program¶
The top-level parsing function:
struct parse_node_st * parse_program(struct parse_table_st *pt,
struct scan_table_st *st) {
struct parse_node_st *np1;
/* A program is a single expression followed by EOT */
np1 = parse_expression(pt, st);
if (!scan_table_accept(st, TK_EOT)) {
parse_error("Expecting EOT");
}
return np1;
}
parse_expression¶
Implements the grammar rule expression ::= operand (operator operand)*:
struct parse_node_st * parse_expression(struct parse_table_st *pt,
struct scan_table_st *st) {
struct scan_token_st *tp;
struct parse_node_st *np1, *np2;
/* An expression must start with an operand */
np1 = parse_operand(pt, st);
while (true) {
tp = scan_table_get(st, 0); /* Peek at current token */
if (tp->id == TK_PLUS) {
/* Found a plus operator */
scan_table_accept(st, TK_ANY); /* Consume the operator */
/* Create a new OPER2 node */
np2 = parse_node_new(pt);
np2->type = EX_OPER2;
np2->oper2.oper = OP_PLUS;
np2->oper2.left = np1; /* Previous result is left child */
np2->oper2.right = parse_operand(pt, st); /* Parse right operand */
np1 = np2; /* New node becomes current result */
} else {
break; /* No more operators */
}
}
return np1;
}
Key Points in parse_expression¶
- Start with an operand:
np1 = parse_operand(pt, st) - Loop while operators found:
while (true) { ... if not operator, break } - Build left-associative tree:
np2->oper2.left = np1(previous is left child) - Update current node:
np1 = np2(new node becomes current)
parse_operand¶
Handles literals, unary operators, and parentheses:
struct parse_node_st * parse_operand(struct parse_table_st *pt,
struct scan_table_st *st) {
struct scan_token_st *tp;
struct parse_node_st *np1;
if (scan_table_accept(st, TK_INTLIT)) {
/* Integer literal */
tp = scan_table_get(st, -1); /* Get the token we just consumed */
np1 = parse_node_new(pt);
np1->type = EX_INTVAL;
np1->intval.value = atoi(tp->value);
} else {
parse_error("Bad operand");
}
return np1;
}
Extended parse_operand (with unary and parentheses)¶
struct parse_node_st * parse_operand(struct parse_table_st *pt,
struct scan_table_st *st) {
struct scan_token_st *tp;
struct parse_node_st *np1;
if (scan_table_accept(st, TK_LPAREN)) {
/* Parenthesized expression */
np1 = parse_expression(pt, st); /* Recursive call! */
if (!scan_table_accept(st, TK_RPAREN)) {
parse_error("Missing right paren");
}
} else if (scan_table_accept(st, TK_INTLIT)) {
/* Integer literal */
tp = scan_table_get(st, -1);
np1 = parse_node_new(pt);
np1->type = EX_INTVAL;
np1->intval.value = atoi(tp->value);
} else if (scan_table_accept(st, TK_MINUS)) {
/* Unary minus */
np1 = parse_node_new(pt);
np1->type = EX_OPER1;
np1->oper1.oper = OP_MINUS;
np1->oper1.operand = parse_operand(pt, st); /* Recursive! */
} else {
parse_error("Bad operand");
}
return np1;
}
5. Building Left-Associative Trees¶
Step-by-Step Example: "1 + 2 + 3"¶
Initial state after scanning:
Step 1: Parse first operand
Step 2: See PLUS, parse second operand
tp->id == TK_PLUS, so enter loop
scan_table_accept(TK_ANY) → consumes PLUS
parse_operand() → Creates INTVAL(2)
table[0]: INTVAL, value=1
table[1]: INTVAL, value=2
Create np2 = OPER2(PLUS, np1, new_operand)
table[2]: OPER2, oper=PLUS
left → table[0]
right → table[1]
np1 = np2 (np1 now points to OPER2)
Step 3: See PLUS, parse third operand
tp->id == TK_PLUS, so continue loop
scan_table_accept(TK_ANY) → consumes PLUS
parse_operand() → Creates INTVAL(3)
table[3]: INTVAL, value=3
Create np2 = OPER2(PLUS, np1, new_operand)
table[4]: OPER2, oper=PLUS
left → table[2] (the previous OPER2!)
right → table[3]
np1 = np2
Step 4: No more operators
Final Tree Structure¶
6. Complete Example¶
main.c¶
#include "ntlang.h"
int main(int argc, char **argv) {
struct scan_table_st scan_table;
struct parse_table_st parse_table;
struct parse_node_st *parse_tree;
char *input = "1 + 2 + 3";
/* Initialize tables */
scan_table_init(&scan_table);
parse_table_init(&parse_table);
/* Scan the input */
scan_table_scan(&scan_table, input);
/* Parse the tokens */
parse_tree = parse_program(&parse_table, &scan_table);
/* Print the tree */
parse_tree_print(parse_tree);
return 0;
}
Output¶
Key Concepts¶
| Concept | Description |
|---|---|
parse_node_st |
AST node structure with union for type-specific data |
parse_table_st |
Pre-allocated array for node storage |
parse_node_new() |
Allocate next node from table |
scan_table_get(i) |
Get token at offset i from current position |
scan_table_accept() |
Consume token if it matches expected type |
TK_ANY |
Wildcard token type for unconditional accept |
Practice Problems¶
Problem 1: Trace Node Allocation¶
For input "(1 + 2) * 3", show the contents of the parse table after parsing.
Click to reveal solution
Note: The parentheses don't create nodes - they just control which subtree becomes the left child.Problem 2: Extend the Parser¶
Add support for TK_MINUS operator in parse_expression. Show the modified code.
Click to reveal solution
struct parse_node_st * parse_expression(struct parse_table_st *pt,
struct scan_table_st *st) {
struct scan_token_st *tp;
struct parse_node_st *np1, *np2;
np1 = parse_operand(pt, st);
while (true) {
tp = scan_table_get(st, 0);
if (tp->id == TK_PLUS) {
scan_table_accept(st, TK_ANY);
np2 = parse_node_new(pt);
np2->type = EX_OPER2;
np2->oper2.oper = OP_PLUS;
np2->oper2.left = np1;
np2->oper2.right = parse_operand(pt, st);
np1 = np2;
} else if (tp->id == TK_MINUS) { /* NEW */
scan_table_accept(st, TK_ANY);
np2 = parse_node_new(pt);
np2->type = EX_OPER2;
np2->oper2.oper = OP_MINUS; /* Use OP_MINUS */
np2->oper2.left = np1;
np2->oper2.right = parse_operand(pt, st);
np1 = np2;
} else {
break;
}
}
return np1;
}
Problem 3: Fix the Bug¶
The following code has a bug. What is it?
Click to reveal solution
**Bug**: Missing check for closing parenthesis! **Fixed code**: Without this check: - Input `"(1 + 2"` would parse without error - The `)` would be left unconsumed, causing problems laterSummary¶
-
AST nodes use a union to store type-specific data efficiently - all nodes are the same size but hold different information.
-
The parse table pre-allocates nodes in a contiguous array, avoiding heap allocation and memory management complexity.
-
scan_table_accept()consumes a token if it matches, returning true/false. UseTK_ANYas a wildcard. -
scan_table_get(-1)retrieves the most recently consumed token to access its value. -
Left-associative trees are built by making the previous result the left child of each new operator node.
-
Recursive descent naturally handles nested constructs -
parse_operandcallsparse_expressionfor parenthesized expressions.