← Back to Course
# Parsing and Abstract Syntax Trees ## CS 631 Systems Foundations --- ## Language Processing Pipeline
flowchart LR A["Input"] --> B["Scanning"] B --> C["Tokens"] C --> D["Parsing"] D --> E["AST"] E --> F["Eval/CodeGen"] F --> G["Result"]
**Today's focus**: Parsing - converting tokens to an AST --- ## Scanning vs Parsing | Phase | Input | Output | Purpose | |-------|-------|--------|---------| | **Scanning** | Characters | Tokens | Recognize words | | **Parsing** | Tokens | AST | Understand structure | ```text Input: "1 + 2 + 3" Tokens: TK_INTLIT("1") | TK_PLUS | TK_INTLIT("2") | TK_PLUS | TK_INTLIT("3") | TK_EOT AST: (+) / \ (+) (3) / \ (1) (2) ``` --- ## Why We Need Trees Consider: `1 + 2 * 3` Without structure, evaluation order is ambiguous!
**Left-to-right:** ```text * / \ + 3 / \ 1 2 (1 + 2) * 3 = 9 ```
**Math precedence:** ```text + / \ 1 * / \ 2 3 1 + (2 * 3) = 7 ```
--- ## NTLang Approach In NTLang, all operators have **same precedence** and associate **left-to-right**. Use parentheses to control grouping: | Expression | Interpretation | Result | |------------|----------------|--------| | `1 + 2 * 3` | `(1 + 2) * 3` | 9 | | `1 + (2 * 3)` | `1 + (2 * 3)` | 7 | | `1 + 2 + 3` | `(1 + 2) + 3` | 6 | --- ## Parser Types | Type | Description | We Use? | |------|-------------|---------| | **LL** | Top-down, left-to-right | Yes! | | **LR** | Bottom-up, rightmost derivation | No | | **GLR** | Generalized LR | No | | **PEG** | Parsing Expression Grammar | No |
We use
LL(1) recursive descent
because it's simple to implement by hand.
--- ## EBNF Grammar: Program ```text program ::= expression EOT ``` A program is an expression followed by end-of-text. ```text Input: "1 + 2" Tokens: [INTLIT(1), PLUS, INTLIT(2), EOT] └──────── expression ────────┘ └─┘ └──────────── program ───────────┘ ``` --- ## EBNF Grammar: Expression ```text expression ::= operand (operator operand)* ``` An operand followed by zero or more (operator, operand) pairs. ```text "1" → operand "1 + 2" → operand operator operand "1 + 2 + 3" → operand operator operand operator operand └─────────────────────────────────────────┘ expression ``` --- ## EBNF Grammar: Operand ```text operand ::= intlit | hexlit | binlit | '-' operand | '~' operand | '(' expression ')' ``` An operand can be: - A literal value (`42`, `0xFF`, `0b1010`) - A unary operation (`-5`, `~x`) - A parenthesized expression (`(1 + 2)`) --- ## Recursive Descent Concept **One function per grammar rule:** | Grammar Rule | Function | |--------------|----------| | `program` | `parse_program()` | | `expression` | `parse_expression()` | | `operand` | `parse_operand()` | Functions call each other following the grammar structure. --- ## Grammar to Code Mapping | Grammar | Code | |---------|------| | Rule name | Function name | | `A B` (sequence) | Call A, then B | | `A \| B` (choice) | if/else or match | | `(A)*` (repetition) | while loop | | Terminal (token) | `accept(TOKEN)` | --- ## Accept/Get Pattern Two key operations: | Operation | Purpose | |-----------|---------| | `accept(TOKEN)` | If current matches, consume it, return true | | `get(i)` | Look at token at position `cur + i` | ```c if (accept(TK_PLUS)) { // Token was PLUS and was consumed tp = get(-1); // Get the token we just consumed } ``` --- ## The Left Recursion Problem This grammar causes infinite recursion: ```text expression ::= expression '+' operand | operand ``` ```c void parse_expression() { parse_expression(); // Calls itself immediately! accept(TK_PLUS); parse_operand(); } ``` Stack overflow! --- ## Eliminating Left Recursion **Transform recursion into iteration:**
**Before (recursive):** ```text expr ::= expr '+' operand | operand ```
**After (iterative):** ```text expr ::= operand ('+' operand)* ```
Now we can use a while loop instead of recursion! --- ## AST Node Types | Type | Description | Children | |------|-------------|----------| | `INTVAL` | Integer value | None (leaf) | | `OPER1` | Unary operator | 1 child | | `OPER2` | Binary operator | 2 children | --- ## AST Example: "1 + 2" ```text Token Stream: INTLIT("1") | PLUS | INTLIT("2") AST: (+) <- OPER2, oper=PLUS / \ (1) (2) <- INTVAL nodes ``` 3 nodes total: 2 INTVAL + 1 OPER2 --- ## AST Example: "1 + 2 + 3" Left-associative: `(1 + 2) + 3` ```text (+) <- Root: OPER2 / \ (+) (3) <- OPER2, INTVAL / \ (1) (2) <- INTVAL, INTVAL ``` 5 nodes total: 3 INTVAL + 2 OPER2 --- ## Left-Associative Building For `"1 + 2 + 3"`:
flowchart LR subgraph Step1 A["np = 1"] end subgraph Step2 B["np = PLUS"] B1["1"] --> B B2["2"] --> B end subgraph Step3 C["np = PLUS"] C1["prev PLUS"] --> C C2["3"] --> C end Step1 --> Step2 --> Step3
**Key**: Previous result becomes left child of new node --- ## Parsing Flowchart
flowchart TD A[parse_program] --> B[parse_expression] B --> C[parse_operand] C --> D{Token?} D --> E[INTVAL] D --> F[recurse] D --> G[OPER1] G --> C B --> I{Operator?} I --> J[OPER2] J --> C I --> K[Return]
--- ## Unary Operators **Expression: `-5`** ```text Tokens: MINUS | INTLIT("5") AST: (-) <- OPER1, oper=MINUS | (5) <- INTVAL ``` Unary operators have exactly one child. --- ## Parentheses Create Subtrees **Expression: `7 + (4 * 2)`** ```text (+) / \ (7) (*) / \ (4) (2) ``` The parenthesized part becomes a subtree attached as the right child. --- ## Key Concepts | Concept | Definition | |---------|------------| | Parsing | Tokens → AST | | Recursive Descent | One function per rule | | Left Associativity | `1+2+3` = `(1+2)+3` | | Accept | Consume matching token | | Get | Look at token without consuming | --- ## Summary 1. **Parser converts tokens to AST** - hierarchical structure 2. **LL parsers** are simple but can't handle left recursion 3. **Recursive descent** - one function per grammar rule 4. **Transform left recursion** to iteration with `*` 5. **Three node types**: INTVAL, OPER1, OPER2 6. **Left-associative** - previous result is always left child