Parsing and Abstract Syntax Trees

# Parsing and Abstract Syntax Trees

## CS 631 Systems Foundations

---

## Language Processing Pipeline

<div class="mermaid">
flowchart LR
    A["Input"] --> B["Scanning"]
    B --> C["Tokens"]
    C --> D["Parsing"]
    D --> E["AST"]
    E --> F["Eval/CodeGen"]
    F --> G["Result"]
</div>

**Today's focus**: Parsing - converting tokens to an AST

---

## Scanning vs Parsing

| Phase | Input | Output | Purpose |
|-------|-------|--------|---------|
| **Scanning** | Characters | Tokens | Recognize words |
| **Parsing** | Tokens | AST | Understand structure |

```text
Input:  "1 + 2 + 3"

AST:        (+)
           /   \
         (+)   (3)
        /   \
      (1)   (2)
```

---

## Why We Need Trees

Consider: `1 + 2 * 3`

Without structure, evaluation order is ambiguous!

**Left-to-right:**
```text
      *
     / \
    +   3
   / \
  1   2

(1 + 2) * 3 = 9
```

</div>
<div>

**Math precedence:**
```text
      +
     / \
    1   *
       / \
      2   3

1 + (2 * 3) = 7
```

</div>
</div>

---

## NTLang Approach

In NTLang, all operators have **same precedence** and associate **left-to-right**.

Use parentheses to control grouping:

| Expression | Interpretation | Result |
|------------|----------------|--------|
| `1 + 2 * 3` | `(1 + 2) * 3` | 9 |
| `1 + (2 * 3)` | `1 + (2 * 3)` | 7 |
| `1 + 2 + 3` | `(1 + 2) + 3` | 6 |

---

## Parser Types

| Type | Description | We Use? |
|------|-------------|---------|
| **LL** | Top-down, left-to-right | Yes! |
| **LR** | Bottom-up, rightmost derivation | No |
| **GLR** | Generalized LR | No |
| **PEG** | Parsing Expression Grammar | No |

<div class="info-box">
We use <strong>LL(1) recursive descent</strong> because it's simple to implement by hand.
</div>

---

## EBNF Grammar: Program

```text
program    ::= expression EOT
```

A program is an expression followed by end-of-text.

```text
Input: "1 + 2"

Tokens: [INTLIT(1), PLUS, INTLIT(2), EOT]
        └──────── expression ────────┘ └─┘
        └──────────── program ───────────┘
```

---

## EBNF Grammar: Expression

```text
expression ::= operand (operator operand)*
```

An operand followed by zero or more (operator, operand) pairs.

```text
"1"           →  operand
"1 + 2"       →  operand operator operand
"1 + 2 + 3"   →  operand operator operand operator operand
               └─────────────────────────────────────────┘
                           expression
```

---

## EBNF Grammar: Operand

An operand can be:
- A literal value (`42`, `0xFF`, `0b1010`)
- A unary operation (`-5`, `~x`)
- A parenthesized expression (`(1 + 2)`)

---

## Recursive Descent Concept

**One function per grammar rule:**

| Grammar Rule | Function |
|--------------|----------|
| `program` | `parse_program()` |
| `expression` | `parse_expression()` |
| `operand` | `parse_operand()` |

Functions call each other following the grammar structure.

---

## Grammar to Code Mapping

| Grammar | Code |
|---------|------|
| Rule name | Function name |
| `A B` (sequence) | Call A, then B |
| `A \| B` (choice) | if/else or match |
| `(A)*` (repetition) | while loop |
| Terminal (token) | `accept(TOKEN)` |

---

## Accept/Get Pattern

Two key operations:

| Operation | Purpose |
|-----------|---------|
| `accept(TOKEN)` | If current matches, consume it, return true |
| `get(i)` | Look at token at position `cur + i` |

```c
if (accept(TK_PLUS)) {
    // Token was PLUS and was consumed
    tp = get(-1);  // Get the token we just consumed
}
```

---

## The Left Recursion Problem

This grammar causes infinite recursion:

```text
expression ::= expression '+' operand
             | operand
```

```c
void parse_expression() {
    parse_expression();  // Calls itself immediately!
    accept(TK_PLUS);
    parse_operand();
}
```

Stack overflow!

---

## Eliminating Left Recursion

**Transform recursion into iteration:**

**Before (recursive):**
```text
expr ::= expr '+' operand
       | operand
```

</div>
<div>

**After (iterative):**
```text
expr ::= operand ('+' operand)*
```

</div>
</div>

Now we can use a while loop instead of recursion!

---

## AST Node Types

| Type | Description | Children |
|------|-------------|----------|
| `INTVAL` | Integer value | None (leaf) |
| `OPER1` | Unary operator | 1 child |
| `OPER2` | Binary operator | 2 children |

---

## AST Example: "1 + 2"

```text
Token Stream: INTLIT("1") | PLUS | INTLIT("2")

AST:
         (+)          <- OPER2, oper=PLUS
        /   \
      (1)   (2)       <- INTVAL nodes
```

3 nodes total: 2 INTVAL + 1 OPER2

---

## AST Example: "1 + 2 + 3"

Left-associative: `(1 + 2) + 3`

```text
            (+)           <- Root: OPER2
           /   \
         (+)   (3)        <- OPER2, INTVAL
        /   \
      (1)   (2)           <- INTVAL, INTVAL
```

5 nodes total: 3 INTVAL + 2 OPER2

---

## Left-Associative Building

For `"1 + 2 + 3"`:

<div class="mermaid">
flowchart LR
    subgraph Step1
        A["np = 1"]
    end
    subgraph Step2
        B["np = PLUS"]
        B1["1"] --> B
        B2["2"] --> B
    end
    subgraph Step3
        C["np = PLUS"]
        C1["prev PLUS"] --> C
        C2["3"] --> C
    end
    Step1 --> Step2 --> Step3
</div>

**Key**: Previous result becomes left child of new node

---

## Parsing Flowchart

<div class="mermaid">
flowchart TD
    A[parse_program] --> B[parse_expression]
    B --> C[parse_operand]
    C --> D{Token?}
    D --> E[INTVAL]
    D --> F[recurse]
    D --> G[OPER1]
    G --> C
    B --> I{Operator?}
    I --> J[OPER2]
    J --> C
    I --> K[Return]
</div>

---

## Unary Operators

**Expression: `-5`**

```text
Tokens: MINUS | INTLIT("5")

AST:
     (-)          <- OPER1, oper=MINUS
      |
     (5)          <- INTVAL
```

Unary operators have exactly one child.

---

## Parentheses Create Subtrees

**Expression: `7 + (4 * 2)`**

```text
        (+)
       /   \
     (7)   (*)
          /   \
        (4)   (2)
```

The parenthesized part becomes a subtree attached as the right child.

---

## Key Concepts

| Concept | Definition |
|---------|------------|
| Parsing | Tokens → AST |
| Recursive Descent | One function per rule |
| Left Associativity | `1+2+3` = `(1+2)+3` |
| Accept | Consume matching token |
| Get | Look at token without consuming |

---

## Summary

1. **Parser converts tokens to AST** - hierarchical structure

2. **LL parsers** are simple but can't handle left recursion

3. **Recursive descent** - one function per grammar rule

4. **Transform left recursion** to iteration with `*`

5. **Three node types**: INTVAL, OPER1, OPER2

6. **Left-associative** - previous result is always left child