← Back to Course
# EBNF and Scanning ## CS 631 Systems Foundations --- ## What is Scanning? **Scanning** (also called **lexical analysis** or **lexing**) is the first phase of a compiler or interpreter. It converts a **stream of characters** into a **stream of tokens**.
flowchart LR A["Source Text
#quot;1 + 2#quot;"] --> B[Scanner/Lexer] B --> C["Token Stream"]
--- ## What is a Token? A **token** is the smallest meaningful unit of a program. Each token has: - **Type**: What kind of token (number, operator, etc.) - **Value**: The actual text that was scanned Examples: - `42` → Type: INTEGER, Value: "42" - `+` → Type: PLUS, Value: "+" - `hello` → Type: IDENTIFIER, Value: "hello" --- ## NTlang Overview **NTlang** (Number Tool Language) is a simple expression language. Supports: - Integer literals: `42`, `100`, `0` - Hexadecimal: `0xFF`, `0x2A` - Binary: `0b1010`, `0b11` - Arithmetic: `+`, `-`, `*`, `/` - Bitwise: `>>`, `<<`, `&`, `|`, `^`, `~` - Grouping: `(`, `)` --- ## Scanning Example 1 **Input:** `"1 + 2"`
INT_LIT
"1"
PLUS
"+"
INT_LIT
"2"
EOT
""
The scanner reads each element, skipping whitespace, and produces a token for each. --- ## Scanning Example 2 **Input:** `"512 + 1024"`
INT_LIT
"512"
PLUS
"+"
INT_LIT
"1024"
EOT
""
Multi-digit numbers are read as a single token! --- ## Scanning Example 3 **Input:** `"(0b1111 & 0xF) >> 2"`
| Token Type | Value | |------------|-------| | LPAREN | "(" | | BINLIT | "1111" | | AND | "&" | | HEXLIT | "F" | | RPAREN | ")" | | LSR | ">>" | | INTLIT | "2" | | EOT | "" |
--- ## What is EBNF? **Extended Backus-Naur Form** is a notation for describing the syntax of languages. It defines rules for what character sequences form valid tokens. EBNF is like a **recipe** that tells the scanner how to recognize different token types. --- ## EBNF Notation | Symbol | Meaning | |--------|---------| | `::=` | "is defined as" | | `\|` | "or" (alternative) | | `()` | Grouping | | `*` | Zero or more repetitions | | `+` | One or more repetitions | | `[]` | Optional (zero or one) | | `'x'` | Literal character x | --- ## NTlang Scanner Grammar ```text tokenlist ::= (token)* token ::= intlit | hexlit | binlit | symbol symbol ::= '+' | '-' | '*' | '/' | '>>' | '>-' | '<<' | '~' | '&' | '|' | '^' | '(' | ')' intlit ::= digit (digit)* hexlit ::= '0x' hexdigit (hexdigit)* binlit ::= '0b' bindigit (bindigit)* hexdigit ::= 'a' | ... | 'f' | 'A' | ... | 'F' | digit bindigit ::= '0' | '1' digit ::= '0' | ... | '9' whitespace ::= (' ' | '\t') (' ' | '\t')* ``` --- ## Reading EBNF: Integer Literals ```text intlit ::= digit (digit)* digit ::= '0' | ... | '9' ``` **Translation:** - An integer literal is **one digit** followed by **zero or more digits** **Valid examples:** `0`, `1`, `42`, `12345` **Invalid examples:** (empty), `abc` --- ## Reading EBNF: Hex Literals ```text hexlit ::= '0x' hexdigit (hexdigit)* hexdigit ::= 'a' | ... | 'f' | 'A' | ... | 'F' | digit ``` **Translation:** - A hex literal starts with `0x` - Followed by **one or more** hex digits (0-9, a-f, A-F) **Valid examples:** `0x0`, `0xFF`, `0x2A3F` **Invalid examples:** `0x` (no digits), `xFF` (missing 0) --- ## Reading EBNF: Binary Literals ```text binlit ::= '0b' bindigit (bindigit)* bindigit ::= '0' | '1' ``` **Translation:** - A binary literal starts with `0b` - Followed by **one or more** binary digits (0 or 1) **Valid examples:** `0b0`, `0b1`, `0b1010`, `0b11111111` **Invalid examples:** `0b` (no digits), `0b123` (invalid digit) --- ## Reading EBNF: Symbols ```text symbol ::= '+' | '-' | '*' | '/' | '>>' | '>-' | '<<' | '~' | '&' | '|' | '^' | '(' | ')' ``` **Translation:** - A symbol is exactly **one** of the listed characters or character pairs Note: `>>`, `>-`, and `<<` are **two-character** symbols! --- ## EBNF to Code Pattern The EBNF grammar directly guides implementation: | EBNF | Code Pattern | |------|-------------| | `'x'` | Check if current char is 'x' | | `A \| B` | if-else or switch statement | | `A*` | while loop | | `A+` | read once, then while loop | | `[A]` | if statement (optional) | --- ## Summary - **Scanning** converts characters to tokens - Each **token** has a type and value - **EBNF** formally describes token syntax - EBNF symbols: `::=`, `|`, `()`, `*`, `+`, `[]` - The grammar guides scanner implementation Next: Implementing scanners in C and Rust!