← Back to Course
# Advanced Architecture ### CS 631 Systems Foundations — Apr 21, 2026 Based on Jason Patterson's *Modern Microprocessors: A 90-Minute Guide* --- ## Today's Agenda 1. Beyond megahertz: CPI and IPC 2. Pipelining & instruction-level parallelism 3. Superpipelining, superscalar, VLIW 4. Branches & branch prediction 5. Out-of-order execution & register renaming 6. The Power Wall & the ILP Wall 7. x86 and μops 8. SMT, multi-core, big-little 9. SIMD vectors 10. Memory wall & caches; latency vs. bandwidth --- ## Where We Left Off Last class: a **single-cycle** RISC-V processor. - One instruction per rising edge - Clock period = longest combinational path - PC, register file, data memory, ALU, decoder — all done in **one** cycle Nothing you buy works like this. Today: why. --- ## The Megahertz Myth | CPU | Clock | CPI | Effective Perf | |-----|-------|-----|----------------| | A | 4.0 GHz | 1.6 | 2.5 G-inst/s | | B | 3.0 GHz | 0.8 | 3.75 G-inst/s | CPU B wins by **50%** with a 25% lower clock.
Performance = IPC × clock.
IPC comes from microarchitecture, not MHz.
--- ## CPI vs. IPC | Metric | Formula | Better if | |--------|---------|-----------| | CPI | cycles / instruction | lower | | IPC | instructions / cycle | higher | | Wall-clock | inst × CPI / clock | lower | Two ways to raise performance: raise clock (hit the Power Wall) or raise IPC (hit the ILP Wall). Today: how to raise IPC. --- ## Pipelining: The Classic 5-Stage Pipe
flowchart LR IF["IF"] --> ID["ID"] ID --> EX["EX"] EX --> MEM["MEM"] MEM --> WB["WB"]
- Instruction enters IF every cycle - Five instructions in flight at once - After fill: **one instruction retires per cycle** → CPI → 1 --- ## Pipeline Space-Time | Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 | |-------|---|---|---|---|---|---|---| | `add` | IF | ID | EX | MEM | WB | | | | `sub` | | IF | ID | EX | MEM | WB | | | `and` | | | IF | ID | EX | MEM | WB | | `or` | | | | IF | ID | EX | MEM | | `xor` | | | | | IF | ID | EX | 5× speedup over unpipelined — *when it works*. --- ## Pipeline Hazards | Hazard | Cause | Fix | |--------|-------|-----| | Structural | Two stages want the same HW | Duplicate (split I/D caches) | | Data | B needs A's result too soon | Forwarding / stall | | Control | Branch target unknown | Branch prediction | Without fixes, a hazard = pipeline bubble = lost cycle. --- ## Forwarding (Bypassing) ``` add t0, t1, t2 # t0 ready at end of EX sub t3, t0, t4 # t0 wanted at start of EX next cycle ``` Bypass wire: ALU output → ALU input. No stall.
Load-use hazard:
lw t0, 0(t1)
then
add t3, t0, t4
still
needs a 1-cycle bubble. Compilers fill the slot with unrelated work.
--- ## Superpipelining Split each stage into shorter ones → higher clock. | Processor | Stages | |-----------|--------| | UltraSPARC T1 | 6 | | ARM Cortex-A53 | 8 | | Intel Core 2 | 14 | | AMD Zen 4 | ~19 | | Pentium 4 Prescott | **31** | Price: deeper pipes → longer mispredict flush. --- ## Superscalar: Issue Width
flowchart LR IF["IF
(4 wide)"] --> ID["ID"] ID --> A1["INT ALU"] ID --> A2["INT ALU"] ID --> FPU["FP Unit"] ID --> LS["Load/Store"] A1 --> WB["Writeback"] A2 --> WB FPU --> WB LS --> WB
Ideal IPC → issue width. Typical modern: 4–8 wide. --- ## Issue Widths in the Wild | Processor | Width | |-----------|-------| | UltraSPARC T1 | 1 | | ARM Cortex-A53 | 2 | | Pentium | 2 | | Intel Core 2 | 4 | | Apple M1 Firestorm | 8 | | Intel Golden Cove | 6 decode → ~12 dispatch | Modern chips are **superpipelined *and* superscalar**. --- ## VLIW: Let the Compiler Do It - "Instructions" = bundles of independent sub-ops (128+ bits) - Compiler packs them; HW just executes - No runtime scheduling hardware **Shipped**: Itanium (failed), Transmeta (niche), GPUs, DSPs. **Never won** in general-purpose CPUs: cache misses, branches, and separately-compiled binaries wreck static scheduling. --- ## Instruction Latencies | Operation | Cycles | |-----------|--------| | Integer add, sub, and/or/xor | 1 | | Integer multiply | 3–5 | | Integer divide | 12–40+ | | FP add/mul | 3–6 | | L1 load (hit) | 3–5 | | L2 load | 10–15 | | **DRAM load (miss)** | **100–300+** | Load latency is the silent killer of IPC. --- ## Branches and Speculation - Branch every ~6 instructions on average - Deep pipeline: 10–20 instructions fetched before branch resolves - Guess, fetch speculatively, check later - Wrong → **flush** every speculative instruction --- ## Mispredict Cost 20-stage pipe, 15-cycle penalty, 95% accuracy, branches 1-in-6: ``` wasted cycles / instruction = 15 × 0.05 / 6 = 0.125 ``` On a 4-IPC core that is **~half** a retirement slot lost per instruction — enormous. --- ## Branch Predictor Evolution | Predictor | Accuracy | |-----------|----------| | Always-taken | ~60% | | Back-taken / Forward-not | ~65% | | 1-bit dynamic | ~80% | | 2-bit saturating | ~90% | | Two-level / gshare | ~95% | | TAGE / perceptron | ~97%+ | Even 97% accuracy leaves real money on the table in a deep pipe. --- ## Predication Replace the branch with a conditional instruction: ```asm # Branchy: # Predicated: beq t0, t1, skip sub t2, t3, t4 sub t2, t3, t4 cmov.eq t2, t5, t0, t1 skip: ``` - ARM: fully predicated - x86 / MIPS / SPARC: `cmov` added later - Itanium: predicate bit on *every* instruction Best for short conditionals; wasted work on large blocks. --- ## Out-of-Order Execution
flowchart LR IF["Fetch"] --> DEC["Decode"] DEC --> REN["Rename"] REN --> RS["Scheduler"] RS --> FU["Functional
Units"] FU --> ROB["Reorder
Buffer"] ROB --> RET["Retire
(in order)"]
Issue when ready, retire in order. Window size: 100–500 in-flight instructions on modern cores. --- ## Register Renaming - Architectural regs: 8 (x86-32), 16 (x86-64), 32 (RISC-V, ARM64) - Physical regs: 150–400 - Map each arch write to a fresh physical reg → breaks false dependencies
Critical for x86. Without renaming, eight architectural registers would serialize nearly everything.
> "The dirty little secret of OOO is that we are often not very much OOO at all." > — Andy Glew, Pentium Pro architect --- ## Brainiac vs. Speed-Demon | Style | Strategy | Example | |-------|----------|---------| | Brainiac | Wide OOO, high IPC, moderate clock | Pentium Pro, POWER4, Apple M1 | | Speed-Demon | Simple, high clock, lean on compiler | Alpha 21164, Pentium 4, Niagara T1 | Design philosophies oscillate. Today: **moderate brainiac with aggressive power management**. --- ## The Power Wall $$ P \propto C \cdot V^2 \cdot f $$ - 30% faster clock → ~2× power - Leakage rises with voltage and temperature - Ceilings today: | Form factor | TDP | |-------------|-----| | Server | 250–400 W | | Desktop | 65–150 W | | Ultralight laptop | 15–25 W | | Phone | 3–10 W | Pentium 4 hit the Power Wall at 3.8 GHz. --- ## The ILP Wall Real integer code sustains only **~1–2 IPC**. - Loads miss unpredictably - Branches limit window - True dependency chains are irreducible You cannot just widen the issue to 20 and expect 20× — the programs do not have that much parallelism. → Need **thread-level** parallelism: SMT + multi-core. --- ## What About x86? x86 is CISC: variable length, complex addressing, memory operands. How do you build a 6-wide OOO core for that?
flowchart LR MEM["x86 bytes"] --> DEC["Complex
Decoders"] DEC --> UOP["μop Cache"] UOP --> REN["Rename +
Scheduler"] REN --> FU["Exec Units"] FU --> RET["Retire"]
Answer: crack each x86 instruction into 1–4 simple **μops** and run those through a RISC-like OOO core. --- ## The μop Cache Decoding x86 is slow and power-hungry. Since Sandy Bridge (2011): already-decoded μops are cached. - Tight loops run entirely from the μop cache - Decoders powered down - Large power and latency savings AMD Zen has an equivalent. --- ## Simultaneous Multithreading (SMT)
flowchart LR T0["Thread 0
PC/regs"] --> IF T1["Thread 1
PC/regs"] --> IF IF["Shared
Fetch/Decode"] --> BACK["Shared
Backend"] BACK --> RET["Retire
(per thread)"]
- Duplicate: PC, arch regs, TLB tags - Share: decoders, FUs, caches - Cost: ~10% area. Benefit: −10% to +30% throughput. Intel: **Hyper-Threading**. Niagara T3: 8-way SMT, 128 threads/chip. --- ## Multi-Core Duplicate the *entire* core. Each core = its own pipeline, L1, L2. Shared L3 and memory controller.
**SMT** - Cheap (~10% area) - Fills bubbles - Throughput only
**Multi-Core** - Expensive (N×) - True parallelism - Real throughput + latency
Modern chips: both — 6–128 cores with 2–8 SMT threads each. --- ## More Cores or Wider Cores? Width scales as **O(n²)** in the scheduler and bypass network. | Design | Area | Peak IPC | Single-thread | |--------|------|----------|---------------| | One 10-wide core | 6× | 10 (in theory) | Sublinear | | Two 5-wide cores | 2× | 2 × 5 = 10 | Linear | Two medium cores beat one giant core for almost all workloads. --- ## Asymmetric Designs Big cores + little cores on the same die: - **ARM big.LITTLE** — originally for phone battery life - **Apple M-series**: P-cores + E-cores - **Intel 12th gen+**: P-cores + E-cores with a thread director - **AMD Zen 4c / 5c**: compact cores for density OS scheduler picks which type to use per thread. --- ## SoC: System on Chip Integrate CPU + GPU + I/O + DSP + security + NPU on one die. - Dominant in phones, tablets - Now everywhere: Apple M-series, AMD Ryzen APUs, Snapdragon X - Saves power, area, cost One chip, whole computer. --- ## SIMD: Data Parallelism One instruction, many data. A 128-bit register = 16 bytes = 4 ints = 2 doubles. ``` [ a0 | a1 | a2 | a3 ] + [ b0 | b1 | b2 | b3 ] = [ a0+b0 | a1+b1 | a2+b2 | a3+b3 ] ``` Single instruction, four independent adds. --- ## The SIMD Zoo | ISA | Extensions | Width | |-----|------------|-------| | x86 | MMX → SSE → AVX → AVX-512 | 64 → 512 | | POWER | AltiVec / VSX | 128 | | ARM | NEON | 128 | | ARM | **SVE / SVE2** | 128–2048 (scalable) | | RISC-V | V extension | scalable | SVE / RISC-V V: same binary, any hardware width. --- ## When SIMD Wins **Big wins**: image/video/audio, matrix math, crypto, string scanning **Small wins**: simple array loops the compiler can vectorize **Nothing**: pointer-chasing, tree walks, parsing, business logic Most real benefit flows through **hand-tuned libraries** (BLAS, libjpeg, OpenSSL) you call without knowing. --- ## The Memory Wall CPU speed: up 1000× in 25 years. DRAM latency: barely changed.
A single DRAM miss costs as many cycles as ~100 ALU ops.
Most "CPU performance" today is actually *memory system* performance. --- ## Memory Hierarchy
flowchart LR CPU["Core"] --> L1["L1
32 KB
4 cyc"] L1 --> L2["L2
1 MB
14 cyc"] L2 --> L3["L3
32 MB
50 cyc"] L3 --> DRAM["DRAM
16 GB
200 cyc"] DRAM --> SSD["SSD
1 TB
50k cyc"]
L1 hit rate ~95%. LLC often **>50% of chip area**. --- ## Caches (Recap) From our [Cache Memory lecture](10-cs631-2026-03-31-cache-memory.md): - **Temporal** and **spatial locality** make caches work - Direct-mapped → set-associative → fully associative - Tag / index / offset split - Hit-rate dominates performance Modern twist: 3D V-Cache (AMD) stacks SRAM dies on top of compute. --- ## Latency vs. Bandwidth | Dimension | Measures | Improved by | |-----------|----------|-------------| | Latency | Time for one access | Caches, shorter wires | | Bandwidth | Bytes per second | Wider buses, more channels, HBM | **Highway**: adding lanes doubles cars/hour (bandwidth) but does not shorten the drive (latency).
"You can't bribe God." Latency is bounded by physics. Bandwidth scales with parallelism.
--- ## Workload Sensitivity | Workload | Bound by | |----------|----------| | Compilers, databases, tree walks | Latency | | Image/video processing | Bandwidth | | Web server (many unrelated requests) | Bandwidth (across threads) | | Interactive UI | Latency | Design the memory system for the workload you care about. --- ## The Modern CPU A tower of tricks on top of the single-cycle datapath: 1. **Pipeline** it 2. **Make it wider** (superscalar) 3. **Make it deeper** (superpipeline) 4. **Predict branches** 5. **Reorder** dynamically; **rename** registers 6. **Decode CISC to μops** 7. **Add SMT**, then **more cores** 8. **Add SIMD** 9. **Stack caches** in a hierarchy Every trick answers a specific bottleneck. Every trick has a cost. --- ## Key Takeaways - **Clock speed is one lever among many.** IPC comes from microarchitecture. - **Pipelining, superscalar, OOO** together push single-thread IPC to the ILP Wall (~2–4 for real code). - **Branch prediction** is what makes deep pipelines viable — misprediction is still the biggest single-thread cost. - **Power Wall + ILP Wall** forced the industry to **multi-core and SMT**; the free-lunch era of rising clocks is over. - **x86** is a RISC-like core wearing a CISC hat — μops and renaming do the heavy lifting. - **Memory** is now the main bottleneck. Caches, bandwidth, and latency-hiding dominate real performance. --- ## Further Reading - [Modern Microprocessors: A 90-Minute Guide](https://www.lighterra.com/papers/modernmicroprocessors/) — Patterson's survey (the basis for today) - [Cache Memory lecture](10-cs631-2026-03-31-cache-memory.md) - Patterson & Hennessy, *Computer Organization and Design*, ch. 4–5 - Hennessy & Patterson, *Computer Architecture: A Quantitative Approach*, ch. 3–5 - [Agner Fog's microarchitecture manuals](https://www.agner.org/optimize/) — pipelines and latencies for every real chip