Advanced Architecture

# Advanced Architecture

### CS 631 Systems Foundations — Apr 21, 2026

Based on Jason Patterson's *Modern Microprocessors: A 90-Minute Guide*

---

## Today's Agenda

1. Beyond megahertz: CPI and IPC
2. Pipelining & instruction-level parallelism
3. Superpipelining, superscalar, VLIW
4. Branches & branch prediction
5. Out-of-order execution & register renaming
6. The Power Wall & the ILP Wall
7. x86 and μops
8. SMT, multi-core, big-little
9. SIMD vectors
10. Memory wall & caches; latency vs. bandwidth

---

## Where We Left Off

Last class: a **single-cycle** RISC-V processor.

- One instruction per rising edge
- Clock period = longest combinational path
- PC, register file, data memory, ALU, decoder — all done in **one** cycle

Nothing you buy works like this. Today: why.

---

## The Megahertz Myth

| CPU | Clock | CPI | Effective Perf |
|-----|-------|-----|----------------|
| A | 4.0 GHz | 1.6 | 2.5 G-inst/s |
| B | 3.0 GHz | 0.8 | 3.75 G-inst/s |

CPU B wins by **50%** with a 25% lower clock.

<div class="info-box">
<strong>Performance = IPC × clock.</strong> IPC comes from microarchitecture, not MHz.
</div>

---

## CPI vs. IPC

| Metric | Formula | Better if |
|--------|---------|-----------|
| CPI | cycles / instruction | lower |
| IPC | instructions / cycle | higher |
| Wall-clock | inst × CPI / clock | lower |

Two ways to raise performance: raise clock (hit the Power Wall) or raise IPC (hit the ILP Wall).

Today: how to raise IPC.

---

## Pipelining: The Classic 5-Stage Pipe

<div class="mermaid">
flowchart LR
    IF["IF"] --> ID["ID"]
    ID --> EX["EX"]
    EX --> MEM["MEM"]
    MEM --> WB["WB"]
</div>

- Instruction enters IF every cycle
- Five instructions in flight at once
- After fill: **one instruction retires per cycle** → CPI → 1

---

## Pipeline Space-Time

| Cycle | 1 | 2 | 3 | 4 | 5 | 6 | 7 |
|-------|---|---|---|---|---|---|---|
| `add` | IF | ID | EX | MEM | WB |   |   |
| `sub` |   | IF | ID | EX | MEM | WB |   |
| `and` |   |   | IF | ID | EX | MEM | WB |
| `or`  |   |   |   | IF | ID | EX | MEM |
| `xor` |   |   |   |   | IF | ID | EX |

5× speedup over unpipelined — *when it works*.

---

## Pipeline Hazards

| Hazard | Cause | Fix |
|--------|-------|-----|
| Structural | Two stages want the same HW | Duplicate (split I/D caches) |
| Data | B needs A's result too soon | Forwarding / stall |
| Control | Branch target unknown | Branch prediction |

Without fixes, a hazard = pipeline bubble = lost cycle.

---

## Forwarding (Bypassing)

```
add t0, t1, t2     # t0 ready at end of EX
sub t3, t0, t4     # t0 wanted at start of EX next cycle
```

Bypass wire: ALU output → ALU input. No stall.

<div class="highlight-box">
Load-use hazard: <code>lw t0, 0(t1)</code> then <code>add t3, t0, t4</code> <em>still</em> needs a 1-cycle bubble. Compilers fill the slot with unrelated work.
</div>

---

## Superpipelining

Split each stage into shorter ones → higher clock.

| Processor | Stages |
|-----------|--------|
| UltraSPARC T1 | 6 |
| ARM Cortex-A53 | 8 |
| Intel Core 2 | 14 |
| AMD Zen 4 | ~19 |
| Pentium 4 Prescott | **31** |

Price: deeper pipes → longer mispredict flush.

---

## Superscalar: Issue Width

<div class="mermaid">
flowchart LR
    IF["IF<br/>(4 wide)"] --> ID["ID"]
    ID --> A1["INT ALU"]
    ID --> A2["INT ALU"]
    ID --> FPU["FP Unit"]
    ID --> LS["Load/Store"]
    A1 --> WB["Writeback"]
    A2 --> WB
    FPU --> WB
    LS --> WB
</div>

Ideal IPC → issue width. Typical modern: 4–8 wide.

---

## Issue Widths in the Wild

| Processor | Width |
|-----------|-------|
| UltraSPARC T1 | 1 |
| ARM Cortex-A53 | 2 |
| Pentium | 2 |
| Intel Core 2 | 4 |
| Apple M1 Firestorm | 8 |
| Intel Golden Cove | 6 decode → ~12 dispatch |

Modern chips are **superpipelined *and* superscalar**.

---

## VLIW: Let the Compiler Do It

- "Instructions" = bundles of independent sub-ops (128+ bits)
- Compiler packs them; HW just executes
- No runtime scheduling hardware

**Shipped**: Itanium (failed), Transmeta (niche), GPUs, DSPs.

**Never won** in general-purpose CPUs: cache misses, branches, and separately-compiled binaries wreck static scheduling.

---

## Instruction Latencies

| Operation | Cycles |
|-----------|--------|
| Integer add, sub, and/or/xor | 1 |
| Integer multiply | 3–5 |
| Integer divide | 12–40+ |
| FP add/mul | 3–6 |
| L1 load (hit) | 3–5 |
| L2 load | 10–15 |
| **DRAM load (miss)** | **100–300+** |

Load latency is the silent killer of IPC.

---

## Branches and Speculation

- Branch every ~6 instructions on average
- Deep pipeline: 10–20 instructions fetched before branch resolves
- Guess, fetch speculatively, check later
- Wrong → **flush** every speculative instruction

---

## Mispredict Cost

20-stage pipe, 15-cycle penalty, 95% accuracy, branches 1-in-6:

```
wasted cycles / instruction
  = 15 × 0.05 / 6  =  0.125
```

On a 4-IPC core that is **~half** a retirement slot lost per instruction — enormous.

---

## Branch Predictor Evolution

| Predictor | Accuracy |
|-----------|----------|
| Always-taken | ~60% |
| Back-taken / Forward-not | ~65% |
| 1-bit dynamic | ~80% |
| 2-bit saturating | ~90% |
| Two-level / gshare | ~95% |
| TAGE / perceptron | ~97%+ |

Even 97% accuracy leaves real money on the table in a deep pipe.

---

## Predication

Replace the branch with a conditional instruction:

```asm
# Branchy:                # Predicated:
  beq  t0, t1, skip         sub  t2, t3, t4
  sub  t2, t3, t4           cmov.eq  t2, t5, t0, t1
skip:
```

- ARM: fully predicated
- x86 / MIPS / SPARC: `cmov` added later
- Itanium: predicate bit on *every* instruction

Best for short conditionals; wasted work on large blocks.

---

## Out-of-Order Execution

<div class="mermaid">
flowchart LR
    IF["Fetch"] --> DEC["Decode"]
    DEC --> REN["Rename"]
    REN --> RS["Scheduler"]
    RS --> FU["Functional<br/>Units"]
    FU --> ROB["Reorder<br/>Buffer"]
    ROB --> RET["Retire<br/>(in order)"]
</div>

Issue when ready, retire in order.

Window size: 100–500 in-flight instructions on modern cores.

---

## Register Renaming

- Architectural regs: 8 (x86-32), 16 (x86-64), 32 (RISC-V, ARM64)
- Physical regs: 150–400
- Map each arch write to a fresh physical reg → breaks false dependencies

<div class="highlight-box">
Critical for x86. Without renaming, eight architectural registers would serialize nearly everything.
</div>

> "The dirty little secret of OOO is that we are often not very much OOO at all."
> — Andy Glew, Pentium Pro architect

---

## Brainiac vs. Speed-Demon

| Style | Strategy | Example |
|-------|----------|---------|
| Brainiac | Wide OOO, high IPC, moderate clock | Pentium Pro, POWER4, Apple M1 |
| Speed-Demon | Simple, high clock, lean on compiler | Alpha 21164, Pentium 4, Niagara T1 |

Design philosophies oscillate. Today: **moderate brainiac with aggressive power management**.

---

## The Power Wall

$$ P \propto C \cdot V^2 \cdot f $$

- 30% faster clock → ~2× power
- Leakage rises with voltage and temperature
- Ceilings today:

| Form factor | TDP |
|-------------|-----|
| Server | 250–400 W |
| Desktop | 65–150 W |
| Ultralight laptop | 15–25 W |
| Phone | 3–10 W |

Pentium 4 hit the Power Wall at 3.8 GHz.

---

## The ILP Wall

Real integer code sustains only **~1–2 IPC**.

- Loads miss unpredictably
- Branches limit window
- True dependency chains are irreducible

You cannot just widen the issue to 20 and expect 20× — the programs do not have that much parallelism.

→ Need **thread-level** parallelism: SMT + multi-core.

---

## What About x86?

x86 is CISC: variable length, complex addressing, memory operands. How do you build a 6-wide OOO core for that?

<div class="mermaid">
flowchart LR
    MEM["x86 bytes"] --> DEC["Complex<br/>Decoders"]
    DEC --> UOP["μop Cache"]
    UOP --> REN["Rename +<br/>Scheduler"]
    REN --> FU["Exec Units"]
    FU --> RET["Retire"]
</div>

Answer: crack each x86 instruction into 1–4 simple **μops** and run those through a RISC-like OOO core.

---

## The μop Cache

Decoding x86 is slow and power-hungry.

Since Sandy Bridge (2011): already-decoded μops are cached.

- Tight loops run entirely from the μop cache
- Decoders powered down
- Large power and latency savings

AMD Zen has an equivalent.

---

## Simultaneous Multithreading (SMT)

<div class="mermaid">
flowchart LR
    T0["Thread 0<br/>PC/regs"] --> IF
    T1["Thread 1<br/>PC/regs"] --> IF
    IF["Shared<br/>Fetch/Decode"] --> BACK["Shared<br/>Backend"]
    BACK --> RET["Retire<br/>(per thread)"]
</div>

- Duplicate: PC, arch regs, TLB tags
- Share: decoders, FUs, caches
- Cost: ~10% area. Benefit: −10% to +30% throughput.

Intel: **Hyper-Threading**. Niagara T3: 8-way SMT, 128 threads/chip.

---

## Multi-Core

Duplicate the *entire* core. Each core = its own pipeline, L1, L2. Shared L3 and memory controller.

**SMT**
- Cheap (~10% area)
- Fills bubbles
- Throughput only

</div>
<div>

**Multi-Core**
- Expensive (N×)
- True parallelism
- Real throughput + latency

</div>
</div>

Modern chips: both — 6–128 cores with 2–8 SMT threads each.

---

## More Cores or Wider Cores?

Width scales as **O(n²)** in the scheduler and bypass network.

| Design | Area | Peak IPC | Single-thread |
|--------|------|----------|---------------|
| One 10-wide core | 6× | 10 (in theory) | Sublinear |
| Two 5-wide cores | 2× | 2 × 5 = 10 | Linear |

Two medium cores beat one giant core for almost all workloads.

---

## Asymmetric Designs

Big cores + little cores on the same die:

- **ARM big.LITTLE** — originally for phone battery life
- **Apple M-series**: P-cores + E-cores
- **Intel 12th gen+**: P-cores + E-cores with a thread director
- **AMD Zen 4c / 5c**: compact cores for density

OS scheduler picks which type to use per thread.

---

## SoC: System on Chip

Integrate CPU + GPU + I/O + DSP + security + NPU on one die.

- Dominant in phones, tablets
- Now everywhere: Apple M-series, AMD Ryzen APUs, Snapdragon X
- Saves power, area, cost

One chip, whole computer.

---

## SIMD: Data Parallelism

One instruction, many data. A 128-bit register = 16 bytes = 4 ints = 2 doubles.

```
  [ a0 | a1 | a2 | a3 ]
+ [ b0 | b1 | b2 | b3 ]
= [ a0+b0 | a1+b1 | a2+b2 | a3+b3 ]
```

Single instruction, four independent adds.

---

## The SIMD Zoo

| ISA | Extensions | Width |
|-----|------------|-------|
| x86 | MMX → SSE → AVX → AVX-512 | 64 → 512 |
| POWER | AltiVec / VSX | 128 |
| ARM | NEON | 128 |
| ARM | **SVE / SVE2** | 128–2048 (scalable) |
| RISC-V | V extension | scalable |

SVE / RISC-V V: same binary, any hardware width.

---

## When SIMD Wins

**Big wins**: image/video/audio, matrix math, crypto, string scanning

**Small wins**: simple array loops the compiler can vectorize

**Nothing**: pointer-chasing, tree walks, parsing, business logic

Most real benefit flows through **hand-tuned libraries** (BLAS, libjpeg, OpenSSL) you call without knowing.

---

## The Memory Wall

CPU speed: up 1000× in 25 years.
DRAM latency: barely changed.

<div class="info-box">
A single DRAM miss costs as many cycles as ~100 ALU ops.
</div>

Most "CPU performance" today is actually *memory system* performance.

---

## Memory Hierarchy

<div class="mermaid">
flowchart LR
    CPU["Core"] --> L1["L1<br/>32 KB<br/>4 cyc"]
    L1 --> L2["L2<br/>1 MB<br/>14 cyc"]
    L2 --> L3["L3<br/>32 MB<br/>50 cyc"]
    L3 --> DRAM["DRAM<br/>16 GB<br/>200 cyc"]
    DRAM --> SSD["SSD<br/>1 TB<br/>50k cyc"]
</div>

L1 hit rate ~95%. LLC often **>50% of chip area**.

---

## Caches (Recap)

From our [Cache Memory lecture](10-cs631-2026-03-31-cache-memory.md):

- **Temporal** and **spatial locality** make caches work
- Direct-mapped → set-associative → fully associative
- Tag / index / offset split
- Hit-rate dominates performance

Modern twist: 3D V-Cache (AMD) stacks SRAM dies on top of compute.

---

## Latency vs. Bandwidth

| Dimension | Measures | Improved by |
|-----------|----------|-------------|
| Latency | Time for one access | Caches, shorter wires |
| Bandwidth | Bytes per second | Wider buses, more channels, HBM |

**Highway**: adding lanes doubles cars/hour (bandwidth) but does not shorten the drive (latency).

<div class="highlight-box">
"You can't bribe God." Latency is bounded by physics. Bandwidth scales with parallelism.
</div>

---

## Workload Sensitivity

| Workload | Bound by |
|----------|----------|
| Compilers, databases, tree walks | Latency |
| Image/video processing | Bandwidth |
| Web server (many unrelated requests) | Bandwidth (across threads) |
| Interactive UI | Latency |

Design the memory system for the workload you care about.

---

## The Modern CPU

A tower of tricks on top of the single-cycle datapath:

1. **Pipeline** it
2. **Make it wider** (superscalar)
3. **Make it deeper** (superpipeline)
4. **Predict branches**
5. **Reorder** dynamically; **rename** registers
6. **Decode CISC to μops**
7. **Add SMT**, then **more cores**
8. **Add SIMD**
9. **Stack caches** in a hierarchy

Every trick answers a specific bottleneck. Every trick has a cost.

---

## Key Takeaways

- **Clock speed is one lever among many.** IPC comes from microarchitecture.
- **Pipelining, superscalar, OOO** together push single-thread IPC to the ILP Wall (~2–4 for real code).
- **Branch prediction** is what makes deep pipelines viable — misprediction is still the biggest single-thread cost.
- **Power Wall + ILP Wall** forced the industry to **multi-core and SMT**; the free-lunch era of rising clocks is over.
- **x86** is a RISC-like core wearing a CISC hat — μops and renaming do the heavy lifting.
- **Memory** is now the main bottleneck. Caches, bandwidth, and latency-hiding dominate real performance.

---

## Further Reading

- [Modern Microprocessors: A 90-Minute Guide](https://www.lighterra.com/papers/modernmicroprocessors/) — Patterson's survey (the basis for today)
- [Cache Memory lecture](10-cs631-2026-03-31-cache-memory.md)
- Patterson & Hennessy, *Computer Organization and Design*, ch. 4–5
- Hennessy & Patterson, *Computer Architecture: A Quantitative Approach*, ch. 3–5
- [Agner Fog's microarchitecture manuals](https://www.agner.org/optimize/) — pipelines and latencies for every real chip