Advanced Architecture¶

Overview¶

Last class we finished a single-cycle RISC-V processor: one rising edge, one instruction, end of story. Nothing you buy is built that way. Modern CPUs — the chip in your laptop, your phone, a cloud VM — use pipelining, superscalar issue, branch prediction, out-of-order execution, multiple cores, simultaneous multithreading, SIMD vector units, and multi-level caches to extract orders of magnitude more performance from the same basic datapath. This lecture is a conceptual tour of those techniques.

We follow the structure of Jason Patterson's Modern Microprocessors: A 90-Minute Guide (2026 edition). The point is not to design these structures in Digital — a single modern CPU core contains several billion transistors — but to give you the vocabulary and mental model to read a CPU spec sheet, interpret benchmark numbers, and understand why those billion transistors exist. By the end of the lecture you should be able to explain why doubling the clock speed rarely doubles performance, why branch mispredictions are expensive, and why the memory system — not the ALU — usually determines how fast your code runs.

Learning Objectives¶

Explain why clock speed alone is a poor measure of processor performance; compute CPI and IPC
Describe a classic 5-stage pipeline and identify structural, data, and control hazards
Distinguish superpipelining (deeper pipes) from superscalar (wider issue) and from VLIW (compiler-scheduled parallelism)
Explain how branch prediction, speculative execution, and predication hide control hazards
Describe out-of-order execution, the reorder buffer, and why register renaming is essential on x86
State the Power Wall and the ILP Wall, and explain what they imply for design trade-offs
Describe how modern x86 chips decode into internal μops and why the μop cache matters
Contrast simultaneous multithreading (SMT) with multi-core and explain when each helps
Explain the n² cost of wider issue, motivating multi-core and asymmetric big-little designs
Describe SIMD (SSE, AVX, NEON, SVE) and identify code that benefits
Explain the memory wall, the L1/L2/L3/DRAM hierarchy, and the difference between latency and bandwidth

Prerequisites¶

Processor Components — the single-cycle datapath we will now parallelize
Digital Design 2: Sequential Logic — registers, clocks, state-machine thinking
Cache Memory — tag/index/offset, direct-mapped and associative caches; we build on this rather than repeat it
RISC-V Machine Code — instruction encoding, opcodes, registers

More Than Just Megahertz¶

Naive performance measure: "my CPU is 4 GHz, yours is 3 GHz, so mine is 33% faster." Almost always wrong. What matters is not cycles per second but useful work per cycle.

CPI and IPC¶

Metric	Formula	Meaning
CPI	cycles / instruction	Average cycles each instruction consumes
IPC	instructions / cycle	Reciprocal of CPI; higher is better
MIPS	IPC × clock (MHz)	Instructions retired per second (millions)
Wall-clock time	instructions × CPI / clock	What the user actually experiences

A 3 GHz CPU with IPC = 4 retires 12 billion instructions per second. A 4 GHz CPU with IPC = 2 retires 8 billion. The 3 GHz part wins by 50%.

Historical Example¶

In 1997, a 250 MHz MIPS R10000 outperformed a 300 MHz Pentium II on integer benchmarks, and a 600 MHz Alpha 21164 obliterated both. Same era, same semiconductor process generation, wildly different performance per cycle. The difference was microarchitecture — the subject of today's lecture.

The "megahertz myth" is why Apple's 2006 transition from PowerPC G5 (2.7 GHz) to Intel Core Duo (1.8 GHz) still delivered better performance. IPC more than compensated for the clock drop.

Pipelining and Instruction-Level Parallelism¶

The single-cycle design wastes most of the chip most of the time: while the ALU is computing, the instruction fetch logic sits idle; while data memory is writing back, the PC increment sits idle. Pipelining fixes this by overlapping successive instructions so that every unit is busy every cycle.

The Classic 5-Stage Pipeline¶

flowchart LR
    IF["IF<br/>Instruction<br/>Fetch"] --> ID["ID<br/>Decode +<br/>Reg Read"]
    ID --> EX["EX<br/>ALU"]
    EX --> MEM["MEM<br/>Data Memory"]
    MEM --> WB["WB<br/>Writeback"]

Each stage is separated from the next by a pipeline register (a bank of flip-flops). A new instruction enters IF every cycle; five instructions are in flight at once.

Space-Time View¶

After the pipeline fills, one instruction retires per cycle — CPI approaches 1.0, a 5× speedup over an unpipelined 5-cycle implementation (at the same clock).

Cycle	1	2	3	4	5	6	7
`add`	IF	ID	EX	MEM	WB
`sub`		IF	ID	EX	MEM	WB
`and`			IF	ID	EX	MEM	WB
`or`				IF	ID	EX	MEM
`xor`					IF	ID	EX

Hazards¶

Three things break the one-instruction-per-cycle promise:

Hazard	Cause	Mitigation
Structural	Two stages want the same hardware	Duplicate the resource (separate I-cache and D-cache)
Data	Instruction B needs a result that A has not written yet	Forwarding (bypass) or stall
Control	Branch target unknown until EX — what do we fetch?	Branch prediction

Forwarding (Bypassing)¶

Consider add t0, t1, t2 followed by sub t3, t0, t4. Without help, sub reads t0 from the register file before add's WB has written it. Forwarding wires the ALU output back to the ALU input so that the fresh value is available one cycle earlier — no stall needed. This is the "bypass" you will see on every modern pipeline diagram.

What if the producer is a load?

`lw t0, 0(t1)` followed by `add t3, t0, t4` has a **load-use hazard**. The value is not available until the end of MEM, but `add` wants it at the start of EX. Even with forwarding, the pipeline must stall one cycle (insert a **bubble**). Good compilers reorder instructions to fill this slot with something useful.

Deeper and Wider: Superpipelining and Superscalar¶

Superpipelining¶

Shorter stages allow higher clock frequency. Split each classic stage into two or three shorter ones — 10, 15, even 31 stages. The same amount of work is done, but now faster clock cycles run through more of them.

Processor	Stages
UltraSPARC T1	6
ARM Cortex-A53	8
Intel Core 2	14
Intel Skylake / Sunny Cove	14–19
AMD Zen 4	~19
Pentium 4 Prescott	31

Trade-off: a deeper pipe has a bigger mispredict penalty. Pentium 4 paid dearly for its 31 stages.

Superscalar¶

Add issue width: fetch, decode, and execute multiple instructions in parallel each cycle. A 4-wide superscalar has four ALUs (or a mix of integer/FP/load-store units) and can retire up to 4 instructions per cycle — IPC up to 4.0.

flowchart LR
    IF["IF<br/>(4 wide)"] --> ID["ID"]
    ID --> ALU1["INT ALU"]
    ID --> ALU2["INT ALU"]
    ID --> FPU["FP Unit"]
    ID --> LS["Load/Store"]
    ALU1 --> WB["Writeback<br/>(4 wide)"]
    ALU2 --> WB
    FPU --> WB
    LS --> WB

Processor	Issue Width
UltraSPARC T1	1
ARM Cortex-A53	2
Pentium	2
Intel Core 2	4
Apple M1 Firestorm	8
Intel Golden Cove	6 (decode) → ~12 (μop dispatch)

Superpipelined + Superscalar¶

Modern CPUs combine both. A Zen 4 core is ~19 stages deep and 6-wide at decode, extracting parallelism in two dimensions at once.

VLIW: Let the Compiler Do It¶

Very Long Instruction Word: each "instruction" is a bundle of independent sub-operations, and the compiler statically packs them. The hardware just executes what it is told — no dependency checking, no dynamic scheduling.

Pros: simpler (and therefore faster, cooler) hardware; the compiler has a whole-program view
Cons: loses when runtime behavior is unpredictable (cache misses, branches); binaries must be recompiled for every new chip width; compilers struggle to find parallelism in irregular code

Who shipped it: Intel Itanium (IA-64, EPIC) — failed in the market. Transmeta Crusoe — niche. GPUs and DSPs — still widely used, because those workloads are regular.

Who did not: anyone selling a general-purpose CPU today.

Dependencies and Latencies¶

Even with a wide, deep pipeline, you cannot execute instructions in parallel if they depend on each other.

Operation	Typical Latency (cycles)
Integer add, sub, and, or, xor	1
Integer multiply	3–5
Integer divide	12–40+
FP add/multiply	3–6
FP divide	10–25
L1 load (cache hit)	3–5
L2 load	10–15
DRAM load (cache miss)	100–300+

Load latency hurts most. Loads happen early in most code sequences and almost everything else depends on what they return. A 4-wide superscalar that stalls 5 cycles on each load is no faster than a scalar in-order machine if every instruction is a load.

Branches and Branch Prediction¶

Every ~5–6 instructions is a branch. In a deep pipeline, by the time we learn whether a branch is taken, we have already fetched 10–20 instructions from somewhere. Which somewhere?

Speculation¶

Guess the target, fetch speculatively, check later. If the guess was right, free performance. If wrong, flush the pipeline — all speculatively-fetched instructions are discarded and the hardware starts over.

Mispredict Penalty¶

Penalty ≈ pipeline depth between fetch and the stage that resolves the branch. On a 20-stage pipeline with a 15-cycle penalty, even 95% prediction accuracy costs you:

lost IPC = 15 &times; 0.05 = 0.75 cycles wasted per branch
branches per instruction ~ 1/6
lost IPC / instruction = 0.125  &rarr;  ~15% of peak performance gone

Predictor Evolution¶

Predictor	Accuracy	Notes
Always-taken	~60%	Trivial
Backward taken, forward not	~65%	"Loops loop, ifs fall through"
1-bit dynamic	~80%	Remember last outcome
2-bit saturating counter	~90%	Resist single flips
Two-level adaptive / gshare	~95%	History of the last N branches
TAGE / perceptron (modern)	~97%+	Multiple predictors voted

Predication¶

An alternative: replace the branch with a predicated instruction that does the work either way and conditionally commits the result.

# Before (branchy):          # After (predicated):
    beq  t0, t1, skip             sub   t2, t3, t4
    sub  t2, t3, t4               cmov.eq  t2, t5, t0, t1  # if t0==t1, t2<-t5
skip:

ARM has been fully predicated since day one
Alpha, x86, MIPS, SPARC added conditional moves (cmov)
IA-64 (Itanium) had a predicate bit on every instruction

Great for short if-then-else. Terrible for large blocks — you pay to execute the unused side.

Out-of-Order Execution and Register Renaming¶

Even with a perfect predictor and a 6-wide pipeline, straight-line code has too many dependency chains to feed six functional units every cycle. Out-of-order (OOO) execution lets the hardware pick ready instructions from a window of 100–500 in-flight operations and issue them as soon as their inputs are available, regardless of program order.

OOO Pipeline¶

flowchart LR
    IF["Fetch"] --> DEC["Decode"]
    DEC --> REN["Rename<br/>(arch &rarr; phys regs)"]
    REN --> RS["Reservation<br/>Station /<br/>Scheduler"]
    RS --> FU["Functional<br/>Units (many)"]
    FU --> ROB["Reorder<br/>Buffer"]
    ROB --> RET["Retire<br/>(in order)"]

Key ideas:

Fetch, decode, rename, dispatch happen in order
Execute happens out of order — whoever is ready, goes
Retire happens in order — the architectural state only updates when the oldest in-flight instruction completes

Register Renaming¶

Architectural registers (rax, rbx, ..., or x0–x31) are few. The pipeline has hundreds of operations in flight that need temporary scratch. Renaming maps each architectural register write to a physical register from a pool of 150–400. Two instructions that write rax get mapped to different physical registers and can execute in parallel.

This matters most on x86: in 32-bit mode the ISA exposes only 8 general-purpose registers. Renaming expands that to hundreds internally — without it, x86 OOO would be almost worthless.

Andy Glew's Confession¶

"The dirty little secret of OOO is that we are often not very much OOO at all." — Andy Glew, Pentium Pro architect

Typical OOO cores extract only ~20–40% more IPC than a well-tuned in-order core running the same code. It is expensive (millions of transistors, significant power) for a modest but real win.

The Brainiac Debate¶

Two design philosophies, coined by Linley Gwennap (1993):

Style	Strategy	Example
Brainiac	Complex OOO, wide issue, aggressive speculation; high IPC, moderate clock	Pentium Pro, POWER4, Apple M1
Speed-Demon	Simple in-order core, high clock, lean on compiler	Alpha 21164, Pentium 4, Niagara T1

Designs have oscillated between the two:

Intel: Pentium Pro (brainiac) → Pentium 4 (speed-demon, failed) → Core / Core i (brainiac) → today (brainiac, tempered by power)
DEC Alpha: 21064/21164 (speed-demon) → 21264 (brainiac) → project cancelled
SPARC: SuperSPARC (brainiac) → UltraSPARC / Niagara (speed-demon + SMT)
ARM: Cortex-A7 (tiny speed-demon) → Cortex-X / Apple M-series (brainiac)

Modern reality: power constraints make pure speed-demon untenable (can't raise clocks past ~5 GHz economically) and pure brainiac wasteful. Everyone is now a moderate brainiac with power management.

The Power Wall and the ILP Wall¶

The Power Wall¶

Dynamic switching power:

P &prop; C &times; V&sup2; &times; f

Raising the clock 30% typically requires a voltage bump to keep timing margins, and voltage enters squared. Net: ~2× power and heat for a 30% frequency increase. Leakage adds a temperature-dependent floor.

Practical ceilings today:

Form factor	Sustained TDP
Server (per socket)	250–400 W
Desktop	65–150 W
Laptop (workstation)	28–45 W
Ultralight laptop	15–25 W
Phone / tablet	3–10 W
Watch	<1 W

Pentium 4 hit the Power Wall at 3.8 GHz and was cancelled. IBM POWER6 and AMD Bulldozer hit it too.

The ILP Wall¶

Real programs have limited fine-grained parallelism. After every architectural trick — OOO, wide issue, big reorder buffer — typical integer code (SPECint) sustains only ~1–2 IPC. Scientific / vectorizable code can do much better, but that is not what most software looks like.

Once you cannot raise clock (Power Wall) and cannot extract more IPC from a single thread (ILP Wall), the only way to get more performance is more threads — SMT and multi-core.

What About x86?¶

x86 is a CISC ISA: variable-length instructions, complex addressing modes, read-modify-write memory operands. It looks nothing like the clean RISC pipelines we have been drawing. How does Intel ship a 6-wide OOO x86 core?

Decoupled Front-End: RISCy x86¶

flowchart LR
    MEM["x86 bytes<br/>(variable length)"] --> DEC["Complex<br/>Decoders"]
    DEC --> UOP["μop Cache<br/>(decoded)"]
    UOP --> ROB["Rename +<br/>Scheduler"]
    ROB --> FU["Functional<br/>Units"]
    FU --> RET["Retire"]

Steps:

Fetch x86 bytes from I-cache
Crack each x86 instruction into 1–4 simple μops (micro-operations) that resemble RISC ops
Rename, schedule, and execute μops in an OOO core
Retire in original x86 instruction order

The μop Cache¶

Decoding x86 is expensive (power-hungry, gates-deep). Since Sandy Bridge (2011), Intel caches already-decoded μops so that loops run straight from the μop cache, skipping the decoders entirely. AMD Zen has an equivalent μop cache.

Issue Width Ambiguity¶

Modern x86 decode up to 5–6 x86 instructions per cycle, producing ~6–8 fused μops, which the backend issues to 10–12 execution ports. Calling it "6-wide" or "12-wide" depends on which stage you measure.

Historical Note¶

The RISCy-x86 approach was invented twice independently in the mid-1990s: NexGen Nx586 (1994) and Intel Pentium Pro / P6 (1995). Transmeta tried to do the same thing in software instead of hardware; it worked but was too slow.

Threads: SMT and Multi-Core¶

If one thread cannot keep a 6-wide core busy (ILP Wall), try running two threads through it at the same time. Any cycle one thread stalls, the other fills the gap.

Simultaneous Multithreading (SMT)¶

flowchart LR
    T0["Thread 0<br/>PC, regs"] --> IF
    T1["Thread 1<br/>PC, regs"] --> IF
    IF["Shared<br/>Fetch/Decode"] --> BACK["Shared<br/>Backend"]
    BACK --> RET["Retire<br/>(per thread)"]

Duplicate: PC, architectural registers, TLB tags
Share: everything else — decoders, functional units, caches, rename tables

Cost: ~5–10% extra core area. Benefit: −10% to +30% performance on the same core depending on workload.

Intel calls it Hyper-Threading. POWER, Apple, and ARM have various flavors. AMD Zen has 2-way SMT. Sun UltraSPARC T3 Niagara pushed it to 8-way SMT per core — 128 threads on a chip.

Multi-Core¶

Duplicate the entire core. Each core has its own pipeline, L1, and L2; cores share the L3 and the memory controller.

Approach	Area cost	Independence
SMT only	~10%	Logical threads share all resources
Multi-core	~100% per extra core	True parallelism; no resource contention
Both (modern)	Big	Multi-core at top level, SMT within each core

Today's shipping chips: 6–16 cores for consumer, 32–128 for server, often with 2–4 SMT threads per core.

More Cores or Wider Cores?¶

Why not just make one really wide 20-issue core?

Quadratic Cost of Width¶

Dependency checking, the rename tables, the scheduler, and the bypass network all scale roughly as O(n²) in issue width. Doubling from 4-wide to 8-wide roughly quadruples the logic and adds wire delay that hurts clock speed.

Design	Area	Peak IPC	Single-Thread Perf
One 10-wide core	~6×	10 (theoretical)	Sublinear
Two 5-wide cores	~2×	2 × 5 = 10	Linear

For the same transistor budget, two medium cores beat one giant core — and run two threads truly in parallel.

Hybrid / Asymmetric Designs¶

Modern consumer CPUs mix big and small cores on one die:

ARM big.LITTLE: fast "big" cores for interactive workloads, tiny "LITTLE" cores for background (saves battery)
Apple M-series: P-cores (performance) + E-cores (efficiency)
Intel 12th gen+: P-cores + E-cores with a hardware thread director
AMD Zen 4c / Zen 5c: smaller cores on the same ISA for density

System-on-Chip (SoC)¶

Integrate CPU + GPU + I/O + DSP + security + networking on one die. Dominant in phones and tablets; increasingly common everywhere (Apple M-series, AMD Ryzen APUs). Saves power and area vs. separate chips.

Data Parallelism: SIMD¶

One instruction, many data. Pack multiple values into a wide register and operate on all of them simultaneously.

The Idea¶

A 128-bit register can hold:

16 × 8-bit bytes (pixel RGBA, audio samples)
8 × 16-bit shorts
4 × 32-bit ints or floats
2 × 64-bit doubles

A single SIMD add instruction applies the operation to all lanes in parallel.

  [ a0 | a1 | a2 | a3 ]
+ [ b0 | b1 | b2 | b3 ]
= [ a0+b0 | a1+b1 | a2+b2 | a3+b3 ]

The SIMD Zoo¶

ISA	Extension	Width
x86	MMX (1997)	64
x86	SSE / SSE2 / SSE3 / SSE4	128
x86	AVX / AVX2	256
x86	AVX-512	512
x86	AVX10 (2024+)	256/512 (flexible)
POWER	AltiVec / VSX	128
ARM	NEON	128
ARM	SVE / SVE2	128–2048 (length-agnostic)
RISC-V	V extension	scalable

ARM SVE and RISC-V V take a different approach: the same binary runs on any hardware vector width. Each loop iteration simply consumes whatever the hardware can offer.

When SIMD Helps¶

Huge win: image/video/audio processing, matrix math, cryptography, string scanning
Small win: regular array loops the compiler can auto-vectorize
No win: pointer-chasing, tree walks, parsing, general business logic

Auto-vectorization in compilers is still limited; most of the benefit comes from hand-tuned libraries (BLAS, libjpeg, OpenSSL, JVM intrinsics) that your code calls without knowing.

The Memory Wall¶

DRAM latency has barely improved in 25 years. CPU frequencies grew 1000× over the same period. The gap — the memory wall — is now the #1 performance bottleneck for most real workloads.

Rough Numbers (2026 consumer system)¶

Level	Size	Latency (CPU cycles)
Register	16–32 values	0
L1 cache	32–128 KB	3–5
L2 cache	256 KB–2 MB	10–15
L3 / LLC	4–96 MB	40–60
DRAM	8–128 GB	150–300
SSD (NVMe)	256 GB–4 TB	~50,000
HDD	1–20 TB	~5,000,000

A single cache miss to DRAM costs as many cycles as ~100 ALU operations.

Why Caches Work¶

Real programs exhibit locality (review from Cache Memory):

Temporal: recently-used data will be used again soon
Spatial: data near a recent access will be accessed soon

Modern L1 hit rates are ~90–95%; L2 catches most of the rest; L3 catches most of what is left. Only a few percent of loads ever go to DRAM.

Cache Hierarchy¶

flowchart LR
    CPU["Core"] --> L1["L1<br/>~32 KB<br/>~4 cycles"]
    L1 --> L2["L2<br/>~1 MB<br/>~14 cycles"]
    L2 --> L3["L3 / LLC<br/>~32 MB<br/>~50 cycles"]
    L3 --> DRAM["DRAM<br/>~16 GB<br/>~200 cycles"]
    DRAM --> SSD["SSD<br/>~1 TB<br/>~50000 cycles"]

Last-level cache (LLC) often consumes more than half of the chip area. AMD's 3D V-Cache bonds an extra SRAM die on top of the compute die to boost LLC size dramatically.

Bandwidth vs. Latency¶

Memory has two performance dimensions that are often conflated:

Dimension	What it measures	How to improve
Latency	Time for one access	Faster DRAM, more cache, shorter wires
Bandwidth	Total bytes per second	Wider buses, more channels, stacked DRAM

Highway Analogy¶

Lanes = bus width → bandwidth: adding lanes doubles cars/hour but does not make any one car go faster
Speed limit = signaling rate → latency: physics sets an upper bound
Distance from A to B = cannot be changed by adding lanes

"You Can't Bribe God"¶

Physical distance + speed-of-light + capacitive wire loading sets a hard floor on DRAM access time. Bandwidth can scale by parallelism (more channels, HBM stacks, DDR5-6400). Latency cannot.

Workload Sensitivity¶

Workload	Latency-bound or bandwidth-bound?
Pointer-chasing (linked lists, trees, compilers, databases)	Latency
Image/video processing, scientific dense code	Bandwidth
Web server handling unrelated requests	Bandwidth (across threads)
Interactive UI	Latency

Practice Problems¶

Problem 1: Megahertz vs. IPC¶

CPU A runs at 4.0 GHz with CPI = 1.6. CPU B runs at 3.0 GHz with CPI = 0.8. Which has higher performance on the same code, and by how much?

Solution

Wall-clock time ∝ CPI / clock. - CPU A: 1.6 / 4.0 = 0.40 ns per instruction - CPU B: 0.8 / 3.0 ≈ 0.267 ns per instruction CPU B is 0.40 / 0.267 ≈ **1.5× faster** despite a 25% lower clock. Higher IPC wins.

Problem 2: Mispredict Penalty¶

A 20-stage pipeline resolves branches at stage 15. Branches occur every 6 instructions and the predictor is 95% accurate. Approximately what fraction of peak IPC is lost to mispredictions?

Solution

- Penalty per mispredict: 15 stages flushed = 15 cycles wasted - Mispredict rate: 5% of branches - Branch rate: 1 / 6 instructions Wasted cycles per instruction = 15 × 0.05 / 6 = **0.125** On a 1-IPC baseline, that is ~12.5% of peak lost just to mispredictions. On a 4-IPC superscalar it is worse — a flushed cycle discards *four* potential retirements.

Problem 3: Why OOO Exists¶

Why can't a clever compiler statically schedule code well enough to make OOO hardware unnecessary?

Solution

The compiler cannot know, at compile time: 1. Which loads will hit or miss the cache — latency varies by 100× 2. Which branches will be taken on a given run — input-dependent 3. How contention with other threads / cores will play out 4. What code will follow across a function call boundary into a separately-compiled binary OOO hardware sees the actual runtime dependencies and latencies and reorders based on what is **actually ready now**. A compiler has to plan for the worst case. This is also why Itanium's VLIW approach struggled on general-purpose code.

Problem 4: x86 Registers¶

Why does x86 depend more heavily on register renaming than RISC-V does?

Solution

x86 exposes only 8 GPRs in 32-bit mode and 16 in 64-bit mode. With that few architectural names, the compiler is forced to reuse the same register repeatedly, creating artificial **false dependencies** (write-after-write, write-after-read) that block parallelism. Renaming maps those reuses to distinct physical registers, recovering the parallelism the ISA hides. RISC-V exposes 32 GPRs — more breathing room, fewer artificial conflicts. Renaming still helps but is less of a rescue mission.

Problem 5: Average Memory Access Time (AMAT)¶

A system has L1 hit rate 95% at 4 cycles, L2 hit rate 80% (of L1 misses) at 14 cycles, L3 hit rate 50% (of L2 misses) at 50 cycles, DRAM at 200 cycles. Compute AMAT.

Solution

AMAT = L1_latency + P(miss L1) &times; (L2_latency + P(miss L2) &times; (L3_latency + P(miss L3) &times; DRAM))
     = 4 + 0.05 &times; (14 + 0.20 &times; (50 + 0.50 &times; 200))
     = 4 + 0.05 &times; (14 + 0.20 &times; 150)
     = 4 + 0.05 &times; (14 + 30)
     = 4 + 0.05 &times; 44
     = 4 + 2.2
     = 6.2 cycles

A 95% L1 hit rate keeps the effective latency close to L1 even though a DRAM miss is 200 cycles. This is the magic of caches.

Problem 6: SIMD Applicability¶

You have two workloads to speed up with AVX-512. Which is a good candidate and why?

Convert an RGB image to grayscale (one arithmetic operation per pixel)
Walk an on-disk B-tree to find a record matching a key

Solution

- **(1) wins big**: independent per-pixel work, regular memory access, arithmetic that maps cleanly to packed 8-bit SIMD. Expect a 4–16× speedup. - **(2) loses**: each step of the walk depends on the *previous* load (pointer chase); the branch at each node is data-dependent and unpredictable; there is no parallel work to pack into a vector lane. SIMD cannot help. Data parallelism requires independent work. Control-dependent serial work needs different tools (branch prediction, prefetching, out-of-order execution).

Problem 7: Cores vs. Width¶

You have a transistor budget for either a single 8-issue OOO core or four 3-issue in-order cores. A customer is running (a) a single-threaded spreadsheet, (b) a web server handling 200 concurrent connections. Which chip should they buy?

Solution

- **(a)** The spreadsheet is single-threaded and latency-sensitive. Only one thread at a time; the 8-issue OOO core wins by extracting ILP within that thread. - **(b)** The web server has abundant thread-level parallelism and each request is memory-latency-bound. Four simpler cores (better still: four cores × 2 SMT threads) win by running eight requests truly in parallel. Dedicated OOO logic would sit idle most of the time. There is no "best" CPU — there is best-for-the-workload. Modern consumer chips attempt to be good at both by mixing P-cores and E-cores.

Key Concepts¶

Concept	Description
CPI / IPC	Cycles per instruction; instructions per cycle — better measure than MHz
Pipelining	Overlap successive instructions; targets CPI = 1
Superpipelining	Deeper stages, higher clock, bigger mispredict penalty
Superscalar	Issue > 1 instruction per cycle through parallel functional units
VLIW	Compiler packs bundles; works for regular code, not general-purpose
Branch prediction	Speculate the path; misprediction flushes the pipeline
Predication	Replace branches with conditional instructions
OOO execution	Issue instructions in data-ready order; retire in program order
Register renaming	Map few architectural registers to many physical; critical on x86
Power Wall	P ∝ f V² — cannot raise clock forever
ILP Wall	Real code sustains ~1–2 IPC regardless of width
μops	Internal RISC-like ops that x86 decodes to; cached after decode
SMT	One core presents multiple logical processors; fills bubbles with other threads
Multi-core	Duplicate entire cores; true thread-level parallelism
Big-little	Asymmetric cores on one die for performance + efficiency
SIMD	One instruction operates on a packed vector of values
Memory Wall	CPU speed far outran DRAM latency; caches hide the gap
Latency vs. Bandwidth	Time-per-access vs. bytes-per-second; different techniques address each

Summary¶

Clock speed is only one factor. Performance = IPC × clock. IPC comes from microarchitecture tricks, not MHz.
Pipelining overlaps instructions so that each pipeline stage stays busy every cycle, pushing CPI toward 1 at the cost of exposing hazards.
Superpipelining (deeper) and superscalar (wider) attack the single-thread performance problem from two directions; modern CPUs do both.
Branch prediction and speculative execution hide control hazards; mispredictions flush a deep pipeline and cost real performance.
Out-of-order execution with register renaming extracts another 20–40% IPC by reordering dynamically; it is essential on x86 because of its tiny architectural register set.
The Power Wall caps clock speed and the ILP Wall caps single-thread IPC. Neither goes away with more transistors.
Beyond those walls, we scale with SMT (cheap extra threads through one core) and multi-core (duplicate cores). Hybrid big-little designs balance throughput and efficiency.
SIMD vector instructions exploit data parallelism — huge for media and scientific code, nothing for pointer-chasing.
The memory wall is now the dominant bottleneck. Multi-level caches hide DRAM latency; most performance tuning in real software comes down to cache behavior.
Modern CPUs are a tower of tricks stacked on top of the single-cycle datapath you built in Digital. Every trick is a response to a specific bottleneck — and every trick has a cost.

Advanced Architecture¶

Overview¶

Learning Objectives¶

Prerequisites¶

More Than Just Megahertz¶

CPI and IPC¶

Historical Example¶

Pipelining and Instruction-Level Parallelism¶

The Classic 5-Stage Pipeline¶

Space-Time View¶

Hazards¶

Forwarding (Bypassing)¶

Deeper and Wider: Superpipelining and Superscalar¶

Superpipelining¶

Superscalar¶

Superpipelined + Superscalar¶

VLIW: Let the Compiler Do It¶

Dependencies and Latencies¶

Branches and Branch Prediction¶

Speculation¶

Mispredict Penalty¶

Predictor Evolution¶

Predication¶

Out-of-Order Execution and Register Renaming¶

OOO Pipeline¶

Register Renaming¶

Andy Glew's Confession¶

The Brainiac Debate¶

The Power Wall and the ILP Wall¶

The Power Wall¶

The ILP Wall¶

What About x86?¶

Decoupled Front-End: RISCy x86¶

The μop Cache¶

Issue Width Ambiguity¶

Historical Note¶

Threads: SMT and Multi-Core¶

Simultaneous Multithreading (SMT)¶

Multi-Core¶

More Cores or Wider Cores?¶

Quadratic Cost of Width¶

Hybrid / Asymmetric Designs¶

System-on-Chip (SoC)¶

Data Parallelism: SIMD¶

The Idea¶

The SIMD Zoo¶

When SIMD Helps¶

The Memory Wall¶

Rough Numbers (2026 consumer system)¶

Why Caches Work¶

Cache Hierarchy¶

Bandwidth vs. Latency¶

Highway Analogy¶

"You Can't Bribe God"¶

Workload Sensitivity¶

Practice Problems¶

Problem 1: Megahertz vs. IPC¶

Problem 2: Mispredict Penalty¶

Problem 3: Why OOO Exists¶

Problem 4: x86 Registers¶

Problem 5: Average Memory Access Time (AMAT)¶

Problem 6: SIMD Applicability¶

Problem 7: Cores vs. Width¶

Key Concepts¶

Summary¶

Further Reading¶