Skip to content

Advanced Architecture

Overview

Last class we finished a single-cycle RISC-V processor: one rising edge, one instruction, end of story. Nothing you buy is built that way. Modern CPUs — the chip in your laptop, your phone, a cloud VM — use pipelining, superscalar issue, branch prediction, out-of-order execution, multiple cores, simultaneous multithreading, SIMD vector units, and multi-level caches to extract orders of magnitude more performance from the same basic datapath. This lecture is a conceptual tour of those techniques.

We follow the structure of Jason Patterson's Modern Microprocessors: A 90-Minute Guide (2026 edition). The point is not to design these structures in Digital — a single modern CPU core contains several billion transistors — but to give you the vocabulary and mental model to read a CPU spec sheet, interpret benchmark numbers, and understand why those billion transistors exist. By the end of the lecture you should be able to explain why doubling the clock speed rarely doubles performance, why branch mispredictions are expensive, and why the memory system — not the ALU — usually determines how fast your code runs.

Learning Objectives

  • Explain why clock speed alone is a poor measure of processor performance; compute CPI and IPC
  • Describe a classic 5-stage pipeline and identify structural, data, and control hazards
  • Distinguish superpipelining (deeper pipes) from superscalar (wider issue) and from VLIW (compiler-scheduled parallelism)
  • Explain how branch prediction, speculative execution, and predication hide control hazards
  • Describe out-of-order execution, the reorder buffer, and why register renaming is essential on x86
  • State the Power Wall and the ILP Wall, and explain what they imply for design trade-offs
  • Describe how modern x86 chips decode into internal μops and why the μop cache matters
  • Contrast simultaneous multithreading (SMT) with multi-core and explain when each helps
  • Explain the n² cost of wider issue, motivating multi-core and asymmetric big-little designs
  • Describe SIMD (SSE, AVX, NEON, SVE) and identify code that benefits
  • Explain the memory wall, the L1/L2/L3/DRAM hierarchy, and the difference between latency and bandwidth

Prerequisites


More Than Just Megahertz

Naive performance measure: "my CPU is 4 GHz, yours is 3 GHz, so mine is 33% faster." Almost always wrong. What matters is not cycles per second but useful work per cycle.

CPI and IPC

Metric Formula Meaning
CPI cycles / instruction Average cycles each instruction consumes
IPC instructions / cycle Reciprocal of CPI; higher is better
MIPS IPC × clock (MHz) Instructions retired per second (millions)
Wall-clock time instructions × CPI / clock What the user actually experiences

A 3 GHz CPU with IPC = 4 retires 12 billion instructions per second. A 4 GHz CPU with IPC = 2 retires 8 billion. The 3 GHz part wins by 50%.

Historical Example

In 1997, a 250 MHz MIPS R10000 outperformed a 300 MHz Pentium II on integer benchmarks, and a 600 MHz Alpha 21164 obliterated both. Same era, same semiconductor process generation, wildly different performance per cycle. The difference was microarchitecture — the subject of today's lecture.

The "megahertz myth" is why Apple's 2006 transition from PowerPC G5 (2.7 GHz) to Intel Core Duo (1.8 GHz) still delivered better performance. IPC more than compensated for the clock drop.

Pipelining and Instruction-Level Parallelism

The single-cycle design wastes most of the chip most of the time: while the ALU is computing, the instruction fetch logic sits idle; while data memory is writing back, the PC increment sits idle. Pipelining fixes this by overlapping successive instructions so that every unit is busy every cycle.

The Classic 5-Stage Pipeline

flowchart LR
    IF["IF<br/>Instruction<br/>Fetch"] --> ID["ID<br/>Decode +<br/>Reg Read"]
    ID --> EX["EX<br/>ALU"]
    EX --> MEM["MEM<br/>Data Memory"]
    MEM --> WB["WB<br/>Writeback"]

Each stage is separated from the next by a pipeline register (a bank of flip-flops). A new instruction enters IF every cycle; five instructions are in flight at once.

Space-Time View

After the pipeline fills, one instruction retires per cycle — CPI approaches 1.0, a 5× speedup over an unpipelined 5-cycle implementation (at the same clock).

Cycle 1 2 3 4 5 6 7
add IF ID EX MEM WB
sub IF ID EX MEM WB
and IF ID EX MEM WB
or IF ID EX MEM
xor IF ID EX

Hazards

Three things break the one-instruction-per-cycle promise:

Hazard Cause Mitigation
Structural Two stages want the same hardware Duplicate the resource (separate I-cache and D-cache)
Data Instruction B needs a result that A has not written yet Forwarding (bypass) or stall
Control Branch target unknown until EX — what do we fetch? Branch prediction

Forwarding (Bypassing)

Consider add t0, t1, t2 followed by sub t3, t0, t4. Without help, sub reads t0 from the register file before add's WB has written it. Forwarding wires the ALU output back to the ALU input so that the fresh value is available one cycle earlier — no stall needed. This is the "bypass" you will see on every modern pipeline diagram.

What if the producer is a load? `lw t0, 0(t1)` followed by `add t3, t0, t4` has a **load-use hazard**. The value is not available until the end of MEM, but `add` wants it at the start of EX. Even with forwarding, the pipeline must stall one cycle (insert a **bubble**). Good compilers reorder instructions to fill this slot with something useful.

Deeper and Wider: Superpipelining and Superscalar

Superpipelining

Shorter stages allow higher clock frequency. Split each classic stage into two or three shorter ones — 10, 15, even 31 stages. The same amount of work is done, but now faster clock cycles run through more of them.

Processor Stages
UltraSPARC T1 6
ARM Cortex-A53 8
Intel Core 2 14
Intel Skylake / Sunny Cove 14–19
AMD Zen 4 ~19
Pentium 4 Prescott 31

Trade-off: a deeper pipe has a bigger mispredict penalty. Pentium 4 paid dearly for its 31 stages.

Superscalar

Add issue width: fetch, decode, and execute multiple instructions in parallel each cycle. A 4-wide superscalar has four ALUs (or a mix of integer/FP/load-store units) and can retire up to 4 instructions per cycle — IPC up to 4.0.

flowchart LR
    IF["IF<br/>(4 wide)"] --> ID["ID"]
    ID --> ALU1["INT ALU"]
    ID --> ALU2["INT ALU"]
    ID --> FPU["FP Unit"]
    ID --> LS["Load/Store"]
    ALU1 --> WB["Writeback<br/>(4 wide)"]
    ALU2 --> WB
    FPU --> WB
    LS --> WB
Processor Issue Width
UltraSPARC T1 1
ARM Cortex-A53 2
Pentium 2
Intel Core 2 4
Apple M1 Firestorm 8
Intel Golden Cove 6 (decode) → ~12 (μop dispatch)

Superpipelined + Superscalar

Modern CPUs combine both. A Zen 4 core is ~19 stages deep and 6-wide at decode, extracting parallelism in two dimensions at once.


VLIW: Let the Compiler Do It

Very Long Instruction Word: each "instruction" is a bundle of independent sub-operations, and the compiler statically packs them. The hardware just executes what it is told — no dependency checking, no dynamic scheduling.

  • Pros: simpler (and therefore faster, cooler) hardware; the compiler has a whole-program view
  • Cons: loses when runtime behavior is unpredictable (cache misses, branches); binaries must be recompiled for every new chip width; compilers struggle to find parallelism in irregular code

Who shipped it: Intel Itanium (IA-64, EPIC) — failed in the market. Transmeta Crusoe — niche. GPUs and DSPs — still widely used, because those workloads are regular.

Who did not: anyone selling a general-purpose CPU today.


Dependencies and Latencies

Even with a wide, deep pipeline, you cannot execute instructions in parallel if they depend on each other.

Operation Typical Latency (cycles)
Integer add, sub, and, or, xor 1
Integer multiply 3–5
Integer divide 12–40+
FP add/multiply 3–6
FP divide 10–25
L1 load (cache hit) 3–5
L2 load 10–15
DRAM load (cache miss) 100–300+

Load latency hurts most. Loads happen early in most code sequences and almost everything else depends on what they return. A 4-wide superscalar that stalls 5 cycles on each load is no faster than a scalar in-order machine if every instruction is a load.


Branches and Branch Prediction

Every ~5–6 instructions is a branch. In a deep pipeline, by the time we learn whether a branch is taken, we have already fetched 10–20 instructions from somewhere. Which somewhere?

Speculation

Guess the target, fetch speculatively, check later. If the guess was right, free performance. If wrong, flush the pipeline — all speculatively-fetched instructions are discarded and the hardware starts over.

Mispredict Penalty

Penalty ≈ pipeline depth between fetch and the stage that resolves the branch. On a 20-stage pipeline with a 15-cycle penalty, even 95% prediction accuracy costs you:

lost IPC = 15 &times; 0.05 = 0.75 cycles wasted per branch
branches per instruction ~ 1/6
lost IPC / instruction = 0.125  &rarr;  ~15% of peak performance gone

Predictor Evolution

Predictor Accuracy Notes
Always-taken ~60% Trivial
Backward taken, forward not ~65% "Loops loop, ifs fall through"
1-bit dynamic ~80% Remember last outcome
2-bit saturating counter ~90% Resist single flips
Two-level adaptive / gshare ~95% History of the last N branches
TAGE / perceptron (modern) ~97%+ Multiple predictors voted

Predication

An alternative: replace the branch with a predicated instruction that does the work either way and conditionally commits the result.

# Before (branchy):          # After (predicated):
    beq  t0, t1, skip             sub   t2, t3, t4
    sub  t2, t3, t4               cmov.eq  t2, t5, t0, t1  # if t0==t1, t2<-t5
skip:
  • ARM has been fully predicated since day one
  • Alpha, x86, MIPS, SPARC added conditional moves (cmov)
  • IA-64 (Itanium) had a predicate bit on every instruction

Great for short if-then-else. Terrible for large blocks — you pay to execute the unused side.


Out-of-Order Execution and Register Renaming

Even with a perfect predictor and a 6-wide pipeline, straight-line code has too many dependency chains to feed six functional units every cycle. Out-of-order (OOO) execution lets the hardware pick ready instructions from a window of 100–500 in-flight operations and issue them as soon as their inputs are available, regardless of program order.

OOO Pipeline

flowchart LR
    IF["Fetch"] --> DEC["Decode"]
    DEC --> REN["Rename<br/>(arch &rarr; phys regs)"]
    REN --> RS["Reservation<br/>Station /<br/>Scheduler"]
    RS --> FU["Functional<br/>Units (many)"]
    FU --> ROB["Reorder<br/>Buffer"]
    ROB --> RET["Retire<br/>(in order)"]

Key ideas:

  • Fetch, decode, rename, dispatch happen in order
  • Execute happens out of order — whoever is ready, goes
  • Retire happens in order — the architectural state only updates when the oldest in-flight instruction completes

Register Renaming

Architectural registers (rax, rbx, ..., or x0x31) are few. The pipeline has hundreds of operations in flight that need temporary scratch. Renaming maps each architectural register write to a physical register from a pool of 150–400. Two instructions that write rax get mapped to different physical registers and can execute in parallel.

This matters most on x86: in 32-bit mode the ISA exposes only 8 general-purpose registers. Renaming expands that to hundreds internally — without it, x86 OOO would be almost worthless.

Andy Glew's Confession

"The dirty little secret of OOO is that we are often not very much OOO at all." — Andy Glew, Pentium Pro architect

Typical OOO cores extract only ~20–40% more IPC than a well-tuned in-order core running the same code. It is expensive (millions of transistors, significant power) for a modest but real win.


The Brainiac Debate

Two design philosophies, coined by Linley Gwennap (1993):

Style Strategy Example
Brainiac Complex OOO, wide issue, aggressive speculation; high IPC, moderate clock Pentium Pro, POWER4, Apple M1
Speed-Demon Simple in-order core, high clock, lean on compiler Alpha 21164, Pentium 4, Niagara T1

Designs have oscillated between the two:

  • Intel: Pentium Pro (brainiac) → Pentium 4 (speed-demon, failed) → Core / Core i (brainiac) → today (brainiac, tempered by power)
  • DEC Alpha: 21064/21164 (speed-demon) → 21264 (brainiac) → project cancelled
  • SPARC: SuperSPARC (brainiac) → UltraSPARC / Niagara (speed-demon + SMT)
  • ARM: Cortex-A7 (tiny speed-demon) → Cortex-X / Apple M-series (brainiac)

Modern reality: power constraints make pure speed-demon untenable (can't raise clocks past ~5 GHz economically) and pure brainiac wasteful. Everyone is now a moderate brainiac with power management.


The Power Wall and the ILP Wall

The Power Wall

Dynamic switching power:

P &prop; C &times; V&sup2; &times; f

Raising the clock 30% typically requires a voltage bump to keep timing margins, and voltage enters squared. Net: ~2× power and heat for a 30% frequency increase. Leakage adds a temperature-dependent floor.

Practical ceilings today:

Form factor Sustained TDP
Server (per socket) 250–400 W
Desktop 65–150 W
Laptop (workstation) 28–45 W
Ultralight laptop 15–25 W
Phone / tablet 3–10 W
Watch <1 W

Pentium 4 hit the Power Wall at 3.8 GHz and was cancelled. IBM POWER6 and AMD Bulldozer hit it too.

The ILP Wall

Real programs have limited fine-grained parallelism. After every architectural trick — OOO, wide issue, big reorder buffer — typical integer code (SPECint) sustains only ~1–2 IPC. Scientific / vectorizable code can do much better, but that is not what most software looks like.

Once you cannot raise clock (Power Wall) and cannot extract more IPC from a single thread (ILP Wall), the only way to get more performance is more threads — SMT and multi-core.


What About x86?

x86 is a CISC ISA: variable-length instructions, complex addressing modes, read-modify-write memory operands. It looks nothing like the clean RISC pipelines we have been drawing. How does Intel ship a 6-wide OOO x86 core?

Decoupled Front-End: RISCy x86

flowchart LR
    MEM["x86 bytes<br/>(variable length)"] --> DEC["Complex<br/>Decoders"]
    DEC --> UOP["μop Cache<br/>(decoded)"]
    UOP --> ROB["Rename +<br/>Scheduler"]
    ROB --> FU["Functional<br/>Units"]
    FU --> RET["Retire"]

Steps:

  1. Fetch x86 bytes from I-cache
  2. Crack each x86 instruction into 1–4 simple μops (micro-operations) that resemble RISC ops
  3. Rename, schedule, and execute μops in an OOO core
  4. Retire in original x86 instruction order

The μop Cache

Decoding x86 is expensive (power-hungry, gates-deep). Since Sandy Bridge (2011), Intel caches already-decoded μops so that loops run straight from the μop cache, skipping the decoders entirely. AMD Zen has an equivalent μop cache.

Issue Width Ambiguity

Modern x86 decode up to 5–6 x86 instructions per cycle, producing ~6–8 fused μops, which the backend issues to 10–12 execution ports. Calling it "6-wide" or "12-wide" depends on which stage you measure.

Historical Note

The RISCy-x86 approach was invented twice independently in the mid-1990s: NexGen Nx586 (1994) and Intel Pentium Pro / P6 (1995). Transmeta tried to do the same thing in software instead of hardware; it worked but was too slow.


Threads: SMT and Multi-Core

If one thread cannot keep a 6-wide core busy (ILP Wall), try running two threads through it at the same time. Any cycle one thread stalls, the other fills the gap.

Simultaneous Multithreading (SMT)

flowchart LR
    T0["Thread 0<br/>PC, regs"] --> IF
    T1["Thread 1<br/>PC, regs"] --> IF
    IF["Shared<br/>Fetch/Decode"] --> BACK["Shared<br/>Backend"]
    BACK --> RET["Retire<br/>(per thread)"]
  • Duplicate: PC, architectural registers, TLB tags
  • Share: everything else — decoders, functional units, caches, rename tables

Cost: ~5–10% extra core area. Benefit: −10% to +30% performance on the same core depending on workload.

Intel calls it Hyper-Threading. POWER, Apple, and ARM have various flavors. AMD Zen has 2-way SMT. Sun UltraSPARC T3 Niagara pushed it to 8-way SMT per core — 128 threads on a chip.

Multi-Core

Duplicate the entire core. Each core has its own pipeline, L1, and L2; cores share the L3 and the memory controller.

Approach Area cost Independence
SMT only ~10% Logical threads share all resources
Multi-core ~100% per extra core True parallelism; no resource contention
Both (modern) Big Multi-core at top level, SMT within each core

Today's shipping chips: 6–16 cores for consumer, 32–128 for server, often with 2–4 SMT threads per core.


More Cores or Wider Cores?

Why not just make one really wide 20-issue core?

Quadratic Cost of Width

Dependency checking, the rename tables, the scheduler, and the bypass network all scale roughly as O(n²) in issue width. Doubling from 4-wide to 8-wide roughly quadruples the logic and adds wire delay that hurts clock speed.

Design Area Peak IPC Single-Thread Perf
One 10-wide core ~6× 10 (theoretical) Sublinear
Two 5-wide cores ~2× 2 × 5 = 10 Linear

For the same transistor budget, two medium cores beat one giant core — and run two threads truly in parallel.

Hybrid / Asymmetric Designs

Modern consumer CPUs mix big and small cores on one die:

  • ARM big.LITTLE: fast "big" cores for interactive workloads, tiny "LITTLE" cores for background (saves battery)
  • Apple M-series: P-cores (performance) + E-cores (efficiency)
  • Intel 12th gen+: P-cores + E-cores with a hardware thread director
  • AMD Zen 4c / Zen 5c: smaller cores on the same ISA for density

System-on-Chip (SoC)

Integrate CPU + GPU + I/O + DSP + security + networking on one die. Dominant in phones and tablets; increasingly common everywhere (Apple M-series, AMD Ryzen APUs). Saves power and area vs. separate chips.


Data Parallelism: SIMD

One instruction, many data. Pack multiple values into a wide register and operate on all of them simultaneously.

The Idea

A 128-bit register can hold:

  • 16 × 8-bit bytes (pixel RGBA, audio samples)
  • 8 × 16-bit shorts
  • 4 × 32-bit ints or floats
  • 2 × 64-bit doubles

A single SIMD add instruction applies the operation to all lanes in parallel.

  [ a0 | a1 | a2 | a3 ]
+ [ b0 | b1 | b2 | b3 ]
= [ a0+b0 | a1+b1 | a2+b2 | a3+b3 ]

The SIMD Zoo

ISA Extension Width
x86 MMX (1997) 64
x86 SSE / SSE2 / SSE3 / SSE4 128
x86 AVX / AVX2 256
x86 AVX-512 512
x86 AVX10 (2024+) 256/512 (flexible)
POWER AltiVec / VSX 128
ARM NEON 128
ARM SVE / SVE2 128–2048 (length-agnostic)
RISC-V V extension scalable

ARM SVE and RISC-V V take a different approach: the same binary runs on any hardware vector width. Each loop iteration simply consumes whatever the hardware can offer.

When SIMD Helps

  • Huge win: image/video/audio processing, matrix math, cryptography, string scanning
  • Small win: regular array loops the compiler can auto-vectorize
  • No win: pointer-chasing, tree walks, parsing, general business logic

Auto-vectorization in compilers is still limited; most of the benefit comes from hand-tuned libraries (BLAS, libjpeg, OpenSSL, JVM intrinsics) that your code calls without knowing.


The Memory Wall

DRAM latency has barely improved in 25 years. CPU frequencies grew 1000× over the same period. The gap — the memory wall — is now the #1 performance bottleneck for most real workloads.

Rough Numbers (2026 consumer system)

Level Size Latency (CPU cycles)
Register 16–32 values 0
L1 cache 32–128 KB 3–5
L2 cache 256 KB–2 MB 10–15
L3 / LLC 4–96 MB 40–60
DRAM 8–128 GB 150–300
SSD (NVMe) 256 GB–4 TB ~50,000
HDD 1–20 TB ~5,000,000

A single cache miss to DRAM costs as many cycles as ~100 ALU operations.

Why Caches Work

Real programs exhibit locality (review from Cache Memory):

  • Temporal: recently-used data will be used again soon
  • Spatial: data near a recent access will be accessed soon

Modern L1 hit rates are ~90–95%; L2 catches most of the rest; L3 catches most of what is left. Only a few percent of loads ever go to DRAM.

Cache Hierarchy

flowchart LR
    CPU["Core"] --> L1["L1<br/>~32 KB<br/>~4 cycles"]
    L1 --> L2["L2<br/>~1 MB<br/>~14 cycles"]
    L2 --> L3["L3 / LLC<br/>~32 MB<br/>~50 cycles"]
    L3 --> DRAM["DRAM<br/>~16 GB<br/>~200 cycles"]
    DRAM --> SSD["SSD<br/>~1 TB<br/>~50000 cycles"]

Last-level cache (LLC) often consumes more than half of the chip area. AMD's 3D V-Cache bonds an extra SRAM die on top of the compute die to boost LLC size dramatically.


Bandwidth vs. Latency

Memory has two performance dimensions that are often conflated:

Dimension What it measures How to improve
Latency Time for one access Faster DRAM, more cache, shorter wires
Bandwidth Total bytes per second Wider buses, more channels, stacked DRAM

Highway Analogy

  • Lanes = bus width → bandwidth: adding lanes doubles cars/hour but does not make any one car go faster
  • Speed limit = signaling rate → latency: physics sets an upper bound
  • Distance from A to B = cannot be changed by adding lanes

"You Can't Bribe God"

Physical distance + speed-of-light + capacitive wire loading sets a hard floor on DRAM access time. Bandwidth can scale by parallelism (more channels, HBM stacks, DDR5-6400). Latency cannot.

Workload Sensitivity

Workload Latency-bound or bandwidth-bound?
Pointer-chasing (linked lists, trees, compilers, databases) Latency
Image/video processing, scientific dense code Bandwidth
Web server handling unrelated requests Bandwidth (across threads)
Interactive UI Latency

Practice Problems

Problem 1: Megahertz vs. IPC

CPU A runs at 4.0 GHz with CPI = 1.6. CPU B runs at 3.0 GHz with CPI = 0.8. Which has higher performance on the same code, and by how much?

Solution Wall-clock time ∝ CPI / clock. - CPU A: 1.6 / 4.0 = 0.40 ns per instruction - CPU B: 0.8 / 3.0 ≈ 0.267 ns per instruction CPU B is 0.40 / 0.267 ≈ **1.5× faster** despite a 25% lower clock. Higher IPC wins.

Problem 2: Mispredict Penalty

A 20-stage pipeline resolves branches at stage 15. Branches occur every 6 instructions and the predictor is 95% accurate. Approximately what fraction of peak IPC is lost to mispredictions?

Solution - Penalty per mispredict: 15 stages flushed = 15 cycles wasted - Mispredict rate: 5% of branches - Branch rate: 1 / 6 instructions Wasted cycles per instruction = 15 × 0.05 / 6 = **0.125** On a 1-IPC baseline, that is ~12.5% of peak lost just to mispredictions. On a 4-IPC superscalar it is worse — a flushed cycle discards *four* potential retirements.

Problem 3: Why OOO Exists

Why can't a clever compiler statically schedule code well enough to make OOO hardware unnecessary?

Solution The compiler cannot know, at compile time: 1. Which loads will hit or miss the cache — latency varies by 100× 2. Which branches will be taken on a given run — input-dependent 3. How contention with other threads / cores will play out 4. What code will follow across a function call boundary into a separately-compiled binary OOO hardware sees the actual runtime dependencies and latencies and reorders based on what is **actually ready now**. A compiler has to plan for the worst case. This is also why Itanium's VLIW approach struggled on general-purpose code.

Problem 4: x86 Registers

Why does x86 depend more heavily on register renaming than RISC-V does?

Solution x86 exposes only 8 GPRs in 32-bit mode and 16 in 64-bit mode. With that few architectural names, the compiler is forced to reuse the same register repeatedly, creating artificial **false dependencies** (write-after-write, write-after-read) that block parallelism. Renaming maps those reuses to distinct physical registers, recovering the parallelism the ISA hides. RISC-V exposes 32 GPRs — more breathing room, fewer artificial conflicts. Renaming still helps but is less of a rescue mission.

Problem 5: Average Memory Access Time (AMAT)

A system has L1 hit rate 95% at 4 cycles, L2 hit rate 80% (of L1 misses) at 14 cycles, L3 hit rate 50% (of L2 misses) at 50 cycles, DRAM at 200 cycles. Compute AMAT.

Solution
AMAT = L1_latency + P(miss L1) &times; (L2_latency + P(miss L2) &times; (L3_latency + P(miss L3) &times; DRAM))
     = 4 + 0.05 &times; (14 + 0.20 &times; (50 + 0.50 &times; 200))
     = 4 + 0.05 &times; (14 + 0.20 &times; 150)
     = 4 + 0.05 &times; (14 + 30)
     = 4 + 0.05 &times; 44
     = 4 + 2.2
     = 6.2 cycles
A 95% L1 hit rate keeps the effective latency close to L1 even though a DRAM miss is 200 cycles. This is the magic of caches.

Problem 6: SIMD Applicability

You have two workloads to speed up with AVX-512. Which is a good candidate and why?

  1. Convert an RGB image to grayscale (one arithmetic operation per pixel)
  2. Walk an on-disk B-tree to find a record matching a key
Solution - **(1) wins big**: independent per-pixel work, regular memory access, arithmetic that maps cleanly to packed 8-bit SIMD. Expect a 4–16× speedup. - **(2) loses**: each step of the walk depends on the *previous* load (pointer chase); the branch at each node is data-dependent and unpredictable; there is no parallel work to pack into a vector lane. SIMD cannot help. Data parallelism requires independent work. Control-dependent serial work needs different tools (branch prediction, prefetching, out-of-order execution).

Problem 7: Cores vs. Width

You have a transistor budget for either a single 8-issue OOO core or four 3-issue in-order cores. A customer is running (a) a single-threaded spreadsheet, (b) a web server handling 200 concurrent connections. Which chip should they buy?

Solution - **(a)** The spreadsheet is single-threaded and latency-sensitive. Only one thread at a time; the 8-issue OOO core wins by extracting ILP within that thread. - **(b)** The web server has abundant thread-level parallelism and each request is memory-latency-bound. Four simpler cores (better still: four cores × 2 SMT threads) win by running eight requests truly in parallel. Dedicated OOO logic would sit idle most of the time. There is no "best" CPU — there is best-for-the-workload. Modern consumer chips attempt to be good at both by mixing P-cores and E-cores.

Key Concepts

Concept Description
CPI / IPC Cycles per instruction; instructions per cycle — better measure than MHz
Pipelining Overlap successive instructions; targets CPI = 1
Superpipelining Deeper stages, higher clock, bigger mispredict penalty
Superscalar Issue > 1 instruction per cycle through parallel functional units
VLIW Compiler packs bundles; works for regular code, not general-purpose
Branch prediction Speculate the path; misprediction flushes the pipeline
Predication Replace branches with conditional instructions
OOO execution Issue instructions in data-ready order; retire in program order
Register renaming Map few architectural registers to many physical; critical on x86
Power Wall P ∝ f V² — cannot raise clock forever
ILP Wall Real code sustains ~1–2 IPC regardless of width
μops Internal RISC-like ops that x86 decodes to; cached after decode
SMT One core presents multiple logical processors; fills bubbles with other threads
Multi-core Duplicate entire cores; true thread-level parallelism
Big-little Asymmetric cores on one die for performance + efficiency
SIMD One instruction operates on a packed vector of values
Memory Wall CPU speed far outran DRAM latency; caches hide the gap
Latency vs. Bandwidth Time-per-access vs. bytes-per-second; different techniques address each

Summary

  1. Clock speed is only one factor. Performance = IPC × clock. IPC comes from microarchitecture tricks, not MHz.
  2. Pipelining overlaps instructions so that each pipeline stage stays busy every cycle, pushing CPI toward 1 at the cost of exposing hazards.
  3. Superpipelining (deeper) and superscalar (wider) attack the single-thread performance problem from two directions; modern CPUs do both.
  4. Branch prediction and speculative execution hide control hazards; mispredictions flush a deep pipeline and cost real performance.
  5. Out-of-order execution with register renaming extracts another 20–40% IPC by reordering dynamically; it is essential on x86 because of its tiny architectural register set.
  6. The Power Wall caps clock speed and the ILP Wall caps single-thread IPC. Neither goes away with more transistors.
  7. Beyond those walls, we scale with SMT (cheap extra threads through one core) and multi-core (duplicate cores). Hybrid big-little designs balance throughput and efficiency.
  8. SIMD vector instructions exploit data parallelism — huge for media and scientific code, nothing for pointer-chasing.
  9. The memory wall is now the dominant bottleneck. Multi-level caches hide DRAM latency; most performance tuning in real software comes down to cache behavior.
  10. Modern CPUs are a tower of tricks stacked on top of the single-cycle datapath you built in Digital. Every trick is a response to a specific bottleneck — and every trick has a cost.

Further Reading