Skip to content

Cache Memory

Overview

This lecture introduces cache memory concepts and their role in bridging the speed gap between processors and main memory. We examine why caches are effective through the principles of temporal and spatial locality, and explore three cache organizations: direct-mapped, fully-associative, and set-associative. We cover address decomposition, cache lookup algorithms, LRU replacement, and multi-word block sizes. Understanding cache memory is essential for writing efficient code and for implementing the cache simulator in Project03.

Learning Objectives

  • Understand the memory hierarchy and the processor-memory speed gap
  • Explain the principles of temporal and spatial locality
  • Describe how direct-mapped caches use address decomposition
  • Calculate tag, index, and offset bits for a given cache configuration
  • Implement cache lookup algorithms for all three cache types in Rust
  • Understand and implement LRU replacement policy using timestamps
  • Decompose addresses for caches with multi-word blocks
  • Trace through cache simulation scenarios to determine hits and misses
  • Compute hit rates and miss rates from simulation data

Prerequisites


The Memory Hierarchy

The Processor-Memory Speed Gap

Modern processors can execute billions of instructions per second, but accessing data from main memory (DRAM) takes many more cycles than a simple arithmetic operation. This creates a fundamental bottleneck:

graph LR
    subgraph "Speed Over Time"
        A["Processor Speed"] -->|"Increasing<br>rapidly"| B["Performance Gap"]
        C["Memory Speed"] -->|"Increasing<br>slowly"| B
    end

    B -->|"Growing wider<br>each year"| D["Solution: Cache Memory"]

Memory Hierarchy Levels

To address this gap, computer systems use a memory hierarchy:

graph TD
    subgraph "Memory Hierarchy"
        A["CPU Registers<br>Fastest, smallest<br>~1 cycle"]
        B["L1 Cache (SRAM)<br>Very fast, small<br>~1-4 cycles"]
        C["L2 Cache (SRAM)<br>Fast, medium<br>~10-20 cycles"]
        D["L3 Cache (SRAM)<br>Moderate, larger<br>~40-75 cycles"]
        E["Main Memory (DRAM)<br>Slower, large<br>~100+ cycles"]
        F["Disk/SSD<br>Slowest, huge<br>~millions of cycles"]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F
Level Technology Size Access Time Cost/Byte
Registers Flip-flops ~32 x 64 bits ~0.5 ns Highest
L1 Cache SRAM 32-64 KB ~1-2 ns Very High
L2 Cache SRAM 256 KB-1 MB ~5-10 ns High
L3 Cache SRAM 2-32 MB ~20-40 ns Medium
Main Memory DRAM 4-64 GB ~50-100 ns Low
SSD Flash 256 GB-4 TB ~100 us Very Low
HDD Magnetic 1-16 TB ~10 ms Lowest

Cache Memory Position

A cache sits between the processor and main memory, holding recently accessed data:

graph LR
    subgraph "Processor"
        A["CPU<br>lw, sw"]
    end

    subgraph "Cache"
        B["SRAM<br>Fast memory"]
    end

    subgraph "Main Memory"
        C["DRAM<br>STACK"]
        D["DATA"]
        E["CODE"]
    end

    A -->|"addr"| B
    B -->|"hit"| A
    B <-->|"miss"| C
    B <--> D
    B <--> E

Principles of Locality

Why Caches Work

Caches achieve high hit rates (often >97%) because programs exhibit locality of reference.

Temporal Locality

Temporal locality: If a data item is accessed, it is likely to be accessed again soon.

Examples:

  • Loop counters: Variable i in a loop is accessed on every iteration
  • Function variables: Local variables used throughout a function
  • Recursive calls: The same function code is executed repeatedly
// Temporal locality example
let mut sum: i64 = 0;
for i in 0..1000 {       // 'i' accessed 1000+ times
    sum += array[i];      // 'sum' accessed 1000 times
}

Spatial Locality

Spatial locality: If a data item is accessed, nearby data items are likely to be accessed soon.

Examples:

  • Array traversal: Accessing array[i] often followed by array[i+1]
  • Sequential code: Instruction at PC followed by instruction at PC+4
  • Struct access: Accessing one field often followed by neighboring fields
// Spatial locality example
for i in 0..1000 {
    total += array[i];  // array[0], array[1], array[2], ... accessed sequentially
}

Locality in Code Execution

graph TD
    subgraph "Instruction Fetch"
        A["PC = 0x1000<br>fetch instruction"]
        B["PC = 0x1004<br>fetch next (spatial)"]
        C["PC = 0x1008<br>fetch next (spatial)"]
        D["PC = 0x1000<br>loop back (temporal)"]
    end

    A --> B
    B --> C
    C -->|"branch"| D
    D --> A

Cache Concepts and Terminology

Key Terms

Term Definition
Hit Requested data is found in the cache
Miss Requested data is not in the cache
Hit Rate Fraction of accesses that are hits
Miss Rate Fraction of accesses that are misses
Block Unit of data transferred between cache and memory
Slot/Line A single entry in the cache (holds one block)
Tag Identifier stored with each block to identify which address it holds
Valid Bit Indicates whether a slot contains valid data
Cold Miss First access to a slot (valid is false)
Hot Miss Slot is valid but contains data for a different address (tag mismatch)

Hit and Miss Metrics

hit_rate  = num_hits / num_refs
miss_rate = num_misses / num_refs

// Alternative formulas
hit_rate  = 1 - miss_rate
miss_rate = 1 - hit_rate

Average Memory Access Time (AMAT)

The AMAT formula combines hit and miss times:

AMAT = hit_time + (miss_rate x miss_penalty)

Where:

  • hit_time: Time to access the cache (typically 1-2 cycles)
  • miss_rate: Probability of a cache miss
  • miss_penalty: Additional time to fetch from memory on a miss (typically 50-200 cycles)

Example: If hit_time = 1 cycle, miss_rate = 5%, miss_penalty = 100 cycles:

AMAT = 1 + (0.05 x 100) = 1 + 5 = 6 cycles

Memory Addressing

Byte vs Word Addressing

Memory is byte-addressable, but caches often work with words (4 bytes):

Address Conversion

// Byte address to word address
let word_addr = addr / 4;
let word_addr = addr >> 2;  // Equivalent using bit shift

// Word address to byte address
let addr = word_addr * 4;
let addr = word_addr << 2;  // Equivalent using bit shift

Word Alignment

For efficiency, word accesses should be word-aligned (addresses divisible by 4):

Byte Address Word Address Aligned?
0x00 0 Yes
0x04 1 Yes
0x08 2 Yes
0x03 - No
0x05 - No

Direct-Mapped Caches

Overview

A direct-mapped cache is the simplest cache organization. Each memory address maps to exactly one cache slot, determined by the address itself. This makes the replacement policy trivial: if the data is not in the cache, you evict whatever is in the slot that the address maps to.

Cache Structure

struct CacheSlot {
    tag: u64,
    block: [u32; 1],  // One word per block
    valid: bool,
}

let slots: [CacheSlot; N];  // N slots in the cache

Address Decomposition

For a 4-slot direct-mapped cache with 1-word blocks:

64-bit byte address:
┌─────────────────────────┬─────────┬───────────┐
│         TAG             │ INDEX   │ BYTE OFF  │
│       [63:4]            │ [3:2]   │  [1:0]    │
└─────────────────────────┴─────────┴───────────┘
         60 bits            2 bits    2 bits
  • Byte Offset (2 bits): Selects byte within a word (for word-aligned access, always 00)
  • Index (2 bits): Selects which cache slot (0-3 for 4 slots)
  • Tag (60 bits): Identifies the specific memory block stored in the slot

Computing Index and Tag

We can use modulus or bit manipulation to extract these fields:

// For a 4-slot cache with 1-word blocks
let word_addr = addr >> 2;
let slot_index = word_addr & 0b11;  // Same as word_addr % 4

Why does this work? Look at the binary form of word addresses:

dec bin
 0  00000 <-- slot 0
 1  00001 <-- slot 1
 2  00010 <-- slot 2
 3  00011 <-- slot 3
 4  00100 <-- slot 0
 5  00101 <-- slot 1
 6  00110 <-- slot 2
 7  00111 <-- slot 3
 8  01000 <-- slot 0

The last two bits of each word address are exactly word_addr % 4. The remaining upper bits form the tag, which distinguishes different addresses that map to the same slot.

Direct-Mapped Cache Lookup

flowchart TD
    A["Memory Address"] --> B["Extract Index Bits"]
    B --> C["Access Cache Slot"]
    C --> D{"Valid?"}
    D -->|"No"| E["MISS (cold)"]
    D -->|"Yes"| F{"Tag Match?"}
    F -->|"No"| G["MISS (hot)"]
    F -->|"Yes"| H["HIT: Return Data"]
    E --> I["Fetch from Memory"]
    G --> I
    I --> J["Update Cache Slot"]
    J --> K["Return Data"]

Direct-Mapped Lookup Algorithm

let addr_tag = addr >> 4;  // Remove byte_offset (2) and slot_index bits (2)
let index_mask: u64 = 0b11;
let slot_index = ((addr >> 2) & index_mask) as usize;
let slot = &mut slots[slot_index];

if slot.valid && slot.tag == addr_tag {
    // Hit
    num_hits += 1;
    slot.block[0]
} else {
    // Miss (cold if !slot.valid, hot if slot.valid)
    num_misses += 1;
    slot.block[0] = mem_read_word(addr);
    slot.tag = addr_tag;
    slot.valid = true;
    slot.block[0]
}

Limitation: Conflict Misses

Direct-mapped caches suffer from conflict misses when multiple addresses map to the same slot:

Addresses 0, 16, 32, 48 all map to slot 0 (for 4-slot cache)
If accessed alternately: 0, 16, 0, 16 -> all misses (thrashing)

Fully-Associative Caches

Concept

In a fully-associative cache, an address can be stored in ANY slot. This eliminates conflict misses but requires checking all slots on every access. While too expensive for regular cache memory, fully-associative caches are used in specific situations like hardware support for virtual memory.

Address Decomposition

No index bits needed — only tag and byte offset:

┌─────────────────────────────────┬───────────┐
│             TAG                 │ BYTE OFF  │
│           [63:2]                │  [1:0]    │
└─────────────────────────────────┴───────────┘

Cache Structure with Timestamp

We add a timestamp field for tracking when each slot was last accessed, which is needed for the LRU replacement policy:

struct CacheSlot {
    tag: u64,
    block: [u32; 1],
    valid: bool,
    timestamp: u64,
}

let slots: [CacheSlot; N];

Fully-Associative Lookup Algorithm

Must compare the address tag against ALL slot tags:

refs += 1;
let addr_tag = addr >> 2;  // Remove byte_offset bits (2)

for i in 0..4 {
    let slot = &mut slots[i];
    if slot.valid && slot.tag == addr_tag {
        // Hit
        num_hits += 1;
        slot.timestamp = refs;
        return slot.block[0];
    }
}

// Miss
num_misses += 1;
let slot = find_lru_slot(&mut slots);
slot.block[0] = mem_read_word(addr);
slot.tag = addr_tag;
slot.valid = true;
slot.timestamp = refs;
slot.block[0]

LRU Replacement Policy

When all slots are full, we need to choose which to evict. LRU (Least Recently Used) evicts the slot that hasn't been accessed for the longest time.

We use a logical timestamp (the number of cache references). Each time we access a cache slot, we set the slot's timestamp to the current reference count. To find the LRU slot, we look for the slot with the smallest timestamp — this is the slot accessed least recently.

First, look for an unused slot (cold miss). If all slots are valid, pick the slot with the oldest timestamp (hot miss):

fn find_lru_slot(slots: &mut [CacheSlot]) -> &mut CacheSlot {
    // First, look for an invalid (unused) slot
    for slot in slots.iter_mut() {
        if !slot.valid {
            return slot;
        }
    }

    // All valid — find slot with oldest timestamp
    let mut lru_index = 0;
    let mut min_timestamp = u64::MAX;

    for (i, slot) in slots.iter().enumerate() {
        if slot.timestamp < min_timestamp {
            min_timestamp = slot.timestamp;
            lru_index = i;
        }
    }

    &mut slots[lru_index]
}

Trade-offs

Aspect Fully-Associative
Hit Rate Highest (no conflict misses)
Lookup Speed Slowest (check all slots)
Hardware Cost Highest (parallel comparators)
Replacement Flexible (LRU, FIFO, Random)

Set-Associative Caches

Concept

A set-associative cache is a compromise between direct-mapped and fully-associative:

  • Cache is divided into sets
  • Each set contains multiple ways (slots)
  • An address maps to one specific set (like direct-mapped)
  • Within the set, the address can go in any way (like fully-associative)

Terminology

  • N-way set-associative: Each set has N slots
  • 2-way: Each set has 2 slots (most common)
  • 4-way: Each set has 4 slots
  • 8-way: Each set has 8 slots

Address Decomposition for 2-Way Set-Associative

For 4 slots total organized as 2-way (2 sets, 2 ways each):

┌──────────────────────────┬──────────┬───────────┐
│          TAG             │ SET IDX  │ BYTE OFF  │
│        [63:3]            │   [2]    │  [1:0]    │
└──────────────────────────┴──────────┴───────────┘
         61 bits             1 bit      2 bits

Set-Associative Formulas

let num_slots = N;           // Total number of cache slots
let num_ways = W;            // Number of ways (slots per set)
let num_sets = N / W;        // Number of sets

// Bit widths
// set_index_bits = log2(num_sets)
// byte_offset_bits = log2(block_size_in_bytes)
// tag_bits = addr_bits - set_index_bits - byte_offset_bits

Cache Layout

                Way 1              Way 0
         ┌───────┬─────┬─────┬───────┬─────┬─────┐
  Set 1  │ valid │ tag │data │ valid │ tag │data │
         ├───────┼─────┼─────┼───────┼─────┼─────┤
  Set 0  │ valid │ tag │data │ valid │ tag │data │
         └───────┴─────┴─────┴───────┴─────┴─────┘

Set-Associative Lookup Algorithm

We logically treat the cache slot array as groups of num_ways slots. The first group is set 0, the second is set 1, and so on. Once we find the set index, we multiply by num_ways to get the first slot in the set:

refs += 1;
let num_ways = 4;
let addr_tag = addr >> 5;  // Remove byte_offset (2) and set_index bits (3)
let set_index = ((addr >> 2) & 0b111) as usize;

// 4 ways per set, so set_index * num_ways gives us the first slot in the set
let set_base = set_index * num_ways;

for i in 0..num_ways {
    let slot = &mut slots[set_base + i];
    if slot.valid && slot.tag == addr_tag {
        // Hit
        num_hits += 1;
        slot.timestamp = refs;
        return slot.block[0];
    }
}

// Miss
num_misses += 1;
let slot = find_lru_slot_in_set(&mut slots, set_base, num_ways);
slot.block[0] = mem_read_word(addr);
slot.tag = addr_tag;
slot.valid = true;
slot.timestamp = refs;
slot.block[0]

LRU Within a Set

The find_lru_slot_in_set function works just like the fully-associative version, but only searches within the set:

fn find_lru_slot_in_set(
    slots: &mut [CacheSlot],
    set_base: usize,
    num_ways: usize,
) -> &mut CacheSlot {
    // First, look for an invalid slot in the set
    for i in 0..num_ways {
        if !slots[set_base + i].valid {
            return &mut slots[set_base + i];
        }
    }

    // All valid — find LRU within the set
    let mut lru_index = set_base;
    let mut min_timestamp = u64::MAX;

    for i in 0..num_ways {
        let idx = set_base + i;
        if slots[idx].timestamp < min_timestamp {
            min_timestamp = slots[idx].timestamp;
            lru_index = idx;
        }
    }

    &mut slots[lru_index]
}

Example: 4-Way Set-Associative with 32 Slots

Configuration:

  • 32 slots total
  • 4-way set-associative
  • 1-word blocks

Calculations:

num_sets = 32 / 4 = 8 sets
set_index_bits = log2(8) = 3 bits
byte_offset_bits = log2(4) = 2 bits
tag_bits = 64 - 3 - 2 = 59 bits

Address decomposition:

┌──────────────────────────┬───────────┬───────────┐
│          TAG             │ SET INDEX │ BYTE OFF  │
│        [63:5]            │  [4:2]    │  [1:0]    │
└──────────────────────────┴───────────┴───────────┘
         59 bits             3 bits      2 bits

Block Size

Multi-Word Blocks

To exploit spatial locality, caches typically store multiple words per slot. When we miss on one word, we load an entire block of neighboring words into the cache — so subsequent accesses to nearby addresses are likely to hit.

struct CacheSlot {
    tag: u64,
    block: [u32; 4],  // 4-word block (16 bytes)
    valid: bool,
    timestamp: u64,
}

let slots: [CacheSlot; N];

Address Decomposition with Block Size

For a 4-slot direct-mapped cache with 4-word blocks:

┌──────────────────────┬───────────┬───────────┬───────────┐
│        TAG           │ SLOT IDX  │ BLOCK IDX │ BYTE OFF  │
│      [63:6]          │  [5:4]    │  [3:2]    │  [1:0]    │
└──────────────────────┴───────────┴───────────┴───────────┘
       58 bits           2 bits      2 bits      2 bits
  • Byte Offset (2 bits): Selects byte within word
  • Block Index (2 bits): Selects word within block
  • Slot Index (2 bits): Selects cache slot
  • Tag (58 bits): Identifies memory block

For other cache types with 4-word blocks:

Fully Associative:

tag[63:4] block_index[3:2] byte_offset[1:0]

Set Associative (2-way, 4 slots total = 2 sets):

tag[63:5] set_index[4] block_index[3:2] byte_offset[1:0]

Loading a Block on Miss

When a miss occurs, we load the entire block from memory. We need to find the block-aligned address:

let word_addr = addr / 4;
let block_index = word_addr % 4;  // Where 4 is the block size in words
let block_base = word_addr - block_index;

Now load 4 words from memory starting at block_base. Note that block_base is a word address and needs to be converted to a byte address to dereference properly.


Cache Comparison

Summary Table

Feature Direct-Mapped Fully-Associative N-Way Set-Assoc
Address maps to 1 slot Any slot 1 set (N slots)
Index bits log2(slots) 0 log2(sets)
Comparisons 1 All slots N (ways)
Conflict misses Many None Reduced
Hardware cost Lowest Highest Medium
Replacement None needed LRU/FIFO/Random LRU within set

Special Cases

  • Direct-mapped = 1-way set-associative (each set has 1 slot)
  • Fully-associative = N-way where N = total slots (one set containing all slots)
graph LR
    subgraph "Direct-Mapped<br>(1-way)"
        DM1["Slot 0"]
        DM2["Slot 1"]
        DM3["Slot 2"]
        DM4["Slot 3"]
    end

    subgraph "2-Way Set-Associative"
        SA1["Set 0: Way 0 | Way 1"]
        SA2["Set 1: Way 0 | Way 1"]
    end

    subgraph "Fully-Associative<br>(4-way, 1 set)"
        FA["Way 0 | Way 1 | Way 2 | Way 3"]
    end

Cache Questions to Answer

When designing or analyzing a cache, consider three fundamental questions:

1. Where to Look?

Given an address, which cache slot(s) could contain the data?

Cache Type Answer
Direct-Mapped Exactly one slot (determined by index bits)
Fully-Associative Any slot (must check all)
Set-Associative One set of slots (determined by set index)

2. Is It There?

How do we know if the requested data is actually in the cache?

Requirements for a hit:

  1. The slot must be valid (valid is true)
  2. The slot's tag must match the address's tag
let is_hit = slot.valid && slot.tag == addr_tag;

3. What to Replace?

On a miss, if we need to bring in new data, what do we evict?

Cache Type Replacement Policy
Direct-Mapped No choice — only one possible slot
Fully-Associative LRU, FIFO, Random, etc.
Set-Associative LRU, FIFO, Random within the set

Address Bit Calculation

General Formula

For a cache with:

  • S = number of slots (lines)
  • B = block size in bytes
  • A = address size in bits (typically 64)

The number of bits for each field:

  • Byte Offset bits = log2(B)
  • Index bits = log2(S)
  • Tag bits = A - Index bits - Byte Offset bits

Example Calculations

Example 1: 4-slot cache, 4-byte blocks (1 word), 64-bit addresses

Byte Offset = log2(4) = 2 bits
Index       = log2(4) = 2 bits
Tag         = 64 - 2 - 2 = 60 bits

Example 2: 16-slot cache, 16-byte blocks (4 words), 64-bit addresses

Byte Offset = log2(16) = 4 bits
Index       = log2(16) = 4 bits
Tag         = 64 - 4 - 4 = 56 bits

Example 3: 256-slot cache, 64-byte blocks (16 words), 64-bit addresses

Byte Offset = log2(64) = 6 bits
Index       = log2(256) = 8 bits
Tag         = 64 - 6 - 8 = 50 bits

Example 4: 64KB cache, 64-byte blocks, 48-bit addresses (direct-mapped)

Number of slots = 65536 / 64 = 1024 slots
Byte Offset = log2(64) = 6 bits
Index       = log2(1024) = 10 bits
Tag         = 48 - 6 - 10 = 32 bits

Cache Simulation Trace Examples

Example 1: Direct-Mapped Thrashing

A 4-slot direct-mapped cache with 1-word blocks is initially empty. Address trace (word addresses): 0, 4, 0, 8, 0, 12, 0

Access Word Addr Slot (addr % 4) Tag Result Cache State
1 0 0 0 MISS [0: tag=0]
2 4 0 1 MISS [0: tag=1] (evicts tag=0)
3 0 0 0 MISS [0: tag=0] (evicts tag=1)
4 8 0 2 MISS [0: tag=2] (evicts tag=0)
5 0 0 0 MISS [0: tag=0] (evicts tag=2)
6 12 0 3 MISS [0: tag=3] (evicts tag=0)
7 0 0 0 MISS [0: tag=0] (evicts tag=3)

Result: All misses! Hit rate: 0%. This is thrashing — all addresses map to the same slot, causing constant evictions.

Example 2: 2-Way Set-Associative Trace

A 4-slot, 2-way set-associative cache (2 sets), 1-word blocks, LRU replacement. Initially empty.

Word addresses: 0, 8, 0, 6, 8

Configuration:

  • Set index = word_addr % 2
  • Tag = word_addr / 2

Access 1: Word Addr 0

  • Set = 0 % 2 = 0, Tag = 0 / 2 = 0
  • Set 0: Both invalid → MISS (cold)
  • Load into Way 0
Set Way 0 (V, Tag, TS) Way 1 (V, Tag, TS)
0 (true, 0, 1) (false, -, -)
1 (false, -, -) (false, -, -)

Access 2: Word Addr 8

  • Set = 8 % 2 = 0, Tag = 8 / 2 = 4
  • Set 0: Way 0 tag=0 ≠ 4, Way 1 invalid → MISS (cold)
  • Load into Way 1
Set Way 0 (V, Tag, TS) Way 1 (V, Tag, TS)
0 (true, 0, 1) (true, 4, 2)
1 (false, -, -) (false, -, -)

Access 3: Word Addr 0

  • Set = 0, Tag = 0
  • Set 0: Way 0 tag=0 matches → HIT
  • Update timestamp
Set Way 0 (V, Tag, TS) Way 1 (V, Tag, TS)
0 (true, 0, 3) (true, 4, 2)
1 (false, -, -) (false, -, -)

Access 4: Word Addr 6

  • Set = 6 % 2 = 0, Tag = 6 / 2 = 3
  • Set 0: Way 0 tag=0 ≠ 3, Way 1 tag=4 ≠ 3 → MISS (hot)
  • Both valid, LRU is Way 1 (timestamp 2 < 3) → evict Way 1
Set Way 0 (V, Tag, TS) Way 1 (V, Tag, TS)
0 (true, 0, 3) (true, 3, 4)
1 (false, -, -) (false, -, -)

Access 5: Word Addr 8

  • Set = 0, Tag = 4
  • Set 0: Way 0 tag=0 ≠ 4, Way 1 tag=3 ≠ 4 → MISS (hot)
  • LRU is Way 0 (timestamp 3 < 4) → evict Way 0
Set Way 0 (V, Tag, TS) Way 1 (V, Tag, TS)
0 (true, 4, 5) (true, 3, 4)
1 (false, -, -) (false, -, -)

Summary: Hits: 1 (access 3), Misses: 4 → Hit rate: 20%

Example 3: Direct-Mapped vs 2-Way Comparison

Address sequence: 0, 16, 0, 16, 0 with a 2-slot cache.

Direct-Mapped (2 slots):

Both 0 and 16 map to slot 0 (since (0/4)%2=0 and (16/4)%2=0).

Access Addr Slot Tag Result
1 0 0 0 MISS
2 16 0 4 MISS (evicts 0)
3 0 0 0 MISS (evicts 16)
4 16 0 4 MISS (evicts 0)
5 0 0 0 MISS (evicts 16)

Hit rate: 0% (thrashing)

2-Way Set-Associative (1 set with 2 ways):

Both addresses map to the same set, but can coexist in different ways.

Access Addr Tag Way 0 Way 1 Result
1 0 0 0 - MISS
2 16 4 0 4 MISS
3 0 0 0 4 HIT
4 16 4 0 4 HIT
5 0 0 0 4 HIT

Hit rate: 60%

The 2-way associative cache eliminates the conflict misses that plagued the direct-mapped cache.


Project03 Cache Simulator

Overview

Project03 extends the RISC-V emulator to include dynamic analysis and cache simulation:

  1. Additional Instructions: Complete RISC-V instruction set
  2. Dynamic Analysis: Count instructions, branches taken/not-taken
  3. Cache Simulation: Simulate I-cache behavior during execution

Cache Simulator Structure

struct CacheSlot {
    tag: u64,
    block: [u32; 1],   // or larger for multi-word blocks
    valid: bool,
    timestamp: u64,     // For LRU replacement
}

struct Cache {
    slots: Vec<CacheSlot>,
    num_slots: usize,
    block_size: usize,    // Block size in words
    num_ways: usize,      // Associativity (1 = direct-mapped)
    num_refs: u64,
    num_hits: u64,
    num_misses: u64,
}

Simulation Flow

flowchart TD
    A["Fetch Instruction"] --> B["Check I-Cache"]
    B --> C{"Hit?"}
    C -->|"Yes"| D["Increment hits"]
    C -->|"No"| E["Increment misses"]
    D --> F["Execute Instruction"]
    E --> G["Update Cache"]
    G --> F
    F --> H{"More instructions?"}
    H -->|"Yes"| A
    H -->|"No"| I["Report Statistics"]

Practice Problems

Problem 1: Address Decomposition

Given a 16-slot direct-mapped cache with 4-byte blocks and 64-bit addresses, decompose the address 0x0000000000001A3C.

Solution **Step 1: Calculate bit widths**
Byte Offset = log2(4) = 2 bits
Index = log2(16) = 4 bits
Tag = 64 - 2 - 4 = 58 bits
**Step 2: Convert address to binary**
0x1A3C = 0001 1010 0011 1100
**Step 3: Extract fields**
Byte Offset [1:0]: 00 = 0
Index [5:2]: 1111 = 15
Tag [63:6]: 0x68 = 104 (remaining upper bits)
**Answer:** Byte Offset: 0, Index: 15 (slot 15), Tag: 0x68

Problem 2: Hit Rate Calculation

A cache simulation reports: 950 hits, 50 misses. Calculate the hit rate, miss rate, and AMAT assuming hit_time = 1 cycle and miss_penalty = 100 cycles.

Solution
num_refs = 950 + 50 = 1000
hit_rate = 950 / 1000 = 0.95 = 95%
miss_rate = 50 / 1000 = 0.05 = 5%
AMAT = 1 + (0.05 x 100) = 1 + 5 = 6 cycles

Problem 3: Set-Associative Address Decomposition

For an 8-way set-associative cache with 64 slots and 4-word blocks, calculate the number of bits for tag, set index, block index, and byte offset (assuming 64-bit addresses).

Solution **Step 1: Calculate number of sets**
num_sets = 64 slots / 8 ways = 8 sets
**Step 2: Calculate bit widths**
Byte offset = log2(4 bytes/word) = 2 bits
Block index = log2(4 words/block) = 2 bits
Set index = log2(8 sets) = 3 bits
Tag = 64 - 2 - 2 - 3 = 57 bits
**Address format:**
┌──────────────────┬───────────┬───────────┬───────────┐
│       TAG        │ SET INDEX │ BLOCK IDX │ BYTE OFF  │
│     [63:7]       │  [6:4]    │  [3:2]    │  [1:0]    │
└──────────────────┴───────────┴───────────┴───────────┘
      57 bits         3 bits      2 bits      2 bits

Problem 4: Cache Trace Analysis

A 4-slot, 2-way set-associative cache (2 sets) with LRU replacement processes these byte addresses: 0, 4, 8, 12, 0, 16. Assume 1-word blocks.

Solution **Configuration:** - 2 sets, 2 ways each - Set index = (addr >> 2) & 0b1 - Tag = addr >> 3 | Access | Byte Addr | Word Addr | Set | Tag | Result | |--------|-----------|-----------|-----|-----|--------| | 1 | 0 | 0 | 0 | 0 | MISS | | 2 | 4 | 1 | 1 | 0 | MISS | | 3 | 8 | 2 | 0 | 1 | MISS | | 4 | 12 | 3 | 1 | 1 | MISS | | 5 | 0 | 0 | 0 | 0 | HIT | | 6 | 16 | 4 | 0 | 2 | MISS (LRU evicts tag=1) | **Result:** M, M, M, M, H, M — Hit rate: 1/6 = 16.7%

Problem 5: Locality Analysis

Which type of locality does each code pattern exhibit?

// Pattern A
for _ in 0..100 {
    count += 1;
}

// Pattern B
for i in 0..100 {
    sum += array[i];
}

// Pattern C
for i in 0..100 {
    process(array[i]);
}
Solution **Pattern A: Temporal Locality** — Variable `count` is accessed repeatedly (100 times). Same location accessed multiple times. **Pattern B: Spatial + Temporal Locality** — `array[i]`, `array[i+1]`, etc. are adjacent in memory (spatial). Variable `sum` is accessed 100 times (temporal). **Pattern C: Spatial + Temporal Locality** — Array accessed sequentially (spatial). Function `process` code accessed 100 times (temporal).

Key Concepts

Concept Definition
Cache Small, fast memory that holds recently accessed data
Hit Data found in cache; fast access
Miss Data not in cache; must fetch from memory
Hit Rate num_hits / num_refs
Miss Rate num_misses / num_refs = 1 - hit_rate
AMAT Average Memory Access Time = hit_time + (miss_rate x miss_penalty)
Temporal Locality Recently accessed data likely to be accessed again soon
Spatial Locality Data near recently accessed data likely to be accessed soon
Direct-Mapped Each address maps to exactly one cache slot
Fully-Associative Address can be stored in any slot
Set-Associative Address maps to one set containing multiple ways
Tag Upper address bits identifying which block is in a slot
Index Address bits selecting which slot/set to access
Offset Address bits selecting byte/word within a block
Valid Bit Indicates whether cache slot contains valid data
LRU Least Recently Used replacement policy
Cold Miss First access to a slot (valid is false)
Hot Miss Slot is valid but tag does not match
Thrashing Repeated evictions when addresses conflict on the same slot

Summary

  1. Memory Hierarchy: Computer systems use a hierarchy of memory (registers, caches, DRAM, disk) to balance speed and cost. Caches bridge the processor-memory speed gap by keeping recently accessed data close to the CPU.

  2. Principles of Locality: Caches work because programs exhibit temporal locality (recently accessed data will be accessed again) and spatial locality (nearby data will be accessed soon). These principles explain why caches achieve hit rates exceeding 95%.

  3. Direct-Mapped Caches: The simplest cache organization maps each address to exactly one slot using the index bits. The tag identifies which of the many addresses mapping to that slot is currently stored. Simple but suffers from conflict misses.

  4. Fully-Associative Caches: An address can map to any slot, eliminating conflict misses. Requires checking all slots on every access and a replacement policy (LRU). Most flexible but most expensive in hardware.

  5. Set-Associative Caches: A compromise — addresses map to one set containing multiple ways. Reduces conflict misses while keeping hardware cost reasonable.

  6. Address Decomposition: A memory address is split into tag, index/set index, and offset. With multi-word blocks, a block index selects the word within the block. The number of bits for each field depends on cache size, block size, and associativity.

  7. LRU Replacement: When all slots (or ways in a set) are full, LRU evicts the slot accessed least recently. Implemented in software using timestamps.


Further Reading

  1. Computer Organization and Design: RISC-V Edition (Patterson & Hennessy) — Chapter 5: Large and Fast: Exploiting Memory Hierarchy

  2. Course Guide: Cache MemoryCache Memory Guide — Complete pseudocode for all three cache types with Rust examples

  3. Project03 AssignmentProject03 — Implementation of I-cache simulation in the RISC-V emulator