Cache Memory¶

Overview¶

This lecture introduces cache memory concepts and their role in bridging the speed gap between processors and main memory. We examine why caches are effective through the principles of temporal and spatial locality, and explore three cache organizations: direct-mapped, fully-associative, and set-associative. We cover address decomposition, cache lookup algorithms, LRU replacement, and multi-word block sizes. Understanding cache memory is essential for writing efficient code and for implementing the cache simulator in Project03.

Learning Objectives¶

Understand the memory hierarchy and the processor-memory speed gap
Explain the principles of temporal and spatial locality
Describe how direct-mapped caches use address decomposition
Calculate tag, index, and offset bits for a given cache configuration
Implement cache lookup algorithms for all three cache types in Rust
Understand and implement LRU replacement policy using timestamps
Decompose addresses for caches with multi-word blocks
Trace through cache simulation scenarios to determine hits and misses
Compute hit rates and miss rates from simulation data

Prerequisites¶

RISC-V Emulation — emulator state and memory operations
RISC-V Machine Code — instruction encoding and bit extraction

The Memory Hierarchy¶

The Processor-Memory Speed Gap¶

Modern processors can execute billions of instructions per second, but accessing data from main memory (DRAM) takes many more cycles than a simple arithmetic operation. This creates a fundamental bottleneck:

graph LR
    subgraph "Speed Over Time"
        A["Processor Speed"] -->|"Increasing<br>rapidly"| B["Performance Gap"]
        C["Memory Speed"] -->|"Increasing<br>slowly"| B
    end

    B -->|"Growing wider<br>each year"| D["Solution: Cache Memory"]

Memory Hierarchy Levels¶

To address this gap, computer systems use a memory hierarchy:

graph TD
    subgraph "Memory Hierarchy"
        A["CPU Registers<br>Fastest, smallest<br>~1 cycle"]
        B["L1 Cache (SRAM)<br>Very fast, small<br>~1-4 cycles"]
        C["L2 Cache (SRAM)<br>Fast, medium<br>~10-20 cycles"]
        D["L3 Cache (SRAM)<br>Moderate, larger<br>~40-75 cycles"]
        E["Main Memory (DRAM)<br>Slower, large<br>~100+ cycles"]
        F["Disk/SSD<br>Slowest, huge<br>~millions of cycles"]
    end

    A --> B
    B --> C
    C --> D
    D --> E
    E --> F

Level	Technology	Size	Access Time	Cost/Byte
Registers	Flip-flops	~32 x 64 bits	~0.5 ns	Highest
L1 Cache	SRAM	32-64 KB	~1-2 ns	Very High
L2 Cache	SRAM	256 KB-1 MB	~5-10 ns	High
L3 Cache	SRAM	2-32 MB	~20-40 ns	Medium
Main Memory	DRAM	4-64 GB	~50-100 ns	Low
SSD	Flash	256 GB-4 TB	~100 us	Very Low
HDD	Magnetic	1-16 TB	~10 ms	Lowest

Cache Memory Position¶

A cache sits between the processor and main memory, holding recently accessed data:

graph LR
    subgraph "Processor"
        A["CPU<br>lw, sw"]
    end

    subgraph "Cache"
        B["SRAM<br>Fast memory"]
    end

    subgraph "Main Memory"
        C["DRAM<br>STACK"]
        D["DATA"]
        E["CODE"]
    end

    A -->|"addr"| B
    B -->|"hit"| A
    B <-->|"miss"| C
    B <--> D
    B <--> E

Principles of Locality¶

Why Caches Work¶

Caches achieve high hit rates (often >97%) because programs exhibit locality of reference.

Temporal Locality¶

Temporal locality: If a data item is accessed, it is likely to be accessed again soon.

Examples:

Loop counters: Variable i in a loop is accessed on every iteration
Function variables: Local variables used throughout a function
Recursive calls: The same function code is executed repeatedly

// Temporal locality example
let mut sum: i64 = 0;
for i in 0..1000 {       // 'i' accessed 1000+ times
    sum += array[i];      // 'sum' accessed 1000 times
}

Spatial Locality¶

Spatial locality: If a data item is accessed, nearby data items are likely to be accessed soon.

Examples:

Array traversal: Accessing array[i] often followed by array[i+1]
Sequential code: Instruction at PC followed by instruction at PC+4
Struct access: Accessing one field often followed by neighboring fields

// Spatial locality example
for i in 0..1000 {
    total += array[i];  // array[0], array[1], array[2], ... accessed sequentially
}

Locality in Code Execution¶

graph TD
    subgraph "Instruction Fetch"
        A["PC = 0x1000<br>fetch instruction"]
        B["PC = 0x1004<br>fetch next (spatial)"]
        C["PC = 0x1008<br>fetch next (spatial)"]
        D["PC = 0x1000<br>loop back (temporal)"]
    end

    A --> B
    B --> C
    C -->|"branch"| D
    D --> A

Cache Concepts and Terminology¶

Key Terms¶

Term	Definition
Hit	Requested data is found in the cache
Miss	Requested data is not in the cache
Hit Rate	Fraction of accesses that are hits
Miss Rate	Fraction of accesses that are misses
Block	Unit of data transferred between cache and memory
Slot/Line	A single entry in the cache (holds one block)
Tag	Identifier stored with each block to identify which address it holds
Valid Bit	Indicates whether a slot contains valid data
Cold Miss	First access to a slot (`valid` is false)
Hot Miss	Slot is valid but contains data for a different address (tag mismatch)

Hit and Miss Metrics¶

hit_rate  = num_hits / num_refs
miss_rate = num_misses / num_refs

// Alternative formulas
hit_rate  = 1 - miss_rate
miss_rate = 1 - hit_rate

Average Memory Access Time (AMAT)¶

The AMAT formula combines hit and miss times:

AMAT = hit_time + (miss_rate x miss_penalty)

Where:

hit_time: Time to access the cache (typically 1-2 cycles)
miss_rate: Probability of a cache miss
miss_penalty: Additional time to fetch from memory on a miss (typically 50-200 cycles)

Example: If hit_time = 1 cycle, miss_rate = 5%, miss_penalty = 100 cycles:

AMAT = 1 + (0.05 x 100) = 1 + 5 = 6 cycles

Memory Addressing¶

Byte vs Word Addressing¶

Memory is byte-addressable, but caches often work with words (4 bytes):

Address Conversion¶

// Byte address to word address
let word_addr = addr / 4;
let word_addr = addr >> 2;  // Equivalent using bit shift

// Word address to byte address
let addr = word_addr * 4;
let addr = word_addr << 2;  // Equivalent using bit shift

Word Alignment¶

For efficiency, word accesses should be word-aligned (addresses divisible by 4):

Byte Address	Word Address	Aligned?
0x00	0	Yes
0x04	1	Yes
0x08	2	Yes
0x03	-	No
0x05	-	No

Direct-Mapped Caches¶

Overview¶

A direct-mapped cache is the simplest cache organization. Each memory address maps to exactly one cache slot, determined by the address itself. This makes the replacement policy trivial: if the data is not in the cache, you evict whatever is in the slot that the address maps to.

Cache Structure¶

struct CacheSlot {
    tag: u64,
    block: [u32; 1],  // One word per block
    valid: bool,
}

let slots: [CacheSlot; N];  // N slots in the cache

Address Decomposition¶

For a 4-slot direct-mapped cache with 1-word blocks:

64-bit byte address:
┌─────────────────────────┬─────────┬───────────┐
│         TAG             │ INDEX   │ BYTE OFF  │
│       [63:4]            │ [3:2]   │  [1:0]    │
└─────────────────────────┴─────────┴───────────┘
         60 bits            2 bits    2 bits

Byte Offset (2 bits): Selects byte within a word (for word-aligned access, always 00)
Index (2 bits): Selects which cache slot (0-3 for 4 slots)
Tag (60 bits): Identifies the specific memory block stored in the slot

Computing Index and Tag¶

We can use modulus or bit manipulation to extract these fields:

// For a 4-slot cache with 1-word blocks
let word_addr = addr >> 2;
let slot_index = word_addr & 0b11;  // Same as word_addr % 4

Why does this work? Look at the binary form of word addresses:

dec bin
 0  00000 <-- slot 0
 1  00001 <-- slot 1
 2  00010 <-- slot 2
 3  00011 <-- slot 3
 4  00100 <-- slot 0
 5  00101 <-- slot 1
 6  00110 <-- slot 2
 7  00111 <-- slot 3
 8  01000 <-- slot 0

The last two bits of each word address are exactly word_addr % 4. The remaining upper bits form the tag, which distinguishes different addresses that map to the same slot.

Direct-Mapped Cache Lookup¶

flowchart TD
    A["Memory Address"] --> B["Extract Index Bits"]
    B --> C["Access Cache Slot"]
    C --> D{"Valid?"}
    D -->|"No"| E["MISS (cold)"]
    D -->|"Yes"| F{"Tag Match?"}
    F -->|"No"| G["MISS (hot)"]
    F -->|"Yes"| H["HIT: Return Data"]
    E --> I["Fetch from Memory"]
    G --> I
    I --> J["Update Cache Slot"]
    J --> K["Return Data"]

Direct-Mapped Lookup Algorithm¶

let addr_tag = addr >> 4;  // Remove byte_offset (2) and slot_index bits (2)
let index_mask: u64 = 0b11;
let slot_index = ((addr >> 2) & index_mask) as usize;
let slot = &mut slots[slot_index];

if slot.valid && slot.tag == addr_tag {
    // Hit
    num_hits += 1;
    slot.block[0]
} else {
    // Miss (cold if !slot.valid, hot if slot.valid)
    num_misses += 1;
    slot.block[0] = mem_read_word(addr);
    slot.tag = addr_tag;
    slot.valid = true;
    slot.block[0]
}

Limitation: Conflict Misses¶

Direct-mapped caches suffer from conflict misses when multiple addresses map to the same slot:

Addresses 0, 16, 32, 48 all map to slot 0 (for 4-slot cache)
If accessed alternately: 0, 16, 0, 16 -> all misses (thrashing)

Fully-Associative Caches¶

Concept¶

In a fully-associative cache, an address can be stored in ANY slot. This eliminates conflict misses but requires checking all slots on every access. While too expensive for regular cache memory, fully-associative caches are used in specific situations like hardware support for virtual memory.

Address Decomposition¶

No index bits needed — only tag and byte offset:

┌─────────────────────────────────┬───────────┐
│             TAG                 │ BYTE OFF  │
│           [63:2]                │  [1:0]    │
└─────────────────────────────────┴───────────┘

Cache Structure with Timestamp¶

We add a timestamp field for tracking when each slot was last accessed, which is needed for the LRU replacement policy:

struct CacheSlot {
    tag: u64,
    block: [u32; 1],
    valid: bool,
    timestamp: u64,
}

let slots: [CacheSlot; N];

Fully-Associative Lookup Algorithm¶

Must compare the address tag against ALL slot tags:

refs += 1;
let addr_tag = addr >> 2;  // Remove byte_offset bits (2)

for i in 0..4 {
    let slot = &mut slots[i];
    if slot.valid && slot.tag == addr_tag {
        // Hit
        num_hits += 1;
        slot.timestamp = refs;
        return slot.block[0];
    }
}

// Miss
num_misses += 1;
let slot = find_lru_slot(&mut slots);
slot.block[0] = mem_read_word(addr);
slot.tag = addr_tag;
slot.valid = true;
slot.timestamp = refs;
slot.block[0]

LRU Replacement Policy¶

When all slots are full, we need to choose which to evict. LRU (Least Recently Used) evicts the slot that hasn't been accessed for the longest time.

We use a logical timestamp (the number of cache references). Each time we access a cache slot, we set the slot's timestamp to the current reference count. To find the LRU slot, we look for the slot with the smallest timestamp — this is the slot accessed least recently.

First, look for an unused slot (cold miss). If all slots are valid, pick the slot with the oldest timestamp (hot miss):

fn find_lru_slot(slots: &mut [CacheSlot]) -> &mut CacheSlot {
    // First, look for an invalid (unused) slot
    for slot in slots.iter_mut() {
        if !slot.valid {
            return slot;
        }
    }

    // All valid — find slot with oldest timestamp
    let mut lru_index = 0;
    let mut min_timestamp = u64::MAX;

    for (i, slot) in slots.iter().enumerate() {
        if slot.timestamp < min_timestamp {
            min_timestamp = slot.timestamp;
            lru_index = i;
        }
    }

    &mut slots[lru_index]
}

Trade-offs¶

Aspect	Fully-Associative
Hit Rate	Highest (no conflict misses)
Lookup Speed	Slowest (check all slots)
Hardware Cost	Highest (parallel comparators)
Replacement	Flexible (LRU, FIFO, Random)

Set-Associative Caches¶

Concept¶

A set-associative cache is a compromise between direct-mapped and fully-associative:

Cache is divided into sets
Each set contains multiple ways (slots)
An address maps to one specific set (like direct-mapped)
Within the set, the address can go in any way (like fully-associative)

Terminology¶

N-way set-associative: Each set has N slots
2-way: Each set has 2 slots (most common)
4-way: Each set has 4 slots
8-way: Each set has 8 slots

Address Decomposition for 2-Way Set-Associative¶

For 4 slots total organized as 2-way (2 sets, 2 ways each):

┌──────────────────────────┬──────────┬───────────┐
│          TAG             │ SET IDX  │ BYTE OFF  │
│        [63:3]            │   [2]    │  [1:0]    │
└──────────────────────────┴──────────┴───────────┘
         61 bits             1 bit      2 bits

Set-Associative Formulas¶

let num_slots = N;           // Total number of cache slots
let num_ways = W;            // Number of ways (slots per set)
let num_sets = N / W;        // Number of sets

// Bit widths
// set_index_bits = log2(num_sets)
// byte_offset_bits = log2(block_size_in_bytes)
// tag_bits = addr_bits - set_index_bits - byte_offset_bits

Cache Layout¶

                Way 1              Way 0
         ┌───────┬─────┬─────┬───────┬─────┬─────┐
  Set 1  │ valid │ tag │data │ valid │ tag │data │
         ├───────┼─────┼─────┼───────┼─────┼─────┤
  Set 0  │ valid │ tag │data │ valid │ tag │data │
         └───────┴─────┴─────┴───────┴─────┴─────┘

Set-Associative Lookup Algorithm¶

We logically treat the cache slot array as groups of num_ways slots. The first group is set 0, the second is set 1, and so on. Once we find the set index, we multiply by num_ways to get the first slot in the set:

refs += 1;
let num_ways = 4;
let addr_tag = addr >> 5;  // Remove byte_offset (2) and set_index bits (3)
let set_index = ((addr >> 2) & 0b111) as usize;

// 4 ways per set, so set_index * num_ways gives us the first slot in the set
let set_base = set_index * num_ways;

for i in 0..num_ways {
    let slot = &mut slots[set_base + i];
    if slot.valid && slot.tag == addr_tag {
        // Hit
        num_hits += 1;
        slot.timestamp = refs;
        return slot.block[0];
    }
}

// Miss
num_misses += 1;
let slot = find_lru_slot_in_set(&mut slots, set_base, num_ways);
slot.block[0] = mem_read_word(addr);
slot.tag = addr_tag;
slot.valid = true;
slot.timestamp = refs;
slot.block[0]

LRU Within a Set¶

The find_lru_slot_in_set function works just like the fully-associative version, but only searches within the set:

fn find_lru_slot_in_set(
    slots: &mut [CacheSlot],
    set_base: usize,
    num_ways: usize,
) -> &mut CacheSlot {
    // First, look for an invalid slot in the set
    for i in 0..num_ways {
        if !slots[set_base + i].valid {
            return &mut slots[set_base + i];
        }
    }

    // All valid — find LRU within the set
    let mut lru_index = set_base;
    let mut min_timestamp = u64::MAX;

    for i in 0..num_ways {
        let idx = set_base + i;
        if slots[idx].timestamp < min_timestamp {
            min_timestamp = slots[idx].timestamp;
            lru_index = idx;
        }
    }

    &mut slots[lru_index]
}

Example: 4-Way Set-Associative with 32 Slots¶

Configuration:

32 slots total
4-way set-associative
1-word blocks

Calculations:

num_sets = 32 / 4 = 8 sets
set_index_bits = log2(8) = 3 bits
byte_offset_bits = log2(4) = 2 bits
tag_bits = 64 - 3 - 2 = 59 bits

Address decomposition:

┌──────────────────────────┬───────────┬───────────┐
│          TAG             │ SET INDEX │ BYTE OFF  │
│        [63:5]            │  [4:2]    │  [1:0]    │
└──────────────────────────┴───────────┴───────────┘
         59 bits             3 bits      2 bits

Block Size¶

Multi-Word Blocks¶

To exploit spatial locality, caches typically store multiple words per slot. When we miss on one word, we load an entire block of neighboring words into the cache — so subsequent accesses to nearby addresses are likely to hit.

struct CacheSlot {
    tag: u64,
    block: [u32; 4],  // 4-word block (16 bytes)
    valid: bool,
    timestamp: u64,
}

let slots: [CacheSlot; N];

Address Decomposition with Block Size¶

For a 4-slot direct-mapped cache with 4-word blocks:

┌──────────────────────┬───────────┬───────────┬───────────┐
│        TAG           │ SLOT IDX  │ BLOCK IDX │ BYTE OFF  │
│      [63:6]          │  [5:4]    │  [3:2]    │  [1:0]    │
└──────────────────────┴───────────┴───────────┴───────────┘
       58 bits           2 bits      2 bits      2 bits

Byte Offset (2 bits): Selects byte within word
Block Index (2 bits): Selects word within block
Slot Index (2 bits): Selects cache slot
Tag (58 bits): Identifies memory block

For other cache types with 4-word blocks:

Fully Associative:

tag[63:4] block_index[3:2] byte_offset[1:0]

Set Associative (2-way, 4 slots total = 2 sets):

tag[63:5] set_index[4] block_index[3:2] byte_offset[1:0]

Loading a Block on Miss¶

When a miss occurs, we load the entire block from memory. We need to find the block-aligned address:

let word_addr = addr / 4;
let block_index = word_addr % 4;  // Where 4 is the block size in words
let block_base = word_addr - block_index;

Now load 4 words from memory starting at block_base. Note that block_base is a word address and needs to be converted to a byte address to dereference properly.

Cache Comparison¶

Summary Table¶

Feature	Direct-Mapped	Fully-Associative	N-Way Set-Assoc
Address maps to	1 slot	Any slot	1 set (N slots)
Index bits	log2(slots)	0	log2(sets)
Comparisons	1	All slots	N (ways)
Conflict misses	Many	None	Reduced
Hardware cost	Lowest	Highest	Medium
Replacement	None needed	LRU/FIFO/Random	LRU within set

Special Cases¶

Direct-mapped = 1-way set-associative (each set has 1 slot)
Fully-associative = N-way where N = total slots (one set containing all slots)

graph LR
    subgraph "Direct-Mapped<br>(1-way)"
        DM1["Slot 0"]
        DM2["Slot 1"]
        DM3["Slot 2"]
        DM4["Slot 3"]
    end

    subgraph "2-Way Set-Associative"
        SA1["Set 0: Way 0 | Way 1"]
        SA2["Set 1: Way 0 | Way 1"]
    end

    subgraph "Fully-Associative<br>(4-way, 1 set)"
        FA["Way 0 | Way 1 | Way 2 | Way 3"]
    end

Cache Questions to Answer¶

When designing or analyzing a cache, consider three fundamental questions:

1. Where to Look?¶

Given an address, which cache slot(s) could contain the data?

Cache Type	Answer
Direct-Mapped	Exactly one slot (determined by index bits)
Fully-Associative	Any slot (must check all)
Set-Associative	One set of slots (determined by set index)

2. Is It There?¶

How do we know if the requested data is actually in the cache?

Requirements for a hit:

The slot must be valid (valid is true)
The slot's tag must match the address's tag

let is_hit = slot.valid && slot.tag == addr_tag;

3. What to Replace?¶

On a miss, if we need to bring in new data, what do we evict?

Cache Type	Replacement Policy
Direct-Mapped	No choice — only one possible slot
Fully-Associative	LRU, FIFO, Random, etc.
Set-Associative	LRU, FIFO, Random within the set

Address Bit Calculation¶

General Formula¶

For a cache with:

S = number of slots (lines)
B = block size in bytes
A = address size in bits (typically 64)

The number of bits for each field:

Byte Offset bits = log2(B)
Index bits = log2(S)
Tag bits = A - Index bits - Byte Offset bits

Example Calculations¶

Example 1: 4-slot cache, 4-byte blocks (1 word), 64-bit addresses

Byte Offset = log2(4) = 2 bits
Index       = log2(4) = 2 bits
Tag         = 64 - 2 - 2 = 60 bits

Example 2: 16-slot cache, 16-byte blocks (4 words), 64-bit addresses

Byte Offset = log2(16) = 4 bits
Index       = log2(16) = 4 bits
Tag         = 64 - 4 - 4 = 56 bits

Example 3: 256-slot cache, 64-byte blocks (16 words), 64-bit addresses

Byte Offset = log2(64) = 6 bits
Index       = log2(256) = 8 bits
Tag         = 64 - 6 - 8 = 50 bits

Example 4: 64KB cache, 64-byte blocks, 48-bit addresses (direct-mapped)

Number of slots = 65536 / 64 = 1024 slots
Byte Offset = log2(64) = 6 bits
Index       = log2(1024) = 10 bits
Tag         = 48 - 6 - 10 = 32 bits

Cache Simulation Trace Examples¶

Example 1: Direct-Mapped Thrashing¶

A 4-slot direct-mapped cache with 1-word blocks is initially empty. Address trace (word addresses): 0, 4, 0, 8, 0, 12, 0

Access	Word Addr	Tag	Result	Cache State
1	0	0	MISS	[0: tag=0]
2	4	1	MISS	[0: tag=1] (evicts tag=0)
3	0	0	MISS	[0: tag=0] (evicts tag=1)
4	8	2	MISS	[0: tag=2] (evicts tag=0)
5	0	0	MISS	[0: tag=0] (evicts tag=2)
6	12	3	MISS	[0: tag=3] (evicts tag=0)
7	0	0	MISS	[0: tag=0] (evicts tag=3)

Result: All misses! Hit rate: 0%. This is thrashing — all addresses map to the same slot, causing constant evictions.

Example 2: 2-Way Set-Associative Trace¶

A 4-slot, 2-way set-associative cache (2 sets), 1-word blocks, LRU replacement. Initially empty.

Word addresses: 0, 8, 0, 6, 8

Configuration:

Set index = word_addr % 2
Tag = word_addr / 2

Access 1: Word Addr 0

Set = 0 % 2 = 0, Tag = 0 / 2 = 0
Set 0: Both invalid → MISS (cold)
Load into Way 0

Set	Way 0 (V, Tag, TS)	Way 1 (V, Tag, TS)
0	(true, 0, 1)	(false, -, -)
1	(false, -, -)	(false, -, -)

Access 2: Word Addr 8

Set = 8 % 2 = 0, Tag = 8 / 2 = 4
Set 0: Way 0 tag=0 ≠ 4, Way 1 invalid → MISS (cold)
Load into Way 1

Set	Way 0 (V, Tag, TS)	Way 1 (V, Tag, TS)
0	(true, 0, 1)	(true, 4, 2)
1	(false, -, -)	(false, -, -)

Access 3: Word Addr 0

Set = 0, Tag = 0
Set 0: Way 0 tag=0 matches → HIT
Update timestamp

Set	Way 0 (V, Tag, TS)	Way 1 (V, Tag, TS)
0	(true, 0, 3)	(true, 4, 2)
1	(false, -, -)	(false, -, -)

Access 4: Word Addr 6

Set = 6 % 2 = 0, Tag = 6 / 2 = 3
Set 0: Way 0 tag=0 ≠ 3, Way 1 tag=4 ≠ 3 → MISS (hot)
Both valid, LRU is Way 1 (timestamp 2 < 3) → evict Way 1

Set	Way 0 (V, Tag, TS)	Way 1 (V, Tag, TS)
0	(true, 0, 3)	(true, 3, 4)
1	(false, -, -)	(false, -, -)

Access 5: Word Addr 8

Set = 0, Tag = 4
Set 0: Way 0 tag=0 ≠ 4, Way 1 tag=3 ≠ 4 → MISS (hot)
LRU is Way 0 (timestamp 3 < 4) → evict Way 0

Set	Way 0 (V, Tag, TS)	Way 1 (V, Tag, TS)
0	(true, 4, 5)	(true, 3, 4)
1	(false, -, -)	(false, -, -)

Summary: Hits: 1 (access 3), Misses: 4 → Hit rate: 20%

Example 3: Direct-Mapped vs 2-Way Comparison¶

Address sequence: 0, 16, 0, 16, 0 with a 2-slot cache.

Direct-Mapped (2 slots):

Both 0 and 16 map to slot 0 (since (0/4)%2=0 and (16/4)%2=0).

Access	Addr	Tag	Result
1	0	0	MISS
2	16	4	MISS (evicts 0)
3	0	0	MISS (evicts 16)
4	16	4	MISS (evicts 0)
5	0	0	MISS (evicts 16)

Hit rate: 0% (thrashing)

2-Way Set-Associative (1 set with 2 ways):

Both addresses map to the same set, but can coexist in different ways.

Access	Addr	Tag	Way 1	Result
1	0	0	-	MISS
2	16	4	4	MISS
3	0	0	4	HIT
4	16	4	4	HIT
5	0	0	4	HIT

Hit rate: 60%

The 2-way associative cache eliminates the conflict misses that plagued the direct-mapped cache.

Project03 Cache Simulator¶

Overview¶

Project03 extends the RISC-V emulator to include dynamic analysis and cache simulation:

Additional Instructions: Complete RISC-V instruction set
Dynamic Analysis: Count instructions, branches taken/not-taken
Cache Simulation: Simulate I-cache behavior during execution

Cache Simulator Structure¶

struct CacheSlot {
    tag: u64,
    block: [u32; 1],   // or larger for multi-word blocks
    valid: bool,
    timestamp: u64,     // For LRU replacement
}

struct Cache {
    slots: Vec<CacheSlot>,
    num_slots: usize,
    block_size: usize,    // Block size in words
    num_ways: usize,      // Associativity (1 = direct-mapped)
    num_refs: u64,
    num_hits: u64,
    num_misses: u64,
}

Simulation Flow¶

flowchart TD
    A["Fetch Instruction"] --> B["Check I-Cache"]
    B --> C{"Hit?"}
    C -->|"Yes"| D["Increment hits"]
    C -->|"No"| E["Increment misses"]
    D --> F["Execute Instruction"]
    E --> G["Update Cache"]
    G --> F
    F --> H{"More instructions?"}
    H -->|"Yes"| A
    H -->|"No"| I["Report Statistics"]

Practice Problems¶

Problem 1: Address Decomposition¶

Given a 16-slot direct-mapped cache with 4-byte blocks and 64-bit addresses, decompose the address 0x0000000000001A3C.

Solution

**Step 1: Calculate bit widths**

Byte Offset = log2(4) = 2 bits
Index = log2(16) = 4 bits
Tag = 64 - 2 - 4 = 58 bits

**Step 2: Convert address to binary**

0x1A3C = 0001 1010 0011 1100

**Step 3: Extract fields**

Byte Offset [1:0]: 00 = 0
Index [5:2]: 1111 = 15
Tag [63:6]: 0x68 = 104 (remaining upper bits)

**Answer:** Byte Offset: 0, Index: 15 (slot 15), Tag: 0x68

Problem 2: Hit Rate Calculation¶

A cache simulation reports: 950 hits, 50 misses. Calculate the hit rate, miss rate, and AMAT assuming hit_time = 1 cycle and miss_penalty = 100 cycles.

Solution

num_refs = 950 + 50 = 1000
hit_rate = 950 / 1000 = 0.95 = 95%
miss_rate = 50 / 1000 = 0.05 = 5%
AMAT = 1 + (0.05 x 100) = 1 + 5 = 6 cycles

Problem 3: Set-Associative Address Decomposition¶

For an 8-way set-associative cache with 64 slots and 4-word blocks, calculate the number of bits for tag, set index, block index, and byte offset (assuming 64-bit addresses).

Solution

**Step 1: Calculate number of sets**

num_sets = 64 slots / 8 ways = 8 sets

**Step 2: Calculate bit widths**

Byte offset = log2(4 bytes/word) = 2 bits
Block index = log2(4 words/block) = 2 bits
Set index = log2(8 sets) = 3 bits
Tag = 64 - 2 - 2 - 3 = 57 bits

**Address format:**

┌──────────────────┬───────────┬───────────┬───────────┐
│       TAG        │ SET INDEX │ BLOCK IDX │ BYTE OFF  │
│     [63:7]       │  [6:4]    │  [3:2]    │  [1:0]    │
└──────────────────┴───────────┴───────────┴───────────┘
      57 bits         3 bits      2 bits      2 bits

Problem 4: Cache Trace Analysis¶

A 4-slot, 2-way set-associative cache (2 sets) with LRU replacement processes these byte addresses: 0, 4, 8, 12, 0, 16. Assume 1-word blocks.

Solution

**Configuration:** - 2 sets, 2 ways each - Set index = (addr >> 2) & 0b1 - Tag = addr >> 3 | Access | Byte Addr | Word Addr | Set | Tag | Result | |--------|-----------|-----------|-----|-----|--------| | 1 | 0 | 0 | 0 | 0 | MISS | | 2 | 4 | 1 | 1 | 0 | MISS | | 3 | 8 | 2 | 0 | 1 | MISS | | 4 | 12 | 3 | 1 | 1 | MISS | | 5 | 0 | 0 | 0 | 0 | HIT | | 6 | 16 | 4 | 0 | 2 | MISS (LRU evicts tag=1) | **Result:** M, M, M, M, H, M — Hit rate: 1/6 = 16.7%

Problem 5: Locality Analysis¶

Which type of locality does each code pattern exhibit?

// Pattern A
for _ in 0..100 {
    count += 1;
}

// Pattern B
for i in 0..100 {
    sum += array[i];
}

// Pattern C
for i in 0..100 {
    process(array[i]);
}

Solution

**Pattern A: Temporal Locality** — Variable `count` is accessed repeatedly (100 times). Same location accessed multiple times. **Pattern B: Spatial + Temporal Locality** — `array[i]`, `array[i+1]`, etc. are adjacent in memory (spatial). Variable `sum` is accessed 100 times (temporal). **Pattern C: Spatial + Temporal Locality** — Array accessed sequentially (spatial). Function `process` code accessed 100 times (temporal).

Key Concepts¶

Concept	Definition
Cache	Small, fast memory that holds recently accessed data
Hit	Data found in cache; fast access
Miss	Data not in cache; must fetch from memory
Hit Rate	num_hits / num_refs
Miss Rate	num_misses / num_refs = 1 - hit_rate
AMAT	Average Memory Access Time = hit_time + (miss_rate x miss_penalty)
Temporal Locality	Recently accessed data likely to be accessed again soon
Spatial Locality	Data near recently accessed data likely to be accessed soon
Direct-Mapped	Each address maps to exactly one cache slot
Fully-Associative	Address can be stored in any slot
Set-Associative	Address maps to one set containing multiple ways
Tag	Upper address bits identifying which block is in a slot
Index	Address bits selecting which slot/set to access
Offset	Address bits selecting byte/word within a block
Valid Bit	Indicates whether cache slot contains valid data
LRU	Least Recently Used replacement policy
Cold Miss	First access to a slot (valid is false)
Hot Miss	Slot is valid but tag does not match
Thrashing	Repeated evictions when addresses conflict on the same slot

Summary¶

Memory Hierarchy: Computer systems use a hierarchy of memory (registers, caches, DRAM, disk) to balance speed and cost. Caches bridge the processor-memory speed gap by keeping recently accessed data close to the CPU.
Principles of Locality: Caches work because programs exhibit temporal locality (recently accessed data will be accessed again) and spatial locality (nearby data will be accessed soon). These principles explain why caches achieve hit rates exceeding 95%.
Direct-Mapped Caches: The simplest cache organization maps each address to exactly one slot using the index bits. The tag identifies which of the many addresses mapping to that slot is currently stored. Simple but suffers from conflict misses.
Fully-Associative Caches: An address can map to any slot, eliminating conflict misses. Requires checking all slots on every access and a replacement policy (LRU). Most flexible but most expensive in hardware.
Set-Associative Caches: A compromise — addresses map to one set containing multiple ways. Reduces conflict misses while keeping hardware cost reasonable.
Address Decomposition: A memory address is split into tag, index/set index, and offset. With multi-word blocks, a block index selects the word within the block. The number of bits for each field depends on cache size, block size, and associativity.
LRU Replacement: When all slots (or ways in a set) are full, LRU evicts the slot accessed least recently. Implemented in software using timestamps.

Cache Memory¶

Overview¶

Learning Objectives¶

Prerequisites¶

The Memory Hierarchy¶

The Processor-Memory Speed Gap¶

Memory Hierarchy Levels¶

Cache Memory Position¶

Principles of Locality¶

Why Caches Work¶

Temporal Locality¶

Spatial Locality¶

Locality in Code Execution¶

Cache Concepts and Terminology¶

Key Terms¶

Hit and Miss Metrics¶

Average Memory Access Time (AMAT)¶

Memory Addressing¶

Byte vs Word Addressing¶

Address Conversion¶

Word Alignment¶

Direct-Mapped Caches¶

Overview¶

Cache Structure¶

Address Decomposition¶

Computing Index and Tag¶

Direct-Mapped Cache Lookup¶

Direct-Mapped Lookup Algorithm¶

Limitation: Conflict Misses¶

Fully-Associative Caches¶

Concept¶

Address Decomposition¶

Cache Structure with Timestamp¶

Fully-Associative Lookup Algorithm¶

LRU Replacement Policy¶

Trade-offs¶

Set-Associative Caches¶

Concept¶

Terminology¶

Address Decomposition for 2-Way Set-Associative¶

Set-Associative Formulas¶

Cache Layout¶

Set-Associative Lookup Algorithm¶

LRU Within a Set¶

Example: 4-Way Set-Associative with 32 Slots¶

Block Size¶

Multi-Word Blocks¶

Address Decomposition with Block Size¶

Loading a Block on Miss¶

Cache Comparison¶

Summary Table¶

Special Cases¶

Cache Questions to Answer¶

1. Where to Look?¶

2. Is It There?¶

3. What to Replace?¶

Address Bit Calculation¶

General Formula¶

Example Calculations¶

Cache Simulation Trace Examples¶

Example 1: Direct-Mapped Thrashing¶

Example 2: 2-Way Set-Associative Trace¶

Example 3: Direct-Mapped vs 2-Way Comparison¶

Project03 Cache Simulator¶

Overview¶

Cache Simulator Structure¶

Simulation Flow¶

Practice Problems¶

Problem 1: Address Decomposition¶

Problem 2: Hit Rate Calculation¶

Problem 3: Set-Associative Address Decomposition¶

Problem 4: Cache Trace Analysis¶

Problem 5: Locality Analysis¶

Key Concepts¶

Summary¶

Further Reading¶