Cache Memory¶
Overview¶
This lecture introduces cache memory concepts and their role in bridging the speed gap between processors and main memory. We examine why caches are effective through the principles of temporal and spatial locality, and explore three cache organizations: direct-mapped, fully-associative, and set-associative. We cover address decomposition, cache lookup algorithms, LRU replacement, and multi-word block sizes. Understanding cache memory is essential for writing efficient code and for implementing the cache simulator in Project03.
Learning Objectives¶
- Understand the memory hierarchy and the processor-memory speed gap
- Explain the principles of temporal and spatial locality
- Describe how direct-mapped caches use address decomposition
- Calculate tag, index, and offset bits for a given cache configuration
- Implement cache lookup algorithms for all three cache types in Rust
- Understand and implement LRU replacement policy using timestamps
- Decompose addresses for caches with multi-word blocks
- Trace through cache simulation scenarios to determine hits and misses
- Compute hit rates and miss rates from simulation data
Prerequisites¶
- RISC-V Emulation — emulator state and memory operations
- RISC-V Machine Code — instruction encoding and bit extraction
The Memory Hierarchy¶
The Processor-Memory Speed Gap¶
Modern processors can execute billions of instructions per second, but accessing data from main memory (DRAM) takes many more cycles than a simple arithmetic operation. This creates a fundamental bottleneck:
graph LR
subgraph "Speed Over Time"
A["Processor Speed"] -->|"Increasing<br>rapidly"| B["Performance Gap"]
C["Memory Speed"] -->|"Increasing<br>slowly"| B
end
B -->|"Growing wider<br>each year"| D["Solution: Cache Memory"]
Memory Hierarchy Levels¶
To address this gap, computer systems use a memory hierarchy:
graph TD
subgraph "Memory Hierarchy"
A["CPU Registers<br>Fastest, smallest<br>~1 cycle"]
B["L1 Cache (SRAM)<br>Very fast, small<br>~1-4 cycles"]
C["L2 Cache (SRAM)<br>Fast, medium<br>~10-20 cycles"]
D["L3 Cache (SRAM)<br>Moderate, larger<br>~40-75 cycles"]
E["Main Memory (DRAM)<br>Slower, large<br>~100+ cycles"]
F["Disk/SSD<br>Slowest, huge<br>~millions of cycles"]
end
A --> B
B --> C
C --> D
D --> E
E --> F
| Level | Technology | Size | Access Time | Cost/Byte |
|---|---|---|---|---|
| Registers | Flip-flops | ~32 x 64 bits | ~0.5 ns | Highest |
| L1 Cache | SRAM | 32-64 KB | ~1-2 ns | Very High |
| L2 Cache | SRAM | 256 KB-1 MB | ~5-10 ns | High |
| L3 Cache | SRAM | 2-32 MB | ~20-40 ns | Medium |
| Main Memory | DRAM | 4-64 GB | ~50-100 ns | Low |
| SSD | Flash | 256 GB-4 TB | ~100 us | Very Low |
| HDD | Magnetic | 1-16 TB | ~10 ms | Lowest |
Cache Memory Position¶
A cache sits between the processor and main memory, holding recently accessed data:
graph LR
subgraph "Processor"
A["CPU<br>lw, sw"]
end
subgraph "Cache"
B["SRAM<br>Fast memory"]
end
subgraph "Main Memory"
C["DRAM<br>STACK"]
D["DATA"]
E["CODE"]
end
A -->|"addr"| B
B -->|"hit"| A
B <-->|"miss"| C
B <--> D
B <--> E
Principles of Locality¶
Why Caches Work¶
Caches achieve high hit rates (often >97%) because programs exhibit locality of reference.
Temporal Locality¶
Temporal locality: If a data item is accessed, it is likely to be accessed again soon.
Examples:
- Loop counters: Variable
iin a loop is accessed on every iteration - Function variables: Local variables used throughout a function
- Recursive calls: The same function code is executed repeatedly
// Temporal locality example
let mut sum: i64 = 0;
for i in 0..1000 { // 'i' accessed 1000+ times
sum += array[i]; // 'sum' accessed 1000 times
}
Spatial Locality¶
Spatial locality: If a data item is accessed, nearby data items are likely to be accessed soon.
Examples:
- Array traversal: Accessing
array[i]often followed byarray[i+1] - Sequential code: Instruction at PC followed by instruction at PC+4
- Struct access: Accessing one field often followed by neighboring fields
// Spatial locality example
for i in 0..1000 {
total += array[i]; // array[0], array[1], array[2], ... accessed sequentially
}
Locality in Code Execution¶
graph TD
subgraph "Instruction Fetch"
A["PC = 0x1000<br>fetch instruction"]
B["PC = 0x1004<br>fetch next (spatial)"]
C["PC = 0x1008<br>fetch next (spatial)"]
D["PC = 0x1000<br>loop back (temporal)"]
end
A --> B
B --> C
C -->|"branch"| D
D --> A
Cache Concepts and Terminology¶
Key Terms¶
| Term | Definition |
|---|---|
| Hit | Requested data is found in the cache |
| Miss | Requested data is not in the cache |
| Hit Rate | Fraction of accesses that are hits |
| Miss Rate | Fraction of accesses that are misses |
| Block | Unit of data transferred between cache and memory |
| Slot/Line | A single entry in the cache (holds one block) |
| Tag | Identifier stored with each block to identify which address it holds |
| Valid Bit | Indicates whether a slot contains valid data |
| Cold Miss | First access to a slot (valid is false) |
| Hot Miss | Slot is valid but contains data for a different address (tag mismatch) |
Hit and Miss Metrics¶
hit_rate = num_hits / num_refs
miss_rate = num_misses / num_refs
// Alternative formulas
hit_rate = 1 - miss_rate
miss_rate = 1 - hit_rate
Average Memory Access Time (AMAT)¶
The AMAT formula combines hit and miss times:
Where:
- hit_time: Time to access the cache (typically 1-2 cycles)
- miss_rate: Probability of a cache miss
- miss_penalty: Additional time to fetch from memory on a miss (typically 50-200 cycles)
Example: If hit_time = 1 cycle, miss_rate = 5%, miss_penalty = 100 cycles:
Memory Addressing¶
Byte vs Word Addressing¶
Memory is byte-addressable, but caches often work with words (4 bytes):
Address Conversion¶
// Byte address to word address
let word_addr = addr / 4;
let word_addr = addr >> 2; // Equivalent using bit shift
// Word address to byte address
let addr = word_addr * 4;
let addr = word_addr << 2; // Equivalent using bit shift
Word Alignment¶
For efficiency, word accesses should be word-aligned (addresses divisible by 4):
| Byte Address | Word Address | Aligned? |
|---|---|---|
| 0x00 | 0 | Yes |
| 0x04 | 1 | Yes |
| 0x08 | 2 | Yes |
| 0x03 | - | No |
| 0x05 | - | No |
Direct-Mapped Caches¶
Overview¶
A direct-mapped cache is the simplest cache organization. Each memory address maps to exactly one cache slot, determined by the address itself. This makes the replacement policy trivial: if the data is not in the cache, you evict whatever is in the slot that the address maps to.
Cache Structure¶
struct CacheSlot {
tag: u64,
block: [u32; 1], // One word per block
valid: bool,
}
let slots: [CacheSlot; N]; // N slots in the cache
Address Decomposition¶
For a 4-slot direct-mapped cache with 1-word blocks:
64-bit byte address:
┌─────────────────────────┬─────────┬───────────┐
│ TAG │ INDEX │ BYTE OFF │
│ [63:4] │ [3:2] │ [1:0] │
└─────────────────────────┴─────────┴───────────┘
60 bits 2 bits 2 bits
- Byte Offset (2 bits): Selects byte within a word (for word-aligned access, always 00)
- Index (2 bits): Selects which cache slot (0-3 for 4 slots)
- Tag (60 bits): Identifies the specific memory block stored in the slot
Computing Index and Tag¶
We can use modulus or bit manipulation to extract these fields:
// For a 4-slot cache with 1-word blocks
let word_addr = addr >> 2;
let slot_index = word_addr & 0b11; // Same as word_addr % 4
Why does this work? Look at the binary form of word addresses:
dec bin
0 00000 <-- slot 0
1 00001 <-- slot 1
2 00010 <-- slot 2
3 00011 <-- slot 3
4 00100 <-- slot 0
5 00101 <-- slot 1
6 00110 <-- slot 2
7 00111 <-- slot 3
8 01000 <-- slot 0
The last two bits of each word address are exactly word_addr % 4. The remaining upper bits form the tag, which distinguishes different addresses that map to the same slot.
Direct-Mapped Cache Lookup¶
flowchart TD
A["Memory Address"] --> B["Extract Index Bits"]
B --> C["Access Cache Slot"]
C --> D{"Valid?"}
D -->|"No"| E["MISS (cold)"]
D -->|"Yes"| F{"Tag Match?"}
F -->|"No"| G["MISS (hot)"]
F -->|"Yes"| H["HIT: Return Data"]
E --> I["Fetch from Memory"]
G --> I
I --> J["Update Cache Slot"]
J --> K["Return Data"]
Direct-Mapped Lookup Algorithm¶
let addr_tag = addr >> 4; // Remove byte_offset (2) and slot_index bits (2)
let index_mask: u64 = 0b11;
let slot_index = ((addr >> 2) & index_mask) as usize;
let slot = &mut slots[slot_index];
if slot.valid && slot.tag == addr_tag {
// Hit
num_hits += 1;
slot.block[0]
} else {
// Miss (cold if !slot.valid, hot if slot.valid)
num_misses += 1;
slot.block[0] = mem_read_word(addr);
slot.tag = addr_tag;
slot.valid = true;
slot.block[0]
}
Limitation: Conflict Misses¶
Direct-mapped caches suffer from conflict misses when multiple addresses map to the same slot:
Addresses 0, 16, 32, 48 all map to slot 0 (for 4-slot cache)
If accessed alternately: 0, 16, 0, 16 -> all misses (thrashing)
Fully-Associative Caches¶
Concept¶
In a fully-associative cache, an address can be stored in ANY slot. This eliminates conflict misses but requires checking all slots on every access. While too expensive for regular cache memory, fully-associative caches are used in specific situations like hardware support for virtual memory.
Address Decomposition¶
No index bits needed — only tag and byte offset:
┌─────────────────────────────────┬───────────┐
│ TAG │ BYTE OFF │
│ [63:2] │ [1:0] │
└─────────────────────────────────┴───────────┘
Cache Structure with Timestamp¶
We add a timestamp field for tracking when each slot was last accessed, which is needed for the LRU replacement policy:
struct CacheSlot {
tag: u64,
block: [u32; 1],
valid: bool,
timestamp: u64,
}
let slots: [CacheSlot; N];
Fully-Associative Lookup Algorithm¶
Must compare the address tag against ALL slot tags:
refs += 1;
let addr_tag = addr >> 2; // Remove byte_offset bits (2)
for i in 0..4 {
let slot = &mut slots[i];
if slot.valid && slot.tag == addr_tag {
// Hit
num_hits += 1;
slot.timestamp = refs;
return slot.block[0];
}
}
// Miss
num_misses += 1;
let slot = find_lru_slot(&mut slots);
slot.block[0] = mem_read_word(addr);
slot.tag = addr_tag;
slot.valid = true;
slot.timestamp = refs;
slot.block[0]
LRU Replacement Policy¶
When all slots are full, we need to choose which to evict. LRU (Least Recently Used) evicts the slot that hasn't been accessed for the longest time.
We use a logical timestamp (the number of cache references). Each time we access a cache slot, we set the slot's timestamp to the current reference count. To find the LRU slot, we look for the slot with the smallest timestamp — this is the slot accessed least recently.
First, look for an unused slot (cold miss). If all slots are valid, pick the slot with the oldest timestamp (hot miss):
fn find_lru_slot(slots: &mut [CacheSlot]) -> &mut CacheSlot {
// First, look for an invalid (unused) slot
for slot in slots.iter_mut() {
if !slot.valid {
return slot;
}
}
// All valid — find slot with oldest timestamp
let mut lru_index = 0;
let mut min_timestamp = u64::MAX;
for (i, slot) in slots.iter().enumerate() {
if slot.timestamp < min_timestamp {
min_timestamp = slot.timestamp;
lru_index = i;
}
}
&mut slots[lru_index]
}
Trade-offs¶
| Aspect | Fully-Associative |
|---|---|
| Hit Rate | Highest (no conflict misses) |
| Lookup Speed | Slowest (check all slots) |
| Hardware Cost | Highest (parallel comparators) |
| Replacement | Flexible (LRU, FIFO, Random) |
Set-Associative Caches¶
Concept¶
A set-associative cache is a compromise between direct-mapped and fully-associative:
- Cache is divided into sets
- Each set contains multiple ways (slots)
- An address maps to one specific set (like direct-mapped)
- Within the set, the address can go in any way (like fully-associative)
Terminology¶
- N-way set-associative: Each set has N slots
- 2-way: Each set has 2 slots (most common)
- 4-way: Each set has 4 slots
- 8-way: Each set has 8 slots
Address Decomposition for 2-Way Set-Associative¶
For 4 slots total organized as 2-way (2 sets, 2 ways each):
┌──────────────────────────┬──────────┬───────────┐
│ TAG │ SET IDX │ BYTE OFF │
│ [63:3] │ [2] │ [1:0] │
└──────────────────────────┴──────────┴───────────┘
61 bits 1 bit 2 bits
Set-Associative Formulas¶
let num_slots = N; // Total number of cache slots
let num_ways = W; // Number of ways (slots per set)
let num_sets = N / W; // Number of sets
// Bit widths
// set_index_bits = log2(num_sets)
// byte_offset_bits = log2(block_size_in_bytes)
// tag_bits = addr_bits - set_index_bits - byte_offset_bits
Cache Layout¶
Way 1 Way 0
┌───────┬─────┬─────┬───────┬─────┬─────┐
Set 1 │ valid │ tag │data │ valid │ tag │data │
├───────┼─────┼─────┼───────┼─────┼─────┤
Set 0 │ valid │ tag │data │ valid │ tag │data │
└───────┴─────┴─────┴───────┴─────┴─────┘
Set-Associative Lookup Algorithm¶
We logically treat the cache slot array as groups of num_ways slots. The first group is set 0, the second is set 1, and so on. Once we find the set index, we multiply by num_ways to get the first slot in the set:
refs += 1;
let num_ways = 4;
let addr_tag = addr >> 5; // Remove byte_offset (2) and set_index bits (3)
let set_index = ((addr >> 2) & 0b111) as usize;
// 4 ways per set, so set_index * num_ways gives us the first slot in the set
let set_base = set_index * num_ways;
for i in 0..num_ways {
let slot = &mut slots[set_base + i];
if slot.valid && slot.tag == addr_tag {
// Hit
num_hits += 1;
slot.timestamp = refs;
return slot.block[0];
}
}
// Miss
num_misses += 1;
let slot = find_lru_slot_in_set(&mut slots, set_base, num_ways);
slot.block[0] = mem_read_word(addr);
slot.tag = addr_tag;
slot.valid = true;
slot.timestamp = refs;
slot.block[0]
LRU Within a Set¶
The find_lru_slot_in_set function works just like the fully-associative version, but only searches within the set:
fn find_lru_slot_in_set(
slots: &mut [CacheSlot],
set_base: usize,
num_ways: usize,
) -> &mut CacheSlot {
// First, look for an invalid slot in the set
for i in 0..num_ways {
if !slots[set_base + i].valid {
return &mut slots[set_base + i];
}
}
// All valid — find LRU within the set
let mut lru_index = set_base;
let mut min_timestamp = u64::MAX;
for i in 0..num_ways {
let idx = set_base + i;
if slots[idx].timestamp < min_timestamp {
min_timestamp = slots[idx].timestamp;
lru_index = idx;
}
}
&mut slots[lru_index]
}
Example: 4-Way Set-Associative with 32 Slots¶
Configuration:
- 32 slots total
- 4-way set-associative
- 1-word blocks
Calculations:
num_sets = 32 / 4 = 8 sets
set_index_bits = log2(8) = 3 bits
byte_offset_bits = log2(4) = 2 bits
tag_bits = 64 - 3 - 2 = 59 bits
Address decomposition:
┌──────────────────────────┬───────────┬───────────┐
│ TAG │ SET INDEX │ BYTE OFF │
│ [63:5] │ [4:2] │ [1:0] │
└──────────────────────────┴───────────┴───────────┘
59 bits 3 bits 2 bits
Block Size¶
Multi-Word Blocks¶
To exploit spatial locality, caches typically store multiple words per slot. When we miss on one word, we load an entire block of neighboring words into the cache — so subsequent accesses to nearby addresses are likely to hit.
struct CacheSlot {
tag: u64,
block: [u32; 4], // 4-word block (16 bytes)
valid: bool,
timestamp: u64,
}
let slots: [CacheSlot; N];
Address Decomposition with Block Size¶
For a 4-slot direct-mapped cache with 4-word blocks:
┌──────────────────────┬───────────┬───────────┬───────────┐
│ TAG │ SLOT IDX │ BLOCK IDX │ BYTE OFF │
│ [63:6] │ [5:4] │ [3:2] │ [1:0] │
└──────────────────────┴───────────┴───────────┴───────────┘
58 bits 2 bits 2 bits 2 bits
- Byte Offset (2 bits): Selects byte within word
- Block Index (2 bits): Selects word within block
- Slot Index (2 bits): Selects cache slot
- Tag (58 bits): Identifies memory block
For other cache types with 4-word blocks:
Fully Associative:
Set Associative (2-way, 4 slots total = 2 sets):
Loading a Block on Miss¶
When a miss occurs, we load the entire block from memory. We need to find the block-aligned address:
let word_addr = addr / 4;
let block_index = word_addr % 4; // Where 4 is the block size in words
let block_base = word_addr - block_index;
Now load 4 words from memory starting at block_base. Note that block_base is a word address and needs to be converted to a byte address to dereference properly.
Cache Comparison¶
Summary Table¶
| Feature | Direct-Mapped | Fully-Associative | N-Way Set-Assoc |
|---|---|---|---|
| Address maps to | 1 slot | Any slot | 1 set (N slots) |
| Index bits | log2(slots) | 0 | log2(sets) |
| Comparisons | 1 | All slots | N (ways) |
| Conflict misses | Many | None | Reduced |
| Hardware cost | Lowest | Highest | Medium |
| Replacement | None needed | LRU/FIFO/Random | LRU within set |
Special Cases¶
- Direct-mapped = 1-way set-associative (each set has 1 slot)
- Fully-associative = N-way where N = total slots (one set containing all slots)
graph LR
subgraph "Direct-Mapped<br>(1-way)"
DM1["Slot 0"]
DM2["Slot 1"]
DM3["Slot 2"]
DM4["Slot 3"]
end
subgraph "2-Way Set-Associative"
SA1["Set 0: Way 0 | Way 1"]
SA2["Set 1: Way 0 | Way 1"]
end
subgraph "Fully-Associative<br>(4-way, 1 set)"
FA["Way 0 | Way 1 | Way 2 | Way 3"]
end
Cache Questions to Answer¶
When designing or analyzing a cache, consider three fundamental questions:
1. Where to Look?¶
Given an address, which cache slot(s) could contain the data?
| Cache Type | Answer |
|---|---|
| Direct-Mapped | Exactly one slot (determined by index bits) |
| Fully-Associative | Any slot (must check all) |
| Set-Associative | One set of slots (determined by set index) |
2. Is It There?¶
How do we know if the requested data is actually in the cache?
Requirements for a hit:
- The slot must be valid (
validistrue) - The slot's tag must match the address's tag
3. What to Replace?¶
On a miss, if we need to bring in new data, what do we evict?
| Cache Type | Replacement Policy |
|---|---|
| Direct-Mapped | No choice — only one possible slot |
| Fully-Associative | LRU, FIFO, Random, etc. |
| Set-Associative | LRU, FIFO, Random within the set |
Address Bit Calculation¶
General Formula¶
For a cache with:
- S = number of slots (lines)
- B = block size in bytes
- A = address size in bits (typically 64)
The number of bits for each field:
- Byte Offset bits = log2(B)
- Index bits = log2(S)
- Tag bits = A - Index bits - Byte Offset bits
Example Calculations¶
Example 1: 4-slot cache, 4-byte blocks (1 word), 64-bit addresses
Example 2: 16-slot cache, 16-byte blocks (4 words), 64-bit addresses
Example 3: 256-slot cache, 64-byte blocks (16 words), 64-bit addresses
Example 4: 64KB cache, 64-byte blocks, 48-bit addresses (direct-mapped)
Number of slots = 65536 / 64 = 1024 slots
Byte Offset = log2(64) = 6 bits
Index = log2(1024) = 10 bits
Tag = 48 - 6 - 10 = 32 bits
Cache Simulation Trace Examples¶
Example 1: Direct-Mapped Thrashing¶
A 4-slot direct-mapped cache with 1-word blocks is initially empty. Address trace (word addresses): 0, 4, 0, 8, 0, 12, 0
| Access | Word Addr | Slot (addr % 4) | Tag | Result | Cache State |
|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | MISS | [0: tag=0] |
| 2 | 4 | 0 | 1 | MISS | [0: tag=1] (evicts tag=0) |
| 3 | 0 | 0 | 0 | MISS | [0: tag=0] (evicts tag=1) |
| 4 | 8 | 0 | 2 | MISS | [0: tag=2] (evicts tag=0) |
| 5 | 0 | 0 | 0 | MISS | [0: tag=0] (evicts tag=2) |
| 6 | 12 | 0 | 3 | MISS | [0: tag=3] (evicts tag=0) |
| 7 | 0 | 0 | 0 | MISS | [0: tag=0] (evicts tag=3) |
Result: All misses! Hit rate: 0%. This is thrashing — all addresses map to the same slot, causing constant evictions.
Example 2: 2-Way Set-Associative Trace¶
A 4-slot, 2-way set-associative cache (2 sets), 1-word blocks, LRU replacement. Initially empty.
Word addresses: 0, 8, 0, 6, 8
Configuration:
- Set index = word_addr % 2
- Tag = word_addr / 2
Access 1: Word Addr 0
- Set = 0 % 2 = 0, Tag = 0 / 2 = 0
- Set 0: Both invalid → MISS (cold)
- Load into Way 0
| Set | Way 0 (V, Tag, TS) | Way 1 (V, Tag, TS) |
|---|---|---|
| 0 | (true, 0, 1) | (false, -, -) |
| 1 | (false, -, -) | (false, -, -) |
Access 2: Word Addr 8
- Set = 8 % 2 = 0, Tag = 8 / 2 = 4
- Set 0: Way 0 tag=0 ≠ 4, Way 1 invalid → MISS (cold)
- Load into Way 1
| Set | Way 0 (V, Tag, TS) | Way 1 (V, Tag, TS) |
|---|---|---|
| 0 | (true, 0, 1) | (true, 4, 2) |
| 1 | (false, -, -) | (false, -, -) |
Access 3: Word Addr 0
- Set = 0, Tag = 0
- Set 0: Way 0 tag=0 matches → HIT
- Update timestamp
| Set | Way 0 (V, Tag, TS) | Way 1 (V, Tag, TS) |
|---|---|---|
| 0 | (true, 0, 3) | (true, 4, 2) |
| 1 | (false, -, -) | (false, -, -) |
Access 4: Word Addr 6
- Set = 6 % 2 = 0, Tag = 6 / 2 = 3
- Set 0: Way 0 tag=0 ≠ 3, Way 1 tag=4 ≠ 3 → MISS (hot)
- Both valid, LRU is Way 1 (timestamp 2 < 3) → evict Way 1
| Set | Way 0 (V, Tag, TS) | Way 1 (V, Tag, TS) |
|---|---|---|
| 0 | (true, 0, 3) | (true, 3, 4) |
| 1 | (false, -, -) | (false, -, -) |
Access 5: Word Addr 8
- Set = 0, Tag = 4
- Set 0: Way 0 tag=0 ≠ 4, Way 1 tag=3 ≠ 4 → MISS (hot)
- LRU is Way 0 (timestamp 3 < 4) → evict Way 0
| Set | Way 0 (V, Tag, TS) | Way 1 (V, Tag, TS) |
|---|---|---|
| 0 | (true, 4, 5) | (true, 3, 4) |
| 1 | (false, -, -) | (false, -, -) |
Summary: Hits: 1 (access 3), Misses: 4 → Hit rate: 20%
Example 3: Direct-Mapped vs 2-Way Comparison¶
Address sequence: 0, 16, 0, 16, 0 with a 2-slot cache.
Direct-Mapped (2 slots):
Both 0 and 16 map to slot 0 (since (0/4)%2=0 and (16/4)%2=0).
| Access | Addr | Slot | Tag | Result |
|---|---|---|---|---|
| 1 | 0 | 0 | 0 | MISS |
| 2 | 16 | 0 | 4 | MISS (evicts 0) |
| 3 | 0 | 0 | 0 | MISS (evicts 16) |
| 4 | 16 | 0 | 4 | MISS (evicts 0) |
| 5 | 0 | 0 | 0 | MISS (evicts 16) |
Hit rate: 0% (thrashing)
2-Way Set-Associative (1 set with 2 ways):
Both addresses map to the same set, but can coexist in different ways.
| Access | Addr | Tag | Way 0 | Way 1 | Result |
|---|---|---|---|---|---|
| 1 | 0 | 0 | 0 | - | MISS |
| 2 | 16 | 4 | 0 | 4 | MISS |
| 3 | 0 | 0 | 0 | 4 | HIT |
| 4 | 16 | 4 | 0 | 4 | HIT |
| 5 | 0 | 0 | 0 | 4 | HIT |
Hit rate: 60%
The 2-way associative cache eliminates the conflict misses that plagued the direct-mapped cache.
Project03 Cache Simulator¶
Overview¶
Project03 extends the RISC-V emulator to include dynamic analysis and cache simulation:
- Additional Instructions: Complete RISC-V instruction set
- Dynamic Analysis: Count instructions, branches taken/not-taken
- Cache Simulation: Simulate I-cache behavior during execution
Cache Simulator Structure¶
struct CacheSlot {
tag: u64,
block: [u32; 1], // or larger for multi-word blocks
valid: bool,
timestamp: u64, // For LRU replacement
}
struct Cache {
slots: Vec<CacheSlot>,
num_slots: usize,
block_size: usize, // Block size in words
num_ways: usize, // Associativity (1 = direct-mapped)
num_refs: u64,
num_hits: u64,
num_misses: u64,
}
Simulation Flow¶
flowchart TD
A["Fetch Instruction"] --> B["Check I-Cache"]
B --> C{"Hit?"}
C -->|"Yes"| D["Increment hits"]
C -->|"No"| E["Increment misses"]
D --> F["Execute Instruction"]
E --> G["Update Cache"]
G --> F
F --> H{"More instructions?"}
H -->|"Yes"| A
H -->|"No"| I["Report Statistics"]
Practice Problems¶
Problem 1: Address Decomposition¶
Given a 16-slot direct-mapped cache with 4-byte blocks and 64-bit addresses, decompose the address 0x0000000000001A3C.
Solution
**Step 1: Calculate bit widths** **Step 2: Convert address to binary** **Step 3: Extract fields** **Answer:** Byte Offset: 0, Index: 15 (slot 15), Tag: 0x68Problem 2: Hit Rate Calculation¶
A cache simulation reports: 950 hits, 50 misses. Calculate the hit rate, miss rate, and AMAT assuming hit_time = 1 cycle and miss_penalty = 100 cycles.
Solution
Problem 3: Set-Associative Address Decomposition¶
For an 8-way set-associative cache with 64 slots and 4-word blocks, calculate the number of bits for tag, set index, block index, and byte offset (assuming 64-bit addresses).
Solution
**Step 1: Calculate number of sets** **Step 2: Calculate bit widths** **Address format:**Problem 4: Cache Trace Analysis¶
A 4-slot, 2-way set-associative cache (2 sets) with LRU replacement processes these byte addresses: 0, 4, 8, 12, 0, 16. Assume 1-word blocks.
Solution
**Configuration:** - 2 sets, 2 ways each - Set index = (addr >> 2) & 0b1 - Tag = addr >> 3 | Access | Byte Addr | Word Addr | Set | Tag | Result | |--------|-----------|-----------|-----|-----|--------| | 1 | 0 | 0 | 0 | 0 | MISS | | 2 | 4 | 1 | 1 | 0 | MISS | | 3 | 8 | 2 | 0 | 1 | MISS | | 4 | 12 | 3 | 1 | 1 | MISS | | 5 | 0 | 0 | 0 | 0 | HIT | | 6 | 16 | 4 | 0 | 2 | MISS (LRU evicts tag=1) | **Result:** M, M, M, M, H, M — Hit rate: 1/6 = 16.7%Problem 5: Locality Analysis¶
Which type of locality does each code pattern exhibit?
// Pattern A
for _ in 0..100 {
count += 1;
}
// Pattern B
for i in 0..100 {
sum += array[i];
}
// Pattern C
for i in 0..100 {
process(array[i]);
}
Solution
**Pattern A: Temporal Locality** — Variable `count` is accessed repeatedly (100 times). Same location accessed multiple times. **Pattern B: Spatial + Temporal Locality** — `array[i]`, `array[i+1]`, etc. are adjacent in memory (spatial). Variable `sum` is accessed 100 times (temporal). **Pattern C: Spatial + Temporal Locality** — Array accessed sequentially (spatial). Function `process` code accessed 100 times (temporal).Key Concepts¶
| Concept | Definition |
|---|---|
| Cache | Small, fast memory that holds recently accessed data |
| Hit | Data found in cache; fast access |
| Miss | Data not in cache; must fetch from memory |
| Hit Rate | num_hits / num_refs |
| Miss Rate | num_misses / num_refs = 1 - hit_rate |
| AMAT | Average Memory Access Time = hit_time + (miss_rate x miss_penalty) |
| Temporal Locality | Recently accessed data likely to be accessed again soon |
| Spatial Locality | Data near recently accessed data likely to be accessed soon |
| Direct-Mapped | Each address maps to exactly one cache slot |
| Fully-Associative | Address can be stored in any slot |
| Set-Associative | Address maps to one set containing multiple ways |
| Tag | Upper address bits identifying which block is in a slot |
| Index | Address bits selecting which slot/set to access |
| Offset | Address bits selecting byte/word within a block |
| Valid Bit | Indicates whether cache slot contains valid data |
| LRU | Least Recently Used replacement policy |
| Cold Miss | First access to a slot (valid is false) |
| Hot Miss | Slot is valid but tag does not match |
| Thrashing | Repeated evictions when addresses conflict on the same slot |
Summary¶
-
Memory Hierarchy: Computer systems use a hierarchy of memory (registers, caches, DRAM, disk) to balance speed and cost. Caches bridge the processor-memory speed gap by keeping recently accessed data close to the CPU.
-
Principles of Locality: Caches work because programs exhibit temporal locality (recently accessed data will be accessed again) and spatial locality (nearby data will be accessed soon). These principles explain why caches achieve hit rates exceeding 95%.
-
Direct-Mapped Caches: The simplest cache organization maps each address to exactly one slot using the index bits. The tag identifies which of the many addresses mapping to that slot is currently stored. Simple but suffers from conflict misses.
-
Fully-Associative Caches: An address can map to any slot, eliminating conflict misses. Requires checking all slots on every access and a replacement policy (LRU). Most flexible but most expensive in hardware.
-
Set-Associative Caches: A compromise — addresses map to one set containing multiple ways. Reduces conflict misses while keeping hardware cost reasonable.
-
Address Decomposition: A memory address is split into tag, index/set index, and offset. With multi-word blocks, a block index selects the word within the block. The number of bits for each field depends on cache size, block size, and associativity.
-
LRU Replacement: When all slots (or ways in a set) are full, LRU evicts the slot accessed least recently. Implemented in software using timestamps.
Further Reading¶
-
Computer Organization and Design: RISC-V Edition (Patterson & Hennessy) — Chapter 5: Large and Fast: Exploiting Memory Hierarchy
-
Course Guide: Cache Memory — Cache Memory Guide — Complete pseudocode for all three cache types with Rust examples
-
Project03 Assignment — Project03 — Implementation of I-cache simulation in the RISC-V emulator