← Back to Course
# Cache Memory ## CS 631 Systems Foundations — Mar 31, 2026 --- ## Today's Agenda 1. The memory hierarchy 2. Principles of locality 3. Cache concepts and terminology 4. Memory addressing 5. Direct-mapped caches 6. Fully-associative caches 7. Set-associative caches 8. Block size 9. Cache simulation traces 10. Address bit calculations --- ## Memory Hierarchy Numbers | Level | Technology | Size | Access Time | |-------|-----------|------|-------------| | Registers | Flip-flops | ~32 x 64 bits | ~0.5 ns | | L1 Cache | SRAM | 32-64 KB | ~1-2 ns | | L2 Cache | SRAM | 256 KB-1 MB | ~5-10 ns | | L3 Cache | SRAM | 2-32 MB | ~20-40 ns | | Main Memory | DRAM | 4-64 GB | ~50-100 ns | | SSD | Flash | 256 GB-4 TB | ~100 us | | HDD | Magnetic | 1-16 TB | ~10 ms |
A cache sits between the CPU and main memory, holding recently accessed data.
--- ## Principles of Locality ### Why do caches work so well? Programs exhibit **locality of reference**: - **Temporal locality**: If data is accessed, it will likely be accessed again soon - **Spatial locality**: If data is accessed, nearby data will likely be accessed soon
Caches exploit both types of locality to achieve hit rates > 97%.
--- ## Locality Examples ### Temporal Locality ```rust let mut sum: i64 = 0; for i in 0..1000 { // 'i' accessed 1000+ times sum += array[i]; // 'sum' accessed 1000 times } ``` ### Spatial Locality ```rust for i in 0..1000 { total += array[i]; // array[0], array[1], ... sequential } ``` Code also exhibits spatial locality: after executing PC, we usually execute PC+4. --- ## Cache Terminology | Term | Definition | |------|------------| | **Hit** | Data found in cache | | **Miss** | Data not in cache | | **Hit Rate** | num_hits / num_refs | | **Miss Rate** | num_misses / num_refs | | **Block** | Unit of data transferred between cache and memory | | **Slot/Line** | One entry in the cache | | **Tag** | Identifies which address block is stored | | **Valid Bit** | Whether a slot contains valid data | | **Cold Miss** | First access to a slot (valid is false) | | **Hot Miss** | Slot valid but tag doesn't match | --- ## Average Memory Access Time (AMAT) ``` AMAT = hit_time + (miss_rate × miss_penalty) ``` **Example:** hit_time = 1 cycle, miss_rate = 5%, miss_penalty = 100 cycles ``` AMAT = 1 + (0.05 × 100) = 1 + 5 = 6 cycles ```
Even a small miss rate has a big impact because the miss penalty is large.
--- ## Memory Addressing Memory is **byte-addressable**, but caches work with **words** (4 bytes): ```rust // Byte address to word address let word_addr = addr >> 2; // Same as addr / 4 // Word address to byte address let addr = word_addr << 2; // Same as word_addr * 4 ``` | Byte Addr | Word Addr | Aligned? | |-----------|-----------|----------| | 0x00 | 0 | Yes | | 0x04 | 1 | Yes | | 0x08 | 2 | Yes | | 0x03 | - | No | --- ## Three Cache Organizations
graph TD subgraph DM[Direct-Mapped] A1[Address maps to exactly ONE slot] end subgraph FA[Fully-Associative] A2[Address can map to ANY slot] end subgraph SA[Set-Associative] A3[Address maps to ONE SET of slots] end
--- ## Direct-Mapped Cache Each address maps to **exactly one slot**. Simplest form of cache. ### Cache Slot Structure ```rust struct CacheSlot { tag: u64, block: [u32; 1], // One word per block valid: bool, } let slots: [CacheSlot; N]; ``` ### Mapping Function ```rust let word_addr = addr >> 2; let slot_index = word_addr & 0b11; // For 4 slots: word_addr % 4 ``` --- ## Address Decomposition (Direct-Mapped) For a 4-slot cache with 1-word blocks: ``` 64-bit byte address: ┌─────────────────────────┬─────────┬───────────┐ │ TAG │ INDEX │ BYTE OFF │ │ [63:4] │ [3:2] │ [1:0] │ └─────────────────────────┴─────────┴───────────┘ 60 bits 2 bits 2 bits ``` - **Byte Offset**: Selects byte within a word - **Index**: Selects which cache slot - **Tag**: Identifies which address block is stored --- ## Why Bit Masking Works ``` dec bin slot 0 00000 0 1 00001 1 2 00010 2 3 00011 3 4 00100 0 (same slot as 0) 5 00101 1 (same slot as 1) 6 00110 2 (same slot as 2) 7 00111 3 (same slot as 3) 8 01000 0 (same slot as 0, 4) ``` The last 2 bits of the word address = `word_addr % 4`. The remaining upper bits = the **tag** (unique per address in the same slot). --- ## Direct-Mapped Lookup
flowchart TD A["Memory Address"] --> B["Extract Index"] B --> C["Access Cache Slot"] C --> D{"Valid?"} D -->|No| E["MISS cold"] D -->|Yes| F{"Tag Match?"} F -->|No| G["MISS hot"] F -->|Yes| H["HIT"] E --> I["Fetch from Memory
Update Slot"] G --> I
--- ## Direct-Mapped Lookup Code ```rust let addr_tag = addr >> 4; // Remove byte_offset (2) + index (2) let index_mask: u64 = 0b11; let slot_index = ((addr >> 2) & index_mask) as usize; let slot = &mut slots[slot_index]; if slot.valid && slot.tag == addr_tag { // Hit num_hits += 1; slot.block[0] } else { // Miss (cold if !slot.valid, hot if slot.valid) num_misses += 1; slot.block[0] = mem_read_word(addr); slot.tag = addr_tag; slot.valid = true; slot.block[0] } ``` --- ## Conflict Misses and Thrashing Direct-mapped caches suffer from **conflict misses**: ``` Addresses 0, 16, 32, 48 all map to slot 0 ``` **Thrashing example** with 4 slots — word addrs: 0, 4, 0, 8, 0, 12, 0 | Access | Addr | Slot | Tag | Result | |--------|------|------|-----|--------| | 1 | 0 | 0 | 0 | MISS | | 2 | 4 | 0 | 1 | MISS (evicts 0) | | 3 | 0 | 0 | 0 | MISS (evicts 1) | | 4 | 8 | 0 | 2 | MISS (evicts 0) | | 5-7 | ... | 0 | ... | ALL MISS | Hit rate: **0%** — all addresses map to slot 0! --- ## Fully-Associative Cache Address can be stored in **ANY** slot — no conflict misses. ### Address Decomposition No index bits — only tag and byte offset: ``` ┌─────────────────────────────────┬───────────┐ │ TAG │ BYTE OFF │ │ [63:2] │ [1:0] │ └─────────────────────────────────┴───────────┘ ``` Must compare address tag against **all** slot tags on every access. --- ## Fully-Associative Lookup Code ```rust refs += 1; let addr_tag = addr >> 2; // Remove byte_offset only for i in 0..NUM_SLOTS { let slot = &mut slots[i]; if slot.valid && slot.tag == addr_tag { // Hit num_hits += 1; slot.timestamp = refs; return slot.block[0]; } } // Miss — find slot to replace num_misses += 1; let slot = find_lru_slot(&mut slots); slot.block[0] = mem_read_word(addr); slot.tag = addr_tag; slot.valid = true; slot.timestamp = refs; slot.block[0] ``` --- ## LRU Replacement Policy **Least Recently Used**: evict the slot not accessed for the longest time. Use a logical timestamp (reference count). Each access updates the slot's timestamp. The slot with the **smallest** timestamp is the LRU. ```rust fn find_lru_slot(slots: &mut [CacheSlot]) -> &mut CacheSlot { // First look for unused slot (cold miss) for slot in slots.iter_mut() { if !slot.valid { return slot; } } // All valid — find oldest timestamp (hot miss) let mut lru_idx = 0; let mut min_ts = u64::MAX; for (i, slot) in slots.iter().enumerate() { if slot.timestamp < min_ts { min_ts = slot.timestamp; lru_idx = i; } } &mut slots[lru_idx] } ``` --- ## Fully-Associative Trade-offs | Aspect | Fully-Associative | |--------|-------------------| | Hit Rate | Highest (no conflict misses) | | Lookup Speed | Slowest (check all slots) | | Hardware Cost | Highest (parallel comparators) | | Replacement | Flexible (LRU, FIFO, Random) |
Too expensive for general cache memory, but used in specific cases like TLBs (translation lookaside buffers) for virtual memory.
--- ## Set-Associative Cache A **compromise** between direct-mapped and fully-associative: - Cache divided into **sets** - Each set contains multiple **ways** (slots) - Address maps to **one set** (like direct-mapped) - Within a set, can go in **any way** (like fully-associative) **N-way set-associative**: each set has N slots ``` Way 1 Way 0 ┌───────┬─────┬─────┬───────┬─────┬─────┐ Set 1 │ valid │ tag │data │ valid │ tag │data │ ├───────┼─────┼─────┼───────┼─────┼─────┤ Set 0 │ valid │ tag │data │ valid │ tag │data │ └───────┴─────┴─────┴───────┴─────┴─────┘ ``` --- ## Set-Associative Formulas ```rust let num_sets = num_slots / num_ways; // set_index_bits = log2(num_sets) // byte_offset_bits = log2(block_size_in_bytes) // tag_bits = addr_bits - set_index_bits - byte_offset_bits ``` **Example**: 4 slots, 2-way → 2 sets ``` ┌──────────────────────────┬──────────┬───────────┐ │ TAG │ SET IDX │ BYTE OFF │ │ [63:3] │ [2] │ [1:0] │ └──────────────────────────┴──────────┴───────────┘ 61 bits 1 bit 2 bits ``` --- ## Set-Associative Lookup Code ```rust refs += 1; let num_ways = 4; let addr_tag = addr >> 5; // Remove byte_off (2) + set_idx (3) let set_index = ((addr >> 2) & 0b111) as usize; let set_base = set_index * num_ways; for i in 0..num_ways { let slot = &mut slots[set_base + i]; if slot.valid && slot.tag == addr_tag { // Hit num_hits += 1; slot.timestamp = refs; return slot.block[0]; } } // Miss — find LRU within the set num_misses += 1; let slot = find_lru_slot_in_set(&mut slots, set_base, num_ways); slot.block[0] = mem_read_word(addr); slot.tag = addr_tag; slot.valid = true; slot.timestamp = refs; slot.block[0] ``` --- ## LRU Within a Set Same logic as fully-associative, but only search within the set: ```rust fn find_lru_slot_in_set( slots: &mut [CacheSlot], set_base: usize, num_ways: usize, ) -> &mut CacheSlot { for i in 0..num_ways { if !slots[set_base + i].valid { return &mut slots[set_base + i]; // Cold miss } } let mut lru_idx = set_base; let mut min_ts = u64::MAX; for i in 0..num_ways { let idx = set_base + i; if slots[idx].timestamp < min_ts { min_ts = slots[idx].timestamp; lru_idx = idx; } } &mut slots[lru_idx] // Hot miss } ``` --- ## Block Size To exploit **spatial locality**, cache slots hold multiple words: ```rust struct CacheSlot { tag: u64, block: [u32; 4], // 4-word block (16 bytes) valid: bool, timestamp: u64, } ``` On a miss, load the **entire block** from memory — neighbors come for free. --- ## Address Decomposition with Blocks 4-slot direct-mapped, 4-word blocks: ``` ┌──────────────────────┬───────────┬───────────┬───────────┐ │ TAG │ SLOT IDX │ BLOCK IDX │ BYTE OFF │ │ [63:6] │ [5:4] │ [3:2] │ [1:0] │ └──────────────────────┴───────────┴───────────┴───────────┘ 58 bits 2 bits 2 bits 2 bits ``` **Block Index** selects which word within the block. ### Finding the block base on a miss: ```rust let word_addr = addr / 4; let block_index = word_addr % 4; // Word position in block let block_base = word_addr - block_index; // Load 4 words starting at block_base ``` --- ## Cache Comparison | Feature | Direct-Mapped | Fully-Assoc | N-Way Set-Assoc | |---------|:---:|:---:|:---:| | Maps to | 1 slot | Any slot | 1 set (N slots) | | Index bits | log2(slots) | 0 | log2(sets) | | Comparisons | 1 | All slots | N (ways) | | Conflict misses | Many | None | Reduced | | Hardware cost | Lowest | Highest | Medium | | Replacement | None needed | LRU/FIFO | LRU in set | ### Special Cases - **Direct-mapped** = 1-way set-associative - **Fully-associative** = N-way where N = total slots --- ## Three Cache Questions ### 1. Where to Look? | Cache Type | Answer | |------------|--------| | Direct-Mapped | Exactly one slot | | Fully-Associative | Any slot | | Set-Associative | One set of slots | ### 2. Is It There? ```rust let is_hit = slot.valid && slot.tag == addr_tag; ``` ### 3. What to Replace? Direct-mapped: no choice. Associative: use LRU. --- ## Trace: Direct-Mapped vs 2-Way Address sequence: 0, 16, 0, 16, 0 — 2-slot cache **Direct-Mapped:** Both map to slot 0 | Access | Addr | Result | |--------|------|--------| | 1 | 0 | MISS | | 2 | 16 | MISS (evicts 0) | | 3 | 0 | MISS (evicts 16) | | 4 | 16 | MISS (evicts 0) | | 5 | 0 | MISS (evicts 16) | Hit rate: **0%** (thrashing!) --- ## Trace: 2-Way Eliminates Thrashing Same sequence: 0, 16, 0, 16, 0 — 2-slot, 2-way (1 set) Both addresses can coexist in different ways: | Access | Addr | Tag | Way 0 | Way 1 | Result | |--------|------|-----|-------|-------|--------| | 1 | 0 | 0 | 0 | - | MISS | | 2 | 16 | 4 | 0 | 4 | MISS | | 3 | 0 | 0 | 0 | 4 | **HIT** | | 4 | 16 | 4 | 0 | 4 | **HIT** | | 5 | 0 | 0 | 0 | 4 | **HIT** | Hit rate: **60%**
2-way associativity eliminates the conflict misses.
--- ## 2-Way Set-Associative Trace 4-slot, 2-way (2 sets), LRU. Word addrs: 0, 8, 0, 6, 8 | # | Addr | Set | Tag | Result | Set 0 Way0 | Set 0 Way1 | |---|------|-----|-----|--------|:---:|:---:| | 1 | 0 | 0 | 0 | MISS | (0, ts=1) | - | | 2 | 8 | 0 | 4 | MISS | (0, ts=1) | (4, ts=2) | | 3 | 0 | 0 | 0 | **HIT** | (0, ts=3) | (4, ts=2) | | 4 | 6 | 0 | 3 | MISS | (0, ts=3) | (3, ts=4) | | 5 | 8 | 0 | 4 | MISS | (4, ts=5) | (3, ts=4) | Hits: 1, Misses: 4, Hit rate: 20% --- ## Address Bit Calculation ### General Formula - **Byte Offset bits** = log2(block_size_bytes) - **Index bits** = log2(num_slots) or log2(num_sets) - **Tag bits** = address_bits - index_bits - offset_bits ### Example: 16-slot, 16-byte blocks, 64-bit ``` Byte Offset = log2(16) = 4 bits Index = log2(16) = 4 bits Tag = 64 - 4 - 4 = 56 bits ``` ### Example: 8-way, 64 slots, 4-word blocks, 64-bit ``` num_sets = 64 / 8 = 8 Byte Offset = log2(4) = 2 bits Block Index = log2(4) = 2 bits Set Index = log2(8) = 3 bits Tag = 64 - 2 - 2 - 3 = 57 bits ``` --- ## Project03: Cache Simulator ```rust struct Cache { slots: Vec
, num_slots: usize, block_size: usize, // Block size in words num_ways: usize, // 1 = direct-mapped num_refs: u64, num_hits: u64, num_misses: u64, } ```
flowchart LR A["Fetch"] --> B["Check I-Cache"] B --> C{"Hit?"} C -->|Yes| D["Count hit"] C -->|No| E["Count miss
Update cache"] D --> F["Execute"] E --> F F --> A
--- ## Key Takeaways - **Memory hierarchy** bridges the processor-memory speed gap - **Locality** (temporal + spatial) is why caches achieve > 95% hit rates - **Direct-mapped**: simple, one slot per address, suffers conflict misses - **Fully-associative**: any slot, no conflicts, expensive - **Set-associative**: compromise — one set, multiple ways - **LRU replacement**: evict least recently used (timestamps) - **Block size**: larger blocks exploit spatial locality - **AMAT** = hit_time + (miss_rate × miss_penalty) --- ## Next Lecture - Cache simulation implementation in the emulator - Dynamic analysis (instruction counting, branch statistics) - Project03 details