Cache Memory - CS 631

# Cache Memory

## CS 631 Systems Foundations — Mar 31, 2026

---

## Today's Agenda

1. The memory hierarchy
2. Principles of locality
3. Cache concepts and terminology
4. Memory addressing
5. Direct-mapped caches
6. Fully-associative caches
7. Set-associative caches
8. Block size
9. Cache simulation traces
10. Address bit calculations

---

## Memory Hierarchy Numbers

| Level | Technology | Size | Access Time |
|-------|-----------|------|-------------|
| Registers | Flip-flops | ~32 x 64 bits | ~0.5 ns |
| L1 Cache | SRAM | 32-64 KB | ~1-2 ns |
| L2 Cache | SRAM | 256 KB-1 MB | ~5-10 ns |
| L3 Cache | SRAM | 2-32 MB | ~20-40 ns |
| Main Memory | DRAM | 4-64 GB | ~50-100 ns |
| SSD | Flash | 256 GB-4 TB | ~100 us |
| HDD | Magnetic | 1-16 TB | ~10 ms |

<div class="info-box">
A cache sits between the CPU and main memory, holding recently accessed data.
</div>

---

## Principles of Locality

### Why do caches work so well?

Programs exhibit **locality of reference**:

- **Temporal locality**: If data is accessed, it will likely be accessed again soon
- **Spatial locality**: If data is accessed, nearby data will likely be accessed soon

<div class="highlight-box">
Caches exploit both types of locality to achieve hit rates > 97%.
</div>

---

## Locality Examples

### Temporal Locality

```rust
let mut sum: i64 = 0;
for i in 0..1000 {       // 'i' accessed 1000+ times
    sum += array[i];      // 'sum' accessed 1000 times
}
```

### Spatial Locality

```rust
for i in 0..1000 {
    total += array[i];  // array[0], array[1], ... sequential
}
```

Code also exhibits spatial locality: after executing PC, we usually execute PC+4.

---

## Cache Terminology

| Term | Definition |
|------|------------|
| **Hit** | Data found in cache |
| **Miss** | Data not in cache |
| **Hit Rate** | num_hits / num_refs |
| **Miss Rate** | num_misses / num_refs |
| **Block** | Unit of data transferred between cache and memory |
| **Slot/Line** | One entry in the cache |
| **Tag** | Identifies which address block is stored |
| **Valid Bit** | Whether a slot contains valid data |
| **Cold Miss** | First access to a slot (valid is false) |
| **Hot Miss** | Slot valid but tag doesn't match |

---

## Average Memory Access Time (AMAT)

```
AMAT = hit_time + (miss_rate × miss_penalty)
```

**Example:** hit_time = 1 cycle, miss_rate = 5%, miss_penalty = 100 cycles

```
AMAT = 1 + (0.05 × 100) = 1 + 5 = 6 cycles
```

<div class="highlight-box">
Even a small miss rate has a big impact because the miss penalty is large.
</div>

---

## Memory Addressing

Memory is **byte-addressable**, but caches work with **words** (4 bytes):

```rust
// Byte address to word address
let word_addr = addr >> 2;  // Same as addr / 4

// Word address to byte address
let addr = word_addr << 2;  // Same as word_addr * 4
```

| Byte Addr | Word Addr | Aligned? |
|-----------|-----------|----------|
| 0x00 | 0 | Yes |
| 0x04 | 1 | Yes |
| 0x08 | 2 | Yes |
| 0x03 | - | No |

---

## Three Cache Organizations

<div class="mermaid">
graph TD
    subgraph DM[Direct-Mapped]
        A1[Address maps to exactly ONE slot]
    end
    subgraph FA[Fully-Associative]
        A2[Address can map to ANY slot]
    end
    subgraph SA[Set-Associative]
        A3[Address maps to ONE SET of slots]
    end
</div>

---

## Direct-Mapped Cache

Each address maps to **exactly one slot**. Simplest form of cache.

### Cache Slot Structure

```rust
struct CacheSlot {
    tag: u64,
    block: [u32; 1],  // One word per block
    valid: bool,
}

let slots: [CacheSlot; N];
```

### Mapping Function

```rust
let word_addr = addr >> 2;
let slot_index = word_addr & 0b11;  // For 4 slots: word_addr % 4
```

---

## Address Decomposition (Direct-Mapped)

For a 4-slot cache with 1-word blocks:

```
64-bit byte address:
┌─────────────────────────┬─────────┬───────────┐
│         TAG             │ INDEX   │ BYTE OFF  │
│       [63:4]            │ [3:2]   │  [1:0]    │
└─────────────────────────┴─────────┴───────────┘
         60 bits            2 bits    2 bits
```

- **Byte Offset**: Selects byte within a word
- **Index**: Selects which cache slot
- **Tag**: Identifies which address block is stored

---

## Why Bit Masking Works

```
dec  bin      slot
 0   00000    0
 1   00001    1
 2   00010    2
 3   00011    3
 4   00100    0    (same slot as 0)
 5   00101    1    (same slot as 1)
 6   00110    2    (same slot as 2)
 7   00111    3    (same slot as 3)
 8   01000    0    (same slot as 0, 4)
```

The last 2 bits of the word address = `word_addr % 4`.

The remaining upper bits = the **tag** (unique per address in the same slot).

---

## Direct-Mapped Lookup

<div class="mermaid">
flowchart TD
    A["Memory Address"] --> B["Extract Index"]
    B --> C["Access Cache Slot"]
    C --> D{"Valid?"}
    D -->|No| E["MISS cold"]
    D -->|Yes| F{"Tag Match?"}
    F -->|No| G["MISS hot"]
    F -->|Yes| H["HIT"]
    E --> I["Fetch from Memory<br>Update Slot"]
    G --> I
</div>

---

## Direct-Mapped Lookup Code

```rust
let addr_tag = addr >> 4;  // Remove byte_offset (2) + index (2)
let index_mask: u64 = 0b11;
let slot_index = ((addr >> 2) & index_mask) as usize;
let slot = &mut slots[slot_index];

if slot.valid && slot.tag == addr_tag {
    // Hit
    num_hits += 1;
    slot.block[0]
} else {
    // Miss (cold if !slot.valid, hot if slot.valid)
    num_misses += 1;
    slot.block[0] = mem_read_word(addr);
    slot.tag = addr_tag;
    slot.valid = true;
    slot.block[0]
}
```

---

## Conflict Misses and Thrashing

Direct-mapped caches suffer from **conflict misses**:

```
Addresses 0, 16, 32, 48 all map to slot 0
```

**Thrashing example** with 4 slots — word addrs: 0, 4, 0, 8, 0, 12, 0

| Access | Addr | Slot | Tag | Result |
|--------|------|------|-----|--------|
| 1 | 0 | 0 | 0 | MISS |
| 2 | 4 | 0 | 1 | MISS (evicts 0) |
| 3 | 0 | 0 | 0 | MISS (evicts 1) |
| 4 | 8 | 0 | 2 | MISS (evicts 0) |
| 5-7 | ... | 0 | ... | ALL MISS |

Hit rate: **0%** — all addresses map to slot 0!

---

## Fully-Associative Cache

Address can be stored in **ANY** slot — no conflict misses.

### Address Decomposition

No index bits — only tag and byte offset:

```
┌─────────────────────────────────┬───────────┐
│             TAG                 │ BYTE OFF  │
│           [63:2]                │  [1:0]    │
└─────────────────────────────────┴───────────┘
```

Must compare address tag against **all** slot tags on every access.

---

## Fully-Associative Lookup Code

```rust
refs += 1;
let addr_tag = addr >> 2;  // Remove byte_offset only

for i in 0..NUM_SLOTS {
    let slot = &mut slots[i];
    if slot.valid && slot.tag == addr_tag {
        // Hit
        num_hits += 1;
        slot.timestamp = refs;
        return slot.block[0];
    }
}

// Miss — find slot to replace
num_misses += 1;
let slot = find_lru_slot(&mut slots);
slot.block[0] = mem_read_word(addr);
slot.tag = addr_tag;
slot.valid = true;
slot.timestamp = refs;
slot.block[0]
```

---

## LRU Replacement Policy

**Least Recently Used**: evict the slot not accessed for the longest time.

Use a logical timestamp (reference count). Each access updates the slot's timestamp. The slot with the **smallest** timestamp is the LRU.

```rust
fn find_lru_slot(slots: &mut [CacheSlot]) -> &mut CacheSlot {
    // First look for unused slot (cold miss)
    for slot in slots.iter_mut() {
        if !slot.valid { return slot; }
    }
    // All valid — find oldest timestamp (hot miss)
    let mut lru_idx = 0;
    let mut min_ts = u64::MAX;
    for (i, slot) in slots.iter().enumerate() {
        if slot.timestamp < min_ts {
            min_ts = slot.timestamp;
            lru_idx = i;
        }
    }
    &mut slots[lru_idx]
}
```

---

## Fully-Associative Trade-offs

| Aspect | Fully-Associative |
|--------|-------------------|
| Hit Rate | Highest (no conflict misses) |
| Lookup Speed | Slowest (check all slots) |
| Hardware Cost | Highest (parallel comparators) |
| Replacement | Flexible (LRU, FIFO, Random) |

<div class="info-box">
Too expensive for general cache memory, but used in specific cases like TLBs (translation lookaside buffers) for virtual memory.
</div>

---

## Set-Associative Cache

A **compromise** between direct-mapped and fully-associative:

- Cache divided into **sets**
- Each set contains multiple **ways** (slots)
- Address maps to **one set** (like direct-mapped)
- Within a set, can go in **any way** (like fully-associative)

**N-way set-associative**: each set has N slots

```
                Way 1              Way 0
         ┌───────┬─────┬─────┬───────┬─────┬─────┐
  Set 1  │ valid │ tag │data │ valid │ tag │data │
         ├───────┼─────┼─────┼───────┼─────┼─────┤
  Set 0  │ valid │ tag │data │ valid │ tag │data │
         └───────┴─────┴─────┴───────┴─────┴─────┘
```

---

## Set-Associative Formulas

```rust
let num_sets = num_slots / num_ways;
// set_index_bits = log2(num_sets)
// byte_offset_bits = log2(block_size_in_bytes)
// tag_bits = addr_bits - set_index_bits - byte_offset_bits
```

**Example**: 4 slots, 2-way → 2 sets

```
┌──────────────────────────┬──────────┬───────────┐
│          TAG             │ SET IDX  │ BYTE OFF  │
│        [63:3]            │   [2]    │  [1:0]    │
└──────────────────────────┴──────────┴───────────┘
         61 bits             1 bit      2 bits
```

---

## Set-Associative Lookup Code

```rust
refs += 1;
let num_ways = 4;
let addr_tag = addr >> 5;  // Remove byte_off (2) + set_idx (3)
let set_index = ((addr >> 2) & 0b111) as usize;
let set_base = set_index * num_ways;

for i in 0..num_ways {
    let slot = &mut slots[set_base + i];
    if slot.valid && slot.tag == addr_tag {
        // Hit
        num_hits += 1;
        slot.timestamp = refs;
        return slot.block[0];
    }
}

// Miss — find LRU within the set
num_misses += 1;
let slot = find_lru_slot_in_set(&mut slots, set_base, num_ways);
slot.block[0] = mem_read_word(addr);
slot.tag = addr_tag;
slot.valid = true;
slot.timestamp = refs;
slot.block[0]
```

---

## LRU Within a Set

Same logic as fully-associative, but only search within the set:

```rust
fn find_lru_slot_in_set(
    slots: &mut [CacheSlot], set_base: usize, num_ways: usize,
) -> &mut CacheSlot {
    for i in 0..num_ways {
        if !slots[set_base + i].valid {
            return &mut slots[set_base + i];  // Cold miss
        }
    }
    let mut lru_idx = set_base;
    let mut min_ts = u64::MAX;
    for i in 0..num_ways {
        let idx = set_base + i;
        if slots[idx].timestamp < min_ts {
            min_ts = slots[idx].timestamp;
            lru_idx = idx;
        }
    }
    &mut slots[lru_idx]  // Hot miss
}
```

---

## Block Size

To exploit **spatial locality**, cache slots hold multiple words:

```rust
struct CacheSlot {
    tag: u64,
    block: [u32; 4],  // 4-word block (16 bytes)
    valid: bool,
    timestamp: u64,
}
```

On a miss, load the **entire block** from memory — neighbors come for free.

---

## Address Decomposition with Blocks

4-slot direct-mapped, 4-word blocks:

```
┌──────────────────────┬───────────┬───────────┬───────────┐
│        TAG           │ SLOT IDX  │ BLOCK IDX │ BYTE OFF  │
│      [63:6]          │  [5:4]    │  [3:2]    │  [1:0]    │
└──────────────────────┴───────────┴───────────┴───────────┘
       58 bits           2 bits      2 bits      2 bits
```

**Block Index** selects which word within the block.

### Finding the block base on a miss:

```rust
let word_addr = addr / 4;
let block_index = word_addr % 4;  // Word position in block
let block_base = word_addr - block_index;
// Load 4 words starting at block_base
```

---

## Cache Comparison

| Feature | Direct-Mapped | Fully-Assoc | N-Way Set-Assoc |
|---------|:---:|:---:|:---:|
| Maps to | 1 slot | Any slot | 1 set (N slots) |
| Index bits | log2(slots) | 0 | log2(sets) |
| Comparisons | 1 | All slots | N (ways) |
| Conflict misses | Many | None | Reduced |
| Hardware cost | Lowest | Highest | Medium |
| Replacement | None needed | LRU/FIFO | LRU in set |

### Special Cases

- **Direct-mapped** = 1-way set-associative
- **Fully-associative** = N-way where N = total slots

---

## Three Cache Questions

### 1. Where to Look?

| Cache Type | Answer |
|------------|--------|
| Direct-Mapped | Exactly one slot |
| Fully-Associative | Any slot |
| Set-Associative | One set of slots |

### 2. Is It There?

```rust
let is_hit = slot.valid && slot.tag == addr_tag;
```

### 3. What to Replace?

Direct-mapped: no choice. Associative: use LRU.

---

## Trace: Direct-Mapped vs 2-Way

Address sequence: 0, 16, 0, 16, 0 — 2-slot cache

**Direct-Mapped:** Both map to slot 0

| Access | Addr | Result |
|--------|------|--------|
| 1 | 0 | MISS |
| 2 | 16 | MISS (evicts 0) |
| 3 | 0 | MISS (evicts 16) |
| 4 | 16 | MISS (evicts 0) |
| 5 | 0 | MISS (evicts 16) |

Hit rate: **0%** (thrashing!)

---

## Trace: 2-Way Eliminates Thrashing

Same sequence: 0, 16, 0, 16, 0 — 2-slot, 2-way (1 set)

Both addresses can coexist in different ways:

| Access | Addr | Tag | Way 0 | Way 1 | Result |
|--------|------|-----|-------|-------|--------|
| 1 | 0 | 0 | 0 | - | MISS |
| 2 | 16 | 4 | 0 | 4 | MISS |
| 3 | 0 | 0 | 0 | 4 | **HIT** |
| 4 | 16 | 4 | 0 | 4 | **HIT** |
| 5 | 0 | 0 | 0 | 4 | **HIT** |

Hit rate: **60%**

<div class="highlight-box">
2-way associativity eliminates the conflict misses.
</div>

---

## 2-Way Set-Associative Trace

4-slot, 2-way (2 sets), LRU. Word addrs: 0, 8, 0, 6, 8

| # | Addr | Set | Tag | Result | Set 0 Way0 | Set 0 Way1 |
|---|------|-----|-----|--------|:---:|:---:|
| 1 | 0 | 0 | 0 | MISS | (0, ts=1) | - |
| 2 | 8 | 0 | 4 | MISS | (0, ts=1) | (4, ts=2) |
| 3 | 0 | 0 | 0 | **HIT** | (0, ts=3) | (4, ts=2) |
| 4 | 6 | 0 | 3 | MISS | (0, ts=3) | (3, ts=4) |
| 5 | 8 | 0 | 4 | MISS | (4, ts=5) | (3, ts=4) |

Hits: 1, Misses: 4, Hit rate: 20%

---

## Address Bit Calculation

### General Formula

- **Byte Offset bits** = log2(block_size_bytes)
- **Index bits** = log2(num_slots) or log2(num_sets)
- **Tag bits** = address_bits - index_bits - offset_bits

### Example: 16-slot, 16-byte blocks, 64-bit

```
Byte Offset = log2(16) = 4 bits
Index       = log2(16) = 4 bits
Tag         = 64 - 4 - 4 = 56 bits
```

### Example: 8-way, 64 slots, 4-word blocks, 64-bit

```
num_sets    = 64 / 8 = 8
Byte Offset = log2(4) = 2 bits
Block Index = log2(4) = 2 bits
Set Index   = log2(8) = 3 bits
Tag         = 64 - 2 - 2 - 3 = 57 bits
```

---

## Project03: Cache Simulator

```rust
struct Cache {
    slots: Vec<CacheSlot>,
    num_slots: usize,
    block_size: usize,    // Block size in words
    num_ways: usize,      // 1 = direct-mapped
    num_refs: u64,
    num_hits: u64,
    num_misses: u64,
}
```

<div class="mermaid">
flowchart LR
    A["Fetch"] --> B["Check I-Cache"]
    B --> C{"Hit?"}
    C -->|Yes| D["Count hit"]
    C -->|No| E["Count miss<br>Update cache"]
    D --> F["Execute"]
    E --> F
    F --> A
</div>

---

## Key Takeaways

- **Memory hierarchy** bridges the processor-memory speed gap
- **Locality** (temporal + spatial) is why caches achieve > 95% hit rates
- **Direct-mapped**: simple, one slot per address, suffers conflict misses
- **Fully-associative**: any slot, no conflicts, expensive
- **Set-associative**: compromise — one set, multiple ways
- **LRU replacement**: evict least recently used (timestamps)
- **Block size**: larger blocks exploit spatial locality
- **AMAT** = hit_time + (miss_rate × miss_penalty)

---

## Next Lecture

- Cache simulation implementation in the emulator
- Dynamic analysis (instruction counting, branch statistics)
- Project03 details