Cache Memory¶

Overview¶

In Computer Science, a cache is construct that is used to speed up access to data. Often data is stored somewhere, say in memory, on the disk, or in a database. To keep the costs of large storage reasonable, access time to the data is slower than what it could be if a more expensive technology was used for the data storage. A cache is designed to be much smaller than the primary data storage, but uses more expensive technology to make access fast. A cache will keep recently accessed data so that future requests for the data are served quickly. Caches can be very effective if the access pattern to data favors recently requested data items and also data items that are physically near one and other. While we are going to focus on cache memory found in computer processors, caches are widely used in other areas of computer systems, including operating system kernels, databases, and cloud computing. This guide will explain cache memory concepts and look at three types of cache memory structures: direct mapped, fully associative, and set associative.

Cache Memory Concepts¶

Processors need to access data that resides in memory. This memory is sometimes called main memory or RAM. When we execute programs, the memory consists of instructions, global data, heap data, and stack data. A RISC-V processor uses the PC (program counter) register to point to the next instruction in memory to execute. Executing programs use load and store instructions to access data in memory, such as ld, sd, lw, and sw. The purpose of cache memory is to speed up access to main memory by holding recently used data in the cache. A cache can hold either data (called a D-Cache), instructions, (called an I-Cache), or both (called a Unified Cache).

A cache memory will take an address as input and decide if the data associated with the address is in the cache. If the data is in the cache, we call this a hit, and the data can be returned immediately without going to main memory. If the data associated with the address is not in the cache, we call this a miss. Each time the processor requests data from the cache, we call this a memory reference, or ref. We can keep track of each of these metrics during the use of a cache. If we do this, we can then compute two important values: the hit rate and the miss rate. We compute these values as follows:

hit_rate = num_of_hits / num_of_refs
miss_rate = num_of_misses / num_of_refs

and

hit_rate = 1 - (miss_rate)
miss_rate =  1 - (hit_rate)

Our goal is to build caches that achieve high hit rates. In fact, cache memories are wildly successful and can achieve extremely high hit rates in practice, e.g., greater than 97% for many types of programs. The reason for this is that executing code adheres to principles of locality: the principle of temporal locality and the principle of spatial locality. The principle of temporal locality states that if a data element is accessed, there is a good chance it will be accessed again in the near future. Think about an array index (i). If we have a loop with many iterations, the array index will need to be accessed on each iteration. Similarly, if we call a function in the loop, the code for that function will be accessed each time through the loop. The same can be said for recursive functions. The principle of spatial locality states that if a data element is accessed, there is a good chance a close neighbor of the data element will be access in the near future. Traversing an array in a loop is a good example. If we access array[i] there is a good chance will access array[i+1]. Code also exhibits a high degree of spatial locality. If we execute an instruction at the current PC, there is a good chance we will execute instruction at PC + 4 (the next instruction) unless the instruction at PC is a branch or jump. Spatial locality suggests that when we access a data element we should also load some of that data elements neighbors into the cache at the same time reduce the latency of going to main memory for smaller data element access.

Cache Mechanisms¶

In order for a cache memory to operate properly it needs to support some basic mechanisms. At a high level, a cache memory will receive a memory address request (memory reference) from the processor. It needs to determine if the data associated with the memory address is in the cache (a hit). If so, the cache needs to respond with the data requested. If the data is not in the cache (a miss) the cache needs to retrieve the data from main memory, populate the cache with this data, and also return the data back to the processor. When the cache retrieves data from memory it needs to find a place in the cache for this data. In many cases, since the cache is much smaller than main memory, the cache will need to remove some existing data in order to bring the new data into the cache. We call the process of choosing which data to remove the cache replacement policy. When we select data to be removed or overwritten in the cache, we call this eviction.

To support spatial locality, we need to define the size of data that is moved between main memory and the cache when we populate the cache with data. We call this data size the block size. The block size will be typically be defined in terms of number of bytes or words.

When the processor updates data in the cache, i.e., with sd and sw, we have to decide when the update will be reflected in main memory. We can chose to update main memory at the same time that the update is made in the cache. This technique is called write-through. We can also delay the update to main memory until we need to evict the updated data on a cache miss. This technique is called write-back and is the most common approach because it avoids having to write to memory more than is necessary. However, the write-back approach does introduce challenges for multi-processors and multi-core processors in which each processor has its own cache, but wants to see the most up to data value in memory. To solve this problem, caches need to be coordinated and employ a cache coherency protocol to keep caches in sync with memory updates.

Note that instruction memory is generally read only. That is, once code is loaded into memory, it is not modified for the duration of the executing program. In this case we can assume there will be no updates to the code in memory, and thus no need for write-through or write-back updates.

Addresses and Memory¶

A memory request from the processor to the cache is just an address. In modern computers an address is usually a byte address. This means that each unique address points to a different byte (8-bits) in memory. You can think of memory as an array of bytes:

let mem: [u8; N];

Where N is the size of memory and the index into the array is an address. However, typically, a cache does not hold individual bytes, but instead, groups of bytes. Recall from the discussion above, to take advantage of spatial locality, a cache will define a block size so that when we move data from memory to the cache we do it a block at a time, where a block will be multiple bytes or words.

We will start by considering caches whose block size is one word (32 bit or 4 bytes). In this way, we will assume that the processor will only generate requests for word values. We will also assume that when the processor requests a word from memory that the address for the word will be word-aligned. That is, the address will be a multiple of 4. This is generally the case for most word data like instruction words (32 bits) and int values (32 bits). With these assumptions we can view memory as a sequence of words, where each word occupies four bytes.

For example, word 0 occupies mem[0], mem[1], mem[2], mem[3]. Then word 1 occupies mem[4], mem[5], mem[6], mem[7]. And so on. If we have a byte address, we can get the word address like this:

let word_addr = addr / 4;

To get a byte address from the word address we can do:

let addr = word_addr * 4;

Direct-Mapped Caches¶

A direct-mapped cache is considered the simplest form of a cache memory because every memory address maps to a specific slot in the cache. This makes the replacement policy very simple. If the data associated with an address is not in the cache, then you evict the data in the slot that the address maps to.

The most important aspect of a direct mapped cache is the mapping function. Let's consider a direct mapped cache that can hold 4 words of data:

let slots: [u32; 4];

In this way, we are saying that our cache holds 4 words of data. We refer to a position in the cache as a cache slot. In order to map an address to a cache slot, we need to first get the word address then compute the following:

let slot_index = word_addr % 4;

This is a very basic hash function that uses modulus (remainder) of the word address divided by the cache size to determine the slot in the cache. Using this mapping function the following word addresses all map to the same slot, slot 0: 0, 4, 8, 12, 16, etc. Similarly the following word addresses all map to slot 1: 1, 5, 9, 13, 17, etc.

Using the modulo operator is one way to get the slot index based on the cache size. However, we can also use bit manipulation to get the same value:

let word_addr = addr >> 2;

We know that a right shift by 1 is the same as dividing by 2, so a right shift by 2 is the same as dividing by 4. Now let's consider the slot_index. We know we can use modulus (%) to get the remainder, but another way to get the remainder is to look at the lowest bits in the word address to get the slot_index like this:

let slot_index = word_addr & 0b11;

Why does this work? Take a look at the binary form of the following word addresses in both decimal and binary form (we only show the bottom 5 bits of each word address):

dec bin
 0  00000 <-- slot 0
 1  00001 <-- slot 1
 2  00010 <-- slot 2
 3  00011 <-- slot 3
 4  00100 <-- slot 0
 5  00101 <-- slot 1
 6  00110 <-- slot 2
 7  00111 <-- slot 3
 8  01000 <-- slot 0
 9  01001 <-- slot 1
10  01010 <-- slot 2

Notice that the last two bits of each word address is exactly the same as word_addr % 4. So, we can just mask off those last two bits to get the slot_index. If we increase the number of slots in our cache, this technique still works. For example, if we increase our cache to hold 8 slots, we can now get the slot_index using word_addr % 8 or we can mask off the last 3 bits (2^3 = 8) to get the slot index. This holds for all cache sizes that are a power of 2, which is usually the case.

Now we have a way to map an address to a cache slot index. Our next challenge is to figure out if the address and data we are looking for is actually in the cache. Since multiple addresses map to the same cache slot, we need a way to know the requested address and data are in the cache. To do this we need to store additional information in the cache slot. So now, we can view a cache slot as a struct:

struct CacheSlot {
    addr: u64,
    block: [u32; 1],
}

let slots: [CacheSlot; N];

That is, we associate the requesting address with the 32-bit word stored in block[0]. This technically works, but we don't need the entire byte address to determine if we have a hit. Look at the word addresses that map to slot 0 from above:

dec bin
 0  00000 <-- slot 0
 4  00100 <-- slot 0
 8  01000 <-- slot 0
12  01100 <-- slot 0

Look at the upper 3 bits for each of these word addresses; 000, 001, 010, 011. They are all unique. While we are only showing 5 bits, all the word addresses that map to slot 0 will have unique values for the upper bits. We call the unique upper bits the tag.

We can now view a byte address in the following way for a 64-bit byte address:

tag[63:4] slot_index[3:2] byte_offset[1:0]

The byte_offset is just the lower two bits that we lose when we compute the word address. If we wanted to access a specific byte from a word in the cache, we would need to use the byte_offset.

In addition to the block data and tag, we need one other piece of information in every cache slot. Typically when we power up a computer or start a new program, the cache does not have valid data. On power up, the computer zeros out all values in the cache (and other memory elements as well). With the cache zeroed out and without any memory requests, there is no valid data in the cache. However a tag of value 0 and data of value 0 are both legitimate values, so without something else, the cache would return the value zero for a request with tag equal to 0. To accommodate this initialization problem we introduce a valid field to each cache slot. So now, our cache slot struct looks like this:

struct CacheSlot {
    tag: u64,
    block: [u32; 1],
    valid: bool,
}

let slots: [CacheSlot; N];

If valid is false, the cache slot does not hold valid data. A request that maps to this slot will result in a miss and a subsequent memory access. If valid is true, then we need to compare the tag of the memory request address to the tag in the cache slot. If they are equal, then we have a cache hit.

We also distinguish between two types of misses. A cold miss occurs when the slot has never been used (valid is false). A hot miss occurs when the slot is valid but contains data for a different address (the tag does not match). This distinction is useful for understanding cache behavior during program execution.

Here is the pseudo code for handling an addr request to a 4 slot direct-mapped cache:

let addr_tag = addr >> 4;  // remove byte_offset (2) and slot_index bits (2)
let index_mask: u64 = 0b11;
let slot_index = ((addr >> 2) & index_mask) as usize;
let slot = &mut slots[slot_index];

if slot.valid && slot.tag == addr_tag {
    // hit
    slot.block[0]
} else {
    // miss (cold if !slot.valid, hot if slot.valid)
    slot.block[0] = mem_read_word(addr);
    slot.tag = addr_tag;
    slot.valid = true;
    slot.block[0]
}

Fully-Associative Caches¶

Before describing set-associative caches, it is useful to first understand how fully-associative caches work. In a fully-associative cache, an address can map to any cache slot, not just one like in the direct-mapped cache. In a direct-mapped cache it is possible to have cache slots that go unused because no addresses mapped to the unused slots during program execution. This usually does not happen for small caches, but can happen as the size of the cache increases. To avoid a situation in which we have unused cache slots a fully associative cache is the most flexible. However, this flexibility comes at a cost. When we get an address request we need to compare the address tag with all the slot tags. In code this requires a loop to check the tag in each cache slot. In hardware this comparison can be done in parallel, but it requires quite a bit of logic to achieve this. While fully-associative caches are too expensive for regular cache memory, they are used in very specific situations, such as in hardware support for virtual memory.

We can use the same slot structure we did for direct mapped caches, but now we add a timestamp field for tracking when each slot was last accessed:

struct CacheSlot {
    tag: u64,
    block: [u32; 1],
    valid: bool,
    timestamp: u64,
}

let slots: [CacheSlot; N];

However, we now view the request address like this:

tag[63:2] byte_offset[1:0]

That is, we no longer need a slot_index because the address can be mapped to any slot in the cache. Here is the pseudo code for handling an addr request to a 4 slot fully-associative cache:

refs += 1;
let addr_tag = addr >> 2;  // remove byte_offset bits (2)

for i in 0..4 {
    let slot = &mut slots[i];
    if slot.valid && slot.tag == addr_tag {
        // hit
        slot.timestamp = refs;
        return slot.block[0];
    }
}

// miss
let slot = find_lru_slot(&mut slots);
slot.block[0] = mem_read_word(addr);
slot.tag = addr_tag;
slot.valid = true;
slot.timestamp = refs;
slot.block[0]

On a miss, we need to find a slot to use for the new request. Because an address can map to any slot in the cache, we should first look for an unused cache slot. We can look for the first slot whose valid field is false, as this means the slot is unused (a cold miss). If all the slots are valid, we need a way to choose a slot to evict (a hot miss). A good choice is to pick the slot that has been least recently used (LRU). In software we could use a linked list to keep track of recently used slots by moving the currently accessed slot to the beginning of the list. Over time the least recently used slot will be at the end of the list. Another option is to associate a timestamp with each slot. This could be a logical timestamp like the number of cache references. This value will always increase. Now, each time we access a cache slot, we can set the slot timestamp to the current number of references. To find the least recently used slot, we look for the slot with the lowest (smallest) timestamp, as this will be the slot accessed least recently compared to all the other slots.

In hardware we can't use a linked list or a full timestamp value. Instead of perfect LRU an approximate form of LRU is used.

The find_lru_slot() function can first look for an unused slot (where valid is false), then use LRU via timestamps to find the slot to evict.

Set Associative Caches¶

Direct-mapped caches are simple, but rigid. A fully-associative cache is flexible, but expensive. A set-associative cache is a compromise between direct and fully associative. The idea is to think about a cache as an array of sets. Now we map an address to a set using a direct mapped approach, but we allow the address to map to any slot in the set. We refer to the number of slots in a set as the number of ways. We refer to a specific set-associative cache as an n-way set associative cache, such as 4-way set associative or 8-way set-associative.

We can still use the same slot structure for our set-associative cache:

struct CacheSlot {
    tag: u64,
    block: [u32; 1],
    valid: bool,
    timestamp: u64,
}

let slots: [CacheSlot; N];

The number of sets in the cache will be:

let num_sets = N / num_ways;

So if we have 4 ways and 32 cache slots, we have 32 / 4 = 8 sets total.

With a set-associative cache, we now need to find the set index given an addr. So our addr now looks like this:

tag[63:5] set_index[4:2] byte_offset[1:0]

Note that the set_index is 3 bits because we have 8 total sets (2^3 = 8). Here is the pseudo code for handling a set-associative lookup for a 4-way cache with 32 cache slots, which is 8 sets total. We logically consider the cache slot array to be groups of 4 slots. That is, the first 4 slots will be set 0, the second 4 slots will be set 1, and so on. In this way, once we find the set index, we can use this value to find the first slot in a set.

refs += 1;
let num_ways = 4;
let addr_tag = addr >> 5;  // remove byte_offset (2) and set_index bits (3)
let set_index = ((addr >> 2) & 0b111) as usize;

// 4 ways per set, so set_index * num_ways gives us the first slot in the set
let set_base = set_index * num_ways;

for i in 0..num_ways {
    let slot = &mut slots[set_base + i];
    if slot.valid && slot.tag == addr_tag {
        // hit
        slot.timestamp = refs;
        return slot.block[0];
    }
}

// miss
let slot = find_lru_slot_in_set(&mut slots, set_base, num_ways);
slot.block[0] = mem_read_word(addr);
slot.tag = addr_tag;
slot.valid = true;
slot.timestamp = refs;
slot.block[0]

The key point with a set-associative cache is that we only need to do a full lookup on the slots within a set (4-way or 8-way for example) and not all the slots in the cache. This gives some flexibility to choose where an address should go within a set. The find_lru_slot_in_set() function should first look for an unused slot in the set (where valid is false, a cold miss), then pick a slot to evict based on LRU for the slots in the set (a hot miss).

Block Size¶

In all the caches described above we assume that each cache slot holds one word (a 32-bit value). However, to take advantage of spatial locality and to reduce the latency cost for transferring smaller amounts of data from memory to the cache, we can have each slot hold multiple words instead of a single word. For a 4 word block size we can define our CacheSlot like this:

struct CacheSlot {
    tag: u64,
    block: [u32; 4],  // 4-word block
    valid: bool,
    timestamp: u64,
}

let slots: [CacheSlot; N];

Similar to how we treat 4 bytes as a word in memory, we now treat 4 words as a block in memory. That is we can view memory logically as an array of blocks where each block consists of 4 words. Consider a 4 slot cache. Now our addresses can be viewed as follows for the different cache types.

Direct Mapped

tag[63:6] slot_index[5:4] block_index[3:2] byte_offset[1:0]

Fully Associative

tag[63:4] block_index[3:2] byte_offset[1:0]

Set Associative (2-way)

tag[63:5] set_index[4] block_index[3:2] byte_offset[1:0]

The block_index selects which word within the block is being accessed. So, we associate a tag with a block of words. When we get a miss we need to find the block the addr is pointing to and load all the words in that block into the cache slot. We can find the base word of a block like this:

let word_addr = addr / 4;
let block_index = word_addr % 4;  // where 4 is the block size in words
let block_base = word_addr - block_index;

Now you can load 4 words from memory starting at the block_base. Note that block_base is a word address and you will need to convert it to a byte address so you can dereference the address properly. Alternatively, you can perform similar calculations using byte addresses or use bitwise operators to get the same values.