Skip to content

OS Page Tables

How the MMU and the kernel collaborate to give each process its own private address space, and how Octox implements that machinery on RISC-V Sv39. We start with a hypothetical 32-bit RISC architecture so the arithmetic stays small, motivate multi-level page tables from a single example, then descend into the real Octox code — the walker, the mappers, and the satp swaps that bracket every trap.

Overview

Virtual memory does three jobs:

  1. Isolation. User A cannot read user B's pages.
  2. Translation. A process's 0x1000 can be different physical memory from another process's 0x1000.
  3. Abstraction. Sparse address spaces become possible — malloc can hand back 0x5_0000_0000 without physically reserving 20 GB of empty memory.

The data structure that makes all three work is the page table. Hardware (the MMU) walks it on every memory access; software (the OS) builds and edits it. Today we build the mental model on a small 32-bit example, then read the Octox code that implements the real RV64 Sv39 scheme.

Learning Objectives

After this lecture you should be able to:

  • Decompose a virtual address into page-table indices and a page offset, for both a hypothetical 32-bit architecture and RISC-V Sv39.
  • Explain why a single-level page table is wasteful for a 32-bit address space — and exactly how a two-level scheme reduces the per-process page-table footprint to kilobytes for sparse processes.
  • State what the TLB caches and what sfence.vma does.
  • Decode the layout of a RISC-V Sv39 PTE (PPN bits, flag bits) and the satp register.
  • Read Octox's walk() and trace the three-level descent it performs.
  • Describe how mappages(), Uvm::create(), and Uvm::copy() are composed to implement fork, exec, and the trampoline / trapframe setup.
  • Explain why the trampoline page is mapped at the same virtual address in every page table.

Prerequisites

  • OS Kernel: Inside Octox — yesterday's tour of privilege modes, the trapframe, and the syscall round trip. We rely on the same usertrap / userret skeleton.
  • Octox Guide §4.4 (Sv39 page tables) and §4.2 (memory layout).
  • RISC-V assembly lectures — CSRs (satp, sfence.vma).

1. Why Virtual Memory

Every modern OS gives each process the illusion of owning the entire address space. Two processes can both store something at 0x1000 and neither knows the other exists. The MMU, sitting between every load/store instruction and physical RAM, performs address translation — mapping each virtual address (VA) the program emits to the physical address (PA) where the bytes actually live.

flowchart LR
    CPU["CPU<br/>load/store VA"] --> MMU["MMU<br/>VA → PA"]
    MMU --> RAM["physical RAM"]
    PT["page table<br/>(per process)"] -.-> MMU

The page table is the per-process data structure that drives the translation. The OS builds it; the MMU consults it. To make this practical, memory is divided into fixed-size pages — 4 KB on RISC-V. A virtual address is split into

   [   virtual page number (VPN)   ][ page offset ]

Translation maps the VPN to a physical page number (PPN); the offset bits pass through unchanged. So we only ever need to translate page numbers, not full addresses.

Why 4 KB?

Big enough that the per-page bookkeeping (one PTE) is small relative to the data; small enough that internal fragmentation inside a page is bearable. RISC-V also supports 2 MB and 1 GB "huge pages" by treating an entry at level 1 or 2 as a leaf, but Octox uses only 4 KB pages.


2. A Hypothetical 32-bit RISC: Single-Level Page Tables

Let's imagine a simple 32-bit RISC ISA with 4 KB pages. The arithmetic is small enough to do in your head.

  • Virtual address: 32 bits.
  • Page size: 4 KB = 212 bytes → 12-bit offset.
  • Page number: 32 − 12 = 20 bits.
  31                                12 11               0
  +------------------------------------+----------------+
  |        VPN (20 bits)               |  offset (12)   |
  +------------------------------------+----------------+

A single-level page table is just a flat array indexed by VPN. With 20-bit VPNs that's

\[ 2^{20} \text{ entries} = 1{,}048{,}576 \text{ PTEs} \]

If each PTE is 4 bytes (32 bits, enough for a 20-bit PPN plus some flag bits), the table is

\[ 2^{20} \times 2^2 = 2^{22} \text{ bytes} = 4 \text{ MB} \]

Per process. Even a tiny program that uses 8 KB of code and 4 KB of stack pays a 4 MB tax in page-table memory. Run a thousand processes and that is 4 GB of page tables alone — for an architecture whose entire physical address space is 4 GB.

   single-level PT
   +--------------+   PTE 0
   |              |
   +--------------+   PTE 1
   |              |
   +--------------+
   |     ...      |
   +--------------+   PTE (2^20 - 1)
   |              |
   +--------------+
   <-- 4 MB total, per process -->

The waste is structural: the table has an entry for every possible page number, even pages the process never touches. We need a representation that lets us skip over unused regions.


3. A Hypothetical 32-bit RISC: Two-Level Page Tables

Split the 20-bit VPN into two 10-bit indices:

  31                  22 21                12 11               0
  +---------------------+--------------------+----------------+
  |   L1 index (10)     |    L0 index (10)   |  offset (12)   |
  +---------------------+--------------------+----------------+

Now we have a two-level trie:

  • The root (level 1) page table has 210 = 1024 entries.
  • Each entry points to a leaf (level 0) page table, also with 1024 entries.
  • Each leaf entry holds the PPN of an actual data page (or is invalid).

With 4-byte PTEs:

Table Entries Bytes
Root 1024 4 KB
Leaf 1024 4 KB
flowchart LR
    VA["VA: [L1=4][L0=2][off]"] --> ROOT
    ROOT["Root PT<br/>1024 entries"] -->|"L1=4"| LEAF["Leaf PT 4<br/>1024 entries"]
    ROOT -.->|"L1=0..3, 5..1023<br/>(NULL)"| X["(no leaf<br/>allocated)"]
    LEAF -->|"L0=2"| PAGE["data page<br/>(4 KB)"]

The savings. The root is mandatory. But leaf tables only exist for regions the process actually uses. A process whose virtual layout is

  • 1 MB of code at 0x0000_0000,
  • 64 KB of stack at the top of memory near 0xFFFF_0000,

uses two regions, each touching one or two L1 entries. So:

Component Size
Root PT 4 KB
1 leaf for code 4 KB
1 leaf for stack 4 KB
Total 12 KB (vs. 4 MB flat)

A factor of 300+ smaller, for the same translation power. The cost is that the MMU now does two memory loads per translation instead of one — which is exactly what the TLB will fix.

The trie idea

Multi-level page tables are tries indexed by chunks of the page number. The level-1 index picks a sub-trie; the level-0 index picks an entry inside it. A NULL (invalid) entry at any level means "this whole sub-region is unmapped" — no leaf table needs to exist.


4. The TLB

A two-level page table is small, but every load and store now needs two extra memory accesses just to translate the address. That would be a disaster.

The fix is a hardware cache: the Translation Lookaside Buffer (TLB). The MMU keeps a small table of recently-used VPN → PPN mappings. Most memory accesses hit the TLB and skip the page-table walk entirely.

  • TLB hit rates in real workloads are typically > 99%.
  • A miss triggers a page-table walk (in hardware on RISC-V), and the resulting translation is inserted into the TLB.
  • The OS owns the page tables, so when it changes a PTE, it must tell the MMU to throw out stale TLB entries: sfence.vma.

sfence.vma is not optional

Forgetting sfence.vma after a PTE edit means the TLB will keep serving the old translation. The bug only shows up when a translation that was cached gets reused after the OS changed it — classic "works most of the time" hazard.

We will see Octox bracket every csrw satp with sfence.vma in §15.


5. From the Toy Architecture to RISC-V Sv39

A flat 64-bit page table would need 252 entries — petabytes. Even two-level is far too big. RISC-V's solution for RV64 is Sv39: 39-bit virtual addresses, three levels of page tables, 9-bit indices.

  38       30 29       21 20       12 11             0
  +----------+-----------+-----------+----------------+
  | L2 (9)   |   L1 (9)  |   L0 (9)  |  offset (12)   |
  +----------+-----------+-----------+----------------+

  bits 63..39 must equal bit 38 (sign-extended canonical addresses)
  • Page size: 4 KB. 12-bit offset.
  • Three levels. Each level uses 9 bits to index 29 = 512 entries.
  • PTE size: 8 bytes on RV64. So one page-table page is exactly 512 × 8 = 4096 bytes — one 4 KB page.
  • Total reach: 239 bytes = 512 GB of virtual address space.
flowchart LR
    VA["39-bit VA"] --> L2T["L2 PT<br/>512 entries<br/>4 KB"]
    L2T -->|"px(2)"| L1T["L1 PT<br/>4 KB"]
    L1T -->|"px(1)"| L0T["L0 PT<br/>4 KB"]
    L0T -->|"px(0)"| PAGE["4 KB data page"]

The same trie idea as our toy architecture, just deeper.

The Sv39 PTE layout

 63       54 53                                        10 9    8 7 6 5 4 3 2 1 0
  +----------+-------------------------------------------+----+--+-+-+-+-+-+-+-+
  | reserved |              PPN (44 bits)                | RSW | D|A|G|U|X|W|R|V|
  +----------+-------------------------------------------+----+--+-+-+-+-+-+-+-+
Bit Name Meaning
0 V Valid — entry is in use
1 R Readable
2 W Writable
3 X Executable
4 U User-mode accessible
5 G Global (shared across address spaces)
6 A Accessed (set by hardware)
7 D Dirty (set by hardware)

Leaf vs. interior PTEs. An interior PTE (level 2 or 1, pointing to the next-level table) sets only V. A leaf PTE additionally sets at least one of R, W, X, marking it as the final mapping for a page. Octox encodes this cleanly:

pub fn is_leaf(&self) -> bool {
    (self.0 & (PTE_R | PTE_W | PTE_X)) != 0
}

(src/kernel/vm.rs:247-283).

The flag constants live in src/kernel/riscv.rs:616-622:

pub mod pteflags {
    pub const PTE_V: usize = 1 << 0; // valid
    pub const PTE_R: usize = 1 << 1;
    pub const PTE_W: usize = 1 << 2;
    pub const PTE_X: usize = 1 << 3;
    pub const PTE_U: usize = 1 << 4; // user can access
}

6. The satp Register

satp ("Supervisor Address Translation and Protection") is the CSR that points the MMU at the current page table. Writing satp switches address spaces; that is exactly what happens on every trap into and out of the kernel.

 63      60 59      44 43                                              0
  +--------+----------+-------------------------------------------------+
  |  MODE  |   ASID   |                  PPN of root PT                 |
  +--------+----------+-------------------------------------------------+
   4 bits     16 bits                       44 bits
  • MODE: Sv39 = 8. (Bare = 0 disables translation entirely.)
  • ASID: address-space identifier, used by the TLB to tag entries.
  • PPN: the physical page number of the root page table. The MMU starts every walk here.

Octox wraps this in a typed module (src/kernel/riscv.rs:357-408):

pub mod satp {
    pub enum Mode {
        Bare = 0, Sv39 = 8, Sv48 = 9, Sv57 = 10, Sv64 = 11,
    }

    pub unsafe fn write(bits: usize) {
        asm!("csrw satp, {}", in(reg) bits);
    }

    pub fn make(mode: Mode, asid: usize, ppn: usize) -> usize {
        let mut bits: usize = 0;
        bits |= (mode as usize) << 60;
        bits |= asid << 44;
        bits |= ppn >> 12;
        bits
    }
}

satp::make(Sv39, 0, ptr_to_root) builds the usize you write to the CSR. Note the ppn >> 12: ppn here is a byte address that gets shifted into a page number.


7. Octox: Typed Addresses

A whole class of OS bugs comes from accidentally treating a user pointer as a kernel pointer or vice versa. Octox catches these at compile time with newtypes (src/kernel/vm.rs:24-42):

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
#[repr(transparent)]
pub struct PAddr(usize);

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
#[repr(transparent)]
pub struct KVAddr(pub usize);

#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Default)]
#[repr(transparent)]
pub struct UVAddr(usize);

#[repr(transparent)] makes them zero-cost — identical layout to usize. The VAddr trait (src/kernel/vm.rs:99-108) gives every virtual address kind a px(level) method that extracts the 9-bit page-table index for that level:

pub trait VAddr: Addr + Debug {
    const PXMASK: usize = 0x1FF; // 9 bits
    fn px(&self, level: usize) -> usize;
    const MAXVA: usize = 1 << (9 + 9 + 9 + 12 - 1);
}

And the implementation (src/kernel/vm.rs:159-174):

fn px(&self, level: usize) -> usize {
    (self.0 >> (PGSHIFT + 9 * level)) & Self::PXMASK
}

So va.px(2) is bits 38..30, va.px(1) is bits 29..21, va.px(0) is bits 20..12 — exactly the three Sv39 indices.

The PageTable itself is generic over the address kind (src/kernel/vm.rs:286-289):

pub struct PageTable<V: VAddr> {
    ptr: *mut RawPageTable,
    _marker: PhantomData<V>,
}

A PageTable<UVAddr> will not let you walk it with a KVAddr, and vice versa. The compiler refuses to mix them up.


8. Octox: RawPageTable and PageTableEntry

A page-table page is a 4 KB array of 512 PTEs (src/kernel/vm.rs:214-218):

#[repr(C, align(4096))]
struct RawPageTable {
    entries: [PageTableEntry; 512],
}

#[repr(align(4096))] guarantees the whole structure is page-aligned — the MMU requires it.

A PTE is a single usize carrying both the PPN and the flag bits (src/kernel/vm.rs:247-283):

#[repr(transparent)]
pub struct PageTableEntry(usize);

impl PageTableEntry {
    pub fn is_v(&self) -> bool         { self.0 & PTE_V != 0 }
    pub fn is_leaf(&self) -> bool      { (self.0 & (PTE_R|PTE_W|PTE_X)) != 0 }
    pub fn flags(&self) -> usize       { self.0 & 0x3FF }
    pub fn to_pa(&self) -> PAddr       { ((self.0 >> 10) << 12).into() }
    pub fn set(&mut self, pa: usize, attr: usize) {
        self.0 = ((pa >> 12) << 10) | attr;
    }
}

Note the bit gymnastics in to_pa/set:

  • A PTE stores the PPN in bits 53..10. (pa >> 12) << 10 packs a byte-address pa into that field.
  • to_pa() reverses it: (self.0 >> 10) << 12 turns the PPN back into a byte address.

9. Octox: walk() — Three Levels in 30 Lines

walk() is the heart of every page-table operation in Octox (src/kernel/vm.rs:310-343):

// The risc-v Sv39 scheme has three levels of page-table pages.
// A page-table page contains 512 64-bit PTEs.
// A 64-bit virtual address is split into five fields:
//   39..63 -- must be zero.
//   30..38 -- 9 bits of level-2 index.
//   21..29 -- 9 bits of level-1 index.
//   12..20 -- 9 bits of level-0 index.
//    0..11 -- 12 bits of byte offset within the page.
pub fn walk(&mut self, va: V, alloc: bool) -> Option<&mut PageTableEntry> {
    let mut pagetable = self.ptr;
    if va.into_usize() >= V::MAXVA {
        panic!("walk");
    }

    unsafe {
        for level in (1..3).rev() {
            let pte = (*pagetable).get_mut(va.px(level))?;
            if pte.is_v() {
                pagetable = pte.to_pa().into_usize() as *mut RawPageTable;
            } else {
                if !alloc {
                    return None;
                }
                pagetable = RawPageTable::try_new_zeroed()?;
                pte.set(pagetable as usize, PTE_V);
            }
        }
        (*pagetable).get_mut(va.px(0))
    }
}

The descent. Start at the root (self.ptr). For levels 2, then 1:

  1. Compute the index for this level: va.px(level).
  2. Fetch that PTE.
  3. If it's valid, follow it: load the next-level table from its PPN.
  4. If it's invalid and the caller asked us to alloc, allocate a new page-table page (zero-filled), store its PA into the PTE with PTE_V, and follow it.
  5. After the loop, return &mut to the leaf-level PTE at px(0).
sequenceDiagram
    participant W as walk(va, alloc)
    participant L2 as L2 table (root)
    participant L1 as L1 table
    participant L0 as L0 table
    W->>L2: index px(2)
    L2-->>W: PTE_V → PA of L1
    W->>L1: index px(1)
    L1-->>W: PTE_V → PA of L0
    W->>L0: index px(0)
    L0-->>W: &mut leaf PTE

The caller fills in the leaf PTE (with a PA and flags); walk() itself never sets leaf bits.

Option<&mut PTE> is the return

walk returning &mut PTE rather than copying the entry is what lets mappages set it in place. The Option lets the caller distinguish "no such mapping" (when alloc=false) from a found leaf.


10. Octox: mappages() — Installing a VA Range

mappages() is the workhorse for everything that creates mappings (src/kernel/vm.rs:360-385):

pub fn mappages(&mut self, mut va: V, mut pa: PAddr,
                size: usize, perm: usize) -> Result<()> {
    if size == 0 { panic!("mappages: size"); }

    let mut last = va + size - 1;
    va.rounddown();
    last.rounddown();
    loop {
        let pte = self.walk(va, true).ok_or(OutOfMemory)?;
        if pte.is_v() {
            panic!("mappages: remap");
        }
        pte.set(pa.into_usize(), perm | PTE_V);
        if va == last { break; }
        va += PGSIZE;
        pa += PGSIZE;
    }
    Ok(())
}

Algorithm:

  1. Round va and last down to page boundaries.
  2. For each page in the range:
    • walk(va, alloc=true) to find or allocate the leaf PTE.
    • Refuse to remap an already-valid PTE (panic!).
    • Store the PA and permissions in the leaf, with PTE_V.

So mappages builds every mapping in the system: kernel direct maps, user code/data, the trampoline, the trapframe.


11. Octox: Memory Layout

Constants from src/kernel/memlayout.rs:1-82:

pub const UART0:    usize = 0x1000_0000;
pub const VIRTIO0:  usize = 0x1000_1000;
pub const PLIC:     usize = 0x0C00_0000;
pub const KERNBASE: usize = 0x8000_0000;
pub const PHYSTOP:  usize = KERNBASE + 512 * 1024 * 1024;
pub const TRAMPOLINE: usize = KVAddr::MAXVA - PGSIZE;
pub const TRAPFRAME:  usize = TRAMPOLINE - PGSIZE;

Every process — user and kernel — lays out its address space the same way at the top:

   MAXVA (= 2^38)  +------------------------+
                   |  TRAMPOLINE (4 KB)     |  same VA in EVERY PT
   MAXVA - PGSIZE  +------------------------+
                   |  TRAPFRAME  (4 KB)     |  per-process register save
                   +------------------------+
                   |  ...                   |

Kernel address space below those two pages: kernel code and data direct-mapped at KERNBASE, MMIO regions for UART/virtio/PLIC direct-mapped at their physical addresses, kernel stacks (one per process) sandwiched between guard pages.

User address space below those two pages: text at 0, then data, heap, stack — all with PTE_U set.

   user space              kernel space
   +-----------+           +-----------+
   | TRAMPOL.  |           | TRAMPOL.  |    <- same VA, same PA, in BOTH
   +-----------+           +-----------+
   | TRAPFRAME |           | TRAPFRAME |    <- different physical pages,
   +-----------+           +-----------+       same VA, mapped only here
   |    ...    |           | kstack[i] |
   |  stack    |           |  guard    |
   |   ↑       |           | kstack[j] |
   |           |           |  guard    |
   |   ↓       |           |    ...    |
   |  heap     |           |  data     |
   |  data     |           |  text     |
   |  text     |           +-----------+ KERNBASE
   +-----------+ 0

The trampoline is the bridge that lets us execute across a csrw satp write — we'll see why in §15.


12. Octox: Building the Kernel Page Table

kinit() builds the singleton kernel page table; kinithart() turns on paging by writing satp. From src/kernel/vm.rs:698-714:

pub fn kinit() {
    unsafe {
        KVM.set(Kvm::new().unwrap()).unwrap();
        KVM.get_mut().unwrap().make();
    }
}

pub fn kinithart() {
    unsafe {
        satp::write(KVM.get().unwrap().as_satp());
        sfence_vma();
    }
}

Kvm::make() does the heavy lifting (src/kernel/vm.rs:658-694). Each self.map(...) is a thin wrapper around mappages:

unsafe fn make(&mut self) {
    self.map(UART0.into(),   UART0.into(),   PGSIZE, PTE_R|PTE_W);
    self.map(VIRTIO0.into(), VIRTIO0.into(), PGSIZE, PTE_R|PTE_W);
    self.map(PLIC.into(),    PLIC.into(),    0x40_0000, PTE_R|PTE_W);

    // kernel text: read + execute
    self.map(KERNBASE.into(), KERNBASE.into(),
             (etext as usize) - KERNBASE, PTE_R|PTE_X);

    // kernel data + remaining RAM: read + write
    self.map((etext as usize).into(), (etext as usize).into(),
             PHYSTOP - (etext as usize), PTE_R|PTE_W);

    // trampoline at MAXVA - PGSIZE, same VA as in every user PT
    self.map(TRAMPOLINE.into(), (trampoline as usize).into(),
             PGSIZE, PTE_R|PTE_X);

    PROCS.mapstacks();   // per-process kernel stacks
}

Notice that none of these mappings include PTE_U. The kernel's pages are unreachable from user mode — any user attempt to access them faults.

The matching switch happens in kinithart():

satp::write(self.as_satp());   // turn on paging
sfence_vma();                  // flush stale (Bare-mode) TLB

as_satp() is just satp::make(Mode::Sv39, 0, root_pt_pa) (src/kernel/vm.rs:306-308).


13. Octox: Building a User Page Table

A fresh user page table starts almost empty. Uvm::create() (src/kernel/vm.rs:452-458):

pub fn create() -> Result<Uvm> {
    Ok(Uvm { page_table: PageTable::new().ok_or(OutOfMemory)? })
}

That allocates only the root PT page. Proc::uvmcreate() then adds the two universal mappings (src/kernel/proc.rs:445-478):

let mut uvm = Uvm::create()?;

// Trampoline: kernel-only (no PTE_U), read + execute
uvm.mappages(UVAddr::from(TRAMPOLINE),
             PAddr::from(trampoline as usize),
             PGSIZE, PTE_R | PTE_X)?;

// Trapframe: kernel-only (no PTE_U), read + write
uvm.mappages(UVAddr::from(TRAPFRAME),
             PAddr::from(self.data().trapframe.as_deref().unwrap()
                          as *const _ as usize),
             PGSIZE, PTE_R | PTE_W)?;

Both pages are mapped without PTE_U, so user code cannot read or write them — only the kernel's own code (which by definition is running in S-mode) can. The walk()/mappages() we just studied are how those PTEs get installed.

After uvmcreate, the user page table looks like:

   MAXVA            +---------------+
                    |  TRAMPOLINE   |  R|X       (no U, kernel only)
   MAXVA - PGSIZE   +---------------+
                    |  TRAPFRAME    |  R|W       (no U, kernel only)
                    +---------------+
                    |               |
                    |  unmapped     |
                    |               |
   0x0              +---------------+

The text/data/stack will be added by exec() or fork().

fork(): copy parent's pages into child's PT

fork() (src/kernel/proc.rs:820-859) calls Uvm::copy() to duplicate the parent's mappings into the freshly-allocated child:

let p_uvm = p_data.uvm.as_mut().unwrap();
let c_uvm = c_data.uvm.as_mut().unwrap();
p_uvm.copy(c_uvm, p_data.sz)?;

Uvm::copy() (src/kernel/vm.rs:517-555) walks the parent's page table, allocates fresh physical pages for the child, memcpys each page, and mappages-installs matching PTEs in the child:

pub fn copy(&mut self, new: &mut Self, size: usize) -> Result<()> {
    let mut va = UVAddr::from(0);
    while va.into_usize() < size {
        let pte = self.walk(va, false).expect("uvmcopy: pte");
        let pa = pte.to_pa();
        let flags = pte.flags();
        let mem = unsafe { Page::try_new_zeroed() }.ok_or(OutOfMemory)?;
        unsafe { *mem = (*(pa.into_usize() as *mut Page)).clone(); }
        new.mappages(va, (mem as usize).into(), PGSIZE, flags)?;
        va += PGSIZE;
    }
    Ok(())
}

(Octox does not implement copy-on-write — conceptually simpler for a teaching kernel.)

exec(): build a fresh PT, swap atomically

exec() (src/kernel/exec.rs:57-200) builds an entirely new Uvm populated with the new program's segments, then swaps it for the old one. Sketch:

let mut new_uvm = p.uvmcreate()?;          // fresh user PT
for ph in elf.program_headers().filter(|h| h.p_type == PT_LOAD) {
    let sz = new_uvm.alloc(sz, ph.p_vaddr + ph.p_msize,
                            flags2perm(ph.p_flags))?;
    new_uvm.loadseg(ph.p_vaddr.into(), &mut inode,
                     ph.p_offset, ph.p_fsize)?;
}
// allocate stack, push argv/envp ...
let old_uvm = proc_data.uvm.replace(new_uvm);   // single-line commit
old_uvm.proc_uvmfree(oldsz);

Every step before the replace is reversible by dropping new_uvm. The replace itself is the irreversible commit point.


14. Putting It Together: the Three Page Tables in Play

flowchart LR
    subgraph user_A["proc A user mode"]
        A1["UVM A (root PT page)"]
    end
    subgraph user_B["proc B user mode"]
        B1["UVM B (root PT page)"]
    end
    subgraph kernel["S-mode"]
        K1["KVM (root PT page)"]
    end
    A1 -->|"trampoline same VA<br/>same PA"| TRAMP["trampoline page"]
    B1 -->|"trampoline same VA<br/>same PA"| TRAMP
    K1 -->|"trampoline same VA<br/>same PA"| TRAMP
    A1 -->|"trapframe<br/>(A's)"| TFA["trapframe A"]
    B1 -->|"trapframe<br/>(B's)"| TFB["trapframe B"]

Three different page tables. The trampoline page is mapped into all three at the same virtual address. Each user PT also maps that process's own trapframe.

satp is the only thing that says which one is active right now.


15. Switching Page Tables at Trap Boundaries

Every trap into the kernel must swap from a user PT to the kernel PT; every return must swap back. The danger: between the csrw satp and the next instruction, the CPU keeps fetching instructions — and those instructions had better still be mapped at the same address.

That is why Octox places uservec and userret on the trampoline page, mapped at the same VA in every page table.

uservec: user → kernel

From src/kernel/trampoline.rs:22-89:

csrw sscratch, a0           # park user a0 (sscratch held TRAPFRAME VA before)
li   a0, TRAPFRAME          # a0 ← TRAPFRAME (same VA in user and kernel PT)
sd   ra,  40(a0)            # save 31 user GPRs to trapframe
sd   sp,  48(a0)
... (28 more)
csrr t0, sscratch           # recover user a0
sd   t0, 112(a0)            # save it last

ld   t1, 0(a0)              # kernel_satp from trapframe
sfence.vma zero, zero       # invalidate stale user TLB
csrw satp, t1               # install kernel page table  ← THE SWITCH
sfence.vma zero, zero       # publish new translations

ld   sp, 8(a0)              # kernel_sp
ld   tp, 32(a0)             # kernel_hartid
ld   t0, 16(a0)             # kernel_trap (= usertrap)
jr   t0                     # jump to usertrap

The page that holds these instructions is mapped at the same address in both PTs. So the instruction after csrw satp is at the same VA — the CPU's next fetch lands in the trampoline page, just now resolved through the kernel's PT instead of the user's. No fault.

userret: kernel → user

Mirror image (src/kernel/trampoline.rs:95-149):

sfence.vma zero, zero       # invalidate stale kernel TLB
csrw satp, a0               # a0 = user satp (passed in)
sfence.vma zero, zero
li   a0, TRAPFRAME
ld   ra,  40(a0)            # restore 31 user GPRs
ld   sp,  48(a0)
... (28 more)
ld   a0, 112(a0)            # restore user a0 LAST
sret                        # PC ← sepc, mode ← U

Same pattern: bracket the swap with sfence.vmas, depend on the trampoline being mapped identically in both PTs.

Two sfence.vmas, both required

The first flush kills stale entries from the outgoing PT. The second forces the next access to consult the incoming PT (avoiding a stale entry that the MMU might have speculatively cached). Skip either and you get a Heisenbug.


Key Concepts

Concept Takeaway
Pages, VPN, offset Translate page numbers, leave the offset alone.
Single-level cost 4 MB per process even for tiny programs — structural waste.
Two-level / multi-level Allocate leaf tables only where the address space is occupied.
TLB Hardware cache of recent translations; keeps the cost of multi-level tables low.
sfence.vma Tell the MMU to drop stale TLB entries. Required after every PTE edit.
Sv39 layout 9 + 9 + 9 + 12 bits, three levels, 8-byte PTEs, 4 KB tables.
satp register [mode 4][asid 16][PPN 44]; writing it changes the active page table.
Octox typed addresses PAddr / KVAddr / UVAddr are checked at compile time.
walk() Three-level descent; allocates intermediate tables on demand.
mappages() Loop over pages, call walk(alloc=true), set the leaf PTE.
Trampoline same VA in every PT The MMU keeps fetching instructions across csrw satp writes.
Trapframe Per-process page mapped without PTE_U; saves user GPRs and kernel scratch.

Practice Problems

Problem 1 — Decode an Sv39 virtual address

Given the 64-bit virtual address

va = 0x0000_0040_0123_45AB

(only the low 39 bits are significant; the high bits are zero), compute:

(a) va.px(2) — the level-2 index.
(b) va.px(1) — the level-1 index.
(c) va.px(0) — the level-0 index.
(d) The page offset.

Solution Write the low 39 bits of `va` in binary, then group:
0x40_0123_45AB = 0100 0000 0000 0001 0010 0011 0100 0101 1010 1011

   bits 38..30 (L2)        | bits 29..21 (L1)         | bits 20..12 (L0)         | bits 11..0 (offset)
   1 0000 0000              | 0 0000 0001 0            | 010 0011 010              | 0 0101 1010 1011
   = 0x100 = 256             | = 0x002 = 2              | = 0x11A = 282             | = 0x5AB = 1451
So `px(2) = 256`, `px(1) = 2`, `px(0) = 282`, offset = `0x5AB`. You can also do this with arithmetic directly:
va.px(2) = (va >> (12 + 9*2)) & 0x1FF = (va >> 30) & 0x1FF
va.px(1) = (va >> (12 + 9*1)) & 0x1FF = (va >> 21) & 0x1FF
va.px(0) = (va >> (12 + 9*0)) & 0x1FF = (va >> 12) & 0x1FF
offset   = va & 0xFFF

Problem 2 — How many PT pages does Octox actually allocate?

A user process has been built by exec() and now has the following mappings in its Uvm:

  • 8 KB of code at virtual address 0x0000_0000 (two pages).
  • 4 KB of stack at virtual address 0x3F_FFFF_E000 (one page).
  • The trampoline at MAXVA - PGSIZE.
  • The trapframe at MAXVA - 2*PGSIZE.

How many 4 KB page-table pages (root + intermediate + leaf) has Octox allocated for this process? Walk through which L2/L1/L0 tables are needed. Assume Octox has not been clever about merging neighbouring regions.

Solution `MAXVA = 2^38`. So the trampoline VA is `0x3F_FFFF_F000` and the trapframe VA is `0x3F_FFFF_E000`. The stack is *also* at `0x3F_FFFF_E000` — wait, that overlaps with the trapframe. Let's reread: the stack is "at" `0x3F_FFFF_E000`. In a real Octox process the user stack sits **below** the trapframe. So treat the stack as one page below: `0x3F_FFFF_D000`. Each region's L2/L1 indices: | Region | VA | px(2) | px(1) | px(0) | |------------|-----------------|-------|-------|-------| | code (p1) | `0x0000_0000` | 0 | 0 | 0 | | code (p2) | `0x0000_1000` | 0 | 0 | 1 | | stack | `0x3F_FFFF_D000`| 511 | 511 | 509 | | trapframe | `0x3F_FFFF_E000`| 511 | 511 | 510 | | trampoline | `0x3F_FFFF_F000`| 511 | 511 | 511 | So the trie has these tables: - 1 root (L2) table — mandatory. - 2 L1 tables: one at L2-index `0`, one at L2-index `511`. - 2 L0 tables: one at (L2=0, L1=0), one at (L2=511, L1=511). All three top-region pages share the same L0 table since they differ only in `px(0)`. **Total: 1 + 2 + 2 = 5 page-table pages = 20 KB of bookkeeping.** For comparison, a single-level Sv39 PT would need 227 × 8 bytes = 1 GB.

Problem 3 — Trace walk() from a fresh Uvm

You have just called p.uvmcreate(), so the user PT contains only the trampoline and trapframe mappings. Now:

uvm.mappages(UVAddr::from(0),
             PAddr::from(0x8010_0000),
             PGSIZE,
             PTE_R | PTE_W | PTE_U);

Trace the sequence of allocations and PTE writes performed by mappages (including any walk does inside). Which intermediate PTEs become valid for the first time? Reuse the existing root PT.

Solution `va = 0` ⇒ `px(2) = 0`, `px(1) = 0`, `px(0) = 0`. `walk(va=0, alloc=true)`: 1. Start at the root (already allocated by `uvmcreate`). Look at `entries[px(2)] = entries[0]`. - The root table's high index `511` was filled in by `uvmcreate` (for the trampoline / trapframe path), but `entries[0]` is still zero. **Not valid.** - Allocate a new L1 page-table page (zero-filled). Call its PA `pa_L1`. - Set `root.entries[0] = (pa_L1 >> 12) << 10 | PTE_V`. 2. Descend to the new L1 table. Look at `entries[px(1)] = entries[0]`. It is zero. **Not valid.** - Allocate a new L0 page-table page. Call its PA `pa_L0`. - Set `L1.entries[0] = (pa_L0 >> 12) << 10 | PTE_V`. 3. Loop ends. Return `&mut L0.entries[px(0)] = &mut L0.entries[0]`. Back in `mappages`: 4. `pte.is_v()` ⇒ false (we just allocated this leaf). 5. `pte.set(0x8010_0000, PTE_R|PTE_W|PTE_U|PTE_V)`. **Newly valid PTEs:** - `root.entries[0]` → points to L1 (interior, `PTE_V` only). - `L1.entries[0]` → points to L0 (interior, `PTE_V` only). - `L0.entries[0]` → leaf, PA `0x8010_0000`, flags `PTE_V|PTE_R|PTE_W|PTE_U`. Two new page-table pages allocated; three PTEs written.

Problem 4 — Why same-VA trampoline?

If we mapped the trampoline page at different virtual addresses in the user PT versus the kernel PT (same physical page, different virtual addresses), what concrete instruction in uservec would fault, and why?

Solution The instruction immediately **after** `csrw satp, t1` would fault. Just before the `csrw`, the CPU's PC is some VA inside the trampoline page in the **user** PT. After the `csrw`, the kernel PT is active. The CPU advances PC and fetches the next instruction at PC = (old VA) + 4. If the trampoline page is mapped at a *different* VA in the kernel PT, that VA is not mapped (or maps to something else), and the instruction fetch faults — while we are halfway through trap entry, with the user's registers already saved but the trap handler not yet running. There is no graceful recovery. Mapping the trampoline at the same VA in every PT means the next fetch resolves through whichever PT is now active, but to the **same physical page** — the instruction sequence keeps running.

Further Reading

  • Octox Guide §4.4 — Sv39 page tables in the Octox guide; complements this lecture's source-level detail with a higher-level overview.
  • Octox Guide §4.2 — trampoline and kernel/user address spaces.
  • OS Kernel: Inside Octox — uses the page-table machinery built today; revisit §§4 and 9 (mode switching, timer preemption) once Sv39 is comfortable.
  • xv6 book, ch. 3 — "Page tables." Octox is a Rust port; the ideas are 1:1.
  • The RISC-V Privileged ISA Spec, §4.3 (Sv39) and §4.1.11 (satp).
  • Octox source — src/kernel/vm.rs, src/kernel/riscv.rs, src/kernel/memlayout.rs, src/kernel/trampoline.rs. Today's lecture cites them with line numbers; reading them end-to-end takes about an hour and is the best follow-up.