OS Page Tables¶
How the MMU and the kernel collaborate to give each process its own
private address space, and how Octox implements that machinery on
RISC-V Sv39. We start with a hypothetical 32-bit RISC architecture so
the arithmetic stays small, motivate multi-level page tables from a
single example, then descend into the real Octox code — the
walker, the mappers, and the satp swaps that bracket every trap.
Overview¶
Virtual memory does three jobs:
- Isolation. User A cannot read user B's pages.
- Translation. A process's
0x1000can be different physical memory from another process's0x1000. - Abstraction. Sparse address spaces become possible —
malloccan hand back0x5_0000_0000without physically reserving 20 GB of empty memory.
The data structure that makes all three work is the page table. Hardware (the MMU) walks it on every memory access; software (the OS) builds and edits it. Today we build the mental model on a small 32-bit example, then read the Octox code that implements the real RV64 Sv39 scheme.
Learning Objectives¶
After this lecture you should be able to:
- Decompose a virtual address into page-table indices and a page offset, for both a hypothetical 32-bit architecture and RISC-V Sv39.
- Explain why a single-level page table is wasteful for a 32-bit address space — and exactly how a two-level scheme reduces the per-process page-table footprint to kilobytes for sparse processes.
- State what the TLB caches and what
sfence.vmadoes. - Decode the layout of a RISC-V Sv39 PTE (PPN bits, flag bits) and the
satpregister. - Read Octox's
walk()and trace the three-level descent it performs. - Describe how
mappages(),Uvm::create(), andUvm::copy()are composed to implementfork,exec, and the trampoline / trapframe setup. - Explain why the trampoline page is mapped at the same virtual address in every page table.
Prerequisites¶
- OS Kernel: Inside Octox —
yesterday's tour of privilege modes, the trapframe, and the syscall
round trip. We rely on the same
usertrap/userretskeleton. - Octox Guide §4.4 (Sv39 page tables) and §4.2 (memory layout).
- RISC-V assembly lectures — CSRs (
satp,sfence.vma).
1. Why Virtual Memory¶
Every modern OS gives each process the illusion of owning the entire
address space. Two processes can both store something at 0x1000 and
neither knows the other exists. The MMU, sitting between every
load/store instruction and physical RAM, performs address
translation — mapping each virtual address (VA) the program
emits to the physical address (PA) where the bytes actually live.
flowchart LR
CPU["CPU<br/>load/store VA"] --> MMU["MMU<br/>VA → PA"]
MMU --> RAM["physical RAM"]
PT["page table<br/>(per process)"] -.-> MMU
The page table is the per-process data structure that drives the translation. The OS builds it; the MMU consults it. To make this practical, memory is divided into fixed-size pages — 4 KB on RISC-V. A virtual address is split into
Translation maps the VPN to a physical page number (PPN); the offset bits pass through unchanged. So we only ever need to translate page numbers, not full addresses.
Why 4 KB?
Big enough that the per-page bookkeeping (one PTE) is small relative to the data; small enough that internal fragmentation inside a page is bearable. RISC-V also supports 2 MB and 1 GB "huge pages" by treating an entry at level 1 or 2 as a leaf, but Octox uses only 4 KB pages.
2. A Hypothetical 32-bit RISC: Single-Level Page Tables¶
Let's imagine a simple 32-bit RISC ISA with 4 KB pages. The arithmetic is small enough to do in your head.
- Virtual address: 32 bits.
- Page size: 4 KB = 212 bytes → 12-bit offset.
- Page number: 32 − 12 = 20 bits.
31 12 11 0
+------------------------------------+----------------+
| VPN (20 bits) | offset (12) |
+------------------------------------+----------------+
A single-level page table is just a flat array indexed by VPN. With 20-bit VPNs that's
If each PTE is 4 bytes (32 bits, enough for a 20-bit PPN plus some flag bits), the table is
Per process. Even a tiny program that uses 8 KB of code and 4 KB of stack pays a 4 MB tax in page-table memory. Run a thousand processes and that is 4 GB of page tables alone — for an architecture whose entire physical address space is 4 GB.
single-level PT
+--------------+ PTE 0
| |
+--------------+ PTE 1
| |
+--------------+
| ... |
+--------------+ PTE (2^20 - 1)
| |
+--------------+
<-- 4 MB total, per process -->
The waste is structural: the table has an entry for every possible page number, even pages the process never touches. We need a representation that lets us skip over unused regions.
3. A Hypothetical 32-bit RISC: Two-Level Page Tables¶
Split the 20-bit VPN into two 10-bit indices:
31 22 21 12 11 0
+---------------------+--------------------+----------------+
| L1 index (10) | L0 index (10) | offset (12) |
+---------------------+--------------------+----------------+
Now we have a two-level trie:
- The root (level 1) page table has 210 = 1024 entries.
- Each entry points to a leaf (level 0) page table, also with 1024 entries.
- Each leaf entry holds the PPN of an actual data page (or is invalid).
With 4-byte PTEs:
| Table | Entries | Bytes |
|---|---|---|
| Root | 1024 | 4 KB |
| Leaf | 1024 | 4 KB |
flowchart LR
VA["VA: [L1=4][L0=2][off]"] --> ROOT
ROOT["Root PT<br/>1024 entries"] -->|"L1=4"| LEAF["Leaf PT 4<br/>1024 entries"]
ROOT -.->|"L1=0..3, 5..1023<br/>(NULL)"| X["(no leaf<br/>allocated)"]
LEAF -->|"L0=2"| PAGE["data page<br/>(4 KB)"]
The savings. The root is mandatory. But leaf tables only exist for regions the process actually uses. A process whose virtual layout is
- 1 MB of code at
0x0000_0000, - 64 KB of stack at the top of memory near
0xFFFF_0000,
uses two regions, each touching one or two L1 entries. So:
| Component | Size |
|---|---|
| Root PT | 4 KB |
| 1 leaf for code | 4 KB |
| 1 leaf for stack | 4 KB |
| Total | 12 KB (vs. 4 MB flat) |
A factor of 300+ smaller, for the same translation power. The cost is that the MMU now does two memory loads per translation instead of one — which is exactly what the TLB will fix.
The trie idea
Multi-level page tables are tries indexed by chunks of the page number. The level-1 index picks a sub-trie; the level-0 index picks an entry inside it. A NULL (invalid) entry at any level means "this whole sub-region is unmapped" — no leaf table needs to exist.
4. The TLB¶
A two-level page table is small, but every load and store now needs two extra memory accesses just to translate the address. That would be a disaster.
The fix is a hardware cache: the Translation Lookaside Buffer (TLB). The MMU keeps a small table of recently-used VPN → PPN mappings. Most memory accesses hit the TLB and skip the page-table walk entirely.
- TLB hit rates in real workloads are typically > 99%.
- A miss triggers a page-table walk (in hardware on RISC-V), and the resulting translation is inserted into the TLB.
- The OS owns the page tables, so when it changes a PTE, it must tell
the MMU to throw out stale TLB entries:
sfence.vma.
sfence.vma is not optional
Forgetting sfence.vma after a PTE edit means the TLB will keep
serving the old translation. The bug only shows up when a
translation that was cached gets reused after the OS changed
it — classic "works most of the time" hazard.
We will see Octox bracket every csrw satp with sfence.vma in
§15.
5. From the Toy Architecture to RISC-V Sv39¶
A flat 64-bit page table would need 252 entries — petabytes. Even two-level is far too big. RISC-V's solution for RV64 is Sv39: 39-bit virtual addresses, three levels of page tables, 9-bit indices.
38 30 29 21 20 12 11 0
+----------+-----------+-----------+----------------+
| L2 (9) | L1 (9) | L0 (9) | offset (12) |
+----------+-----------+-----------+----------------+
bits 63..39 must equal bit 38 (sign-extended canonical addresses)
- Page size: 4 KB. 12-bit offset.
- Three levels. Each level uses 9 bits to index 29 = 512 entries.
- PTE size: 8 bytes on RV64. So one page-table page is exactly 512 × 8 = 4096 bytes — one 4 KB page.
- Total reach: 239 bytes = 512 GB of virtual address space.
flowchart LR
VA["39-bit VA"] --> L2T["L2 PT<br/>512 entries<br/>4 KB"]
L2T -->|"px(2)"| L1T["L1 PT<br/>4 KB"]
L1T -->|"px(1)"| L0T["L0 PT<br/>4 KB"]
L0T -->|"px(0)"| PAGE["4 KB data page"]
The same trie idea as our toy architecture, just deeper.
The Sv39 PTE layout¶
63 54 53 10 9 8 7 6 5 4 3 2 1 0
+----------+-------------------------------------------+----+--+-+-+-+-+-+-+-+
| reserved | PPN (44 bits) | RSW | D|A|G|U|X|W|R|V|
+----------+-------------------------------------------+----+--+-+-+-+-+-+-+-+
| Bit | Name | Meaning |
|---|---|---|
| 0 | V |
Valid — entry is in use |
| 1 | R |
Readable |
| 2 | W |
Writable |
| 3 | X |
Executable |
| 4 | U |
User-mode accessible |
| 5 | G |
Global (shared across address spaces) |
| 6 | A |
Accessed (set by hardware) |
| 7 | D |
Dirty (set by hardware) |
Leaf vs. interior PTEs. An interior PTE (level 2 or 1, pointing to
the next-level table) sets only V. A leaf PTE additionally sets at
least one of R, W, X, marking it as the final mapping for a page.
Octox encodes this cleanly:
(src/kernel/vm.rs:247-283).
The flag constants live in src/kernel/riscv.rs:616-622:
pub mod pteflags {
pub const PTE_V: usize = 1 << 0; // valid
pub const PTE_R: usize = 1 << 1;
pub const PTE_W: usize = 1 << 2;
pub const PTE_X: usize = 1 << 3;
pub const PTE_U: usize = 1 << 4; // user can access
}
6. The satp Register¶
satp ("Supervisor Address Translation and Protection")
is the CSR that points the MMU at the current page table. Writing
satp switches address spaces; that is exactly what happens on every
trap into and out of the kernel.
63 60 59 44 43 0
+--------+----------+-------------------------------------------------+
| MODE | ASID | PPN of root PT |
+--------+----------+-------------------------------------------------+
4 bits 16 bits 44 bits
- MODE:
Sv39 = 8. (Bare = 0disables translation entirely.) - ASID: address-space identifier, used by the TLB to tag entries.
- PPN: the physical page number of the root page table. The MMU starts every walk here.
Octox wraps this in a typed module (src/kernel/riscv.rs:357-408):
pub mod satp {
pub enum Mode {
Bare = 0, Sv39 = 8, Sv48 = 9, Sv57 = 10, Sv64 = 11,
}
pub unsafe fn write(bits: usize) {
asm!("csrw satp, {}", in(reg) bits);
}
pub fn make(mode: Mode, asid: usize, ppn: usize) -> usize {
let mut bits: usize = 0;
bits |= (mode as usize) << 60;
bits |= asid << 44;
bits |= ppn >> 12;
bits
}
}
satp::make(Sv39, 0, ptr_to_root) builds the usize you write to the
CSR. Note the ppn >> 12: ppn here is a byte address that gets
shifted into a page number.
7. Octox: Typed Addresses¶
A whole class of OS bugs comes from accidentally treating a user
pointer as a kernel pointer or vice versa. Octox catches these at
compile time with newtypes (src/kernel/vm.rs:24-42):
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
#[repr(transparent)]
pub struct PAddr(usize);
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord)]
#[repr(transparent)]
pub struct KVAddr(pub usize);
#[derive(Debug, Clone, Copy, PartialEq, Eq, PartialOrd, Ord, Default)]
#[repr(transparent)]
pub struct UVAddr(usize);
#[repr(transparent)] makes them zero-cost — identical layout
to usize. The VAddr trait (src/kernel/vm.rs:99-108) gives every
virtual address kind a px(level) method that extracts the 9-bit
page-table index for that level:
pub trait VAddr: Addr + Debug {
const PXMASK: usize = 0x1FF; // 9 bits
fn px(&self, level: usize) -> usize;
const MAXVA: usize = 1 << (9 + 9 + 9 + 12 - 1);
}
And the implementation (src/kernel/vm.rs:159-174):
So va.px(2) is bits 38..30, va.px(1) is bits 29..21, va.px(0)
is bits 20..12 — exactly the three Sv39 indices.
The PageTable itself is generic over the address kind
(src/kernel/vm.rs:286-289):
A PageTable<UVAddr> will not let you walk it with a KVAddr, and
vice versa. The compiler refuses to mix them up.
8. Octox: RawPageTable and PageTableEntry¶
A page-table page is a 4 KB array of 512 PTEs (src/kernel/vm.rs:214-218):
#[repr(align(4096))] guarantees the whole structure is page-aligned
— the MMU requires it.
A PTE is a single usize carrying both the PPN and the flag bits
(src/kernel/vm.rs:247-283):
#[repr(transparent)]
pub struct PageTableEntry(usize);
impl PageTableEntry {
pub fn is_v(&self) -> bool { self.0 & PTE_V != 0 }
pub fn is_leaf(&self) -> bool { (self.0 & (PTE_R|PTE_W|PTE_X)) != 0 }
pub fn flags(&self) -> usize { self.0 & 0x3FF }
pub fn to_pa(&self) -> PAddr { ((self.0 >> 10) << 12).into() }
pub fn set(&mut self, pa: usize, attr: usize) {
self.0 = ((pa >> 12) << 10) | attr;
}
}
Note the bit gymnastics in to_pa/set:
- A PTE stores the PPN in bits 53..10.
(pa >> 12) << 10packs a byte-addresspainto that field. to_pa()reverses it:(self.0 >> 10) << 12turns the PPN back into a byte address.
9. Octox: walk() — Three Levels in 30 Lines¶
walk() is the heart of every page-table operation in Octox
(src/kernel/vm.rs:310-343):
// The risc-v Sv39 scheme has three levels of page-table pages.
// A page-table page contains 512 64-bit PTEs.
// A 64-bit virtual address is split into five fields:
// 39..63 -- must be zero.
// 30..38 -- 9 bits of level-2 index.
// 21..29 -- 9 bits of level-1 index.
// 12..20 -- 9 bits of level-0 index.
// 0..11 -- 12 bits of byte offset within the page.
pub fn walk(&mut self, va: V, alloc: bool) -> Option<&mut PageTableEntry> {
let mut pagetable = self.ptr;
if va.into_usize() >= V::MAXVA {
panic!("walk");
}
unsafe {
for level in (1..3).rev() {
let pte = (*pagetable).get_mut(va.px(level))?;
if pte.is_v() {
pagetable = pte.to_pa().into_usize() as *mut RawPageTable;
} else {
if !alloc {
return None;
}
pagetable = RawPageTable::try_new_zeroed()?;
pte.set(pagetable as usize, PTE_V);
}
}
(*pagetable).get_mut(va.px(0))
}
}
The descent. Start at the root (self.ptr). For levels 2, then 1:
- Compute the index for this level:
va.px(level). - Fetch that PTE.
- If it's valid, follow it: load the next-level table from its PPN.
- If it's invalid and the caller asked us to
alloc, allocate a new page-table page (zero-filled), store its PA into the PTE withPTE_V, and follow it. - After the loop, return
&mutto the leaf-level PTE atpx(0).
sequenceDiagram
participant W as walk(va, alloc)
participant L2 as L2 table (root)
participant L1 as L1 table
participant L0 as L0 table
W->>L2: index px(2)
L2-->>W: PTE_V → PA of L1
W->>L1: index px(1)
L1-->>W: PTE_V → PA of L0
W->>L0: index px(0)
L0-->>W: &mut leaf PTE
The caller fills in the leaf PTE (with a PA and flags); walk()
itself never sets leaf bits.
Option<&mut PTE> is the return
walk returning &mut PTE rather than copying the entry is what
lets mappages set it in place. The Option lets the caller
distinguish "no such mapping" (when alloc=false) from a found
leaf.
10. Octox: mappages() — Installing a VA Range¶
mappages() is the workhorse for everything that creates mappings
(src/kernel/vm.rs:360-385):
pub fn mappages(&mut self, mut va: V, mut pa: PAddr,
size: usize, perm: usize) -> Result<()> {
if size == 0 { panic!("mappages: size"); }
let mut last = va + size - 1;
va.rounddown();
last.rounddown();
loop {
let pte = self.walk(va, true).ok_or(OutOfMemory)?;
if pte.is_v() {
panic!("mappages: remap");
}
pte.set(pa.into_usize(), perm | PTE_V);
if va == last { break; }
va += PGSIZE;
pa += PGSIZE;
}
Ok(())
}
Algorithm:
- Round
vaandlastdown to page boundaries. - For each page in the range:
walk(va, alloc=true)to find or allocate the leaf PTE.- Refuse to remap an already-valid PTE (
panic!). - Store the PA and permissions in the leaf, with
PTE_V.
So mappages builds every mapping in the system: kernel direct
maps, user code/data, the trampoline, the trapframe.
11. Octox: Memory Layout¶
Constants from src/kernel/memlayout.rs:1-82:
pub const UART0: usize = 0x1000_0000;
pub const VIRTIO0: usize = 0x1000_1000;
pub const PLIC: usize = 0x0C00_0000;
pub const KERNBASE: usize = 0x8000_0000;
pub const PHYSTOP: usize = KERNBASE + 512 * 1024 * 1024;
pub const TRAMPOLINE: usize = KVAddr::MAXVA - PGSIZE;
pub const TRAPFRAME: usize = TRAMPOLINE - PGSIZE;
Every process — user and kernel — lays out its address space the same way at the top:
MAXVA (= 2^38) +------------------------+
| TRAMPOLINE (4 KB) | same VA in EVERY PT
MAXVA - PGSIZE +------------------------+
| TRAPFRAME (4 KB) | per-process register save
+------------------------+
| ... |
Kernel address space below those two pages: kernel code and data
direct-mapped at KERNBASE, MMIO regions for UART/virtio/PLIC
direct-mapped at their physical addresses, kernel stacks (one per
process) sandwiched between guard pages.
User address space below those two pages: text at 0, then data,
heap, stack — all with PTE_U set.
user space kernel space
+-----------+ +-----------+
| TRAMPOL. | | TRAMPOL. | <- same VA, same PA, in BOTH
+-----------+ +-----------+
| TRAPFRAME | | TRAPFRAME | <- different physical pages,
+-----------+ +-----------+ same VA, mapped only here
| ... | | kstack[i] |
| stack | | guard |
| ↑ | | kstack[j] |
| | | guard |
| ↓ | | ... |
| heap | | data |
| data | | text |
| text | +-----------+ KERNBASE
+-----------+ 0
The trampoline is the bridge that lets us execute across a csrw satp
write — we'll see why in §15.
12. Octox: Building the Kernel Page Table¶
kinit() builds the singleton kernel page table; kinithart() turns
on paging by writing satp. From src/kernel/vm.rs:698-714:
pub fn kinit() {
unsafe {
KVM.set(Kvm::new().unwrap()).unwrap();
KVM.get_mut().unwrap().make();
}
}
pub fn kinithart() {
unsafe {
satp::write(KVM.get().unwrap().as_satp());
sfence_vma();
}
}
Kvm::make() does the heavy lifting (src/kernel/vm.rs:658-694).
Each self.map(...) is a thin wrapper around mappages:
unsafe fn make(&mut self) {
self.map(UART0.into(), UART0.into(), PGSIZE, PTE_R|PTE_W);
self.map(VIRTIO0.into(), VIRTIO0.into(), PGSIZE, PTE_R|PTE_W);
self.map(PLIC.into(), PLIC.into(), 0x40_0000, PTE_R|PTE_W);
// kernel text: read + execute
self.map(KERNBASE.into(), KERNBASE.into(),
(etext as usize) - KERNBASE, PTE_R|PTE_X);
// kernel data + remaining RAM: read + write
self.map((etext as usize).into(), (etext as usize).into(),
PHYSTOP - (etext as usize), PTE_R|PTE_W);
// trampoline at MAXVA - PGSIZE, same VA as in every user PT
self.map(TRAMPOLINE.into(), (trampoline as usize).into(),
PGSIZE, PTE_R|PTE_X);
PROCS.mapstacks(); // per-process kernel stacks
}
Notice that none of these mappings include PTE_U. The kernel's
pages are unreachable from user mode — any user attempt to
access them faults.
The matching switch happens in kinithart():
as_satp() is just satp::make(Mode::Sv39, 0, root_pt_pa)
(src/kernel/vm.rs:306-308).
13. Octox: Building a User Page Table¶
A fresh user page table starts almost empty. Uvm::create()
(src/kernel/vm.rs:452-458):
That allocates only the root PT page. Proc::uvmcreate() then
adds the two universal mappings (src/kernel/proc.rs:445-478):
let mut uvm = Uvm::create()?;
// Trampoline: kernel-only (no PTE_U), read + execute
uvm.mappages(UVAddr::from(TRAMPOLINE),
PAddr::from(trampoline as usize),
PGSIZE, PTE_R | PTE_X)?;
// Trapframe: kernel-only (no PTE_U), read + write
uvm.mappages(UVAddr::from(TRAPFRAME),
PAddr::from(self.data().trapframe.as_deref().unwrap()
as *const _ as usize),
PGSIZE, PTE_R | PTE_W)?;
Both pages are mapped without PTE_U, so user code cannot read or
write them — only the kernel's own code (which by definition is
running in S-mode) can. The walk()/mappages() we just studied are
how those PTEs get installed.
After uvmcreate, the user page table looks like:
MAXVA +---------------+
| TRAMPOLINE | R|X (no U, kernel only)
MAXVA - PGSIZE +---------------+
| TRAPFRAME | R|W (no U, kernel only)
+---------------+
| |
| unmapped |
| |
0x0 +---------------+
The text/data/stack will be added by exec() or fork().
fork(): copy parent's pages into child's PT¶
fork() (src/kernel/proc.rs:820-859) calls Uvm::copy() to
duplicate the parent's mappings into the freshly-allocated child:
let p_uvm = p_data.uvm.as_mut().unwrap();
let c_uvm = c_data.uvm.as_mut().unwrap();
p_uvm.copy(c_uvm, p_data.sz)?;
Uvm::copy() (src/kernel/vm.rs:517-555) walks the parent's page
table, allocates fresh physical pages for the child, memcpys each
page, and mappages-installs matching PTEs in the child:
pub fn copy(&mut self, new: &mut Self, size: usize) -> Result<()> {
let mut va = UVAddr::from(0);
while va.into_usize() < size {
let pte = self.walk(va, false).expect("uvmcopy: pte");
let pa = pte.to_pa();
let flags = pte.flags();
let mem = unsafe { Page::try_new_zeroed() }.ok_or(OutOfMemory)?;
unsafe { *mem = (*(pa.into_usize() as *mut Page)).clone(); }
new.mappages(va, (mem as usize).into(), PGSIZE, flags)?;
va += PGSIZE;
}
Ok(())
}
(Octox does not implement copy-on-write — conceptually simpler for a teaching kernel.)
exec(): build a fresh PT, swap atomically¶
exec() (src/kernel/exec.rs:57-200) builds an entirely new Uvm
populated with the new program's segments, then swaps it for the old
one. Sketch:
let mut new_uvm = p.uvmcreate()?; // fresh user PT
for ph in elf.program_headers().filter(|h| h.p_type == PT_LOAD) {
let sz = new_uvm.alloc(sz, ph.p_vaddr + ph.p_msize,
flags2perm(ph.p_flags))?;
new_uvm.loadseg(ph.p_vaddr.into(), &mut inode,
ph.p_offset, ph.p_fsize)?;
}
// allocate stack, push argv/envp ...
let old_uvm = proc_data.uvm.replace(new_uvm); // single-line commit
old_uvm.proc_uvmfree(oldsz);
Every step before the replace is reversible by dropping new_uvm.
The replace itself is the irreversible commit point.
14. Putting It Together: the Three Page Tables in Play¶
flowchart LR
subgraph user_A["proc A user mode"]
A1["UVM A (root PT page)"]
end
subgraph user_B["proc B user mode"]
B1["UVM B (root PT page)"]
end
subgraph kernel["S-mode"]
K1["KVM (root PT page)"]
end
A1 -->|"trampoline same VA<br/>same PA"| TRAMP["trampoline page"]
B1 -->|"trampoline same VA<br/>same PA"| TRAMP
K1 -->|"trampoline same VA<br/>same PA"| TRAMP
A1 -->|"trapframe<br/>(A's)"| TFA["trapframe A"]
B1 -->|"trapframe<br/>(B's)"| TFB["trapframe B"]
Three different page tables. The trampoline page is mapped into all three at the same virtual address. Each user PT also maps that process's own trapframe.
satp is the only thing that says which one is active right now.
15. Switching Page Tables at Trap Boundaries¶
Every trap into the kernel must swap from a user PT to the kernel PT;
every return must swap back. The danger: between the csrw satp and
the next instruction, the CPU keeps fetching instructions — and
those instructions had better still be mapped at the same address.
That is why Octox places uservec and userret on the trampoline
page, mapped at the same VA in every page table.
uservec: user → kernel¶
From src/kernel/trampoline.rs:22-89:
csrw sscratch, a0 # park user a0 (sscratch held TRAPFRAME VA before)
li a0, TRAPFRAME # a0 ← TRAPFRAME (same VA in user and kernel PT)
sd ra, 40(a0) # save 31 user GPRs to trapframe
sd sp, 48(a0)
... (28 more)
csrr t0, sscratch # recover user a0
sd t0, 112(a0) # save it last
ld t1, 0(a0) # kernel_satp from trapframe
sfence.vma zero, zero # invalidate stale user TLB
csrw satp, t1 # install kernel page table ← THE SWITCH
sfence.vma zero, zero # publish new translations
ld sp, 8(a0) # kernel_sp
ld tp, 32(a0) # kernel_hartid
ld t0, 16(a0) # kernel_trap (= usertrap)
jr t0 # jump to usertrap
The page that holds these instructions is mapped at the same address
in both PTs. So the instruction after csrw satp is at the same VA
— the CPU's next fetch lands in the trampoline page, just now
resolved through the kernel's PT instead of the user's. No fault.
userret: kernel → user¶
Mirror image (src/kernel/trampoline.rs:95-149):
sfence.vma zero, zero # invalidate stale kernel TLB
csrw satp, a0 # a0 = user satp (passed in)
sfence.vma zero, zero
li a0, TRAPFRAME
ld ra, 40(a0) # restore 31 user GPRs
ld sp, 48(a0)
... (28 more)
ld a0, 112(a0) # restore user a0 LAST
sret # PC ← sepc, mode ← U
Same pattern: bracket the swap with sfence.vmas, depend on the
trampoline being mapped identically in both PTs.
Two sfence.vmas, both required
The first flush kills stale entries from the outgoing PT. The second forces the next access to consult the incoming PT (avoiding a stale entry that the MMU might have speculatively cached). Skip either and you get a Heisenbug.
Key Concepts¶
| Concept | Takeaway |
|---|---|
| Pages, VPN, offset | Translate page numbers, leave the offset alone. |
| Single-level cost | 4 MB per process even for tiny programs — structural waste. |
| Two-level / multi-level | Allocate leaf tables only where the address space is occupied. |
| TLB | Hardware cache of recent translations; keeps the cost of multi-level tables low. |
sfence.vma |
Tell the MMU to drop stale TLB entries. Required after every PTE edit. |
| Sv39 layout | 9 + 9 + 9 + 12 bits, three levels, 8-byte PTEs, 4 KB tables. |
satp register |
[mode 4][asid 16][PPN 44]; writing it changes the active page table. |
| Octox typed addresses | PAddr / KVAddr / UVAddr are checked at compile time. |
walk() |
Three-level descent; allocates intermediate tables on demand. |
mappages() |
Loop over pages, call walk(alloc=true), set the leaf PTE. |
| Trampoline same VA in every PT | The MMU keeps fetching instructions across csrw satp writes. |
| Trapframe | Per-process page mapped without PTE_U; saves user GPRs and kernel scratch. |
Practice Problems¶
Problem 1 — Decode an Sv39 virtual address¶
Given the 64-bit virtual address
(only the low 39 bits are significant; the high bits are zero), compute:
(a) va.px(2) — the level-2 index.
(b) va.px(1) — the level-1 index.
(c) va.px(0) — the level-0 index.
(d) The page offset.
Solution
Write the low 39 bits of `va` in binary, then group: So `px(2) = 256`, `px(1) = 2`, `px(0) = 282`, offset = `0x5AB`. You can also do this with arithmetic directly:Problem 2 — How many PT pages does Octox actually allocate?¶
A user process has been built by exec() and now has the following
mappings in its Uvm:
- 8 KB of code at virtual address
0x0000_0000(two pages). - 4 KB of stack at virtual address
0x3F_FFFF_E000(one page). - The trampoline at
MAXVA - PGSIZE. - The trapframe at
MAXVA - 2*PGSIZE.
How many 4 KB page-table pages (root + intermediate + leaf) has Octox allocated for this process? Walk through which L2/L1/L0 tables are needed. Assume Octox has not been clever about merging neighbouring regions.
Solution
`MAXVA = 2^38`. So the trampoline VA is `0x3F_FFFF_F000` and the trapframe VA is `0x3F_FFFF_E000`. The stack is *also* at `0x3F_FFFF_E000` — wait, that overlaps with the trapframe. Let's reread: the stack is "at" `0x3F_FFFF_E000`. In a real Octox process the user stack sits **below** the trapframe. So treat the stack as one page below: `0x3F_FFFF_D000`. Each region's L2/L1 indices: | Region | VA | px(2) | px(1) | px(0) | |------------|-----------------|-------|-------|-------| | code (p1) | `0x0000_0000` | 0 | 0 | 0 | | code (p2) | `0x0000_1000` | 0 | 0 | 1 | | stack | `0x3F_FFFF_D000`| 511 | 511 | 509 | | trapframe | `0x3F_FFFF_E000`| 511 | 511 | 510 | | trampoline | `0x3F_FFFF_F000`| 511 | 511 | 511 | So the trie has these tables: - 1 root (L2) table — mandatory. - 2 L1 tables: one at L2-index `0`, one at L2-index `511`. - 2 L0 tables: one at (L2=0, L1=0), one at (L2=511, L1=511). All three top-region pages share the same L0 table since they differ only in `px(0)`. **Total: 1 + 2 + 2 = 5 page-table pages = 20 KB of bookkeeping.** For comparison, a single-level Sv39 PT would need 227 × 8 bytes = 1 GB.Problem 3 — Trace walk() from a fresh Uvm¶
You have just called p.uvmcreate(), so the user PT contains only
the trampoline and trapframe mappings. Now:
Trace the sequence of allocations and PTE writes performed by
mappages (including any walk does inside). Which intermediate PTEs
become valid for the first time? Reuse the existing root PT.
Solution
`va = 0` ⇒ `px(2) = 0`, `px(1) = 0`, `px(0) = 0`. `walk(va=0, alloc=true)`: 1. Start at the root (already allocated by `uvmcreate`). Look at `entries[px(2)] = entries[0]`. - The root table's high index `511` was filled in by `uvmcreate` (for the trampoline / trapframe path), but `entries[0]` is still zero. **Not valid.** - Allocate a new L1 page-table page (zero-filled). Call its PA `pa_L1`. - Set `root.entries[0] = (pa_L1 >> 12) << 10 | PTE_V`. 2. Descend to the new L1 table. Look at `entries[px(1)] = entries[0]`. It is zero. **Not valid.** - Allocate a new L0 page-table page. Call its PA `pa_L0`. - Set `L1.entries[0] = (pa_L0 >> 12) << 10 | PTE_V`. 3. Loop ends. Return `&mut L0.entries[px(0)] = &mut L0.entries[0]`. Back in `mappages`: 4. `pte.is_v()` ⇒ false (we just allocated this leaf). 5. `pte.set(0x8010_0000, PTE_R|PTE_W|PTE_U|PTE_V)`. **Newly valid PTEs:** - `root.entries[0]` → points to L1 (interior, `PTE_V` only). - `L1.entries[0]` → points to L0 (interior, `PTE_V` only). - `L0.entries[0]` → leaf, PA `0x8010_0000`, flags `PTE_V|PTE_R|PTE_W|PTE_U`. Two new page-table pages allocated; three PTEs written.Problem 4 — Why same-VA trampoline?¶
If we mapped the trampoline page at different virtual addresses in
the user PT versus the kernel PT (same physical page, different
virtual addresses), what concrete instruction in uservec would
fault, and why?
Solution
The instruction immediately **after** `csrw satp, t1` would fault. Just before the `csrw`, the CPU's PC is some VA inside the trampoline page in the **user** PT. After the `csrw`, the kernel PT is active. The CPU advances PC and fetches the next instruction at PC = (old VA) + 4. If the trampoline page is mapped at a *different* VA in the kernel PT, that VA is not mapped (or maps to something else), and the instruction fetch faults — while we are halfway through trap entry, with the user's registers already saved but the trap handler not yet running. There is no graceful recovery. Mapping the trampoline at the same VA in every PT means the next fetch resolves through whichever PT is now active, but to the **same physical page** — the instruction sequence keeps running.Further Reading¶
- Octox Guide §4.4 — Sv39 page tables in the Octox guide; complements this lecture's source-level detail with a higher-level overview.
- Octox Guide §4.2 — trampoline and kernel/user address spaces.
- OS Kernel: Inside Octox — uses the page-table machinery built today; revisit §§4 and 9 (mode switching, timer preemption) once Sv39 is comfortable.
- xv6 book, ch. 3 — "Page tables." Octox is a Rust port; the ideas are 1:1.
- The RISC-V Privileged ISA Spec, §4.3 (Sv39) and §4.1.11
(
satp). - Octox source —
src/kernel/vm.rs,src/kernel/riscv.rs,src/kernel/memlayout.rs,src/kernel/trampoline.rs. Today's lecture cites them with line numbers; reading them end-to-end takes about an hour and is the best follow-up.