`
7. Octox: `walk()` — three levels in 30 lines
8. Octox: `mappages()`, memory layout
9. Building the kernel PT, building user PTs
10. Switching PTs at trap boundaries
---
## Why Virtual Memory
Three jobs:
- **Isolation.** Process A cannot read process B's pages.
- **Translation.** A's `0x1000` ≠ B's `0x1000`.
- **Abstraction.** Sparse layouts — `malloc` can return
`0x5_0000_0000` without 20 GB of RAM.
flowchart LR
CPU["CPU
load/store VA"] --> MMU["MMU
VA → PA"]
MMU --> RAM["physical RAM"]
PT["page table
(per process)"] -.-> MMU
---
## VA Anatomy
```text
[ virtual page number (VPN) ][ page offset ]
```
Translate the VPN; the offset passes through.
- 4 KB pages on RISC-V ⇒ **12-bit offset**.
- Internal fragmentation tolerable; per-page bookkeeping reasonable.
We translate **page numbers**, never full addresses.
---
## A Hypothetical 32-bit RISC
32-bit VA, 4 KB pages.
```text
31 12 11 0
+------------------------------------+----------------+
| VPN (20 bits) | offset (12) |
+------------------------------------+----------------+
```
- 20-bit VPN → 220 = 1,048,576 entries.
- 4-byte PTEs → **4 MB** per process for the table itself.
---
## Single-Level: The Cost
```text
single-level PT (per process)
+--------------+ PTE 0
| |
+--------------+ PTE 1
| |
+--------------+
| ... |
+--------------+ PTE (2^20 - 1)
| |
+--------------+
<-- 4 MB total -->
```
4 MB even if the process uses 8 KB. 1000 processes → 4 GB of
page tables — for an architecture whose whole physical address
space is 4 GB.
---
## Two-Level: Split the VPN
```text
31 22 21 12 11 0
+---------------------+--------------------+----------------+
| L1 index (10) | L0 index (10) | offset (12) |
+---------------------+--------------------+----------------+
```
- Root (L1): 1024 × 4 B = **4 KB**
- Each leaf (L0): 1024 × 4 B = **4 KB**
A leaf table only exists for regions actually used.
---
## Two-Level: The Trie
flowchart LR
VA["VA: [L1=4][L0=2][off]"] --> ROOT
ROOT["Root PT
1024 entries"] -->|"L1=4"| LEAF["Leaf PT 4
1024 entries"]
ROOT -.->|"L1=0..3, 5..1023
(NULL)"| X["(no leaf
allocated)"]
LEAF -->|"L0=2"| PAGE["data page
(4 KB)"]
NULL at any level → whole sub-region unmapped, no table needed.
---
## Two-Level: The Savings
Process: 1 MB code at `0x0000_0000`, 64 KB stack near `0xFFFF_0000`.
| Component | Size |
|------------------|----------------------------|
| Root PT | 4 KB |
| 1 leaf for code | 4 KB |
| 1 leaf for stack | 4 KB |
| **Total** | **12 KB** (vs. 4 MB flat) |
~300× smaller. Cost: two memory loads per translation
— which is exactly what the TLB will fix.
---
## The TLB
A hardware cache of recent VPN → PPN mappings.
- Hit rate typically > 99%.
- Miss ⇒ hardware walks the page table; result is inserted.
- OS edits the PT ⇒ OS must invalidate stale TLB entries.
```asm
sfence.vma zero, zero # flush all TLB entries
```
Forgetting sfence.vma after a PTE edit ⇒ TLB serves
stale translations ⇒ "works most of the time" Heisenbug.
---
## From the Toy to RISC-V Sv39
A flat 64-bit table is petabytes. Two-level is too big.
**Sv39:** 39-bit VAs, **3 levels**, **9-bit indices**.
```text
38 30 29 21 20 12 11 0
+----------+-----------+-----------+----------------+
| L2 (9) | L1 (9) | L0 (9) | offset (12) |
+----------+-----------+-----------+----------------+
bits 63..39 must equal bit 38 (sign-extended canonical addresses)
```
- 8-byte PTEs ⇒ one PT page = 512 × 8 = **4 KB**.
- 239 bytes = **512 GB** of virtual reach.
---
## Sv39: The Three-Level Trie
flowchart LR
VA["39-bit VA"] --> L2T["L2 PT
512 entries
4 KB"]
L2T -->|"px(2)"| L1T["L1 PT
4 KB"]
L1T -->|"px(1)"| L0T["L0 PT
4 KB"]
L0T -->|"px(0)"| PAGE["4 KB data page"]
Same trie idea as the toy architecture, just deeper.
---
## Sv39 PTE Layout
```text
63 54 53 10 9 8 7 6 5 4 3 2 1 0
+----------+-------------------------------------------+----+--+-+-+-+-+-+-+-+
| reserved | PPN (44 bits) | RSW | D|A|G|U|X|W|R|V|
+----------+-------------------------------------------+----+--+-+-+-+-+-+-+-+
```
| Bit | Name | Meaning |
|-----|------|---------|
| 0 | `V` | valid |
| 1..3 | `R/W/X` | readable / writable / executable |
| 4 | `U` | user-mode accessible |
| 5..7 | `G/A/D` | global / accessed / dirty |
**Leaf** PTE: any of `R`, `W`, `X` set. **Interior** PTE: only `V`.
---
## The `satp` Register
```text
63 60 59 44 43 0
+--------+----------+-------------------------------------------------+
| MODE | ASID | PPN of root PT |
+--------+----------+-------------------------------------------------+
```
```rust
pub fn make(mode: Mode, asid: usize, ppn: usize) -> usize {
let mut bits = 0;
bits |= (mode as usize) << 60; // Sv39 = 8
bits |= asid << 44;
bits |= ppn >> 12; // PPN = byte addr >> 12
bits
}
```
`csrw satp, x` switches address spaces. **Every** trap boundary does this.
`riscv.rs:357-408`.
---
## Octox: Typed Addresses
```rust
#[repr(transparent)] pub struct PAddr(usize);
#[repr(transparent)] pub struct KVAddr(pub usize);
#[repr(transparent)] pub struct UVAddr(usize);
```
Zero-cost newtypes — identical layout to `usize`.
The compiler refuses to walk a PageTable<UVAddr>
with a KVAddr. A whole class of OS bugs caught at
compile time.
`vm.rs:24-42`.
---
## Octox: `VAddr::px()`
```rust
pub trait VAddr: Addr + Debug {
const PXMASK: usize = 0x1FF; // 9 bits
fn px(&self, level: usize) -> usize;
const MAXVA: usize = 1 << 38;
}
fn px(&self, level: usize) -> usize {
(self.0 >> (PGSHIFT + 9 * level)) & Self::PXMASK
}
```
- `va.px(2)` → bits 38..30
- `va.px(1)` → bits 29..21
- `va.px(0)` → bits 20..12
`vm.rs:99-108, 159-174`.
---
## Octox: `PageTable` & `RawPageTable`
```rust
#[repr(C, align(4096))]
struct RawPageTable {
entries: [PageTableEntry; 512],
}
pub struct PageTable {
ptr: *mut RawPageTable,
_marker: PhantomData,
}
```
- `align(4096)` ⇒ the MMU's required page alignment.
- 512 × 8 = exactly one page.
- `PhantomData` carries the address-kind type at zero cost.
`vm.rs:214-218, 286-289`.
---
## Octox: `PageTableEntry`
```rust
#[repr(transparent)]
pub struct PageTableEntry(usize);
impl PageTableEntry {
pub fn is_v(&self) -> bool { self.0 & PTE_V != 0 }
pub fn is_leaf(&self) -> bool { (self.0 & (PTE_R|PTE_W|PTE_X)) != 0 }
pub fn flags(&self) -> usize { self.0 & 0x3FF }
pub fn to_pa(&self) -> PAddr { ((self.0 >> 10) << 12).into() }
pub fn set(&mut self, pa: usize, attr: usize) {
self.0 = ((pa >> 12) << 10) | attr;
}
}
```
PPN packed in bits 53..10; flags in 9..0. `set` and `to_pa` do the
shift dance.
`vm.rs:247-283`.
---
## Octox: `walk()` — the Walker
```rust
pub fn walk(&mut self, va: V, alloc: bool) -> Option<&mut PageTableEntry> {
let mut pagetable = self.ptr;
if va.into_usize() >= V::MAXVA { panic!("walk"); }
unsafe {
for level in (1..3).rev() {
let pte = (*pagetable).get_mut(va.px(level))?;
if pte.is_v() {
pagetable = pte.to_pa().into_usize() as *mut RawPageTable;
} else {
if !alloc { return None; }
pagetable = RawPageTable::try_new_zeroed()?;
pte.set(pagetable as usize, PTE_V);
}
}
(*pagetable).get_mut(va.px(0))
}
}
```
`vm.rs:310-343`. The whole 3-level descent — ~30 lines.
---
## `walk()` — What It Does
sequenceDiagram
participant W as walk(va, alloc)
participant L2 as L2 (root)
participant L1 as L1
participant L0 as L0
W->>L2: index px(2)
L2-->>W: PTE_V → PA of L1
W->>L1: index px(1)
L1-->>W: PTE_V → PA of L0
W->>L0: index px(0)
L0-->>W: &mut leaf PTE
- Iterate L2, L1; at each, follow or `alloc` an interior PTE.
- Return `&mut` to the leaf at L0.
- Caller fills in the leaf's PA + flags.
---
## Octox: `mappages()`
```rust
pub fn mappages(&mut self, mut va: V, mut pa: PAddr,
size: usize, perm: usize) -> Result<()> {
let mut last = va + size - 1;
va.rounddown();
last.rounddown();
loop {
let pte = self.walk(va, true).ok_or(OutOfMemory)?;
if pte.is_v() { panic!("mappages: remap"); }
pte.set(pa.into_usize(), perm | PTE_V);
if va == last { break; }
va += PGSIZE;
pa += PGSIZE;
}
Ok(())
}
```
Loop over pages → `walk(alloc=true)` → set leaf.
Builds **every** mapping in the system. `vm.rs:360-385`.
---
## Octox Memory Layout
```text
KVAddr::MAXVA +-----------------------------+
| TRAMPOLINE (4 KB) | same VA in EVERY PT
MAXVA - PGSZ +-----------------------------+
| TRAPFRAME (per proc) |
+-----------------------------+
| ... |
+-----------------------------+
KERNBASE | kernel text + data + RAM |
0x8000_0000 +-----------------------------+
| MMIO (UART, virtio, PLIC) |
0x0000_0000 +-----------------------------+
```
`memlayout.rs:1-82`. `KERNBASE = 0x8000_0000`,
`PHYSTOP = KERNBASE + 512 MiB`.
---
## Building the Kernel PT
```rust
unsafe fn make(&mut self) {
self.map(UART0.into(), UART0.into(), PGSIZE, PTE_R|PTE_W);
self.map(VIRTIO0.into(), VIRTIO0.into(), PGSIZE, PTE_R|PTE_W);
self.map(PLIC.into(), PLIC.into(), 0x40_0000, PTE_R|PTE_W);
// kernel text: read + execute
self.map(KERNBASE.into(), KERNBASE.into(),
(etext as usize) - KERNBASE, PTE_R|PTE_X);
// kernel data + RAM: read + write
self.map((etext as usize).into(), (etext as usize).into(),
PHYSTOP - (etext as usize), PTE_R|PTE_W);
// trampoline at MAXVA - PGSIZE
self.map(TRAMPOLINE.into(), (trampoline as usize).into(),
PGSIZE, PTE_R|PTE_X);
PROCS.mapstacks();
}
```
**No** `PTE_U` — user mode cannot reach kernel pages. `vm.rs:658-694`.
---
## Turning On Paging
```rust
pub fn kinit() {
KVM.set(Kvm::new().unwrap()).unwrap();
KVM.get_mut().unwrap().make(); // build mappings
}
pub fn kinithart() {
satp::write(KVM.get().unwrap().as_satp()); // ← THE moment
sfence_vma(); // flush stale TLB
}
```
`as_satp()` is `satp::make(Sv39, 0, root_pt_pa)`.
After `kinithart`, every fetch and load is translated through the
kernel PT. `vm.rs:698-714`.
---
## Building a User PT
```rust
let mut uvm = Uvm::create()?; // root PT only
// Trampoline: kernel-only (no PTE_U), R + X
uvm.mappages(UVAddr::from(TRAMPOLINE),
PAddr::from(trampoline as usize),
PGSIZE, PTE_R | PTE_X)?;
// Trapframe: kernel-only (no PTE_U), R + W
uvm.mappages(UVAddr::from(TRAPFRAME),
PAddr::from(/* per-proc trapframe page */),
PGSIZE, PTE_R | PTE_W)?;
```
Empty user space below; `exec` or `fork` will fill it in.
`proc.rs:445-478`.
---
## `fork()` — Deep-Copy User PT
```rust
let p_uvm = p_data.uvm.as_mut().unwrap();
let c_uvm = c_data.uvm.as_mut().unwrap();
p_uvm.copy(c_uvm, p_data.sz)?; // walk parent, alloc child pages
```
`Uvm::copy()` (`vm.rs:517-555`):
```rust
while va < size {
let pte = self.walk(va, false).expect("uvmcopy");
let pa = pte.to_pa();
let flags = pte.flags();
let mem = Page::try_new_zeroed()?;
*mem = (*(pa as *mut Page)).clone(); // memcpy
new.mappages(va, mem.into(), PGSIZE, flags)?;
va += PGSIZE;
}
```
No copy-on-write — teaching kernel.
---
## `exec()` — Build, Then Atomic Swap
```rust
let mut new_uvm = p.uvmcreate()?; // fresh user PT
for ph in elf.PT_LOAD_segments() {
new_uvm.alloc(sz, ph.p_vaddr + ph.p_msize, perm)?;
new_uvm.loadseg(ph.p_vaddr.into(), &mut inode,
ph.p_offset, ph.p_fsize)?;
}
// allocate stack, push argv/envp ...
let old_uvm = proc_data.uvm.replace(new_uvm); // ← THE commit
old_uvm.proc_uvmfree(oldsz);
```
Everything before `replace` is reversible by dropping `new_uvm`.
`exec.rs:57-200`.
---
## Three Page Tables in Play
flowchart LR
A1["UVM A"] -->|"trampoline same VA, same PA"| TRAMP["trampoline page"]
B1["UVM B"] -->|"trampoline same VA, same PA"| TRAMP
K1["KVM"] -->|"trampoline same VA, same PA"| TRAMP
A1 -->|"trapframe A"| TFA["TF A"]
B1 -->|"trapframe B"| TFB["TF B"]
`satp` says *which one is active right now*.
---
## The Crux: Switching `satp` at Trap Time
The CPU keeps fetching instructions across `csrw satp`.
The instruction **after** the swap had better still be mapped at the
same virtual address.
Solution: place uservec / userret on a
"trampoline" page, and map that page at the same VA
in every page table. Same physical page; same VA. The fetch keeps
working.
---
## `uservec`: User → Kernel
```asm
csrw sscratch, a0 # park user a0
li a0, TRAPFRAME # (same VA in user and kernel PT!)
sd ra, 40(a0) # save 31 user GPRs
... (29 more sd's)
csrr t0, sscratch
sd t0, 112(a0) # save user a0 last
ld t1, 0(a0) # kernel_satp from trapframe
sfence.vma zero, zero # invalidate stale user TLB
csrw satp, t1 # ← THE SWITCH
sfence.vma zero, zero # publish new translations
ld sp, 8(a0) # kernel_sp
ld t0, 16(a0) # usertrap addr
jr t0 # next instruction is in kernel PT now
```
`trampoline.rs:22-89`.
---
## `userret`: Kernel → User
```asm
sfence.vma zero, zero # flush kernel TLB
csrw satp, a0 # a0 = user satp
sfence.vma zero, zero
li a0, TRAPFRAME
ld ra, 40(a0) # restore 31 user GPRs
... (29 more ld's)
ld a0, 112(a0) # restore user a0 LAST
sret # PC ← sepc, mode ← U
```
Mirror image of `uservec`. Two `sfence.vma`s bracket the swap.
`trampoline.rs:95-149`.
---
## Why TWO `sfence.vma`s?
```asm
sfence.vma zero, zero # 1: flush stale entries from OLD PT
csrw satp, t1 # switch
sfence.vma zero, zero # 2: force consult NEW PT
```
- The **first** kills entries from the outgoing PT.
- The **second** prevents speculative use of any entry the MMU
cached *before* the switch.
Skip either → bug surfaces only when a pre-cached translation
gets reused. The kind of bug that hits production once an hour.
---
## Key Concepts Recap
| Concept | Takeaway |
|---------|----------|
| Pages, VPN, offset | Translate page numbers, leave the offset alone |
| Single-level cost | 4 MB per process even for tiny programs |
| Multi-level / trie | Allocate leaf tables only where used |
| TLB & `sfence.vma` | Cache + explicit invalidation on PT edits |
| Sv39 layout | 9 + 9 + 9 + 12 bits, 8-byte PTEs, 4 KB tables |
| `satp` register | `[mode 4][asid 16][PPN 44]`; switch = address space change |
| Octox typed addrs | `PAddr` / `KVAddr` / `UVAddr` checked at compile time |
| `walk()` | 3-level descent, allocates interior tables on demand |
| `mappages()` | Loop → walk(alloc) → set leaf PTE |
| Trampoline same VA | Lets fetches keep working across `csrw satp` |
---
## Practice Problems
1. **Decode** `va = 0x0000_0040_0123_45AB` into Sv39 indices and
offset.
2. **Count PT pages** for a process with 8 KB code at VA `0`,
1 page of stack near `MAXVA`, plus trampoline + trapframe.
3. **Trace `walk()`** from a fresh `Uvm` (only trampoline + trapframe)
when `mappages(va=0, pa=0x80100000, PGSIZE, R|W|U)` is called.
Which intermediate PTEs become valid?
4. **Why same-VA trampoline?** Which instruction would fault if we
mapped it at different VAs in user vs kernel PT?
(Full solutions in the lecture notes.)
---
## Further Reading
- [Octox Guide](../guides/octox-guide.md) §4.4 — Sv39 page
tables; complements today's source-level detail.
- [Octox Guide](../guides/octox-guide.md) §4.2 —
trampoline / kernel layout.
- [OS Kernel: Inside Octox](13-cs631-2026-04-30-os-kernel.md) —
uses today's machinery; revisit it once Sv39 is comfortable.
- xv6 book ch. 3 — "Page tables." Octox is a Rust port; ideas 1:1.
- RISC-V Privileged Spec, §4.3 (Sv39), §4.1.11 (`satp`).
- Octox: `src/kernel/vm.rs`, `riscv.rs`, `memlayout.rs`,
`trampoline.rs`. End-to-end read ≈ 1 hour.