← Back to Course
# OS Kernel: Inside Octox ### CS 631 Systems Foundations — Apr 30, 2026 Last time: the user side of the syscall interface. Today: the kernel side of the same boundary. --- ## Today's Agenda 1. Privilege levels (M / S / U) 2. Process isolation: page tables & TRAMPOLINE 3. The system-call mechanism (`ecall` round-trip) 4. Mode switching: trapframe, `sscratch`, `sfence.vma` 5. Context switching: `swtch`, scheduler, `fork_ret` 6. §8.1 walkthrough — `getpid` 7. §8.2 walkthrough — `fork` + `wait` 8. §8.3 walkthrough — `exec` 9. §8.4 walkthrough — timer preemption --- ## Three Privilege Levels | Mode | Name | Who runs here | |------|------|---------------| | **M** | Machine | firmware: timer, delegate to S | | **S** | Supervisor | kernel (Octox `usertrap`, scheduler) | | **U** | User | every application | Mode bit lives in `mstatus` / `sstatus`. Only a **trap** (in) or `mret` / `sret` (out) moves the CPU between levels.
That single fact is what makes process isolation possible.
--- ## Why We Want an OS - **Isolation.** A buggy program cannot corrupt another program. - **Arbitration.** Only the kernel speaks to disk, timer, console. - **Fault containment.** A page fault in U is recoverable (kill the process); the same fault in S is a kernel panic. Octox boots in **M**, programs the timer, then `mret`s to **S**. User processes run in **U** and trap *up* to S whenever they need a privileged operation. --- ## Two Page Tables, One Kernel Every process has its own `Uvm` (user page table). The kernel has one `Kvm` of its own. The `satp` register selects which is active.
flowchart LR P1["proc A user"] -->|"satp=A"| PTA["Uvm A"] P2["proc B user"] -->|"satp=B"| PTB["Uvm B"] K["kernel S-mode"] -->|"satp=K"| PTK["Kvm"] PTA --> PHYS["physical RAM"] PTB --> PHYS PTK --> PHYS
PTE flag `PTE_U` = "this page is reachable from user mode." The kernel's pages have **no** `PTE_U`. --- ## The Memory Map ```text MAXVA +-----------------------------+ | TRAMPOLINE (asm trap) | <- mapped in EVERY PT MAXVA-PGSZ +-----------------------------+ | TRAPFRAME (per proc) | +-----------------------------+ | guard page (no PTE_U) | | user stack | | heap | | .bss / .data | | .text | 0x0 +-----------------------------+ ``` Same skeleton in every process — only `.text/.data/...` differ. --- ## PTE Bits Cheat Sheet | Bit | Name | Meaning | |-----|------|---------| | 0 | `V` | valid (entry is in use) | | 1 | `R` | readable | | 2 | `W` | writable | | 3 | `X` | executable | | 4 | `U` | user-mode accessible | | 5 | `G` | global (shared across address spaces) | | 6 | `A` | accessed (set by hardware) | | 7 | `D` | dirty (set by hardware) | Defined in `riscv.rs:615`. Octox uses `V|R|X` for code, `V|R|W` for data, plus `|U` if user-visible. --- ## The TRAMPOLINE Trick The page that contains `uservec` and `userret` — the asm that swaps `satp` — must be mapped at the **same virtual address** in every page table. Why?
The CPU is fetching instructions
while we change
satp
. If the next instruction lives at a different VA in the new PT, the very next fetch lands in unmapped memory.
Octox places it at `MAXVA - PGSIZE` and maps it identically in `Kvm::make()` (`vm.rs:685`) and every `uvmcreate()` (`proc.rs:445`). --- ## `ecall` — The Trap Instruction A user program executes: ```rust sys::getpid() // wraps: // li a7, 11 // ecall // (return value in a0) ``` **What hardware does on `ecall`:** 1. Save user PC into `sepc`. 2. Set `scause = UserEnvCall`. 3. Switch CPU privilege from U to S. 4. Jump to the address in `stvec` — that's `uservec`. The kernel programmed `stvec` at boot. From here on, *software* is in charge. --- ## The Trapframe Per-process page (`proc.rs:170`). First 32 bytes are kernel scratch — populated *before* the process runs.
| off | field | populated by | |-----|-------|--------------| | 0 | `kernel_satp` | kernel before sret | | 8 | `kernel_sp` | kernel before sret | | 16 | `kernel_trap` (= `usertrap`) | kernel before sret | | 24 | `epc` | kernel saves on entry | | 32 | `kernel_hartid` | kernel before sret | | 40 | `ra` | uservec saves on entry | | 48 | `sp` | uservec saves on entry | | ... | ... | ... | | 112 | `a0` | uservec via `sscratch` | | ... | (a1..t6) | uservec saves on entry |
--- ## `uservec` — Saving User Registers ```asm csrrw a0, sscratch, a0 # atomic swap: a0 ↔ sscratch # (sscratch held TRAPFRAME VA) sd ra, 40(a0) # save user ra sd sp, 48(a0) # save user sp sd gp, 56(a0) ... (28 more sd's) csrr t0, sscratch # recover original user a0 sd t0, 112(a0) # save it last ``` `sscratch` was the only "free" register on entry. Swapping it with `a0` gives us a working pointer to the trapframe. --- ## `uservec` — Switching satp ```asm ld t1, 0(a0) # kernel_satp from trapframe sfence.vma zero, zero # invalidate stale user TLB csrw satp, t1 # install kernel page table sfence.vma zero, zero # publish new translations ld sp, 8(a0) # kernel_sp ld tp, 32(a0) # kernel_hartid ld t0, 16(a0) # kernel_trap (= usertrap) jr t0 ```
The two
sfence.vma
s bracket the swap. The TRAMPOLINE is mapped at the same VA in both PTs — that's why the next instruction fetch still works.
--- ## `usertrap` — Rust Side ```rust fn usertrap() { // we are in S-mode on the kernel page table let p = myproc(); let tf = p.trapframe_mut(); tf.epc = sepc::read(); // save user PC tf.epc += 4; // return PAST the ecall intr_on(); // re-enable interrupts match scause::read().cause() { Trap::Exception(UserEnvCall) => syscall(), Trap::Interrupt(_) => devintr_dispatch(), ... } usertrap_ret(); } ``` `trap.rs:44`. `epc += 4` is critical: without it, `sret` would just re-execute the `ecall` forever. --- ## The Syscall Dispatch Table ```rust pub const TABLE: [(Fn, &str); 24] = [ (Fn::N(Self::invalid), ""), // 0 (Fn::I(Self::fork), "()"), // 1 (Fn::I(Self::exit), "(i32)"), // 2 (Fn::I(Self::wait), "(i32*)"), // 3 ... (Fn::I(Self::exec), "(s,a,e)"),// 7 ... (Fn::I(Self::getpid), "()"), // 11 ]; pub fn syscall() { let n = tf.a7; tf.a0 = TABLE[n].0.call() as usize; } ``` `syscall.rs:115`. The user-side stubs (`gen_usys`, `syscall.rs:837`) are generated from the **same enum** — one source of truth. --- ## `userret` & `sret` — The Return Path ```asm csrw satp, a0 # install user PT (a0 = user satp) sfence.vma zero, zero ld ra, 40(a1) # restore user GPRs from trapframe ld sp, 48(a1) ... (28 more ld's) csrw sscratch, a1 # park trapframe VA for next entry ld a0, 112(a1) # restore user a0 LAST sret # PC <- sepc, mode <- U ``` `sret` reads `sepc` (which we set to `tf.epc`) and lands the CPU exactly one instruction past the `ecall`, in user mode, with `a0` holding the return value. --- ## The Context Struct `proc.rs:136`. Just **14 callee-saved registers**: ```rust #[repr(C)] pub struct Context { pub ra: usize, pub sp: usize, pub s0: usize, pub s1: usize, pub s2: usize, pub s3: usize, pub s4: usize, pub s5: usize, pub s6: usize, pub s7: usize, pub s8: usize, pub s9: usize, pub s10: usize, pub s11: usize, } ``` Caller-saved regs (`t0..t6`, `a0..a7`) are **not** here — the compiler already spilled anything live across the `swtch` call. --- ## `swtch` — All 28 Instructions ```asm swtch: # fn swtch(old: *mut Context, # new: *const Context) sd ra, 0(a0) # save outgoing sd sp, 8(a0) sd s0, 16(a0) sd s1, 24(a0) ... (s2..s11) # 10 more sd's ld ra, 0(a1) # load incoming ld sp, 8(a1) ld s0, 16(a1) ld s1, 24(a1) ... (s2..s11) # 10 more ld's ret # JUMPS TO NEW ra ``` The `ret` pops the incoming `ra` — we return wherever the *other* thread last called `swtch`. --- ## The Scheduler Loop ```rust fn scheduler(c: &mut Cpu) -> ! { loop { intr_on(); // ← enabled while scanning for p in PROCS.pool.iter() { let mut inner = p.inner.lock(); if inner.state == State::Runnable { inner.state = State::Running; c.proc = Some(p.clone()); swtch(&mut c.context, &p.data().context); // ← lock held! c.proc = None; // (lock released by scope exit on resume) } } } } ``` `proc.rs:662`. One scheduler thread **per hart**. --- ## The Lock-Across-`swtch` Invariant `sched()` (`proc.rs:692`) asserts: ```rust assert!(c.noff == 1, "sched: multiple locks"); assert!(!intr_get(), "sched interruptible"); ``` Exactly **one** lock held: this proc's `inner` lock. Released by whichever code resumes on the other side: - previously-yielded: scope exit of `sched` - brand-new fork: `fork_ret::force_unlock`
Without it, two harts could observe the same Runnable proc and both swtch into it — one process running on two CPUs sharing one kernel stack. Catastrophic.
--- ## `fork_ret` — Where a New Process Starts A brand-new child has an empty kernel stack — there is no "previous swtch" to return through. `fork()` plants `context.ra = fork_ret`. So `swtch`'s final `ret` jumps straight there. ```rust extern "C" fn fork_ret() { let p = myproc(); p.inner.force_unlock(); // release scheduler's lock if FIRST_PROC { fsinit(); } // one-shot usertrap_ret(); // → userret → sret } ``` `proc.rs:619`. First user instruction = right after the parent's `ecall`. --- ## `myproc()` and the IntrLock "Which process am I?" lives in `Cpu::proc` for *this hart*, found via the `tp` register. ```rust pub fn myproc() -> Arc
{ let _intr = lock_mycpu(); // intr_off + IntrLock let id = cpu_id(); CPUS.0[id].proc.clone().unwrap() // _intr drops here → intr_on } ``` If a timer fires between reading `tp` and the index, the scheduler can migrate the thread — `tp` becomes stale — we'd return the *wrong* process. `proc.rs:72`. --- ## §8.1 — `getpid()` ```rust fn main() { let pid = sys::getpid().unwrap(); println!("my pid is {}", pid); } ``` The simplest end-to-end trace. No arguments, no side effects, one register field read. Every step here happens in **every other syscall too**. --- ## `getpid` — Sequence
sequenceDiagram participant U as user participant TR as uservec participant UT as usertrap participant SC as syscall participant H as SysCalls::getpid participant P as Proc U->>TR: ecall a7=11 TR->>TR: save regs, swap satp TR->>UT: jump usertrap UT->>UT: tf.epc=sepc+4, intr_on UT->>SC: syscall() SC->>H: TABLE[11].0.call() H->>P: myproc().pid() P-->>H: pid H-->>SC: Ok(pid) SC-->>UT: tf.a0 = pid UT->>TR: usertrap_ret → userret TR->>U: sret, a0 = pid
--- ## `getpid` — Kernel Trace 1. User stub: `a7=11`, `ecall`. `(trampoline.rs:22)` 2. `uservec`: save 31 GPRs, swap `satp`, jump `usertrap`. 3. `usertrap` (`trap.rs:44`): `tf.epc=sepc+4`, `intr_on`, `syscall()`. 4. `syscall()` (`syscall.rs:115`): `TABLE[11]` → `SysCalls::getpid`. 5. `getpid` (`syscall.rs:321`): `myproc()` (intr off), read `inner.pid.0`. 6. Return `Ok(pid)`. `Fn::I::call` writes `tf.a0 = pid`. 7. `usertrap_ret` → `userret` → `sret`. 8. CPU resumes 1 instruction past `ecall` with `a0 = pid`. **State delta:** none observable. `tf.epc += 4` and `tf.a0 = pid` are the only writes — both happen in *every* syscall. --- ## Teaching Aside: Why `myproc` Disables Interrupts
"Which process am I?" sounds trivial — it's the subtlest read in the kernel.
- Answer lives in this hart's `Cpu::proc`. - "Which hart am I?" = `tp` register. - Timer fires between read of `tp` and index of `CPUS`? Scheduler migrates this thread → `tp` is stale → we return the *other* hart's current proc. `IntrLock` brackets the lookup atomically. Same hazard, in miniature, as the lock-across-`swtch` trick. --- ## §8.2 — `fork()` + `wait()` ```rust match sys::fork().unwrap() { 0 => { println!("child says hi"); sys::exit(0); } child_pid => { let mut status: i32 = -1; let pid = sys::wait(&mut status).unwrap(); println!("reaped {} status={}", pid, status); } } ``` A single call returns **twice**: `child_pid` in the parent, `0` in the child. The asymmetry is one byte — `tf.a0 = 0` in the child. --- ## `fork` — Sequence
sequenceDiagram participant P as parent user participant SC as syscall participant PR as proc::fork participant SCH as scheduler participant C as child user P->>SC: ecall a7=1 SC->>PR: SysCalls::fork PR->>PR: allocproc, uvmcopy PR->>PR: tf clone, tf.a0 = 0 PR->>PR: state = Runnable PR-->>SC: Ok(child_pid) SC-->>P: tf.a0 = child_pid, sret Note over SCH,C: scheduler picks child SCH->>C: swtch into fork_ret C->>C: runs with a0 = 0, exits C->>PR: exit wakes parent P->>PR: wait reaps zombie
--- ## `fork` — Kernel Trace (1/2) 1. User: `a7=1`, `ecall`. 2. `SysCalls::fork` (`syscall.rs:329`) → `proc::fork()` (`proc.rs:814`). 3. `PROCS.alloc()` finds an `Unused` slot, fresh pid, fresh user PT. `child.context.ra = fork_ret`, `context.sp = top of new kstack`. 4. `parent_uvm.copy(&child_uvm, sz)`: walk parent PTEs, alloc fresh pages, `memcpy`, install matching PTEs in child. (Deep copy — no COW.) 5. `child_tf.clone_from(parent_tf)`: same `epc`, same `sp`. Then: ```rust child_tf.a0 = 0; // ← the one byte of asymmetry ``` --- ## `fork` — Kernel Trace (2/2) 6. Clone `ofile` array (each `Arc
` refcount bumps), `cwd`, set `parents[c.idx]`. State → `Runnable`. 7. Return `Ok(child_pid)`. `Fn::I::call` writes parent's `tf.a0`. 8. Parent: `usertrap_ret` → `userret` → `sret` with `a0 = child_pid`. 9. Some hart's scheduler picks child, `swtch`es while holding lock. 10. `context.ra = fork_ret`. Child's first instruction: `fork_ret` (`proc.rs:619`). Force-unlock, `usertrap_ret`. 11. Child `sret`s to **same user PC** as parent — with `tf.a0 = 0`. Its `match` arm runs. 12. Child exits. Parent's `wait` reaps zombie. State: `Unused → Used → Runnable → Running → Zombie → Unused`. --- ## Teaching Aside: Lock Across `swtch`
The scheduler swtches into a runnable process
while holding
that process's
inner
lock.
The lock is **not** released inside `swtch`. It is released by whichever code resumes on the other side: - previously yielded → scope exit of `sched()` - brand-new fork → `fork_ret::force_unlock` In Rust we *have to* `force_unlock` on the newborn branch: the `MutexGuard` object simply doesn't exist there. --- ## §8.3 — `exec()` ```rust let argv = ["echo", "hi"]; sys::exec("/bin/_echo", &argv, None).unwrap(); // reachable only if exec failed ``` `exec` **replaces** the current program's image. **It does not return on success.** `argv[0]` is conventionally the program name. The fd table survives. The pid survives. The user address space does not. --- ## `exec` — Sequence
sequenceDiagram participant U as child user participant SC as syscall participant EX as exec.rs participant FS as fs participant VM as vm U->>SC: ecall a7=7 SC->>SC: copy path/argv/envp INTO kernel SC->>EX: exec(path, argv, envp) EX->>FS: namei, read ELF, validate magic EX->>VM: uvmcreate (fresh PT) EX->>VM: per PT_LOAD: alloc + loadseg EX->>VM: alloc stack, clear guard EX->>EX: copyout argv strings, ustack, slice EX->>EX: tf.epc=e_entry, tf.sp, tf.a1=argv EX->>VM: COMMIT: replace uvm, free old EX-->>SC: Ok(argc) SC-->>U: sret into e_entry
--- ## `exec` — The Stack at User Entry ``` high +--------------------------------+ | "hi\0" | argv strings | "echo\0" | +--------------------------------+ | ustack[0..1] = (ptr"echo", 4) | pointer table | ustack[2..3] = (ptr"hi", 2) | | (zeros padding to MAXARG) | +--------------------------------+ | slice = (ptr_ustack, MAXARG) | &[&str] sp -> +--------------------------------+ user main() starts here low ``` `tf.a1` points at the slice header. `tf.a0` carries `argc` from the syscall return path. RISC-V ABI loves `a1` for argv. --- ## `exec` — The "Commit" Step Everything before this point can be unwound by **dropping `new_uvm`**: ```rust let olduvm = proc_data.uvm.replace(new_uvm); // ← THE moment proc_data.sz = new_sz; proc_data.name = "echo"; olduvm.proc_uvmfree(oldsz); // free old pages ``` After this single line, the old program is gone. `sret` will use `tf.epc = elf.e_entry`, putting the CPU at the new program's entry. --- ## Teaching Aside: fork+exec vs. spawn UNIX **splits** process creation (`fork`) from program loading (`exec`). Why? Between the two, the child can rearrange its own FDs, cwd, env — with **ordinary user code**. ```text fork() -> child: fd = open("out", WRONLY|CREATE|TRUNC) dup2(fd, 1) close(fd) exec("/bin/cmd", argv) ``` A combined `posix_spawn` would have to grow a mini DSL to do the same. fork+exec gets it for free. --- ## §8.4 — Timer Preemption Two procs A and B, both runnable. A is in user mode. Timer fires. What has to happen for B to run next?
flowchart LR CLINT["CLINT mtimer"] --> TV["timervec (M-mode asm)"] TV --> SIP["set sip.SSIP, mret"] SIP --> UV["uservec"] UV --> UT["usertrap"] UT --> YD["yielding"] YD --> SCH["scheduler"] SCH --> NEW["swtch into B"]
--- ## Timer Path: M → S **Boot** (`start.rs:55`): program CLINT `mtimecmp`, install `timervec` as M-mode trap vector, set `mie.mtimer`. **Tick fires in M-mode.** `timervec` asm (`kernelvec.rs:87`): ```asm # save a0..a3 to mscratch # bump mtimecmp for next tick li a1, 2 csrw sip, a1 # raise SSIP (S-mode soft int) mret # back to S-mode ``` The M-mode handler **never touches the scheduler.** It hands the event down to S-mode as a software interrupt. --- ## `yielding` → `sched` → `swtch` ```rust pub fn yielding() { let p = myproc(); let mut inner = p.inner.lock(); inner.state = State::Runnable; sched(inner, &mut p.data.context); // returns the guard // (when we resume, the lock is still held) } fn sched(guard, ctx) { assert!(c.noff == 1); assert!(!intr_get()); let intena = c.intena; swtch(ctx, &c.context); // ← off to scheduler c.intena = intena; guard } ``` `proc.rs:692`, `proc.rs:718`. --- ## Teaching Aside: Why Bounce Through Scheduler? Octox could in principle `swtch` straight from A to B. It deliberately doesn't. Two interlocking reasons: 1. The scheduler needs **interrupts on** while scanning (so a `wakeup` from another hart is delivered). `yielding` runs with **interrupts off** (holds A's lock). They cannot share a stack. 2. The lock-across-`swtch` invariant only composes if the *other side* of every `swtch` is a known, trusted site (scheduler or `fork_ret`) that knows to release it. A direct A → B `swtch` would force every process to anticipate every other's locking state. Combinatorial nightmare. --- ## Key Concepts Recap | Concept | Takeaway | |---------|----------| | M / S / U | Mode bit; only traps move you up | | `satp` swap | Per-proc page table; bracketed by `sfence.vma` | | TRAMPOLINE | Same VA in every PT — that's why the swap is safe | | TRAPFRAME | Per-proc; saves user GPRs + kernel scratch | | `ecall` / `sret` | The two boundary-crossing instructions | | Syscall TABLE | `tf.a7` indexes; one source of truth (gen_usys) | | `swtch` | 14 sd / 14 ld / ret — jumps to new `ra` | | Lock-across-`swtch` | Resumer releases; pinning invariant | | `fork_ret` | Brand-new proc's first kernel instruction | | SSIP | M-mode timer hands events to S-mode via `sip` | --- ## Practice Problems 1. Trace `tf.a0` over `fork`'s lifetime — parent and child trapframes, at six different moments. Where is the asymmetry? 2. Construct a sequence of events that breaks `myproc()` if you *remove* its IntrLock. 3. Why hold the proc lock across `swtch`? What concrete bug appears if you release it before the swap? 4. Where does a brand-new forked process first execute, in **kernel** mode? In **user** mode? (Full solutions in the lecture notes.) --- ## Further Reading - [Octox Guide](../guides/octox-guide.md) §§4.5–4.8 — process model, scheduler, traps, syscall dispatch - [Octox Guide](../guides/octox-guide.md) §§8.1–8.4 — the operation walkthroughs - xv6 book ch. 4 (Traps), ch. 5 (Interrupts), ch. 7 (Scheduling) - RISC-V Privileged Spec §4.1.1 (`sepc`, `stvec`, `sret`), §4.2 (Sv39) - [UNIX System Calls lecture](12-cs631-2026-04-23-os-syscalls.md)