Skip to content

OS Kernel: Inside Octox

A guided tour of how an operating-system kernel actually works, using the Octox kernel (a Rust port of xv6) as the running example. The previous lecture introduced the user-side of the syscall interface; today we cross the trap boundary and look at what the kernel does to enforce isolation, dispatch system calls, switch privilege modes, and multiplex CPUs across processes.

Overview

A modern OS rests on three hardware-supported ideas. The CPU has more than one privilege mode; the MMU gives each process its own virtual address space; and a single instruction (ecall) crosses the boundary between them in a controlled way. Octox shows the entire mechanism in roughly 7,000 lines of Rust — small enough to read end-to-end, big enough to be a real kernel. Today we walk that mechanism in two passes:

  1. The four conceptual blocks — user vs. kernel space, process isolation, system calls / mode switching, context switching.
  2. Four end-to-end traces from the Octox guide: getpid, fork, exec, and timer-driven preemption.

Learning Objectives

After this lecture you should be able to:

  • Distinguish the three RISC-V privilege levels (M, S, U) and explain why hardware enforces them.
  • Describe how Octox uses one page table per process plus a separate kernel page table, and why both must map the TRAMPOLINE at the same virtual address.
  • Walk an ecall from user mode through uservec, usertrap, the syscall dispatch table, usertrap_ret, userret, and sret — naming what the hardware does at each step and what the software does.
  • Explain the trapframe: what gets saved, where, and why.
  • Read swtch.rs and explain why exactly one lock is held across it.
  • Trace what tf.a0 holds at every step of fork() from the parent's ecall to the child's first instruction back in user space.
  • Describe how a hardware timer interrupt in M-mode becomes a process preemption in S-mode.

Prerequisites

  • UNIX System Calls — the user side of the same interface we descend into today.
  • Octox Guide — build/run, workspace layout, and the memory map (§2.3).
  • RISC-V assembly lectures — ecall, CSRs (sepc, stvec, satp, sstatus), and the calling convention.

1. What Lives Where: Privilege Modes

RISC-V defines three privilege levels:

Mode Name Who runs here
M Machine firmware: timer setup, delegating traps to S-mode
S Supervisor the kernel itself (Octox's usertrap, scheduler, ...)
U User every application binary (ulib, the shell, programs)

The mode bit is part of mstatus (in M) and sstatus (in S). User code cannot change it directly; only a trap (in) or an mret/sret (out) moves the CPU between levels. That single fact is what makes process isolation possible: any attempt to access a kernel resource has to go through code the kernel itself wrote.

Why we want this.

  • Isolation. A buggy program cannot corrupt another program's memory or crash the OS.
  • Arbitration. Only the kernel can speak to disks, the timer, or the console controller; programs go through it.
  • Fault containment. A page fault in U-mode is recoverable (kill the process); the same fault in S-mode is a kernel panic.

Octox boots in M-mode in src/kernel/start.rs, programs the timer and delegates the rest of the trap surface to S-mode, then mrets into the kernel main — the kernel itself runs in S-mode from then on. User processes run in U-mode and trap up to S-mode whenever they need a privileged operation.


2. Process Isolation in Octox

Each process has its own page table — a Uvm (user virtual memory) in src/kernel/vm.rs. The kernel keeps a separate Kvm of its own. The MMU's satp register selects which page table is active. When the CPU traps into the kernel we swap satp to the kernel table; on the way out we swap back.

Each PTE (page table entry) carries a PTE_U flag. The kernel's own pages have no PTE_U; user mode trying to read them faults instantly. Each process's page table only maps that process's own memory plus two shared kernel pages (TRAMPOLINE and TRAPFRAME, below).

Memory map (per-process address space, top down):

  MAXVA  ┌────────────────────────────┐
         │  TRAMPOLINE  (asm trap glue, mapped in every PT) │
  -PGSZ  ├────────────────────────────┤
         │  TRAPFRAME   (per-process register save area)    │
  ...    ├────────────────────────────┤
         │  guard page (no PTE_U)     │
         │  user stack                │
         │  heap                      │
         │  .bss / .data              │
         │  .text                     │
   0x0   └────────────────────────────┘

The TRAMPOLINE trick. The page that contains uservec and userret (the asm that swaps satp) must be mapped at the same virtual address in every page table — user and kernel — so that the instructions can keep executing while the page table changes underneath them. Octox places it at MAXVA - PGSIZE and maps it identically in Kvm::make (vm.rs:685) and every uvmcreate (proc.rs:445). Without this, the csrw satp, ... instruction would suddenly land in unmapped memory on the next fetch.

Code sites.

  • src/kernel/vm.rs:322walk() does the 3-level Sv39 walk.
  • src/kernel/vm.rs:364mappages() installs a VA → PA mapping.
  • src/kernel/vm.rs:454Uvm::create() allocates a fresh user PT.
  • src/kernel/proc.rs:445 — per-process TRAMPOLINE / TRAPFRAME setup.
  • src/kernel/riscv.rs:615 — PTE flag bits (PTE_V, _R, _W, _X, _U).

3. The System Call Mechanism

A system call is a function whose body lives on the other side of a hardware-enforced privilege boundary. On RISC-V, ecall is the instruction that crosses the boundary. The kernel installed the address of its trap vector in stvec, so ecall jumps there with the CPU now in S-mode.

In Octox the round trip looks like this:

flowchart LR
    U["user main()"] --> W["sys::xyz wrapper"]
    W --> E["ecall"]
    E --> UV["uservec (asm)"]
    UV --> UT["usertrap"]
    UT --> SC["syscall()"]
    SC --> H["SysCalls::xyz"]
    H --> URET["usertrap_ret"]
    URET --> UR["userret (asm)"]
    UR --> S["sret"]
    S --> U

Step by step.

  1. The user wrapper (auto-generated stub from gen_usys at src/kernel/syscall.rs:837) loads a7 with the syscall number, fills a0..a5 with arguments, and executes ecall.
  2. Hardware: stvecuservec (src/kernel/trampoline.rs:22), privilege moves to S, sepc captures the user PC, scause records UserEnvCall.
  3. uservec saves all 31 user GPRs into the per-process Trapframe at fixed offsets, swaps satp to the kernel page table, loads the kernel sp and tp from the trapframe, and jumps to usertrap.
  4. usertrap (src/kernel/trap.rs:44) saves sepc → tf.epc, advances tf.epc += 4 so the eventual sret returns past the ecall, calls intr_on(), and dispatches on the cause. For UserEnvCall it calls syscall().
  5. syscall() (src/kernel/syscall.rs:115) reads tf.a7, indexes SysCalls::TABLE, and invokes the handler through the Fn::I::call trampoline (syscall.rs:59). The handler returns Result<usize>; Fn::I::call unwraps it (positive = success, negative = -errno) and syscall() writes the result into tf.a0.
  6. Back through usertrap_ret (src/kernel/trap.rs:120) → userret (in the trampoline) → sret. Because hardware loads pc from sepc and we set sepc = tf.epc, the user resumes one instruction past the ecall with a0 holding the return value.

The dispatch table. A peek at syscall.rs:80:

pub const TABLE: [(Fn, &'static str); /* count */] = [
    (Fn::N(Self::invalid), ""),       // 0
    (Fn::I(Self::fork),    "()"),     // 1
    (Fn::I(Self::exit),    "(i32)"),  // 2
    (Fn::I(Self::wait),    "(i32*)"), // 3
    // ... 24 entries total
    (Fn::I(Self::getpid),  "()"),     // 11
];

tf.a7 is the syscall number; TABLE[tf.a7] selects the handler. The matching user-side stubs are generated from the same enum by gen_usys — if you add a syscall to SysCalls, the user library learns about it for free.


4. Mode Switching, in Detail

The hardest 90 lines of code in any xv6-style kernel live in trampoline.rs. They are the bridge between two address spaces.

The trapframe. Every process owns a Trapframe page (proc.rs:170). The first 32 bytes are kernel scratch — the kernel populates them before starting the process so the trampoline can find the kernel's page table and stack:

  off   field
  ----  ----------------
    0   kernel_satp     // kernel page table (loaded on trap entry)
    8   kernel_sp       // kernel stack pointer
   16   kernel_trap     // address of usertrap()
   24   epc             // user PC (saved/restored by the kernel)
   32   kernel_hartid   // this hart's id
   40   ra              // user GPRs from here on...
   48   sp
  ...   ...
  280   t6

The sscratch swap. Saving 31 user registers needs at least one working register, which means we have to free one up first. The trick is sscratch: just before entering user mode, the kernel writes the TRAPFRAME virtual address into sscratch. On the way back into the kernel, uservec does:

csrrw a0, sscratch, a0      # atomic swap: a0 ↔ sscratch
sd ra,  40(a0)              # save user ra into trapframe
sd sp,  48(a0)              # save user sp
... (28 more registers)
csrr t0, sscratch           # recover the original user a0
sd   t0, 112(a0)            # store it last

The sfence.vma bracket. The page-table swap is the moment of maximum hazard. Two sfence.vma instructions surround it:

ld   t1, 0(a0)              # kernel_satp from trapframe
sfence.vma zero, zero       # invalidate stale user TLB entries
csrw satp, t1               # install kernel page table
sfence.vma zero, zero       # ensure new translations take effect
jr   t0                     # jump to usertrap

The first fence flushes any user PTEs the TLB cached; the second ensures the kernel PTEs we just installed are actually used by the next fetch. Skipping either one is a classic source of "the bug only happens once an hour under load."


5. Context Switching

The other half of an OS kernel is multiplexing. Octox runs one scheduler thread per hart; user processes are kernel threads that yield to it whenever they need to wait or are preempted.

The Context struct (proc.rs:136) holds 14 callee-saved registers: ra, sp, and s0..s11. Caller-saved registers are not saved — that is the whole point of the callee/caller split: by the time we are about to call swtch, the compiler has already spilled anything live across the call. So swtch.rs is just 14 sd + 14 ld + ret:

swtch:
    sd ra,    0(a0)        # save outgoing ctx
    sd sp,    8(a0)
    sd s0,   16(a0)
    ... (s1..s11)
    ld ra,    0(a1)        # load incoming ctx
    ld sp,    8(a1)
    ld s0,   16(a1)
    ... (s1..s11)
    ret                    # jumps to the *new* ra

swtch(a, b) saves the current registers into *a and loads new ones from *b. The ret at the end pops the incoming ra — so we return to wherever the other thread last called swtch.

The scheduler loop (proc.rs:662):

fn scheduler(c: &mut Cpu) -> ! {
    loop {
        intr_on();
        for p in PROCS.pool.iter() {
            let mut inner = p.inner.lock();
            if inner.state == State::Runnable {
                inner.state = State::Running;
                c.proc = Some(p.clone());
                swtch(&mut c.context, &p.data().context);
                c.proc = None;
                // (lock released by scope exit on the way back)
            }
        }
    }
}

The lock-across-swtch invariant. When the scheduler picks a runnable process it swtches to it while holding that process's inner lock. The lock is released by whichever code resumes on the other side — either the scope-exit at sched's return, or fork_ret::force_unlock for a brand-new child. sched() asserts:

assert!(c.noff == 1, "sched: multiple locks");
assert!(!intr_get(), "sched interruptible");

Holding the lock across the switch is the only thing that keeps another hart from grabbing this process and trying to run it on two CPUs at once.

fork_ret (proc.rs:619). When a brand-new child runs for the first time, its kernel stack is empty — there is no "previous swtch to return through." The fork code planted context.ra = fork_ret, so the child's first instruction is fork_ret. It force-unlocks the inherited lock, does one-shot FS init if it is the very first user process, and falls through into usertrap_ret to enter user space.

myproc() and IntrLock (proc.rs:72). "Which process am I?" sounds trivial, but the answer lives in this hart's Cpu::proc. To find which Cpu is "mine" we read the tp register, which the boot code populated. If a timer fires between reading tp and indexing into CPUS, the scheduler may migrate this thread — and tp becomes stale. lock_mycpu() disables interrupts before the read; the returned IntrLock guard re-enables them on drop.


6. Walkthrough — getpid()

The simplest end-to-end trace: a syscall with no arguments and no side effects. Every step here is part of every other syscall too.

Scenario.

#![no_std]
use ulib::{println, sys};

fn main() {
    let pid = sys::getpid().unwrap();
    println!("my pid is {}", pid);
}

Diagram.

sequenceDiagram
    participant U as user
    participant TR as uservec
    participant UT as usertrap
    participant SC as syscall
    participant H as SysCalls::getpid
    participant P as Proc
    U->>TR: ecall a7=11
    TR->>TR: save user regs to TRAPFRAME, swap satp
    TR->>UT: jump usertrap
    UT->>UT: tf.epc = sepc + 4, intr_on
    UT->>SC: syscall()
    SC->>H: TABLE[11].0.call()
    H->>P: Cpus::myproc().pid()
    P-->>H: pid
    H-->>SC: Ok(pid)
    SC-->>UT: tf.a0 = pid
    UT->>TR: usertrap_ret → userret
    TR->>U: sret to user PC = epc, a0 = pid

Kernel trace.

  1. User stub loads a7 = 11, executes ecall.
  2. uservec (trampoline.rs:22) saves 31 user GPRs to TRAPFRAME, swaps satp to kernel PT, loads kernel sp/tp, jumps to usertrap.
  3. usertrap (trap.rs:44) sets tf.epc = sepc, tf.epc += 4, enables interrupts, calls syscall().
  4. syscall() (syscall.rs:115) reads tf.a7 = 11, dispatches via TABLE[11] to SysCalls::getpid (syscall.rs:321).
  5. getpid calls Cpus::myproc(). myproc (proc.rs:72) calls lock_mycpu() to disable interrupts (returning an IntrLock), reads tp, indexes CPUS.0[id], clones the Arc<Proc> from Cpu::proc. The IntrLock re-enables interrupts on drop.
  6. getpid calls .pid() (proc.rs:416), briefly takes the per-proc inner mutex, reads inner.pid.0, returns it.
  7. Fn::I::call writes the result into tf.a0.
  8. usertrap_ret (trap.rs:120) → userretsret. CPU resumes one instruction past the ecall with a0 = pid.

State deltas. None observable. Every write is bookkeeping (tf.epc += 4, tf.a0 = pid).

Teaching aside — why myproc() disables interrupts

"Which process am I?" sounds like a one-liner, but the answer lives in Cpu::proc for this hart, found via the tp register. If a timer fires between reading tp and indexing into CPUS, the scheduler can migrate this thread to a different hart — at which point tp is stale. lock_mycpu() brackets the lookup with interrupts disabled. This is the same hazard, in miniature, as the "lock held across swtch" trick in §5.


7. Walkthrough — fork() + wait()

The most famous syscall in UNIX. A single call returns twice: once in the parent with the child's pid, once in the child with 0.

Scenario.

match sys::fork().unwrap() {
    0 => {
        println!("child says hi");
        sys::exit(0);
    }
    child_pid => {
        let mut status: i32 = -1;
        let pid = sys::wait(&mut status).unwrap();
        println!("reaped pid={} status={}", pid, status);
    }
}

Diagram.

sequenceDiagram
    participant P as parent user
    participant TR as uservec
    participant UT as usertrap
    participant SC as syscall
    participant PR as proc::fork
    participant SCH as scheduler
    participant C as child user
    P->>TR: ecall a7=1
    TR->>UT: usertrap
    UT->>SC: syscall dispatch
    SC->>PR: SysCalls::fork
    PR->>PR: allocproc, uvmcopy, tf clone, tf.a0 = 0
    PR->>PR: state = Runnable
    PR-->>SC: Ok(child_pid)
    SC-->>UT: tf.a0 = child_pid
    UT-->>P: userret, sret
    Note over SCH,C: on some hart, scheduler picks child
    SCH->>C: swtch into fork_ret, then userret
    C->>C: runs with a0 = 0, calls exit
    C->>PR: proc::exit wakes parent
    P->>PR: sys_wait finds zombie, reads xstate, frees

Kernel trace.

  1. User stub: a7 = 1, ecall.
  2. uservecusertrapsyscall().
  3. SysCalls::fork (syscall.rs:329) calls free-function fork() (proc.rs:814).
  4. fork() allocates an Unused proc slot, assigns a fresh pid, creates a fresh user page table, sets child.context.ra = fork_ret and context.sp to the top of the new kernel stack.
  5. parent_uvm.copy(&child_uvm, sz) deep-copies every PTE: walks the parent, allocates fresh physical pages for the child, memcpys each, installs matching PTEs. (No copy-on-write — conceptually simpler; performance penalty acceptable for teaching.)
  6. child_tf.clone_from(parent_tf) — same epc, same user sp. Then child_tf.a0 = 0. That single byte is the only asymmetry between parent and child.
  7. The ofile array is cloned (each Arc<File> refcount bumps); cwd is cloned; parents[c.idx] is set to the parent's Arc; child state → Runnable.
  8. fork() returns Ok(child_pid). Fn::I::call writes it into the parent's tf.a0.
  9. usertrap_retuserretsret. Parent resumes with a0 = child_pid.
  10. Meanwhile on some hart, scheduler (proc.rs:662) finds the Runnable child, takes its inner lock, sets state to Running, swtches into it.
  11. Because context.ra == fork_ret, the child's first instruction is fork_ret (proc.rs:619). It force-unlocks the inherited lock and falls through to usertrap_ret.
  12. The child srets to the same user PC as the parent — but tf.a0 = 0. Its match arm runs.
  13. Child calls sys::exit(0). exit() (proc.rs:725) closes files, drops cwd, re-parents children to INITPROC, calls wakeup(&parent), sets state → Zombie, calls sched() — never returns.
  14. The parent, blocked in wait() (proc.rs:875), wakes up, scans PROCS.pool, finds its Zombie child, copyouts the exit status, frees the child slot, returns the child's pid.

State deltas. ProcInner.state walks Unused &rarr; Used &rarr; Runnable &rarr; Running &rarr; Zombie &rarr; Unused. tf.a0 is written twice — once to the child's pid in the parent, once to 0 in the child. Every Arc<File> refcount bumps in the child, drops on exit.

Teaching aside — the lock held across swtch

The scheduler swtches into a runnable process while holding that process's inner lock. The lock is not released inside swtch; it is released by whichever code the process resumes at — either fork_ret (for a brand-new child) or the scope exit of sched() (for a process that previously yielded). In Rust we have to force_unlock on the newborn-process branch because the MutexGuard object simply does not exist there — we never ran the code that would have created it.


8. Walkthrough — exec()

exec replaces the current program's image with a new one. It does not return on success — the caller's code is gone.

Scenario. The shell forks; the child runs /bin/_echo:

let argv = ["echo", "hi"];
sys::exec("/bin/_echo", &argv, None).unwrap();
// reachable only if exec failed

Diagram.

sequenceDiagram
    participant U as child user
    participant SC as syscall
    participant EX as exec.rs
    participant FS as fs (namei/read)
    participant VM as vm (uvmcreate/alloc)
    participant TR as userret
    U->>SC: ecall a7=7, path/argv/envp
    SC->>SC: Path::from_arg, Argv::from_arg → kernel Vec
    SC->>EX: exec(path, argv, envp)
    EX->>FS: namei, read ELF header, validate magic
    EX->>VM: uvmcreate, per PT_LOAD alloc + loadseg
    EX->>VM: alloc stack, clear guard page
    EX->>EX: copyout argv, envp, pointer array, slice descriptor
    EX->>EX: tf.epc = e_entry, tf.sp, tf.a1 = argv ptr
    EX->>VM: swap uvm, free old pages
    EX-->>SC: Ok(argc)
    SC->>TR: tf.a0 = argc, userret
    TR->>U: sret into e_entry

Kernel trace.

  1. User stub: a7 = 7, a0 = path, a1 = argv, a2 = envp, ecall.
  2. SysCalls::exec (syscall.rs:605) uses Path::from_arg, Argv::from_arg, Envp::from_arg to copy the path and every argv/envp string into kernel-owned Strings before doing any work. This matters: the user page table is about to disappear.
  3. Calls free-function exec() (exec.rs:90).
  4. path.namei() walks the FS to an Arc<Inode>. Reads the first 64 bytes as ElfHdr. Validates magic 0x7F 'E' 'L' 'F'.
  5. p.uvmcreate() builds a fresh user page table — initially only TRAMPOLINE and TRAPFRAME mappings. The current address space is untouched.
  6. For each PT_LOAD program header: uvm.alloc(...) grows the new address space, then loadseg() (exec.rs:32) reads p_filesz bytes from the inode into the freshly mapped pages by walking the new page table to resolve each VA. Pages from p_filesz to p_memsz are left zero — that is the BSS.
  7. Stack. One more uvm.alloc adds (1 + STACK_PAGE_NUM) * PGSIZE at the top of user space. The bottom page has PTE_U stripped by uvm.clear(guard) — a stack overflow page-faults instead of silently sliding into data.
  8. Push argv/envp strings. For each, decrement sp by arg.len(), round to 16 bytes, copyout into the new user stack, remember (sp, len) in a kernel ustack: [usize; MAXARG*2].
  9. Push the ustack array itself, then a 16-byte slice descriptor (ptr_to_ustack, MAXARG) — that is the &[&str] user-side main will receive.
  10. Rewrite the trapframe. tf.epc = elf.e_entry, tf.sp = sp, tf.a1 = pointer to slice descriptor (argv); argc rides out in tf.a0 via the syscall return.
  11. Commit. let olduvm = proc_data.uvm.replace(new_uvm); proc_data.sz = new_sz; olduvm.proc_uvmfree(oldsz);. This is the only irreversible step. Everything before it can be unwound by dropping new_uvm.
  12. Any FD with FD_CLOEXEC is closed.
  13. Fn::I::call writes argc to tf.a0; usertrap_retuserretsret. Because sepc = tf.epc = e_entry, the CPU jumps into the new program. The old program is gone.

Stack at user entry (high → low):

   "hi"\0           <-- highest address
   "echo"\0
   ustack[0..1] = (ptr_to_"echo", 4)
   ustack[2..3] = (ptr_to_"hi",   2)
   (zeros padding to MAXARG)
   slice header  = (ptr_to_ustack, MAXARG)
   sp ->                                <-- user main() starts here

State deltas. ProcData.uvm swapped (old freed). tf.epc/sp/a0/a1 rewritten. ofile entries with FD_CLOEXEC gone. Pid, parent link, cwd, surviving FDs all preserved.

Teaching aside — fork+exec vs spawn

UNIX splits process creation (fork) from program loading (exec) so that, between the two, the child can rearrange its own FDs, cwd, env. Every shell pipeline and redirection is built that way. A combined posix_spawn would have to grow a mini DSL for "close this, dup that, open the other" — fork+exec gets it for free by just running ordinary user code in the child.


9. Walkthrough — Timer-Driven Preemption

Two processes A and B are both runnable. A is in user mode; the timer fires; A is preempted; B runs next.

Diagram.

sequenceDiagram
    participant A as proc A user
    participant TV as timervec (M-mode)
    participant TRAP as usertrap
    participant YD as yielding
    participant SW as swtch
    participant SCH as scheduler
    participant B as proc B user
    TV->>TV: CLINT mtimer fires
    TV->>TV: bump mtimecmp, set sip SSIP, mret
    A->>TRAP: SSIP delivered, uservec then usertrap
    TRAP->>YD: devintr returns Timer
    YD->>YD: inner.lock, state = Runnable
    YD->>SW: swtch A.ctx → scheduler.ctx
    SW->>SCH: ra loaded, return into scheduler loop
    SCH->>SCH: release A lock, find Runnable B
    SCH->>SW: swtch scheduler.ctx → B.ctx
    SW->>B: ra loaded into B prior sched call
    B->>B: returns through yielding, usertrap_ret
    B->>B: userret, sret back to user

Kernel trace.

  1. Boot setup (start.rs:55): program CLINT mtimecmp for the next tick, install timervec (kernelvec.rs:87) as the M-mode trap vector, set mie.mtimer. Per-CPU scratch holds the mtimecmp address and the interval.
  2. Timer fires in M-mode. timervec (naked asm): saves a0..a3 to mscratch, bumps mtimecmp for the next tick, writes 2 to sip (SSIP — supervisor software interrupt pending), restores its scratch registers, mret. The M-mode handler never touches the scheduler.
  3. Back in S-mode with SSIP pending. If A was in user mode the trap funnels through uservecusertrap. devintr() classifies the cause as SupervisorSoft, clears SSIP, returns Some(Intr::Timer).
  4. The trap handler calls proc::yielding() (proc.rs:718).
  5. yielding() acquires p.inner.lock(), sets state = Runnable, calls sched(guard, &mut p.data.context).
  6. sched() (proc.rs:692) sanity-checks: exactly one lock held (c.noff == 1), state is not Running, interrupts disabled. Saves c.intena. Calls swtch(&mut p.context, &c.context).
  7. swtch (swtch.rs) saves ra, sp, s0..s11 into A's context, loads them from the scheduler's, rets — jumping to the scheduler's saved ra. Process A is now frozen with its stack intact and its proc lock still held.
  8. The scheduler resumes at proc.rs:662. The proc lock A's yielding passed in is released (by scope exit on the swtch-return path). The scheduler intr_on()s — it wants interrupts enabled while scanning so a wakeup from another hart is delivered — loops over PROCS.pool, finds B, takes B.inner.lock(), sets state = Running, swtch(&mut cpu.context, &B.context).
  9. B resumes wherever its previous swtch left it:
    • Previously yielded: resumes inside sched() after swtch. Returning up the stack drops B's lock, returns to yielding, returns to usertrap, calls usertrap_ret.
    • Brand-new fork: context.ra still points to fork_ret. It force-unlocks and heads to usertrap_ret.
  10. usertrap_retuserret restores B's trapframe, switches satp to B's Uvm, sret — B is in user space.

State deltas. ProcInner.state flips Running → Runnable for A, Runnable → Running for B. Per-CPU cpu.proc swaps. cpu.intena is preserved across the round trip so interrupt-enable nesting stays balanced.

Teaching aside — why bounce through a scheduler thread

Octox could in principle swtch straight from A to B. It doesn't for two interlocking reasons. First, the scheduler needs interrupts enabled while it scans (so a wakeup from another hart is delivered promptly), but yielding runs with interrupts disabled while holding A's lock — they cannot share a stack. Second, the "lock held across swtch" invariant only composes cleanly if the code on the other side of every swtch is a known, trusted site (the scheduler or fork_ret) that knows to release it. If any process could swtch to any other, every process would have to anticipate every other's locking state — combinatorial nightmare.


Key Concepts

Concept Takeaway
Privilege modes (M / S / U) Mode bit in mstatus/sstatus; only traps move you up.
satp switch Each process has its own page table; trap glue swaps satp.
TRAMPOLINE Mapped at the same VA in every page table so the swap is safe.
TRAPFRAME Per-proc page; saves user GPRs and pre-populated kernel scratch.
ecall / sret The two instructions that cross the user/kernel boundary.
Syscall dispatch tf.a7 indexes SysCalls::TABLE; Fn::I::call adapts return values.
swtch 14 sd / 14 ld / ret — saves callee-saved regs, jumps to new ra.
Lock-across-swtch Scheduler holds the proc lock across the switch; resumer releases it.
fork_ret First instruction a brand-new process executes; force-unlocks the lock.
myproc() and IntrLock Disable interrupts while reading tp to avoid mid-read migration.
SSIP delivery from M to S M-mode timer handler raises SSIP; S-mode handles it as a software int.

Practice Problems

Problem 1 — tf.a0 over the lifetime of fork

Trace the value of tf.a0 (in the parent's trapframe and in the child's trapframe) at each of these moments. Some entries do not exist yet.

Moment Parent tf.a0 Child tf.a0
User code right before ecall ? ?
Inside usertrap, just after tf.epc += 4 ? ?
Inside proc::fork, just after c_tf.clone_from(p_tf) ? ?
Inside proc::fork, after c_tf.a0 = 0 ? ?
Just after Fn::I::call writes the return value ? ?
Back in user mode after sret ? ?
Solution | Moment | Parent `tf.a0` | Child `tf.a0` | |-----------------------------------------------------|----------------|--------------------------| | Right before `ecall` | undefined | (no child yet) | | Inside `usertrap` after `tf.epc += 4` | undefined | (no child yet) | | After `c_tf.clone_from(p_tf)` | undefined | undefined (== parent) | | After `c_tf.a0 = 0` | undefined | `0` | | After `Fn::I::call` writes return value | `child_pid` | `0` | | Back in user mode after `sret` | `child_pid` | `0` | The single asymmetric write is `c_tf.a0 = 0`. Everything else is identical in the two trapframes — that is *why* both processes return from the same user-side `ecall` to the same user PC.

Problem 2 — Why does myproc() disable interrupts?

Suppose we removed the IntrLock and let myproc() read tp without disabling interrupts. Construct a sequence of events that would cause myproc() to return the wrong Arc<Proc>.

Solution 1. Hart 0 enters `myproc()` and reads `tp = 0`. 2. Before it can index `CPUS[0]`, a timer interrupt fires. 3. The trap calls `yielding()`, the scheduler runs, and the *current* thread happens to be migrated to hart 1 (e.g., the scheduler on hart 1 picked it). 4. Hart 1's boot code already set `tp = 1` for itself. But the in-progress `myproc()` had its `tp` value (`0`) sitting in a register and resumes with that stale value. 5. It indexes `CPUS[0]` and returns *hart 0's* current process — which is some other thread now. Disabling interrupts pins the read to a single hart; the `IntrLock` guard re-enables them on drop.

Problem 3 — Why hold the proc lock across swtch?

When the scheduler swtches into a runnable process, it holds that process's inner lock. Why? What concrete bug would happen if we released the lock right before swtch instead?

Solution Without the lock held across `swtch`, between releasing the lock and the actual register swap, **another hart's** scheduler could observe the same `Runnable` process, set its state to `Running`, and `swtch` into it — while we were *also* about to `swtch` into it. The process would now be running on two CPUs, sharing one kernel stack. Catastrophic. Holding the lock across the swap means only one scheduler can ever own the process during the transition; the lock is released by whichever code resumes on the other side.

Problem 4 — Where does a brand-new forked process first run?

A child has just been created by fork(). The scheduler picks it and swtches to it. What is the first instruction the child executes in kernel mode? What is its first instruction in user mode?

Solution The child's `context.ra` was planted by `fork()` to point at **`fork_ret`** (`proc.rs:619`). So `swtch`'s final `ret` jumps to `fork_ret`, which is the child's first kernel instruction. `fork_ret` force-unlocks the proc lock the scheduler left held, does one-shot FS init if this is the first user process, and falls through into `usertrap_ret` → `userret` → `sret`. Because the child's `tf.epc` is a clone of the parent's, the first *user* instruction is the one immediately after the parent's `ecall` — the same address the parent will return to. The only difference is `tf.a0`, which is `0` for the child.

Further Reading

  • Octox Guide §§4.5–4.8 (process model, scheduler, traps, syscall dispatch).
  • Octox Guide §§8.1–8.4 — the operation walkthroughs this lecture is built on.
  • xv6 book, ch. 4 (Traps), ch. 5 (Interrupts), ch. 7 (Scheduling). Octox is closer to the Rust port than to the C original, but the ideas match line by line.
  • The RISC-V Privileged Architecture Spec, §4.1.1 (sepc, stvec, sret) and §4.2 (Sv39).
  • UNIX System Calls — the user-side companion to today's lecture.