OS Kernel: Inside Octox¶
A guided tour of how an operating-system kernel actually works, using the Octox kernel (a Rust port of xv6) as the running example. The previous lecture introduced the user-side of the syscall interface; today we cross the trap boundary and look at what the kernel does to enforce isolation, dispatch system calls, switch privilege modes, and multiplex CPUs across processes.
Overview¶
A modern OS rests on three hardware-supported ideas. The CPU has more than
one privilege mode; the MMU gives each process its own virtual
address space; and a single instruction (ecall) crosses the boundary
between them in a controlled way. Octox shows the entire mechanism in
roughly 7,000 lines of Rust — small enough to read end-to-end, big
enough to be a real kernel. Today we walk that mechanism in two passes:
- The four conceptual blocks — user vs. kernel space, process isolation, system calls / mode switching, context switching.
- Four end-to-end traces from the Octox guide:
getpid,fork,exec, and timer-driven preemption.
Learning Objectives¶
After this lecture you should be able to:
- Distinguish the three RISC-V privilege levels (M, S, U) and explain why hardware enforces them.
- Describe how Octox uses one page table per process plus a separate kernel page table, and why both must map the TRAMPOLINE at the same virtual address.
- Walk an
ecallfrom user mode throughuservec,usertrap, the syscall dispatch table,usertrap_ret,userret, andsret— naming what the hardware does at each step and what the software does. - Explain the trapframe: what gets saved, where, and why.
- Read
swtch.rsand explain why exactly one lock is held across it. - Trace what
tf.a0holds at every step offork()from the parent'secallto the child's first instruction back in user space. - Describe how a hardware timer interrupt in M-mode becomes a process preemption in S-mode.
Prerequisites¶
- UNIX System Calls — the user side of the same interface we descend into today.
- Octox Guide — build/run, workspace layout, and the memory map (§2.3).
- RISC-V assembly lectures —
ecall, CSRs (sepc,stvec,satp,sstatus), and the calling convention.
1. What Lives Where: Privilege Modes¶
RISC-V defines three privilege levels:
| Mode | Name | Who runs here |
|---|---|---|
| M | Machine | firmware: timer setup, delegating traps to S-mode |
| S | Supervisor | the kernel itself (Octox's usertrap, scheduler, ...) |
| U | User | every application binary (ulib, the shell, programs) |
The mode bit is part of mstatus (in M) and sstatus (in S). User
code cannot change it directly; only a trap (in) or an mret/sret (out)
moves the CPU between levels. That single fact is what makes process
isolation possible: any attempt to access a kernel resource has to go
through code the kernel itself wrote.
Why we want this.
- Isolation. A buggy program cannot corrupt another program's memory or crash the OS.
- Arbitration. Only the kernel can speak to disks, the timer, or the console controller; programs go through it.
- Fault containment. A page fault in U-mode is recoverable (kill the process); the same fault in S-mode is a kernel panic.
Octox boots in M-mode in src/kernel/start.rs, programs the timer and
delegates the rest of the trap surface to S-mode, then mrets into the
kernel main — the kernel itself runs in S-mode from then on. User
processes run in U-mode and trap up to S-mode whenever they need a
privileged operation.
2. Process Isolation in Octox¶
Each process has its own page table — a Uvm (user virtual
memory) in src/kernel/vm.rs. The kernel keeps a separate Kvm of its
own. The MMU's satp register selects which page table is active. When
the CPU traps into the kernel we swap satp to the kernel table; on the
way out we swap back.
Each PTE (page table entry) carries a PTE_U flag. The kernel's own pages
have no PTE_U; user mode trying to read them faults instantly. Each
process's page table only maps that process's own memory plus two shared
kernel pages (TRAMPOLINE and TRAPFRAME, below).
Memory map (per-process address space, top down):
MAXVA ┌────────────────────────────┐
│ TRAMPOLINE (asm trap glue, mapped in every PT) │
-PGSZ ├────────────────────────────┤
│ TRAPFRAME (per-process register save area) │
... ├────────────────────────────┤
│ guard page (no PTE_U) │
│ user stack │
│ heap │
│ .bss / .data │
│ .text │
0x0 └────────────────────────────┘
The TRAMPOLINE trick. The page that contains uservec and userret
(the asm that swaps satp) must be mapped at the same virtual address
in every page table — user and kernel — so that the
instructions can keep executing while the page table changes underneath
them. Octox places it at MAXVA - PGSIZE and maps it identically in
Kvm::make (vm.rs:685) and every uvmcreate (proc.rs:445). Without
this, the csrw satp, ... instruction would suddenly land in unmapped
memory on the next fetch.
Code sites.
src/kernel/vm.rs:322—walk()does the 3-level Sv39 walk.src/kernel/vm.rs:364—mappages()installs a VA → PA mapping.src/kernel/vm.rs:454—Uvm::create()allocates a fresh user PT.src/kernel/proc.rs:445— per-process TRAMPOLINE / TRAPFRAME setup.src/kernel/riscv.rs:615— PTE flag bits (PTE_V,_R,_W,_X,_U).
3. The System Call Mechanism¶
A system call is a function whose body lives on the other side of a
hardware-enforced privilege boundary. On RISC-V, ecall is the
instruction that crosses the boundary. The kernel installed the address of
its trap vector in stvec, so ecall jumps there with the CPU now in
S-mode.
In Octox the round trip looks like this:
flowchart LR
U["user main()"] --> W["sys::xyz wrapper"]
W --> E["ecall"]
E --> UV["uservec (asm)"]
UV --> UT["usertrap"]
UT --> SC["syscall()"]
SC --> H["SysCalls::xyz"]
H --> URET["usertrap_ret"]
URET --> UR["userret (asm)"]
UR --> S["sret"]
S --> U
Step by step.
- The user wrapper (auto-generated stub from
gen_usysatsrc/kernel/syscall.rs:837) loadsa7with the syscall number, fillsa0..a5with arguments, and executesecall. - Hardware:
stvec→uservec(src/kernel/trampoline.rs:22), privilege moves to S,sepccaptures the user PC,scauserecordsUserEnvCall. uservecsaves all 31 user GPRs into the per-processTrapframeat fixed offsets, swapssatpto the kernel page table, loads the kernelspandtpfrom the trapframe, and jumps tousertrap.usertrap(src/kernel/trap.rs:44) savessepc → tf.epc, advancestf.epc += 4so the eventualsretreturns past theecall, callsintr_on(), and dispatches on the cause. ForUserEnvCallit callssyscall().syscall()(src/kernel/syscall.rs:115) readstf.a7, indexesSysCalls::TABLE, and invokes the handler through theFn::I::calltrampoline (syscall.rs:59). The handler returnsResult<usize>;Fn::I::callunwraps it (positive = success, negative =-errno) andsyscall()writes the result intotf.a0.- Back through
usertrap_ret(src/kernel/trap.rs:120) →userret(in the trampoline) →sret. Because hardware loadspcfromsepcand we setsepc = tf.epc, the user resumes one instruction past theecallwitha0holding the return value.
The dispatch table. A peek at syscall.rs:80:
pub const TABLE: [(Fn, &'static str); /* count */] = [
(Fn::N(Self::invalid), ""), // 0
(Fn::I(Self::fork), "()"), // 1
(Fn::I(Self::exit), "(i32)"), // 2
(Fn::I(Self::wait), "(i32*)"), // 3
// ... 24 entries total
(Fn::I(Self::getpid), "()"), // 11
];
tf.a7 is the syscall number; TABLE[tf.a7] selects the handler. The
matching user-side stubs are generated from the same enum by gen_usys
— if you add a syscall to SysCalls, the user library learns about
it for free.
4. Mode Switching, in Detail¶
The hardest 90 lines of code in any xv6-style kernel live in
trampoline.rs. They are the bridge between two address spaces.
The trapframe. Every process owns a Trapframe page (proc.rs:170).
The first 32 bytes are kernel scratch — the kernel populates them
before starting the process so the trampoline can find the kernel's
page table and stack:
off field
---- ----------------
0 kernel_satp // kernel page table (loaded on trap entry)
8 kernel_sp // kernel stack pointer
16 kernel_trap // address of usertrap()
24 epc // user PC (saved/restored by the kernel)
32 kernel_hartid // this hart's id
40 ra // user GPRs from here on...
48 sp
... ...
280 t6
The sscratch swap. Saving 31 user registers needs at least one
working register, which means we have to free one up first. The
trick is sscratch: just before entering user mode, the kernel writes
the TRAPFRAME virtual address into sscratch. On the way back into the
kernel, uservec does:
csrrw a0, sscratch, a0 # atomic swap: a0 ↔ sscratch
sd ra, 40(a0) # save user ra into trapframe
sd sp, 48(a0) # save user sp
... (28 more registers)
csrr t0, sscratch # recover the original user a0
sd t0, 112(a0) # store it last
The sfence.vma bracket. The page-table swap is the moment of
maximum hazard. Two sfence.vma instructions surround it:
ld t1, 0(a0) # kernel_satp from trapframe
sfence.vma zero, zero # invalidate stale user TLB entries
csrw satp, t1 # install kernel page table
sfence.vma zero, zero # ensure new translations take effect
jr t0 # jump to usertrap
The first fence flushes any user PTEs the TLB cached; the second ensures the kernel PTEs we just installed are actually used by the next fetch. Skipping either one is a classic source of "the bug only happens once an hour under load."
5. Context Switching¶
The other half of an OS kernel is multiplexing. Octox runs one scheduler thread per hart; user processes are kernel threads that yield to it whenever they need to wait or are preempted.
The Context struct (proc.rs:136) holds 14 callee-saved registers:
ra, sp, and s0..s11. Caller-saved registers are not saved —
that is the whole point of the callee/caller split: by the time we are
about to call swtch, the compiler has already spilled anything live
across the call. So swtch.rs is just 14 sd + 14 ld + ret:
swtch:
sd ra, 0(a0) # save outgoing ctx
sd sp, 8(a0)
sd s0, 16(a0)
... (s1..s11)
ld ra, 0(a1) # load incoming ctx
ld sp, 8(a1)
ld s0, 16(a1)
... (s1..s11)
ret # jumps to the *new* ra
swtch(a, b) saves the current registers into *a and loads new ones
from *b. The ret at the end pops the incoming ra — so we
return to wherever the other thread last called swtch.
The scheduler loop (proc.rs:662):
fn scheduler(c: &mut Cpu) -> ! {
loop {
intr_on();
for p in PROCS.pool.iter() {
let mut inner = p.inner.lock();
if inner.state == State::Runnable {
inner.state = State::Running;
c.proc = Some(p.clone());
swtch(&mut c.context, &p.data().context);
c.proc = None;
// (lock released by scope exit on the way back)
}
}
}
}
The lock-across-swtch invariant. When the scheduler picks a
runnable process it swtches to it while holding that process's
inner lock. The lock is released by whichever code resumes on the
other side — either the scope-exit at sched's return, or
fork_ret::force_unlock for a brand-new child. sched() asserts:
Holding the lock across the switch is the only thing that keeps another hart from grabbing this process and trying to run it on two CPUs at once.
fork_ret (proc.rs:619). When a brand-new child runs for the first
time, its kernel stack is empty — there is no "previous swtch to
return through." The fork code planted context.ra = fork_ret, so the
child's first instruction is fork_ret. It force-unlocks the inherited
lock, does one-shot FS init if it is the very first user process, and
falls through into usertrap_ret to enter user space.
myproc() and IntrLock (proc.rs:72). "Which process am I?"
sounds trivial, but the answer lives in this hart's Cpu::proc. To
find which Cpu is "mine" we read the tp register, which the boot
code populated. If a timer fires between reading tp and indexing
into CPUS, the scheduler may migrate this thread — and tp
becomes stale. lock_mycpu() disables interrupts before the read; the
returned IntrLock guard re-enables them on drop.
6. Walkthrough — getpid()¶
The simplest end-to-end trace: a syscall with no arguments and no side effects. Every step here is part of every other syscall too.
Scenario.
#![no_std]
use ulib::{println, sys};
fn main() {
let pid = sys::getpid().unwrap();
println!("my pid is {}", pid);
}
Diagram.
sequenceDiagram
participant U as user
participant TR as uservec
participant UT as usertrap
participant SC as syscall
participant H as SysCalls::getpid
participant P as Proc
U->>TR: ecall a7=11
TR->>TR: save user regs to TRAPFRAME, swap satp
TR->>UT: jump usertrap
UT->>UT: tf.epc = sepc + 4, intr_on
UT->>SC: syscall()
SC->>H: TABLE[11].0.call()
H->>P: Cpus::myproc().pid()
P-->>H: pid
H-->>SC: Ok(pid)
SC-->>UT: tf.a0 = pid
UT->>TR: usertrap_ret → userret
TR->>U: sret to user PC = epc, a0 = pid
Kernel trace.
- User stub loads
a7 = 11, executesecall. uservec(trampoline.rs:22) saves 31 user GPRs to TRAPFRAME, swapssatpto kernel PT, loads kernelsp/tp, jumps tousertrap.usertrap(trap.rs:44) setstf.epc = sepc,tf.epc += 4, enables interrupts, callssyscall().syscall()(syscall.rs:115) readstf.a7 = 11, dispatches viaTABLE[11]toSysCalls::getpid(syscall.rs:321).getpidcallsCpus::myproc().myproc(proc.rs:72) callslock_mycpu()to disable interrupts (returning anIntrLock), readstp, indexesCPUS.0[id], clones theArc<Proc>fromCpu::proc. TheIntrLockre-enables interrupts on drop.getpidcalls.pid()(proc.rs:416), briefly takes the per-procinnermutex, readsinner.pid.0, returns it.Fn::I::callwrites the result intotf.a0.usertrap_ret(trap.rs:120) →userret→sret. CPU resumes one instruction past theecallwitha0 = pid.
State deltas. None observable. Every write is bookkeeping
(tf.epc += 4, tf.a0 = pid).
Teaching aside — why myproc() disables interrupts
"Which process am I?" sounds like a one-liner, but the answer lives
in Cpu::proc for this hart, found via the tp register. If a
timer fires between reading tp and indexing into CPUS, the
scheduler can migrate this thread to a different hart — at
which point tp is stale. lock_mycpu() brackets the lookup with
interrupts disabled. This is the same hazard, in miniature, as the
"lock held across swtch" trick in §5.
7. Walkthrough — fork() + wait()¶
The most famous syscall in UNIX. A single call returns twice: once in
the parent with the child's pid, once in the child with 0.
Scenario.
match sys::fork().unwrap() {
0 => {
println!("child says hi");
sys::exit(0);
}
child_pid => {
let mut status: i32 = -1;
let pid = sys::wait(&mut status).unwrap();
println!("reaped pid={} status={}", pid, status);
}
}
Diagram.
sequenceDiagram
participant P as parent user
participant TR as uservec
participant UT as usertrap
participant SC as syscall
participant PR as proc::fork
participant SCH as scheduler
participant C as child user
P->>TR: ecall a7=1
TR->>UT: usertrap
UT->>SC: syscall dispatch
SC->>PR: SysCalls::fork
PR->>PR: allocproc, uvmcopy, tf clone, tf.a0 = 0
PR->>PR: state = Runnable
PR-->>SC: Ok(child_pid)
SC-->>UT: tf.a0 = child_pid
UT-->>P: userret, sret
Note over SCH,C: on some hart, scheduler picks child
SCH->>C: swtch into fork_ret, then userret
C->>C: runs with a0 = 0, calls exit
C->>PR: proc::exit wakes parent
P->>PR: sys_wait finds zombie, reads xstate, frees
Kernel trace.
- User stub:
a7 = 1,ecall. uservec→usertrap→syscall().SysCalls::fork(syscall.rs:329) calls free-functionfork()(proc.rs:814).fork()allocates anUnusedproc slot, assigns a fresh pid, creates a fresh user page table, setschild.context.ra = fork_retandcontext.spto the top of the new kernel stack.parent_uvm.copy(&child_uvm, sz)deep-copies every PTE: walks the parent, allocates fresh physical pages for the child,memcpys each, installs matching PTEs. (No copy-on-write — conceptually simpler; performance penalty acceptable for teaching.)child_tf.clone_from(parent_tf)— sameepc, same usersp. Thenchild_tf.a0 = 0. That single byte is the only asymmetry between parent and child.- The
ofilearray is cloned (eachArc<File>refcount bumps);cwdis cloned;parents[c.idx]is set to the parent'sArc; child state →Runnable. fork()returnsOk(child_pid).Fn::I::callwrites it into the parent'stf.a0.usertrap_ret→userret→sret. Parent resumes witha0 = child_pid.- Meanwhile on some hart,
scheduler(proc.rs:662) finds theRunnablechild, takes itsinnerlock, sets state toRunning,swtches into it. - Because
context.ra == fork_ret, the child's first instruction isfork_ret(proc.rs:619). It force-unlocks the inherited lock and falls through tousertrap_ret. - The child
srets to the same user PC as the parent — buttf.a0 = 0. Itsmatcharm runs. - Child calls
sys::exit(0).exit()(proc.rs:725) closes files, drops cwd, re-parents children toINITPROC, callswakeup(&parent), sets state →Zombie, callssched()— never returns. - The parent, blocked in
wait()(proc.rs:875), wakes up, scansPROCS.pool, finds itsZombiechild,copyouts the exit status, frees the child slot, returns the child's pid.
State deltas. ProcInner.state walks
Unused → Used → Runnable → Running → Zombie → Unused.
tf.a0 is written twice — once to the child's pid in the
parent, once to 0 in the child. Every Arc<File> refcount bumps in
the child, drops on exit.
Teaching aside — the lock held across swtch
The scheduler swtches into a runnable process while holding
that process's inner lock. The lock is not released inside
swtch; it is released by whichever code the process resumes at
— either fork_ret (for a brand-new child) or the scope exit
of sched() (for a process that previously yielded). In Rust we
have to force_unlock on the newborn-process branch because the
MutexGuard object simply does not exist there — we never
ran the code that would have created it.
8. Walkthrough — exec()¶
exec replaces the current program's image with a new one. It does
not return on success — the caller's code is gone.
Scenario. The shell forks; the child runs /bin/_echo:
let argv = ["echo", "hi"];
sys::exec("/bin/_echo", &argv, None).unwrap();
// reachable only if exec failed
Diagram.
sequenceDiagram
participant U as child user
participant SC as syscall
participant EX as exec.rs
participant FS as fs (namei/read)
participant VM as vm (uvmcreate/alloc)
participant TR as userret
U->>SC: ecall a7=7, path/argv/envp
SC->>SC: Path::from_arg, Argv::from_arg → kernel Vec
SC->>EX: exec(path, argv, envp)
EX->>FS: namei, read ELF header, validate magic
EX->>VM: uvmcreate, per PT_LOAD alloc + loadseg
EX->>VM: alloc stack, clear guard page
EX->>EX: copyout argv, envp, pointer array, slice descriptor
EX->>EX: tf.epc = e_entry, tf.sp, tf.a1 = argv ptr
EX->>VM: swap uvm, free old pages
EX-->>SC: Ok(argc)
SC->>TR: tf.a0 = argc, userret
TR->>U: sret into e_entry
Kernel trace.
- User stub:
a7 = 7,a0 = path,a1 = argv,a2 = envp,ecall. SysCalls::exec(syscall.rs:605) usesPath::from_arg,Argv::from_arg,Envp::from_argto copy the path and every argv/envp string into kernel-ownedStrings before doing any work. This matters: the user page table is about to disappear.- Calls free-function
exec()(exec.rs:90). path.namei()walks the FS to anArc<Inode>. Reads the first 64 bytes asElfHdr. Validates magic0x7F 'E' 'L' 'F'.p.uvmcreate()builds a fresh user page table — initially only TRAMPOLINE and TRAPFRAME mappings. The current address space is untouched.- For each
PT_LOADprogram header:uvm.alloc(...)grows the new address space, thenloadseg()(exec.rs:32) readsp_fileszbytes from the inode into the freshly mapped pages by walking the new page table to resolve each VA. Pages fromp_filesztop_memszare left zero — that is the BSS. - Stack. One more
uvm.allocadds(1 + STACK_PAGE_NUM) * PGSIZEat the top of user space. The bottom page hasPTE_Ustripped byuvm.clear(guard)— a stack overflow page-faults instead of silently sliding into data. - Push argv/envp strings. For each, decrement
spbyarg.len(), round to 16 bytes,copyoutinto the new user stack, remember(sp, len)in a kernelustack: [usize; MAXARG*2]. - Push the
ustackarray itself, then a 16-byte slice descriptor(ptr_to_ustack, MAXARG)— that is the&[&str]user-sidemainwill receive. - Rewrite the trapframe.
tf.epc = elf.e_entry,tf.sp = sp,tf.a1 = pointer to slice descriptor(argv); argc rides out intf.a0via the syscall return. - Commit.
let olduvm = proc_data.uvm.replace(new_uvm); proc_data.sz = new_sz; olduvm.proc_uvmfree(oldsz);. This is the only irreversible step. Everything before it can be unwound by droppingnew_uvm. - Any FD with
FD_CLOEXECis closed. Fn::I::callwritesargctotf.a0;usertrap_ret→userret→sret. Becausesepc = tf.epc = e_entry, the CPU jumps into the new program. The old program is gone.
Stack at user entry (high → low):
"hi"\0 <-- highest address
"echo"\0
ustack[0..1] = (ptr_to_"echo", 4)
ustack[2..3] = (ptr_to_"hi", 2)
(zeros padding to MAXARG)
slice header = (ptr_to_ustack, MAXARG)
sp -> <-- user main() starts here
State deltas. ProcData.uvm swapped (old freed). tf.epc/sp/a0/a1
rewritten. ofile entries with FD_CLOEXEC gone. Pid, parent link,
cwd, surviving FDs all preserved.
Teaching aside — fork+exec vs spawn
UNIX splits process creation (fork) from program loading
(exec) so that, between the two, the child can rearrange its
own FDs, cwd, env. Every shell pipeline and redirection is built
that way. A combined posix_spawn would have to grow a mini
DSL for "close this, dup that, open the other" — fork+exec
gets it for free by just running ordinary user code in the child.
9. Walkthrough — Timer-Driven Preemption¶
Two processes A and B are both runnable. A is in user mode; the timer fires; A is preempted; B runs next.
Diagram.
sequenceDiagram
participant A as proc A user
participant TV as timervec (M-mode)
participant TRAP as usertrap
participant YD as yielding
participant SW as swtch
participant SCH as scheduler
participant B as proc B user
TV->>TV: CLINT mtimer fires
TV->>TV: bump mtimecmp, set sip SSIP, mret
A->>TRAP: SSIP delivered, uservec then usertrap
TRAP->>YD: devintr returns Timer
YD->>YD: inner.lock, state = Runnable
YD->>SW: swtch A.ctx → scheduler.ctx
SW->>SCH: ra loaded, return into scheduler loop
SCH->>SCH: release A lock, find Runnable B
SCH->>SW: swtch scheduler.ctx → B.ctx
SW->>B: ra loaded into B prior sched call
B->>B: returns through yielding, usertrap_ret
B->>B: userret, sret back to user
Kernel trace.
- Boot setup (
start.rs:55): program CLINTmtimecmpfor the next tick, installtimervec(kernelvec.rs:87) as the M-mode trap vector, setmie.mtimer. Per-CPU scratch holds the mtimecmp address and the interval. - Timer fires in M-mode.
timervec(naked asm): savesa0..a3tomscratch, bumpsmtimecmpfor the next tick, writes2tosip(SSIP — supervisor software interrupt pending), restores its scratch registers,mret. The M-mode handler never touches the scheduler. - Back in S-mode with SSIP pending. If A was in user mode the
trap funnels through
uservec→usertrap.devintr()classifies the cause asSupervisorSoft, clears SSIP, returnsSome(Intr::Timer). - The trap handler calls
proc::yielding()(proc.rs:718). yielding()acquiresp.inner.lock(), setsstate = Runnable, callssched(guard, &mut p.data.context).sched()(proc.rs:692) sanity-checks: exactly one lock held (c.noff == 1), state is notRunning, interrupts disabled. Savesc.intena. Callsswtch(&mut p.context, &c.context).swtch(swtch.rs) savesra, sp, s0..s11into A's context, loads them from the scheduler's,rets — jumping to the scheduler's savedra. Process A is now frozen with its stack intact and its proc lock still held.- The scheduler resumes at
proc.rs:662. The proc lock A'syieldingpassed in is released (by scope exit on theswtch-return path). The schedulerintr_on()s — it wants interrupts enabled while scanning so awakeupfrom another hart is delivered — loops overPROCS.pool, finds B, takesB.inner.lock(), setsstate = Running,swtch(&mut cpu.context, &B.context). - B resumes wherever its previous
swtchleft it:- Previously yielded: resumes inside
sched()afterswtch. Returning up the stack drops B's lock, returns toyielding, returns tousertrap, callsusertrap_ret. - Brand-new fork:
context.rastill points tofork_ret. It force-unlocks and heads tousertrap_ret.
- Previously yielded: resumes inside
usertrap_ret→userretrestores B's trapframe, switchessatpto B'sUvm,sret— B is in user space.
State deltas. ProcInner.state flips
Running → Runnable for A, Runnable → Running for B. Per-CPU
cpu.proc swaps. cpu.intena is preserved across the round trip so
interrupt-enable nesting stays balanced.
Teaching aside — why bounce through a scheduler thread
Octox could in principle swtch straight from A to B. It doesn't
for two interlocking reasons. First, the scheduler needs
interrupts enabled while it scans (so a wakeup from another
hart is delivered promptly), but yielding runs with interrupts
disabled while holding A's lock — they cannot share a stack.
Second, the "lock held across swtch" invariant only composes
cleanly if the code on the other side of every swtch is a
known, trusted site (the scheduler or fork_ret) that knows to
release it. If any process could swtch to any other, every
process would have to anticipate every other's locking state
— combinatorial nightmare.
Key Concepts¶
| Concept | Takeaway |
|---|---|
| Privilege modes (M / S / U) | Mode bit in mstatus/sstatus; only traps move you up. |
satp switch |
Each process has its own page table; trap glue swaps satp. |
| TRAMPOLINE | Mapped at the same VA in every page table so the swap is safe. |
| TRAPFRAME | Per-proc page; saves user GPRs and pre-populated kernel scratch. |
ecall / sret |
The two instructions that cross the user/kernel boundary. |
| Syscall dispatch | tf.a7 indexes SysCalls::TABLE; Fn::I::call adapts return values. |
swtch |
14 sd / 14 ld / ret — saves callee-saved regs, jumps to new ra. |
Lock-across-swtch |
Scheduler holds the proc lock across the switch; resumer releases it. |
fork_ret |
First instruction a brand-new process executes; force-unlocks the lock. |
myproc() and IntrLock |
Disable interrupts while reading tp to avoid mid-read migration. |
| SSIP delivery from M to S | M-mode timer handler raises SSIP; S-mode handles it as a software int. |
Practice Problems¶
Problem 1 — tf.a0 over the lifetime of fork¶
Trace the value of tf.a0 (in the parent's trapframe and in the
child's trapframe) at each of these moments. Some entries do not
exist yet.
| Moment | Parent tf.a0 |
Child tf.a0 |
|---|---|---|
User code right before ecall |
? | ? |
Inside usertrap, just after tf.epc += 4 |
? | ? |
Inside proc::fork, just after c_tf.clone_from(p_tf) |
? | ? |
Inside proc::fork, after c_tf.a0 = 0 |
? | ? |
Just after Fn::I::call writes the return value |
? | ? |
Back in user mode after sret |
? | ? |
Solution
| Moment | Parent `tf.a0` | Child `tf.a0` | |-----------------------------------------------------|----------------|--------------------------| | Right before `ecall` | undefined | (no child yet) | | Inside `usertrap` after `tf.epc += 4` | undefined | (no child yet) | | After `c_tf.clone_from(p_tf)` | undefined | undefined (== parent) | | After `c_tf.a0 = 0` | undefined | `0` | | After `Fn::I::call` writes return value | `child_pid` | `0` | | Back in user mode after `sret` | `child_pid` | `0` | The single asymmetric write is `c_tf.a0 = 0`. Everything else is identical in the two trapframes — that is *why* both processes return from the same user-side `ecall` to the same user PC.Problem 2 — Why does myproc() disable interrupts?¶
Suppose we removed the IntrLock and let myproc() read tp
without disabling interrupts. Construct a sequence of events that
would cause myproc() to return the wrong Arc<Proc>.
Solution
1. Hart 0 enters `myproc()` and reads `tp = 0`. 2. Before it can index `CPUS[0]`, a timer interrupt fires. 3. The trap calls `yielding()`, the scheduler runs, and the *current* thread happens to be migrated to hart 1 (e.g., the scheduler on hart 1 picked it). 4. Hart 1's boot code already set `tp = 1` for itself. But the in-progress `myproc()` had its `tp` value (`0`) sitting in a register and resumes with that stale value. 5. It indexes `CPUS[0]` and returns *hart 0's* current process — which is some other thread now. Disabling interrupts pins the read to a single hart; the `IntrLock` guard re-enables them on drop.Problem 3 — Why hold the proc lock across swtch?¶
When the scheduler swtches into a runnable process, it holds that
process's inner lock. Why? What concrete bug would happen if we
released the lock right before swtch instead?
Solution
Without the lock held across `swtch`, between releasing the lock and the actual register swap, **another hart's** scheduler could observe the same `Runnable` process, set its state to `Running`, and `swtch` into it — while we were *also* about to `swtch` into it. The process would now be running on two CPUs, sharing one kernel stack. Catastrophic. Holding the lock across the swap means only one scheduler can ever own the process during the transition; the lock is released by whichever code resumes on the other side.Problem 4 — Where does a brand-new forked process first run?¶
A child has just been created by fork(). The scheduler picks it and
swtches to it. What is the first instruction the child executes in
kernel mode? What is its first instruction in user mode?
Solution
The child's `context.ra` was planted by `fork()` to point at **`fork_ret`** (`proc.rs:619`). So `swtch`'s final `ret` jumps to `fork_ret`, which is the child's first kernel instruction. `fork_ret` force-unlocks the proc lock the scheduler left held, does one-shot FS init if this is the first user process, and falls through into `usertrap_ret` → `userret` → `sret`. Because the child's `tf.epc` is a clone of the parent's, the first *user* instruction is the one immediately after the parent's `ecall` — the same address the parent will return to. The only difference is `tf.a0`, which is `0` for the child.Further Reading¶
- Octox Guide §§4.5–4.8 (process model, scheduler, traps, syscall dispatch).
- Octox Guide §§8.1–8.4 — the operation walkthroughs this lecture is built on.
- xv6 book, ch. 4 (Traps), ch. 5 (Interrupts), ch. 7 (Scheduling). Octox is closer to the Rust port than to the C original, but the ideas match line by line.
- The RISC-V Privileged Architecture Spec, §4.1.1 (
sepc,stvec,sret) and §4.2 (Sv39). UNIX System Calls— the user-side companion to today's lecture.