OS Kernel: Inside Octox

# OS Kernel: Inside Octox

### CS 631 Systems Foundations — Apr 30, 2026

Last time: the user side of the syscall interface.
Today: the kernel side of the same boundary.

---

## Today's Agenda

1. Privilege levels (M / S / U)
2. Process isolation: page tables & TRAMPOLINE
3. The system-call mechanism (`ecall` round-trip)
4. Mode switching: trapframe, `sscratch`, `sfence.vma`
5. Context switching: `swtch`, scheduler, `fork_ret`
6. §8.1 walkthrough — `getpid`
7. §8.2 walkthrough — `fork` + `wait`
8. §8.3 walkthrough — `exec`
9. §8.4 walkthrough — timer preemption

---

## Three Privilege Levels

| Mode | Name | Who runs here |
|------|------|---------------|
| **M** | Machine | firmware: timer, delegate to S |
| **S** | Supervisor | kernel (Octox `usertrap`, scheduler) |
| **U** | User | every application |

Mode bit lives in `mstatus` / `sstatus`. Only a **trap** (in) or
`mret` / `sret` (out) moves the CPU between levels.

<div class="info-box">
That single fact is what makes process isolation possible.
</div>

---

## Why We Want an OS

- **Isolation.** A buggy program cannot corrupt another program.
- **Arbitration.** Only the kernel speaks to disk, timer, console.
- **Fault containment.** A page fault in U is recoverable (kill the
  process); the same fault in S is a kernel panic.

Octox boots in **M**, programs the timer, then `mret`s to **S**.
User processes run in **U** and trap *up* to S whenever they need a
privileged operation.

---

## Two Page Tables, One Kernel

Every process has its own `Uvm` (user page table).
The kernel has one `Kvm` of its own.
The `satp` register selects which is active.

PTE flag `PTE_U` = "this page is reachable from user mode."
The kernel's pages have **no** `PTE_U`.

---

## The Memory Map

Same skeleton in every process — only `.text/.data/...` differ.

---

## PTE Bits Cheat Sheet

| Bit | Name | Meaning |
|-----|------|---------|
| 0 | `V` | valid (entry is in use) |
| 1 | `R` | readable |
| 2 | `W` | writable |
| 3 | `X` | executable |
| 4 | `U` | user-mode accessible |
| 5 | `G` | global (shared across address spaces) |
| 6 | `A` | accessed (set by hardware) |
| 7 | `D` | dirty (set by hardware) |

---

## The TRAMPOLINE Trick

The page that contains `uservec` and `userret` — the asm that
swaps `satp` — must be mapped at the **same virtual address**
in every page table. Why?

<div class="info-box">
The CPU is fetching instructions <em>while we change <code>satp</code></em>.
If the next instruction lives at a different VA in the new PT,
the very next fetch lands in unmapped memory.
</div>

Octox places it at `MAXVA - PGSIZE` and maps it identically in
`Kvm::make()` (`vm.rs:685`) and every `uvmcreate()` (`proc.rs:445`).

---

## `ecall` — The Trap Instruction

A user program executes:

```rust
sys::getpid()        // wraps:
//   li a7, 11
//   ecall
//   (return value in a0)
```

**What hardware does on `ecall`:**

1. Save user PC into `sepc`.
2. Set `scause = UserEnvCall`.
3. Switch CPU privilege from U to S.
4. Jump to the address in `stvec` — that's `uservec`.

The kernel programmed `stvec` at boot. From here on, *software* is in charge.

---

## The Trapframe

Per-process page (`proc.rs:170`). First 32 bytes are kernel scratch —
populated *before* the process runs.

| off | field | populated by |
|-----|-------|--------------|
| 0 | `kernel_satp` | kernel before sret |
| 8 | `kernel_sp` | kernel before sret |
| 16 | `kernel_trap` (= `usertrap`) | kernel before sret |
| 24 | `epc` | kernel saves on entry |
| 32 | `kernel_hartid` | kernel before sret |
| 40 | `ra` | uservec saves on entry |
| 48 | `sp` | uservec saves on entry |
| ... | ... | ... |
| 112 | `a0` | uservec via `sscratch` |
| ... | (a1..t6) | uservec saves on entry |

</div>

---

## `uservec` — Saving User Registers

```asm
csrrw a0, sscratch, a0      # atomic swap: a0 ↔ sscratch
                            # (sscratch held TRAPFRAME VA)
sd ra,  40(a0)              # save user ra
sd sp,  48(a0)              # save user sp
sd gp,  56(a0)
... (28 more sd's)
csrr t0, sscratch           # recover original user a0
sd   t0, 112(a0)            # save it last
```

`sscratch` was the only "free" register on entry. Swapping it with `a0`
gives us a working pointer to the trapframe.

---

## `uservec` — Switching satp

```asm
ld   t1, 0(a0)              # kernel_satp from trapframe
sfence.vma zero, zero       # invalidate stale user TLB
csrw satp, t1               # install kernel page table
sfence.vma zero, zero       # publish new translations
ld   sp, 8(a0)              # kernel_sp
ld   tp, 32(a0)             # kernel_hartid
ld   t0, 16(a0)             # kernel_trap (= usertrap)
jr   t0
```

<div class="info-box">
The two <code>sfence.vma</code>s bracket the swap.
The TRAMPOLINE is mapped at the same VA in both PTs — that's why the next instruction fetch still works.
</div>

---

## `usertrap` — Rust Side

```rust
fn usertrap() {
    // we are in S-mode on the kernel page table
    let p = myproc();
    let tf = p.trapframe_mut();
    tf.epc = sepc::read();      // save user PC
    tf.epc += 4;                // return PAST the ecall
    intr_on();                  // re-enable interrupts

match scause::read().cause() {
        Trap::Exception(UserEnvCall) => syscall(),
        Trap::Interrupt(_)           => devintr_dispatch(),
        ...
    }
    usertrap_ret();
}
```

`trap.rs:44`. `epc += 4` is critical: without it, `sret` would just
re-execute the `ecall` forever.

---

## The Syscall Dispatch Table

```rust
pub const TABLE: [(Fn, &str); 24] = [
    (Fn::N(Self::invalid), ""),       //  0
    (Fn::I(Self::fork),    "()"),     //  1
    (Fn::I(Self::exit),    "(i32)"),  //  2
    (Fn::I(Self::wait),    "(i32*)"), //  3
    ...
    (Fn::I(Self::exec),    "(s,a,e)"),//  7
    ...
    (Fn::I(Self::getpid),  "()"),     // 11
];

pub fn syscall() {
    let n = tf.a7;
    tf.a0 = TABLE[n].0.call() as usize;
}
```

`syscall.rs:115`. The user-side stubs (`gen_usys`, `syscall.rs:837`) are
generated from the **same enum** — one source of truth.

---

## `userret` & `sret` — The Return Path

```asm
csrw  satp, a0          # install user PT (a0 = user satp)
sfence.vma zero, zero
ld    ra,  40(a1)       # restore user GPRs from trapframe
ld    sp,  48(a1)
... (28 more ld's)
csrw  sscratch, a1      # park trapframe VA for next entry
ld    a0, 112(a1)       # restore user a0 LAST
sret                    # PC <- sepc, mode <- U
```

`sret` reads `sepc` (which we set to `tf.epc`) and lands the CPU
exactly one instruction past the `ecall`, in user mode, with `a0`
holding the return value.

---

## The Context Struct

`proc.rs:136`. Just **14 callee-saved registers**:

```rust
#[repr(C)]
pub struct Context {
    pub ra:  usize,
    pub sp:  usize,
    pub s0:  usize, pub s1:  usize, pub s2:  usize, pub s3:  usize,
    pub s4:  usize, pub s5:  usize, pub s6:  usize, pub s7:  usize,
    pub s8:  usize, pub s9:  usize, pub s10: usize, pub s11: usize,
}
```

Caller-saved regs (`t0..t6`, `a0..a7`) are **not** here — the
compiler already spilled anything live across the `swtch` call.

---

## `swtch` — All 28 Instructions

```asm
swtch:                         # fn swtch(old: *mut Context,
                               #          new: *const Context)
    sd ra,    0(a0)            # save outgoing
    sd sp,    8(a0)
    sd s0,   16(a0)
    sd s1,   24(a0)
    ... (s2..s11)              # 10 more sd's
    ld ra,    0(a1)            # load incoming
    ld sp,    8(a1)
    ld s0,   16(a1)
    ld s1,   24(a1)
    ... (s2..s11)              # 10 more ld's
    ret                        # JUMPS TO NEW ra
```

The `ret` pops the incoming `ra` — we return wherever the *other*
thread last called `swtch`.

---

## The Scheduler Loop

```rust
fn scheduler(c: &mut Cpu) -> ! {
    loop {
        intr_on();                 // ← enabled while scanning
        for p in PROCS.pool.iter() {
            let mut inner = p.inner.lock();
            if inner.state == State::Runnable {
                inner.state = State::Running;
                c.proc = Some(p.clone());
                swtch(&mut c.context,
                      &p.data().context);   // ← lock held!
                c.proc = None;
                // (lock released by scope exit on resume)
            }
        }
    }
}
```

`proc.rs:662`. One scheduler thread **per hart**.

---

## The Lock-Across-`swtch` Invariant

`sched()` (`proc.rs:692`) asserts:

```rust
assert!(c.noff == 1, "sched: multiple locks");
assert!(!intr_get(), "sched interruptible");
```

Exactly **one** lock held: this proc's `inner` lock.
Released by whichever code resumes on the other side:

- previously-yielded: scope exit of `sched`
- brand-new fork: `fork_ret::force_unlock`

<div class="highlight-box">
Without it, two harts could observe the same Runnable proc and both
swtch into it — one process running on two CPUs sharing one
kernel stack. Catastrophic.
</div>

---

## `fork_ret` — Where a New Process Starts

A brand-new child has an empty kernel stack — there is no
"previous swtch" to return through.

`fork()` plants `context.ra = fork_ret`. So `swtch`'s final `ret`
jumps straight there.

```rust
extern "C" fn fork_ret() {
    let p = myproc();
    p.inner.force_unlock();         // release scheduler's lock
    if FIRST_PROC { fsinit(); }     // one-shot
    usertrap_ret();                 // → userret → sret
}
```

`proc.rs:619`. First user instruction = right after the parent's `ecall`.

---

## `myproc()` and the IntrLock

"Which process am I?" lives in `Cpu::proc` for *this hart*, found via
the `tp` register.

```rust
pub fn myproc() -> Arc<Proc> {
    let _intr = lock_mycpu();        // intr_off + IntrLock
    let id = cpu_id();
    CPUS.0[id].proc.clone().unwrap()
    // _intr drops here → intr_on
}
```

If a timer fires between reading `tp` and the index, the scheduler
can migrate the thread — `tp` becomes stale — we'd return
the *wrong* process.

`proc.rs:72`.

---

## §8.1 — `getpid()`

```rust
fn main() {
    let pid = sys::getpid().unwrap();
    println!("my pid is {}", pid);
}
```

The simplest end-to-end trace. No arguments, no side effects, one
register field read. Every step here happens in **every other syscall too**.

---

## `getpid` — Sequence

<div class="mermaid">
sequenceDiagram
    participant U as user
    participant TR as uservec
    participant UT as usertrap
    participant SC as syscall
    participant H as SysCalls::getpid
    participant P as Proc
    U->>TR: ecall a7=11
    TR->>TR: save regs, swap satp
    TR->>UT: jump usertrap
    UT->>UT: tf.epc=sepc+4, intr_on
    UT->>SC: syscall()
    SC->>H: TABLE[11].0.call()
    H->>P: myproc().pid()
    P-->>H: pid
    H-->>SC: Ok(pid)
    SC-->>UT: tf.a0 = pid
    UT->>TR: usertrap_ret → userret
    TR->>U: sret, a0 = pid
</div>

---

## `getpid` — Kernel Trace

1. User stub: `a7=11`, `ecall`. `(trampoline.rs:22)`
2. `uservec`: save 31 GPRs, swap `satp`, jump `usertrap`.
3. `usertrap` (`trap.rs:44`): `tf.epc=sepc+4`, `intr_on`, `syscall()`.
4. `syscall()` (`syscall.rs:115`): `TABLE[11]` → `SysCalls::getpid`.
5. `getpid` (`syscall.rs:321`): `myproc()` (intr off), read `inner.pid.0`.
6. Return `Ok(pid)`. `Fn::I::call` writes `tf.a0 = pid`.
7. `usertrap_ret` → `userret` → `sret`.
8. CPU resumes 1 instruction past `ecall` with `a0 = pid`.

**State delta:** none observable. `tf.epc += 4` and `tf.a0 = pid` are
the only writes — both happen in *every* syscall.

---

## Teaching Aside: Why `myproc` Disables Interrupts

<div class="info-box">
"Which process am I?" sounds trivial — it's the subtlest read
in the kernel.
</div>

- Answer lives in this hart's `Cpu::proc`.
- "Which hart am I?" = `tp` register.
- Timer fires between read of `tp` and index of `CPUS`?
  Scheduler migrates this thread → `tp` is stale → we return
  the *other* hart's current proc.

`IntrLock` brackets the lookup atomically. Same hazard, in
miniature, as the lock-across-`swtch` trick.

---

## §8.2 — `fork()` + `wait()`

```rust
match sys::fork().unwrap() {
    0 => {
        println!("child says hi");
        sys::exit(0);
    }
    child_pid => {
        let mut status: i32 = -1;
        let pid = sys::wait(&mut status).unwrap();
        println!("reaped {} status={}", pid, status);
    }
}
```

A single call returns **twice**: `child_pid` in the parent, `0` in
the child. The asymmetry is one byte — `tf.a0 = 0` in the child.

---

## `fork` — Sequence

<div class="mermaid">
sequenceDiagram
    participant P as parent user
    participant SC as syscall
    participant PR as proc::fork
    participant SCH as scheduler
    participant C as child user
    P->>SC: ecall a7=1
    SC->>PR: SysCalls::fork
    PR->>PR: allocproc, uvmcopy
    PR->>PR: tf clone, tf.a0 = 0
    PR->>PR: state = Runnable
    PR-->>SC: Ok(child_pid)
    SC-->>P: tf.a0 = child_pid, sret
    Note over SCH,C: scheduler picks child
    SCH->>C: swtch into fork_ret
    C->>C: runs with a0 = 0, exits
    C->>PR: exit wakes parent
    P->>PR: wait reaps zombie
</div>

---

## `fork` — Kernel Trace (1/2)

1. User: `a7=1`, `ecall`.
2. `SysCalls::fork` (`syscall.rs:329`) → `proc::fork()` (`proc.rs:814`).
3. `PROCS.alloc()` finds an `Unused` slot, fresh pid, fresh user PT.
   `child.context.ra = fork_ret`, `context.sp = top of new kstack`.
4. `parent_uvm.copy(&child_uvm, sz)`: walk parent PTEs, alloc fresh
   pages, `memcpy`, install matching PTEs in child. (Deep copy —
   no COW.)
5. `child_tf.clone_from(parent_tf)`: same `epc`, same `sp`. Then:

```rust
child_tf.a0 = 0;        // ← the one byte of asymmetry
```

---

## `fork` — Kernel Trace (2/2)

6. Clone `ofile` array (each `Arc<File>` refcount bumps), `cwd`,
   set `parents[c.idx]`. State → `Runnable`.
7. Return `Ok(child_pid)`. `Fn::I::call` writes parent's `tf.a0`.
8. Parent: `usertrap_ret` → `userret` → `sret` with
   `a0 = child_pid`.
9. Some hart's scheduler picks child, `swtch`es while holding lock.
10. `context.ra = fork_ret`. Child's first instruction:
    `fork_ret` (`proc.rs:619`). Force-unlock, `usertrap_ret`.
11. Child `sret`s to **same user PC** as parent — with
    `tf.a0 = 0`. Its `match` arm runs.
12. Child exits. Parent's `wait` reaps zombie.

State: `Unused → Used → Runnable → Running →
Zombie → Unused`.

---

## Teaching Aside: Lock Across `swtch`

<div class="highlight-box">
The scheduler swtches into a runnable process <strong>while
holding</strong> that process's <code>inner</code> lock.
</div>

The lock is **not** released inside `swtch`. It is released by
whichever code resumes on the other side:

- previously yielded → scope exit of `sched()`
- brand-new fork → `fork_ret::force_unlock`

In Rust we *have to* `force_unlock` on the newborn branch: the
`MutexGuard` object simply doesn't exist there.

---

## §8.3 — `exec()`

```rust
let argv = ["echo", "hi"];
sys::exec("/bin/_echo", &argv, None).unwrap();
// reachable only if exec failed
```

`exec` **replaces** the current program's image. **It does not
return on success.** `argv[0]` is conventionally the program name.

The fd table survives. The pid survives. The user address space
does not.

---

## `exec` — Sequence

<div class="mermaid">
sequenceDiagram
    participant U as child user
    participant SC as syscall
    participant EX as exec.rs
    participant FS as fs
    participant VM as vm
    U->>SC: ecall a7=7
    SC->>SC: copy path/argv/envp INTO kernel
    SC->>EX: exec(path, argv, envp)
    EX->>FS: namei, read ELF, validate magic
    EX->>VM: uvmcreate (fresh PT)
    EX->>VM: per PT_LOAD: alloc + loadseg
    EX->>VM: alloc stack, clear guard
    EX->>EX: copyout argv strings, ustack, slice
    EX->>EX: tf.epc=e_entry, tf.sp, tf.a1=argv
    EX->>VM: COMMIT: replace uvm, free old
    EX-->>SC: Ok(argc)
    SC-->>U: sret into e_entry
</div>

---

## `exec` — The Stack at User Entry

`tf.a1` points at the slice header. `tf.a0` carries `argc` from the
syscall return path. RISC-V ABI loves `a1` for argv.

---

## `exec` — The "Commit" Step

Everything before this point can be unwound by **dropping `new_uvm`**:

```rust
let olduvm = proc_data.uvm.replace(new_uvm);   // ← THE moment
proc_data.sz = new_sz;
proc_data.name = "echo";
olduvm.proc_uvmfree(oldsz);                    // free old pages
```

After this single line, the old program is gone. `sret` will use
`tf.epc = elf.e_entry`, putting the CPU at the new program's entry.

---

## Teaching Aside: fork+exec vs. spawn

UNIX **splits** process creation (`fork`) from program loading
(`exec`).

Why? Between the two, the child can rearrange its own FDs, cwd, env
— with **ordinary user code**.

```text
fork() -> child:
    fd = open("out", WRONLY|CREATE|TRUNC)
    dup2(fd, 1)
    close(fd)
    exec("/bin/cmd", argv)
```

A combined `posix_spawn` would have to grow a mini DSL to do the
same. fork+exec gets it for free.

---

## §8.4 — Timer Preemption

Two procs A and B, both runnable. A is in user mode. Timer fires.

What has to happen for B to run next?

<div class="mermaid">
flowchart LR
    CLINT["CLINT mtimer"] --> TV["timervec (M-mode asm)"]
    TV --> SIP["set sip.SSIP, mret"]
    SIP --> UV["uservec"]
    UV --> UT["usertrap"]
    UT --> YD["yielding"]
    YD --> SCH["scheduler"]
    SCH --> NEW["swtch into B"]
</div>

---

## Timer Path: M → S

**Boot** (`start.rs:55`): program CLINT `mtimecmp`, install
`timervec` as M-mode trap vector, set `mie.mtimer`.

**Tick fires in M-mode.** `timervec` asm (`kernelvec.rs:87`):

```asm
# save a0..a3 to mscratch
# bump mtimecmp for next tick
li   a1, 2
csrw sip, a1            # raise SSIP (S-mode soft int)
mret                    # back to S-mode
```

The M-mode handler **never touches the scheduler.** It hands the
event down to S-mode as a software interrupt.

---

## `yielding` → `sched` → `swtch`

```rust
pub fn yielding() {
    let p = myproc();
    let mut inner = p.inner.lock();
    inner.state = State::Runnable;
    sched(inner, &mut p.data.context);   // returns the guard
    // (when we resume, the lock is still held)
}

fn sched(guard, ctx) {
    assert!(c.noff == 1);
    assert!(!intr_get());
    let intena = c.intena;
    swtch(ctx, &c.context);              // ← off to scheduler
    c.intena = intena;
    guard
}
```

`proc.rs:692`, `proc.rs:718`.

---

## Teaching Aside: Why Bounce Through Scheduler?

Octox could in principle `swtch` straight from A to B.
It deliberately doesn't. Two interlocking reasons:

1. The scheduler needs **interrupts on** while scanning (so a
   `wakeup` from another hart is delivered).
   `yielding` runs with **interrupts off** (holds A's lock).
   They cannot share a stack.
2. The lock-across-`swtch` invariant only composes if the *other
   side* of every `swtch` is a known, trusted site (scheduler or
   `fork_ret`) that knows to release it.

A direct A → B `swtch` would force every process to anticipate
every other's locking state. Combinatorial nightmare.

---

## Key Concepts Recap

| Concept | Takeaway |
|---------|----------|
| M / S / U | Mode bit; only traps move you up |
| `satp` swap | Per-proc page table; bracketed by `sfence.vma` |
| TRAMPOLINE | Same VA in every PT — that's why the swap is safe |
| TRAPFRAME | Per-proc; saves user GPRs + kernel scratch |
| `ecall` / `sret` | The two boundary-crossing instructions |
| Syscall TABLE | `tf.a7` indexes; one source of truth (gen_usys) |
| `swtch` | 14 sd / 14 ld / ret — jumps to new `ra` |
| Lock-across-`swtch` | Resumer releases; pinning invariant |
| `fork_ret` | Brand-new proc's first kernel instruction |
| SSIP | M-mode timer hands events to S-mode via `sip` |

---

## Practice Problems

1. Trace `tf.a0` over `fork`'s lifetime — parent and child
   trapframes, at six different moments. Where is the asymmetry?
2. Construct a sequence of events that breaks `myproc()` if you
   *remove* its IntrLock.
3. Why hold the proc lock across `swtch`? What concrete bug appears
   if you release it before the swap?
4. Where does a brand-new forked process first execute, in **kernel**
   mode? In **user** mode?

(Full solutions in the lecture notes.)

---

## Further Reading

- [Octox Guide](../guides/octox-guide.md) §§4.5–4.8
  — process model, scheduler, traps, syscall dispatch
- [Octox Guide](../guides/octox-guide.md) §§8.1–8.4
  — the operation walkthroughs
- xv6 book ch. 4 (Traps), ch. 5 (Interrupts), ch. 7 (Scheduling)
- RISC-V Privileged Spec §4.1.1 (`sepc`, `stvec`, `sret`),
  §4.2 (Sv39)
- [UNIX System Calls lecture](12-cs631-2026-04-23-os-syscalls.md)