UNIX System Calls¶
A tour of the core UNIX system-call interface — fork, exec, wait,
open, read, write, close, dup2, pipe — through nine small
teaching programs in the Octox kernel (a Rust port
of xv6). By the end of this lecture you should be able to read, modify, and
write Octox user programs of your own.
Overview¶
Everything that a program does outside its own address space goes through a system call: opening files, reading bytes, spawning processes, sending data to another process. UNIX made two unusual design choices that have shaped every major operating system since:
- Files, pipes, sockets, and devices are all "file descriptors." You
read and write them with the same two syscalls (
read,write). - Creating a new process is split into two steps.
forkduplicates the caller, thenexecreplaces the program running in the duplicate. This seam is what lets a shell wire up redirection and pipelines without cooperation from the programs it launches.
The nine example programs in src/user/bin/ex_*.rs each isolate one idea.
We'll walk them in order.
Learning Objectives¶
After this lecture you should be able to:
- Explain what a system call is, and what happens on a RISC-V
ecalltrap. - List the core UNIX process and file syscalls and what each one does.
- Trace
fork/wait/exitand predict what each process sees. - Describe why
execdoes not return on success, and what survives across it. - Use
dup2to rewire a file descriptor — and show how a shell uses this to implement>redirection. - Set up a pipe between two child processes and explain why everyone must close the unused end.
Prerequisites¶
- Advanced Architecture — the hardware story the kernel is sitting on top of.
- Octox Guide — build, run, and user-program layout.
- RISC-V assembly lectures — in particular the
ecallinstruction.
What Is a System Call?¶
A system call is a function whose body runs in the kernel's address space
instead of yours. The hardware forces a protection transition. On RISC-V
that transition is the ecall instruction: it saves the user PC, switches
the CPU to supervisor mode, and jumps to a fixed trap vector.
flowchart LR
U["User program<br/>fn main()"] --> W["sys::fork wrapper<br/>(ulib, user mode)"]
W --> E["ecall<br/>trap"]
E --> K["kernel trap<br/>handler"]
K --> S["fork impl<br/>(kernel)"]
S --> R["sret<br/>return"]
R --> W
W --> U
The RISC-V syscall ABI Octox uses:
| Register | Holds |
|---|---|
a7 |
syscall number |
a0..a5 |
arguments |
a0 |
return value after sret — Ok or negative error code |
The ulib user library hides this: you call sys::fork(), it emits the
ecall, the kernel runs, you get back a Result<usize>.
The Octox Syscall Surface¶
Every program in this lecture uses only these wrappers from ulib::sys
(generated from src/kernel/syscall.rs into usys.rs):
pub fn fork() -> Result<usize> // child: 0; parent: child pid
pub fn exit(xstatus: i32) -> !
pub fn wait(xstatus: &mut i32) -> Result<usize> // returns child pid
pub fn pipe(p: &mut [usize]) -> Result<()> // p[0]=read fd, p[1]=write fd
pub fn read(fd: usize, buf: &mut [u8]) -> Result<usize>
pub fn write(fd: usize, b: &[u8]) -> Result<usize>
pub fn exec(filename: &str, argv: &[&str], envp: Option<&[Option<&str>]>)
-> Result<usize> // does not return on success
pub fn open(filename: &str, flags: usize) -> Result<usize>
pub fn close(fd: usize) -> Result<()>
pub fn dup(fd: usize) -> Result<usize>
pub fn dup2(src: usize, dst: usize) -> Result<usize>
pub fn getpid() -> Result<usize>
pub fn sleep(n: usize) -> Result<()>
The print! and println! macros in ulib are not syscalls — they
are thin wrappers that ultimately call sys::write(STDOUT_FILENO, ...). When
we want to teach a syscall, we call it directly.
File Descriptors and Open Modes¶
A file descriptor is a small non-negative integer that indexes into a
per-process table the kernel maintains. When you open a file you get back
a fresh fd; when you fork, the child inherits a copy of that whole table;
when you exec, the fd table survives (this is the key to how
redirection works).
Three fds are already open when your program starts:
| fd | Constant | Normally |
|---|---|---|
| 0 | STDIN_FILENO |
keyboard / pipe input |
| 1 | STDOUT_FILENO |
terminal / pipe output |
| 2 | STDERR_FILENO |
terminal |
open takes a flag mask built from ulib::sys::fcntl::omode:
| Flag | Value | Meaning |
|---|---|---|
RDONLY |
0x000 |
read only |
WRONLY |
0x001 |
write only |
RDWR |
0x002 |
read and write |
CREATE |
0x200 |
create file if it does not exist |
TRUNC |
0x400 |
shrink file to 0 bytes on open |
APPEND |
0x800 |
position writes at end of file |
The canonical "open for output" combination is
WRONLY | CREATE | TRUNC — create it if missing, wipe it if present.
1. ex_args — Command-Line Arguments¶
When the kernel executes a program, it places the argv array on the new
stack. The lang_start wrapper in ulib publishes it through
env::args(). argv[0] is the program name; argv[1..] are what the user
typed.
#![no_std]
use ulib::{env, print, println};
fn main() {
// Iterate every entry (including argv[0]) and echo it on its own line.
for arg in env::args() {
println!("{}", arg);
}
}
Try it:
2. ex_count — open + read Loop + close¶
A stripped-down wc -c. The lesson is the read loop: read may return
fewer bytes than the buffer holds, and returns Ok(0) at end-of-file, so
the caller must loop until it sees that zero.
#![no_std]
use ulib::{env, print, println, sys, sys::fcntl::omode};
fn main() {
let mut args = env::args().skip(1);
let path = args.next().expect("usage: ex_count FILE");
// open() returns a small integer file descriptor. RDONLY means we
// only plan to read from it.
let fd = sys::open(path, omode::RDONLY).expect("open");
// read() is allowed to return fewer bytes than the buffer holds,
// and returns Ok(0) at end-of-file — so the caller must loop.
let mut buf = [0u8; 512];
let mut count: usize = 0;
loop {
let n = sys::read(fd, &mut buf).expect("read");
if n == 0 {
break; // EOF
}
count += n;
}
// Always release the descriptor when done.
sys::close(fd).expect("close");
println!("{}: {} bytes", path, count);
}
Key idea: "partial reads are normal." A single read call is one trip
into the kernel, and the kernel is allowed to give you whatever is
conveniently available right now. Loop until EOF.
3. ex_write — open(... CREATE | TRUNC) + write¶
Mirror image of ex_count: open for output, write bytes. write is also
allowed to accept fewer bytes than you offered, so we loop.
#![no_std]
use ulib::{env, sys, sys::fcntl::omode};
fn main() {
let mut args = env::args().skip(1);
let path = args.next().expect("usage: ex_write FILE TEXT");
let text = args.next().expect("usage: ex_write FILE TEXT");
// Flag semantics:
// WRONLY — open for writing only
// CREATE — create the file if it does not exist
// TRUNC — if it already exists, shrink it back to 0 bytes
let fd = sys::open(path, omode::WRONLY | omode::CREATE | omode::TRUNC)
.expect("open");
// write() is permitted to accept fewer bytes than we offered, so
// loop until every byte of TEXT has been delivered.
let bytes = text.as_bytes();
let mut off = 0;
while off < bytes.len() {
let n = sys::write(fd, &bytes[off..]).expect("write");
off += n;
}
sys::close(fd).expect("close");
}
Try it:
4. ex_fork — Two Processes from One¶
fork is the most famous syscall in UNIX and the one that confuses students
the hardest. A single call returns twice: once in the parent with the
child's pid, once in the child with 0. After fork, parent and child are
two independent processes with separate copies of every variable.
#![no_std]
use ulib::{print, println, sys};
fn main() {
// Both parent and child will see x == 100 immediately after fork.
// The child then bumps its own copy; the parent's copy is untouched.
let mut x: i32 = 100;
println!("before fork: x={}", x);
// fork() returns:
// Ok(0) in the child
// Ok(child_pid) in the parent
match sys::fork().expect("fork") {
0 => {
// --- child ---
x += 1;
println!("child : pid={} x={}", sys::getpid().unwrap(), x);
// Exit explicitly so the child never falls through to the
// parent branch below.
sys::exit(0);
}
child_pid => {
// --- parent ---
// wait() blocks until some child exits and writes that
// child's exit status into the i32 we hand it.
let mut status: i32 = 0;
sys::wait(&mut status).expect("wait");
println!(
"parent: pid={} x={} (child {} exited with {})",
sys::getpid().unwrap(), x, child_pid, status
);
}
}
}
Copy-on-write
A real UNIX kernel does not literally duplicate every page of memory at
fork. It shares the pages between parent and child read-only, and only
copies a page when one side writes to it. From a correctness standpoint
you can pretend it was a full copy — the mutations are invisible
to the other process either way.
Try it:
Notice the parent's x is still 100.
5. ex_exec — Replace the Running Program¶
exec takes a filename plus an argv and replaces the current process's
program image with that binary. On success it does not return; the old code
is gone, and execution starts at the new program's entry point.
#![no_std]
use ulib::{print, println, sys};
fn main() {
match sys::fork().expect("fork") {
0 => {
// --- child ---
// The "sleep" command lives at /bin/sleep. sleep takes one
// argument: seconds to pause.
let argv = ["sleep", "10"];
sys::exec("/bin/sleep", &argv, None).expect("exec");
// Only reachable if exec() failed (should be unreachable).
sys::exit(1);
}
child => {
// --- parent ---
println!("parent: launched child pid={}, waiting...", child);
let mut status: i32 = 0;
let reaped = sys::wait(&mut status).expect("wait");
println!("parent: child {} exited with {}", reaped, status);
}
}
}
Key idea: fork + exec is the UNIX answer to "run another program."
fork gives you a process you own; exec gives it a different program to
run. Between the two calls is where a shell does its setup work
(redirection, pipes, close-on-exec, setuid, ...). That's the point of
splitting the operation.
About the binary names
In Octox every user binary is built as _<name> (see
src/user/Cargo.toml) so it does not clash with host-system binaries
of the same name during the cross-build. mkfs strips the leading
_ when writing the program into fs.img, so on the running system
it lives at /bin/<name>. exec requires a full path — there
is no PATH lookup in the kernel.
6. ex_redir — Rewiring Your Own Stdout¶
Before we can redirect other programs we need one more syscall: dup2.
dup2(src, dst) atomically:
- closes whatever
dstcurrently refers to (if anything), then - makes
dstan alias for the same underlying file assrc.
After dup2(fd, 1), writes to fd 1 go wherever fd 3 (or whatever) was
pointing. The program doesn't have to know it is "redirected." This
example only forks — the child keeps running ex_redir, but with
its stdout replaced.
#![no_std]
use ulib::{env, print, println, stdio::STDOUT_FILENO, sys, sys::fcntl::omode};
fn main() {
let mut args = env::args().skip(1);
let path = args.next().expect("usage: ex_redir OUTFILE");
match sys::fork().expect("fork") {
0 => {
// --- child ---
let fd = sys::open(path, omode::WRONLY | omode::CREATE | omode::TRUNC)
.expect("open");
// Redirect stdout. After this call, fd 1 refers to the file.
sys::dup2(fd, STDOUT_FILENO).expect("dup2");
// fd 3 (or whatever the original was) is now redundant: the
// file is still referenced by fd 1. Closing it avoids leaking
// a descriptor across exec or just within this process.
sys::close(fd).expect("close");
// println! writes to fd 1 — which is now the file.
println!("hello from child: my stdout is redirected");
sys::exit(0);
}
_ => {
// --- parent ---
let mut status: i32 = 0;
sys::wait(&mut status).expect("wait");
// The parent never touched its own fd 1, so this prints to
// the terminal, not to the file.
println!("parent: child finished; my stdout is still the terminal");
}
}
}
Try it:
$ ex_redir /tmp/r
parent: child finished; my stdout is still the terminal
$ cat /tmp/r
hello from child: my stdout is redirected
7. ex_redir2 — Redirection Survives exec¶
The point of this variant is a single fact: a process's fd table is
preserved across exec. So if we set up fd 1 before calling exec,
the new program's writes to stdout land in the file. The program being
exec'd does not need to know, or cooperate.
This is exactly how cmd > file works in every shell on Earth.
#![no_std]
use ulib::{env, print, println, stdio::STDOUT_FILENO, sys, sys::fcntl::omode};
fn main() {
let mut args = env::args().skip(1);
let path = args.next().expect("usage: ex_redir2 OUTFILE");
match sys::fork().expect("fork") {
0 => {
// --- child ---
// Open the output file and splice it onto stdout.
let fd = sys::open(path, omode::WRONLY | omode::CREATE | omode::TRUNC)
.expect("open");
sys::dup2(fd, STDOUT_FILENO).expect("dup2");
sys::close(fd).expect("close");
// Now exec() into /bin/echo. Because fd 1 is inherited, the
// echo program's output lands in the file rather than on
// the terminal. This is why redirection "just works" for
// arbitrary programs — they never need to know they are
// being redirected.
let argv = ["echo", "hello", "from", "exec"];
sys::exec("/bin/echo", &argv, None).expect("exec");
sys::exit(1); // unreachable unless exec failed
}
_ => {
// --- parent ---
let mut status: i32 = 0;
sys::wait(&mut status).expect("wait");
println!("parent: child exited with {}", status);
}
}
}
Recipe for cmd > file:
fork()
if child:
fd = open(file, WRONLY | CREATE | TRUNC)
dup2(fd, 1)
close(fd)
exec(cmd, argv)
else:
wait()
8. ex_pipe — Talking Through a Pipe¶
A pipe is a one-way in-kernel byte buffer exposed as a pair of file
descriptors. sys::pipe(&mut p) fills in p[0] (the read end) and p[1]
(the write end). Bytes written to p[1] come back out of p[0].
After fork, both processes hold both ends. By convention each side
closes the end it does not use. This is not a stylistic choice:
The EOF rule
read returns Ok(0) (EOF) on a pipe only when every open write end
has been closed. If the reader forgets to close its own copy of the
write end, it will read its own EOF — never — and deadlock.
#![no_std]
use ulib::{print, println, sys, stdio::STDOUT_FILENO};
fn main() {
let mut p = [0usize; 2];
sys::pipe(&mut p).expect("pipe");
let (read_fd, write_fd) = (p[0], p[1]);
match sys::fork().expect("fork") {
0 => {
// --- child: the writer ---
// We will not read from the pipe, so close that end.
sys::close(read_fd).expect("close");
let msg = b"hello from child\n";
sys::write(write_fd, msg).expect("write");
// Closing the write end is what lets the parent's read()
// return 0 (EOF) once we are done.
sys::close(write_fd).expect("close");
sys::exit(0);
}
_ => {
// --- parent: the reader ---
// Symmetric: we will not write, so close the write end.
// This is important — if we left it open and then tried to
// read until EOF, we would block forever because the kernel
// thinks *we* are still a writer.
sys::close(write_fd).expect("close");
let mut buf = [0u8; 64];
let n = sys::read(read_fd, &mut buf).expect("read");
sys::close(read_fd).expect("close");
// Echo exactly what we received to our own stdout.
sys::write(STDOUT_FILENO, &buf[..n]).expect("write");
let mut status: i32 = 0;
sys::wait(&mut status).expect("wait");
println!("parent: child exited with {}", status);
}
}
}
9. ex_pipe2 — The Shell's Pipeline: ls | wc¶
Combine every idea so far. Two children, two execs, one pipe. Each child
dup2s one end of the pipe onto its stdin or stdout before exec, so
ls and wc run as if they had a normal terminal on the other side.
flowchart LR
P1["child1<br/>ls"] -- "stdout = write_fd" --> PIPE["pipe<br/>buffer"]
PIPE -- "read_fd = stdin" --> P2["child2<br/>wc"]
PARENT["parent<br/>(closes both ends,<br/>waits twice)"] -.-> P1
PARENT -.-> P2
#![no_std]
use ulib::{
print, println,
stdio::{STDIN_FILENO, STDOUT_FILENO},
sys,
};
fn main() {
let mut p = [0usize; 2];
sys::pipe(&mut p).expect("pipe");
let (read_fd, write_fd) = (p[0], p[1]);
// --- first child: the producer (`ls`) ---
match sys::fork().expect("fork") {
0 => {
// Redirect stdout to the pipe's write end.
sys::dup2(write_fd, STDOUT_FILENO).expect("dup2");
// After dup2, fd 1 already refers to the write end. The
// original pipe fds are redundant in this process, and we
// must close the read end because we are not going to read.
sys::close(read_fd).expect("close");
sys::close(write_fd).expect("close");
let argv = ["ls"];
sys::exec("/bin/ls", &argv, None).expect("exec");
sys::exit(1);
}
_ => {}
}
// --- second child: the consumer (`wc`) ---
match sys::fork().expect("fork") {
0 => {
// Redirect stdin to the pipe's read end.
sys::dup2(read_fd, STDIN_FILENO).expect("dup2");
// Same cleanup as above, mirror image.
sys::close(read_fd).expect("close");
sys::close(write_fd).expect("close");
let argv = ["wc"];
sys::exec("/bin/wc", &argv, None).expect("exec");
sys::exit(1);
}
_ => {}
}
// --- parent ---
// Close BOTH pipe ends here. If we did not, the kernel would still
// count us as a writer, and wc would block on read() forever
// waiting for an EOF that never comes.
sys::close(read_fd).expect("close");
sys::close(write_fd).expect("close");
// Reap both children. wait() returns whichever finishes first.
let mut status: i32 = 0;
sys::wait(&mut status).expect("wait 1");
sys::wait(&mut status).expect("wait 2");
println!("parent: both children finished");
}
The classic deadlock
The parent calls pipe before any of the children exist, so the parent
holds both ends too. If the parent forgets to close them, wc will hang
forever on read waiting for an EOF that never comes — because the
kernel still sees the parent as a live writer.
Building a Shell from Seven Syscalls¶
Every feature a minimalist shell provides reduces to a small fixed recipe over these syscalls:
| Shell feature | Syscall recipe |
|---|---|
| Run a command | fork → child exec; parent wait |
| Run in background | fork → child exec; parent does not wait |
cmd > file |
fork → open + dup2(fd,1) + close → exec |
cmd < file |
fork → open + dup2(fd,0) + close → exec |
cmd1 | cmd2 |
pipe; fork twice; each child rewires one end; parent closes both, wait twice |
| Exit a subprocess | exit(status) |
| Reap child | wait(&mut status) |
The Octox shell (src/user/bin/sh.rs) is implemented with exactly this
toolkit.
Key Concepts¶
| Concept | Takeaway |
|---|---|
| System call | User code asks the kernel to do something via ecall; returns to user. |
| File descriptor | Small integer indexing a per-process open-file table. |
fork returns twice |
0 in child, child_pid in parent; memory is (logically) copied. |
exec does not return |
Success replaces the image; only failure paths run the code after it. |
fds survive fork |
Child inherits a copy of the whole fd table. |
fds survive exec |
New program runs with the fd table the caller set up. |
dup2(src, dst) |
Atomically closes dst, then aliases it onto src. |
| Pipe EOF rule | Readers see EOF only after every write-end fd is closed. |
Partial read/write |
Always loop; never assume a single call moved all bytes. |
Practice Problems¶
Problem 1 — What does ex_fork print?¶
If you change x += 1; in the child branch to x += 50;, what does the
parent line print? Why?
Solution
The parent still prints `x=100`. `fork` gives the child a separate copy of `x`; whatever the child does to its copy is invisible to the parent. Only the child's line changes (to `x=150`).Problem 2 — The hanging pipeline¶
In ex_pipe2, delete the two sys::close calls the parent makes on the
pipe fds (lines 72–73). The program now hangs. Which process is stuck,
on which call, and why?
Solution
`wc` (child 2) hangs inside its final `read` on stdin. The parent still holds the write end of the pipe open, so the kernel cannot signal EOF to the reader even after `ls` has exited and closed its own write end. `read` waits for more data forever. The first `sys::wait` in the parent then blocks waiting for `wc` to exit, so the whole program is stuck. Lesson: **everyone** who ever held a pipe fd must close the ends they do not use. The reader needs to see the *last* write-end fd go away.Problem 3 — A three-command pipeline¶
Sketch the syscalls needed to implement cmd1 | cmd2 | cmd3. How many
pipes? How many forks? How many close calls does the parent make?
Solution
Two pipes, three `fork`s. Let `p1` and `p2` be the two pipes. - child 1: `dup2(p1[1], 1)`; close `p1[0]`, `p1[1]`, `p2[0]`, `p2[1]`; `exec(cmd1)`. - child 2: `dup2(p1[0], 0)`; `dup2(p2[1], 1)`; close all four raw fds; `exec(cmd2)`. - child 3: `dup2(p2[0], 0)`; close all four raw fds; `exec(cmd3)`. - parent: close all four raw fds (`p1[0]`, `p1[1]`, `p2[0]`, `p2[1]`); `wait` three times. Four closes in the parent (one per pipe end), plus the four each child did for cleanup. Missing any of the closes in the parent can deadlock the pipeline.Further Reading¶
- Octox repo:
src/user/bin/ex_*.rs— the nine programs above, in one place. - Octox Guide — architecture, build, and how to add a user program.
- xv6 book, Chapter 1 (Operating System Interfaces) and Chapter 8 (File System) — the C ancestors of these examples.
- On a real Linux box, compare:
man 2 fork,man 2 execve,man 2 open,man 2 pipe,man 2 dup2. The semantics we covered today are essentially unchanged from 1970s UNIX.