Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

In-Process Threading Contract

This page freezes the 7.1.0 design contract for kernel-managed threads inside one process. The 7.1.1 park authority contract is frozen separately in Park Authority. These pages are the handoff from the single-thread runtime checkpoint to the 7.2 implementation work. The 7.2.3 checkpoint implements the basic single-CPU lifecycle plus private ParkSpace wait/wake.

Scope

The first threading milestone stays single-CPU. It changes the scheduler’s unit of execution from process to thread while keeping the process as the authority, address-space, and resource-accounting boundary. SMP, per-CPU run queues, TLB shootdown, SQPOLL, and scheduler-policy services remain later milestones.

This contract covers:

  • process-owned versus thread-owned state;
  • the initial thread creation ABI;
  • per-thread FS-base/TLS rules;
  • thread exit and join semantics;
  • the ring-blocking constraint needed before a sharded or per-thread ring design exists;
  • the handoff to the 7.1.1 park authority design.

Ownership Split

The process remains the security boundary. All threads in one process share the same address space and capability table, so a thread has the same authority as its sibling threads.

Process-owned stateThread-owned state
Process id and process generationThread id and thread generation
User address space and CR3Saved CPU context and user register state
Capability table and resource ledgerKernel stack and syscall stack top
Capability ring page and ring scratchFS base
Read-only CapSet pageScheduling/blocking state
ProcessHandle exit stateThreadHandle join/exit state
Endpoint owner state and process-wide cleanup hooksFuture scheduling-context binding

The implementation migrates incrementally. The 7.2.0 slice makes each process contain a single initial Thread, with saved context, kernel stack, FS base, and blocking state stored on that thread. The 7.2.1 slice changes scheduler-owned queues, current execution, direct IPC handoff, and wake records to generation-checked ThreadRef values while still allowing exactly one thread per process. Later slices widen creation and lifecycle. The single-thread intermediate state must preserve existing QEMU behavior.

Scheduler Contract

Scheduler will store runnable execution contexts as thread references, not process ids. A thread reference is (pid, process_generation, tid, thread_generation). The process generation keeps handles from naming a reused process; the thread generation keeps handles from naming a reused thread slot inside a live process.

The 7.2.1 checkpoint applies this identity to Scheduler.current, run queues, direct IPC targets, Timer sleep waiters, process/terminal waiters, and endpoint caller/receiver wake records while preserving one initial thread per process.

The run queue, current, direct IPC target, and blocked waiter scans become thread-oriented. Address-space switches happen only when the next runnable thread belongs to a different process. TSS.RSP0, the syscall kernel stack, and FS base are updated on every thread switch because those are thread-local machine resources.

The idle process can remain the existing special user-mode idle process until the kernel-mode/per-CPU idle work lands. It should still be treated as a kernel-owned execution context that cannot block, exit, or hold ordinary caps.

Thread Creation ABI

Thread creation is exposed through a process-local ThreadSpawner capability. It creates threads only in the caller’s current process. It does not grant authority to another process and is non-transferable across IPC in the initial implementation.

The initial control-plane shape is:

interface ThreadSpawner {
    create @0 (
        entry :UInt64,
        stackTop :UInt64,
        arg :UInt64,
        fsBase :UInt64,
        flags :UInt64
    ) -> (handleIndex :UInt16);
}

interface ThreadHandle {
    join @0 () -> (exitCode :Int64);
    exitCode @1 () -> (exited :Bool, exitCode :Int64);
}

interface ThreadControl {
    getFsBase @0 () -> (fsBase :UInt64);
    setFsBase @1 (fsBase :UInt64) -> ();
    exitThread @2 (code :Int64) -> ();
}

Any 7.2 schema adjustment must update this page in the same branch before implementation review. The stable semantics are that creation is in-process, the returned handle is an observed result cap, ThreadHandle observes one thread rather than the whole process, and current-thread exit stays in the capability-ring transport rather than adding a syscall.

The new thread starts in Ring 3 at entry with:

  • RDI = arg;
  • RSI = tid;
  • RDX = pid;
  • RCX = RING_VADDR;
  • R8 = CAPSET_VADDR, or zero if the process has no CapSet.

The runtime supplies the user stack and TLS block. The kernel validates that entry, stackTop, and fsBase are user-canonical, that stackTop is 16-byte aligned at entry, and that reserved flags bits are zero. Page presence and stack-growth policy remain process address-space questions; before a page-fault subsystem exists, an invalid thread stack can fault the process.

Resource Accounting

Thread creation allocates kernel memory and is quota-backed by process-owned ledger state, not per-capability helper counters. The 7.2.0 checkpoint charges the initial thread during process creation; ThreadSpawner.create extends the same ledgers to additional threads. The ledger of record is:

  • PROCESS_THREAD_LIMIT, the maximum live or retained thread records in one process, initially 16;
  • PROCESS_THREAD_KERNEL_STACK_PAGES, initially matching the current per-thread kernel stack allocation size of 32 pages;
  • thread_records_used / thread_records_max;
  • thread_kernel_stack_pages_used / thread_kernel_stack_pages_max.

The initial process thread charges one thread record and one kernel-stack allocation during process creation. ThreadSpawner.create reserves a thread record and kernel-stack page budget before allocating the stack or publishing a ThreadHandle; every later failure rolls both reservations back before returning. Cap-slot reservation for the result handle remains charged to the existing process cap-table ledger.

Creation failures are controlled application exceptions. Thread count, kernel-stack budget, handle cap-slot exhaustion, and kernel stack allocation failure return Overloaded with a specific message and no partially runnable thread. Invalid entry, stack, FS base, or flags return Failed.

Thread exit releases the kernel stack only after the scheduler is running on a different kernel stack. The thread record remains charged while a live ThreadHandle, pending join waiter, or unjoined exit status can still observe it. Once the handle is released without a pending join, or once a one-shot join has consumed the status and no wait record pins it, the retained record charge is released. Process exit releases all thread records and stack charges once.

FS Base And TLS

FS base is thread-owned. The existing ThreadControl.getFsBase and ThreadControl.setFsBase operations keep their names, but after threading they refer to the current thread, not the whole process. setFsBase continues to reject non-user-canonical values and writes the CPU FS-base MSR immediately when called by the running thread.

The initial process thread uses the PT_TLS block installed by ELF loading. Additional threads receive an FS base from ThreadSpawner.create; the runtime is responsible for allocating and initializing each thread’s TLS/TCB data. There is no process-global FS base. Current-thread FS-base operations are useful for the single-thread runtime checkpoint, but they must not be treated as the final threading ABI for language runtimes. True multi-threaded Go or C/POSIX-like runtime support requires each ThreadRef to own a distinct TLS block and FS base.

Context switching must save the outgoing thread’s FS base and restore the next thread’s FS base even when both threads belong to the same process and no CR3 switch is needed.

Thread Identity In Waiters And Dispatch

The concrete identity type for in-process scheduling is:

#![allow(unused)]
fn main() {
ThreadRef {
    pid,
    process_generation,
    tid,
    thread_generation,
}
}

Process identity still governs authority and accounting, but wakeup and blocking state must name a thread. 7.2 changes context-aware capability dispatch so CapCallContext carries both the caller process id for authority checks and the caller ThreadRef for wake/cancel decisions. Existing pid-only records that can resume execution or write a caller CQE must be widened before multiple threads can run in one process.

The migration target is:

  • TimerSleepWaiter stores the sleeping ThreadRef and validates the generation before waking it;
  • endpoint CALL, RECV, RETURN target, deferred-cancel, current-caller, and direct IPC handoff records store the blocked or target ThreadRef;
  • terminal line input and any other ProcessWaiter consumer store the waiting ThreadRef and validate the generation before writing a CQE;
  • ProcessHandle.wait records the waiting ThreadRef while the handle still names the child process;
  • ThreadHandle.join records the waiting ThreadRef and the target ThreadRef;
  • the single process-ring cap_enter waiter is stored as Option<ThreadRef>;
  • process-exit cleanup cancels every waiter whose pid and process_generation match the exiting process, regardless of thread id.

A generation mismatch on wake or completion is a stale waiter and must be drained without writing to userspace. This mirrors current process-generation behavior and prevents one thread slot reuse from receiving another thread’s Timer, endpoint, join, or ring completion.

Exit And Join

The current exit(code) syscall remains process exit. It terminates the whole process, releases the shared capability table, cancels process-owned endpoint state, removes all timer/park/ring waiters for every thread in the process, and completes the parent-facing ProcessHandle.

Thread exit is separate and does not add a syscall. The initial implementation adds ThreadControl.exitThread(code) as a terminal capability-ring operation on the current thread. A successful invocation does not post a CQE back to the exiting thread, because cap_enter will not return to that execution context. It records the exit code, wakes or completes any valid join waiter, and removes only the current thread from scheduling. If the last non-idle thread in a process exits through exitThread, the process exits with that thread’s code.

ThreadHandle.join is process-local and one-shot. If the target thread already exited and its status is retained, join returns its code immediately and marks the status joined. If it is still live, join blocks the caller’s thread until the target exits. Self-join returns Failed. A second waiter, join after a successful join, or join after detach returns Failed; it must not park an ambiguous waiter. ThreadHandle.exitCode is nonblocking and may observe the retained status while the handle is live, but it does not consume the one-shot join right.

Releasing the last ThreadHandle before the target exits detaches the target: the thread continues to run, but no exit status is retained after it exits unless a join waiter already pins the state. Releasing the handle after exit but before join drops the retained status and releases the thread-record charge. A pending join waiter pins the handle state until completion or process exit, so cap release cannot create a use-after-free. The exiting thread’s kernel stack must not be freed while it is still executing on that stack; final drop follows the existing process-exit rule and happens after another kernel stack is active.

Fatal user faults remain process-fatal in the first implementation. Per-thread fault isolation can be designed later, after the basic scheduler and futex paths are stable.

Capability Ring And Blocking

The first threading implementation keeps one capability ring per process. The runtime’s single-owner ring-client invariant remains part of the contract: well-formed userspace serializes ring submission and completion matching through capos-rt.

The kernel must not admit multiple blocked cap_enter waiters on the same process ring in 7.2. If a second thread in the same process asks to block in cap_enter while another thread is already the process ring waiter, the kernel returns the current available completion count without blocking that second thread. This preserves the existing syscall return shape and forces the runtime to retry or wait through a runtime-level mutex/park rather than letting two threads race to consume the same CQEs. A thread blocked in Park is separate from the process ring’s CapEnter waiter; it must not consume the one blocked ring-waiter slot.

This constraint avoids freezing a premature per-thread or sharded completion queue ABI. A later runtime/ring milestone can add per-thread rings, completion steering, or a process-level dispatcher thread if measurements show that the single ring waiter is too restrictive.

The full-SMP target is now recorded in Ring v2 For Full SMP: each thread gets its own complete SQ/CQ endpoint, and cap_enter waits on the current thread’s CQ rather than a shared process CQ. The current process-ring rule remains a compatibility constraint for 7.2 and for any runtime reactor bridge built before Ring v2.

Park Handoff

Park authority is defined in Park Authority. The scheduler changes above must leave room for a thread block reason that is not tied to the process ring CQ. The frozen handoff is:

  • park wait blocks the current thread, not the whole process;
  • park wake makes selected generation-checked ThreadRef values runnable;
  • timeouts use the same monotonic time base as Timer;
  • private park keys are based on address-space identity plus user virtual address;
  • shared-memory park keys are MemoryObject-derived identity plus offset;
  • the first implementation starts with compact CAP_OP_PARK and CAP_OP_UNPARK operations rather than generic Cap’n Proto methods;
  • park wait SQEs are thread-owned so ring dispatch cannot park a sibling thread under the waiter’s user_data;
  • blocking park wait is a syscall-context operation that releases runtime ring-client ownership before the thread parks, while capos-rt demultiplexes reserved park CQEs back to the waiting thread.

Pre-thread 4.5.4 measurement chose the compact capability-authorized shape for failed wait and empty wake. 4.5.5 measured the real blocked/resume path through thread-lifecycle under make run-measure, so the compact ParkSpace opcodes remain the runtime ABI target for this slice.

Security Invariants

  • A thread never owns a separate capability table in the initial model.
  • A thread cannot escape the authority of its containing process.
  • A ThreadHandle names only a thread in the same process and is non-transferable in the first implementation.
  • Thread creation is charged to one process-owned thread/kernel-stack ledger of record before the thread can become runnable.
  • Process exit releases shared authority once, after all live threads are removed from scheduling.
  • Per-process resource quotas are shared by all threads.
  • ThreadControl changes only the current thread’s FS base.
  • ThreadControl.exitThread terminates only the current thread and is a capability-ring operation, not a syscall.
  • Every waiter or direct handoff that can resume execution stores a generation checked ThreadRef.
  • Process-owned user-buffer validation/copy/read paths hold the process AddressSpace lock; future shared-memory thread primitives still need mapping provenance or object pins when they derive keys from shared backing.

Implementation Order

  1. Add internal Thread state, make each process own one initial thread, move saved context / kernel stack / FS base / block state onto that thread, and charge the initial thread against private process ledgers. Done 2026-04-24 23:09 UTC.
  2. Change scheduler queues, blocking, exit cleanup, and direct IPC targets from pid-oriented state to thread references while preserving one thread per process. Done 2026-04-24 23:33 UTC.
  3. Add ThreadSpawner, ThreadHandle, and ThreadControl.exitThread with a QEMU smoke for create, join, detach, self-join rejection, second join rejection, and last-thread process exit. Done 2026-04-25.
  4. Implement the ParkSpace private wait/wake path from Park Authority after the scheduler can block and wake individual threads, then run 4.5.5 blocked/resume measurements before declaring the park ABI stable. Done 2026-04-25.

Validation Plan

The first implementation smoke should create two threads in one process, prove they share the address space and CapSet, prove each has an independent FS base, join one thread from another, then let the last thread exit the process. The existing make run-spawn path should keep covering runtime-fs-base and single-thread-runtime so regressions in the pre-thread runtime contract stay visible. make run-measure additionally records the private ParkSpace blocked/resume timings and proves process exit with a parked park waiter.