In-Process Threading Contract
This page freezes the 7.1.0 design contract for kernel-managed threads inside one process. The 7.1.1 park authority contract is frozen separately in Park Authority. These pages are the handoff from the single-thread runtime checkpoint to the 7.2 implementation work. The 7.2.3 checkpoint implements the basic single-CPU lifecycle plus private ParkSpace wait/wake.
Scope
The first threading milestone stays single-CPU. It changes the scheduler’s unit of execution from process to thread while keeping the process as the authority, address-space, and resource-accounting boundary. SMP, per-CPU run queues, TLB shootdown, SQPOLL, and scheduler-policy services remain later milestones.
This contract covers:
- process-owned versus thread-owned state;
- the initial thread creation ABI;
- per-thread FS-base/TLS rules;
- thread exit and join semantics;
- the ring-blocking constraint needed before a sharded or per-thread ring design exists;
- the handoff to the 7.1.1 park authority design.
Ownership Split
The process remains the security boundary. All threads in one process share the same address space and capability table, so a thread has the same authority as its sibling threads.
| Process-owned state | Thread-owned state |
|---|---|
| Process id and process generation | Thread id and thread generation |
| User address space and CR3 | Saved CPU context and user register state |
| Capability table and resource ledger | Kernel stack and syscall stack top |
| Capability ring page and ring scratch | FS base |
| Read-only CapSet page | Scheduling/blocking state |
| ProcessHandle exit state | ThreadHandle join/exit state |
| Endpoint owner state and process-wide cleanup hooks | Future scheduling-context binding |
The implementation migrates incrementally. The 7.2.0 slice makes each process
contain a single initial Thread, with saved context, kernel stack, FS base,
and blocking state stored on that thread. The 7.2.1 slice changes
scheduler-owned queues, current execution, direct IPC handoff, and wake records
to generation-checked ThreadRef values while still allowing exactly one
thread per process. Later slices widen creation and lifecycle. The
single-thread intermediate state must preserve existing QEMU behavior.
Scheduler Contract
Scheduler will store runnable execution contexts as thread
references, not process ids. A thread reference is (pid, process_generation, tid, thread_generation). The process generation keeps handles from naming a
reused process; the thread generation keeps handles from naming a reused
thread slot inside a live process.
The 7.2.1 checkpoint applies this identity to Scheduler.current, run queues,
direct IPC targets, Timer sleep waiters, process/terminal waiters, and endpoint
caller/receiver wake records while preserving one initial thread per process.
The run queue, current, direct IPC target, and blocked waiter scans become
thread-oriented. Address-space switches happen only when the next runnable
thread belongs to a different process. TSS.RSP0, the syscall kernel stack, and
FS base are updated on every thread switch because those are thread-local
machine resources.
The idle process can remain the existing special user-mode idle process until the kernel-mode/per-CPU idle work lands. It should still be treated as a kernel-owned execution context that cannot block, exit, or hold ordinary caps.
Thread Creation ABI
Thread creation is exposed through a process-local ThreadSpawner capability.
It creates threads only in the caller’s current process. It does not grant
authority to another process and is non-transferable across IPC in the initial
implementation.
The initial control-plane shape is:
interface ThreadSpawner {
create @0 (
entry :UInt64,
stackTop :UInt64,
arg :UInt64,
fsBase :UInt64,
flags :UInt64
) -> (handleIndex :UInt16);
}
interface ThreadHandle {
join @0 () -> (exitCode :Int64);
exitCode @1 () -> (exited :Bool, exitCode :Int64);
}
interface ThreadControl {
getFsBase @0 () -> (fsBase :UInt64);
setFsBase @1 (fsBase :UInt64) -> ();
exitThread @2 (code :Int64) -> ();
}
Any 7.2 schema adjustment must update this page in the same branch before
implementation review. The stable semantics are that creation is in-process,
the returned handle is an observed result cap, ThreadHandle observes one
thread rather than the whole process, and current-thread exit stays in the
capability-ring transport rather than adding a syscall.
The new thread starts in Ring 3 at entry with:
RDI = arg;RSI = tid;RDX = pid;RCX = RING_VADDR;R8 = CAPSET_VADDR, or zero if the process has no CapSet.
The runtime supplies the user stack and TLS block. The kernel validates that
entry, stackTop, and fsBase are user-canonical, that stackTop is
16-byte aligned at entry, and that reserved flags bits are zero. Page
presence and stack-growth policy remain process address-space questions;
before a page-fault subsystem exists, an invalid thread stack can fault the
process.
Resource Accounting
Thread creation allocates kernel memory and is quota-backed by process-owned
ledger state, not per-capability helper counters. The 7.2.0 checkpoint charges
the initial thread during process creation; ThreadSpawner.create extends the
same ledgers to additional threads. The ledger of record is:
PROCESS_THREAD_LIMIT, the maximum live or retained thread records in one process, initially 16;PROCESS_THREAD_KERNEL_STACK_PAGES, initially matching the current per-thread kernel stack allocation size of 32 pages;thread_records_used/thread_records_max;thread_kernel_stack_pages_used/thread_kernel_stack_pages_max.
The initial process thread charges one thread record and one kernel-stack
allocation during process creation. ThreadSpawner.create reserves a thread
record and kernel-stack page budget before allocating the stack or publishing a
ThreadHandle; every later failure rolls both reservations back before
returning. Cap-slot reservation for the result handle remains charged to the
existing process cap-table ledger.
Creation failures are controlled application exceptions. Thread count,
kernel-stack budget, handle cap-slot exhaustion, and kernel stack allocation
failure return Overloaded with a specific message and no partially runnable
thread. Invalid entry, stack, FS base, or flags return Failed.
Thread exit releases the kernel stack only after the scheduler is running on a
different kernel stack. The thread record remains charged while a live
ThreadHandle, pending join waiter, or unjoined exit status can still observe
it. Once the handle is released without a pending join, or once a one-shot join
has consumed the status and no wait record pins it, the retained record charge
is released. Process exit releases all thread records and stack charges once.
FS Base And TLS
FS base is thread-owned. The existing ThreadControl.getFsBase and
ThreadControl.setFsBase operations keep their names, but after threading they
refer to the current thread, not the whole process. setFsBase continues to
reject non-user-canonical values and writes the CPU FS-base MSR immediately
when called by the running thread.
The initial process thread uses the PT_TLS block installed by ELF loading.
Additional threads receive an FS base from ThreadSpawner.create; the runtime
is responsible for allocating and initializing each thread’s TLS/TCB data.
There is no process-global FS base. Current-thread FS-base operations are useful
for the single-thread runtime checkpoint, but they must not be treated as the
final threading ABI for language runtimes. True multi-threaded Go or
C/POSIX-like runtime support requires each ThreadRef to own a distinct TLS
block and FS base.
Context switching must save the outgoing thread’s FS base and restore the next thread’s FS base even when both threads belong to the same process and no CR3 switch is needed.
Thread Identity In Waiters And Dispatch
The concrete identity type for in-process scheduling is:
#![allow(unused)]
fn main() {
ThreadRef {
pid,
process_generation,
tid,
thread_generation,
}
}
Process identity still governs authority and accounting, but wakeup and
blocking state must name a thread. 7.2 changes context-aware capability
dispatch so CapCallContext carries both the caller process id for authority
checks and the caller ThreadRef for wake/cancel decisions. Existing pid-only
records that can resume execution or write a caller CQE must be widened before
multiple threads can run in one process.
The migration target is:
TimerSleepWaiterstores the sleepingThreadRefand validates the generation before waking it;- endpoint CALL, RECV, RETURN target, deferred-cancel, current-caller, and
direct IPC handoff records store the blocked or target
ThreadRef; - terminal line input and any other
ProcessWaiterconsumer store the waitingThreadRefand validate the generation before writing a CQE; ProcessHandle.waitrecords the waitingThreadRefwhile the handle still names the child process;ThreadHandle.joinrecords the waitingThreadRefand the targetThreadRef;- the single process-ring
cap_enterwaiter is stored asOption<ThreadRef>; - process-exit cleanup cancels every waiter whose
pidandprocess_generationmatch the exiting process, regardless of thread id.
A generation mismatch on wake or completion is a stale waiter and must be drained without writing to userspace. This mirrors current process-generation behavior and prevents one thread slot reuse from receiving another thread’s Timer, endpoint, join, or ring completion.
Exit And Join
The current exit(code) syscall remains process exit. It terminates the whole
process, releases the shared capability table, cancels process-owned endpoint
state, removes all timer/park/ring waiters for every thread in the process,
and completes the parent-facing ProcessHandle.
Thread exit is separate and does not add a syscall. The initial implementation
adds ThreadControl.exitThread(code) as a terminal capability-ring operation
on the current thread. A successful invocation does not post a CQE back to the
exiting thread, because cap_enter will not return to that execution context.
It records the exit code, wakes or completes any valid join waiter, and removes
only the current thread from scheduling. If the last non-idle thread in a
process exits through exitThread, the process exits with that thread’s code.
ThreadHandle.join is process-local and one-shot. If the target thread already
exited and its status is retained, join returns its code immediately and marks
the status joined. If it is still live, join blocks the caller’s thread until
the target exits. Self-join returns Failed. A second waiter, join after a
successful join, or join after detach returns Failed; it must not park an
ambiguous waiter. ThreadHandle.exitCode is nonblocking and may observe the
retained status while the handle is live, but it does not consume the one-shot
join right.
Releasing the last ThreadHandle before the target exits detaches the target:
the thread continues to run, but no exit status is retained after it exits
unless a join waiter already pins the state. Releasing the handle after exit
but before join drops the retained status and releases the thread-record
charge. A pending join waiter pins the handle state until completion or process
exit, so cap release cannot create a use-after-free. The exiting thread’s
kernel stack must not be freed while it is still executing on that stack; final
drop follows the existing process-exit rule and happens after another kernel
stack is active.
Fatal user faults remain process-fatal in the first implementation. Per-thread fault isolation can be designed later, after the basic scheduler and futex paths are stable.
Capability Ring And Blocking
The first threading implementation keeps one capability ring per process. The
runtime’s single-owner ring-client invariant remains part of the contract:
well-formed userspace serializes ring submission and completion matching
through capos-rt.
The kernel must not admit multiple blocked cap_enter waiters on the same
process ring in 7.2. If a second thread in the same process asks to block in
cap_enter while another thread is already the process ring waiter, the kernel
returns the current available completion count without blocking that second
thread. This preserves the existing syscall return shape and forces the
runtime to retry or wait through a runtime-level mutex/park rather than
letting two threads race to consume the same CQEs. A thread blocked in
Park is separate from the process ring’s CapEnter waiter; it must not
consume the one blocked ring-waiter slot.
This constraint avoids freezing a premature per-thread or sharded completion queue ABI. A later runtime/ring milestone can add per-thread rings, completion steering, or a process-level dispatcher thread if measurements show that the single ring waiter is too restrictive.
The full-SMP target is now recorded in
Ring v2 For Full SMP: each thread gets
its own complete SQ/CQ endpoint, and cap_enter waits on the current thread’s
CQ rather than a shared process CQ. The current process-ring rule remains a
compatibility constraint for 7.2 and for any runtime reactor bridge built
before Ring v2.
Park Handoff
Park authority is defined in Park Authority. The scheduler changes above must leave room for a thread block reason that is not tied to the process ring CQ. The frozen handoff is:
- park wait blocks the current thread, not the whole process;
- park wake makes selected generation-checked
ThreadRefvalues runnable; - timeouts use the same monotonic time base as
Timer; - private park keys are based on address-space identity plus user virtual address;
- shared-memory park keys are MemoryObject-derived identity plus offset;
- the first implementation starts with compact
CAP_OP_PARKandCAP_OP_UNPARKoperations rather than generic Cap’n Proto methods; - park wait SQEs are thread-owned so ring dispatch cannot park a sibling
thread under the waiter’s
user_data; - blocking park wait is a syscall-context operation that releases runtime
ring-client ownership before the thread parks, while
capos-rtdemultiplexes reserved park CQEs back to the waiting thread.
Pre-thread 4.5.4 measurement chose the compact capability-authorized shape for
failed wait and empty wake. 4.5.5 measured the real blocked/resume path through
thread-lifecycle under make run-measure, so the compact ParkSpace opcodes
remain the runtime ABI target for this slice.
Security Invariants
- A thread never owns a separate capability table in the initial model.
- A thread cannot escape the authority of its containing process.
- A
ThreadHandlenames only a thread in the same process and is non-transferable in the first implementation. - Thread creation is charged to one process-owned thread/kernel-stack ledger of record before the thread can become runnable.
- Process exit releases shared authority once, after all live threads are removed from scheduling.
- Per-process resource quotas are shared by all threads.
ThreadControlchanges only the current thread’s FS base.ThreadControl.exitThreadterminates only the current thread and is a capability-ring operation, not a syscall.- Every waiter or direct handoff that can resume execution stores a generation
checked
ThreadRef. - Process-owned user-buffer validation/copy/read paths hold the process
AddressSpacelock; future shared-memory thread primitives still need mapping provenance or object pins when they derive keys from shared backing.
Implementation Order
- Add internal
Threadstate, make each process own one initial thread, move saved context / kernel stack / FS base / block state onto that thread, and charge the initial thread against private process ledgers. Done 2026-04-24 23:09 UTC. - Change scheduler queues, blocking, exit cleanup, and direct IPC targets from pid-oriented state to thread references while preserving one thread per process. Done 2026-04-24 23:33 UTC.
- Add
ThreadSpawner,ThreadHandle, andThreadControl.exitThreadwith a QEMU smoke for create, join, detach, self-join rejection, second join rejection, and last-thread process exit. Done 2026-04-25. - Implement the ParkSpace private wait/wake path from Park Authority after the scheduler can block and wake individual threads, then run 4.5.5 blocked/resume measurements before declaring the park ABI stable. Done 2026-04-25.
Validation Plan
The first implementation smoke should create two threads in one process, prove
they share the address space and CapSet, prove each has an independent FS base,
join one thread from another, then let the last thread exit the process. The
existing make run-spawn path should keep covering runtime-fs-base and
single-thread-runtime so regressions in the pre-thread runtime contract stay
visible. make run-measure additionally records the private ParkSpace
blocked/resume timings and proves process exit with a parked park waiter.