# In-Process Threading Contract

This page freezes the 7.1.0 design contract for kernel-managed threads inside
one process. The 7.1.1 park authority contract is frozen separately in
[Park Authority](park.md). These pages are the handoff from the
single-thread runtime checkpoint to the 7.2 implementation work. The 7.2.3
checkpoint implements the basic single-CPU lifecycle plus private ParkSpace
wait/wake.


## Scope

The first threading milestone stays single-CPU. It changes the scheduler's
unit of execution from process to thread while keeping the process as the
authority, address-space, and resource-accounting boundary. SMP, per-CPU run
queues, TLB shootdown, SQPOLL, and scheduler-policy services remain later
milestones.

This contract covers:

- process-owned versus thread-owned state;
- the initial thread creation ABI;
- per-thread FS-base/TLS rules;
- thread exit and join semantics;
- the ring-blocking constraint needed before a sharded or per-thread ring
  design exists;
- the handoff to the 7.1.1 park authority design.

## Ownership Split

The process remains the security boundary. All threads in one process share
the same address space and capability table, so a thread has the same
authority as its sibling threads.

| Process-owned state | Thread-owned state |
| --- | --- |
| Process id and process generation | Thread id and thread generation |
| User address space and CR3 | Saved CPU context and user register state |
| Capability table and resource ledger | Kernel stack and syscall stack top |
| Capability ring page and ring scratch | FS base |
| Read-only CapSet page | Scheduling/blocking state |
| ProcessHandle exit state | ThreadHandle join/exit state |
| Endpoint owner state and process-wide cleanup hooks | Future scheduling-context binding |

The implementation migrates incrementally. The 7.2.0 slice makes each process
contain a single initial `Thread`, with saved context, kernel stack, FS base,
and blocking state stored on that thread. The 7.2.1 slice changes
scheduler-owned queues, current execution, direct IPC handoff, and wake records
to generation-checked `ThreadRef` values while still allowing exactly one
thread per process. Later slices widen creation and lifecycle. The
single-thread intermediate state must preserve existing QEMU behavior.

## Scheduler Contract

`Scheduler` will store runnable execution contexts as thread
references, not process ids. A thread reference is `(pid, process_generation,
tid, thread_generation)`. The process generation keeps handles from naming a
reused process; the thread generation keeps handles from naming a reused
thread slot inside a live process.

The 7.2.1 checkpoint applies this identity to `Scheduler.current`, run queues,
direct IPC targets, Timer sleep waiters, process/terminal waiters, and endpoint
caller/receiver wake records while preserving one initial thread per process.

The run queue, `current`, direct IPC target, and blocked waiter scans become
thread-oriented. Address-space switches happen only when the next runnable
thread belongs to a different process. TSS.RSP0, the syscall kernel stack, and
FS base are updated on every thread switch because those are thread-local
machine resources.

The idle process can remain the existing special user-mode idle process until
the kernel-mode/per-CPU idle work lands. It should still be treated as a
kernel-owned execution context that cannot block, exit, or hold ordinary caps.

## Thread Creation ABI

Thread creation is exposed through a process-local `ThreadSpawner` capability.
It creates threads only in the caller's current process. It does not grant
authority to another process and is non-transferable across IPC in the initial
implementation.

The initial control-plane shape is:

```capnp
interface ThreadSpawner {
    create @0 (
        entry :UInt64,
        stackTop :UInt64,
        arg :UInt64,
        fsBase :UInt64,
        flags :UInt64
    ) -> (handleIndex :UInt16);
}

interface ThreadHandle {
    join @0 () -> (exitCode :Int64);
    exitCode @1 () -> (exited :Bool, exitCode :Int64);
}

interface ThreadControl {
    getFsBase @0 () -> (fsBase :UInt64);
    setFsBase @1 (fsBase :UInt64) -> ();
    exitThread @2 (code :Int64) -> ();
}
```

Any 7.2 schema adjustment must update this page in the same branch before
implementation review. The stable semantics are that creation is in-process,
the returned handle is an observed result cap, `ThreadHandle` observes one
thread rather than the whole process, and current-thread exit stays in the
capability-ring transport rather than adding a syscall.

The new thread starts in Ring 3 at `entry` with:

- `RDI = arg`;
- `RSI = tid`;
- `RDX = pid`;
- `RCX = RING_VADDR`;
- `R8 = CAPSET_VADDR`, or zero if the process has no CapSet.

The runtime supplies the user stack and TLS block. The kernel validates that
`entry`, `stackTop`, and `fsBase` are user-canonical, that `stackTop` is
16-byte aligned at entry, and that reserved `flags` bits are zero. Page
presence and stack-growth policy remain process address-space questions;
before a page-fault subsystem exists, an invalid thread stack can fault the
process.

## Resource Accounting

Thread creation allocates kernel memory and is quota-backed by process-owned
ledger state, not per-capability helper counters. The 7.2.0 checkpoint charges
the initial thread during process creation; `ThreadSpawner.create` extends the
same ledgers to additional threads. The ledger of record is:

- `PROCESS_THREAD_LIMIT`, the maximum live or retained thread records in one
  process, initially 16;
- `PROCESS_THREAD_KERNEL_STACK_PAGES`, initially matching the current
  per-thread kernel stack allocation size of 32 pages;
- `thread_records_used` / `thread_records_max`;
- `thread_kernel_stack_pages_used` / `thread_kernel_stack_pages_max`.

The initial process thread charges one thread record and one kernel-stack
allocation during process creation. `ThreadSpawner.create` reserves a thread
record and kernel-stack page budget before allocating the stack or publishing a
`ThreadHandle`; every later failure rolls both reservations back before
returning. Cap-slot reservation for the result handle remains charged to the
existing process cap-table ledger.

Creation failures are controlled application exceptions. Thread count,
kernel-stack budget, handle cap-slot exhaustion, and kernel stack allocation
failure return `Overloaded` with a specific message and no partially runnable
thread. Invalid entry, stack, FS base, or flags return `Failed`.

Thread exit releases the kernel stack only after the scheduler is running on a
different kernel stack. The thread record remains charged while a live
`ThreadHandle`, pending join waiter, or unjoined exit status can still observe
it. Once the handle is released without a pending join, or once a one-shot join
has consumed the status and no wait record pins it, the retained record charge
is released. Process exit releases all thread records and stack charges once.

## FS Base And TLS

FS base is thread-owned. The existing `ThreadControl.getFsBase` and
`ThreadControl.setFsBase` operations keep their names, but after threading they
refer to the current thread, not the whole process. `setFsBase` continues to
reject non-user-canonical values and writes the CPU FS-base MSR immediately
when called by the running thread.

The initial process thread uses the PT_TLS block installed by ELF loading.
Additional threads receive an FS base from `ThreadSpawner.create`; the runtime
is responsible for allocating and initializing each thread's TLS/TCB data.
There is no process-global FS base. Current-thread FS-base operations are useful
for the single-thread runtime checkpoint, but they must not be treated as the
final threading ABI for language runtimes. True multi-threaded Go or
C/POSIX-like runtime support requires each `ThreadRef` to own a distinct TLS
block and FS base.

Context switching must save the outgoing thread's FS base and restore the next
thread's FS base even when both threads belong to the same process and no CR3
switch is needed.

## Thread Identity In Waiters And Dispatch

The concrete identity type for in-process scheduling is:

```rust
ThreadRef {
    pid,
    process_generation,
    tid,
    thread_generation,
}
```

Process identity still governs authority and accounting, but wakeup and
blocking state must name a thread. 7.2 changes context-aware capability
dispatch so `CapCallContext` carries both the caller process id for authority
checks and the caller `ThreadRef` for wake/cancel decisions. Existing pid-only
records that can resume execution or write a caller CQE must be widened before
multiple threads can run in one process.

The migration target is:

- `TimerSleepWaiter` stores the sleeping `ThreadRef` and validates the
  generation before waking it;
- endpoint CALL, RECV, RETURN target, deferred-cancel, current-caller, and
  direct IPC handoff records store the blocked or target `ThreadRef`;
- terminal line input and any other `ProcessWaiter` consumer store the waiting
  `ThreadRef` and validate the generation before writing a CQE;
- `ProcessHandle.wait` records the waiting `ThreadRef` while the handle still
  names the child process;
- `ThreadHandle.join` records the waiting `ThreadRef` and the target
  `ThreadRef`;
- the single process-ring `cap_enter` waiter is stored as `Option<ThreadRef>`;
- process-exit cleanup cancels every waiter whose `pid` and
  `process_generation` match the exiting process, regardless of thread id.

A generation mismatch on wake or completion is a stale waiter and must be
drained without writing to userspace. This mirrors current process-generation
behavior and prevents one thread slot reuse from receiving another thread's
Timer, endpoint, join, or ring completion.

## Exit And Join

The current `exit(code)` syscall remains process exit. It terminates the whole
process, releases the shared capability table, cancels process-owned endpoint
state, removes all timer/park/ring waiters for every thread in the process,
and completes the parent-facing `ProcessHandle`.

Thread exit is separate and does not add a syscall. The initial implementation
adds `ThreadControl.exitThread(code)` as a terminal capability-ring operation
on the current thread. A successful invocation does not post a CQE back to the
exiting thread, because `cap_enter` will not return to that execution context.
It records the exit code, wakes or completes any valid join waiter, and removes
only the current thread from scheduling. If the last non-idle thread in a
process exits through `exitThread`, the process exits with that thread's code.

`ThreadHandle.join` is process-local and one-shot. If the target thread already
exited and its status is retained, join returns its code immediately and marks
the status joined. If it is still live, join blocks the caller's thread until
the target exits. Self-join returns `Failed`. A second waiter, join after a
successful join, or join after detach returns `Failed`; it must not park an
ambiguous waiter. `ThreadHandle.exitCode` is nonblocking and may observe the
retained status while the handle is live, but it does not consume the one-shot
join right.

Releasing the last `ThreadHandle` before the target exits detaches the target:
the thread continues to run, but no exit status is retained after it exits
unless a join waiter already pins the state. Releasing the handle after exit
but before join drops the retained status and releases the thread-record
charge. A pending join waiter pins the handle state until completion or process
exit, so cap release cannot create a use-after-free. The exiting thread's
kernel stack must not be freed while it is still executing on that stack; final
drop follows the existing process-exit rule and happens after another kernel
stack is active.

Fatal user faults remain process-fatal in the first implementation. Per-thread
fault isolation can be designed later, after the basic scheduler and futex
paths are stable.

## Capability Ring And Blocking

The first threading implementation keeps one capability ring per process. The
runtime's single-owner ring-client invariant remains part of the contract:
well-formed userspace serializes ring submission and completion matching
through `capos-rt`.

The kernel must not admit multiple blocked `cap_enter` waiters on the same
process ring in 7.2. If a second thread in the same process asks to block in
`cap_enter` while another thread is already the process ring waiter, the kernel
returns the current available completion count without blocking that second
thread. This preserves the existing syscall return shape and forces the
runtime to retry or wait through a runtime-level mutex/park rather than
letting two threads race to consume the same CQEs. A thread blocked in
`Park` is separate from the process ring's `CapEnter` waiter; it must not
consume the one blocked ring-waiter slot.

This constraint avoids freezing a premature per-thread or sharded completion
queue ABI. A later runtime/ring milestone can add per-thread rings, completion
steering, or a process-level dispatcher thread if measurements show that the
single ring waiter is too restrictive.

The full-SMP target is now recorded in
[Ring v2 For Full SMP](../proposals/ring-v2-smp-proposal.md): each thread gets
its own complete SQ/CQ endpoint, and `cap_enter` waits on the current thread's
CQ rather than a shared process CQ. The current process-ring rule remains a
compatibility constraint for 7.2 and for any runtime reactor bridge built
before Ring v2.

## Park Handoff

Park authority is defined in [Park Authority](park.md). The scheduler
changes above must leave room for a thread block reason that is not tied to the
process ring CQ. The frozen handoff is:

- park wait blocks the current thread, not the whole process;
- park wake makes selected generation-checked `ThreadRef` values runnable;
- timeouts use the same monotonic time base as `Timer`;
- private park keys are based on address-space identity plus user virtual
  address;
- shared-memory park keys are MemoryObject-derived identity plus offset;
- the first implementation starts with compact `CAP_OP_PARK` and
  `CAP_OP_UNPARK` operations rather than generic Cap'n Proto methods;
- park wait SQEs are thread-owned so ring dispatch cannot park a sibling
  thread under the waiter's `user_data`;
- blocking park wait is a syscall-context operation that releases runtime
  ring-client ownership before the thread parks, while `capos-rt` demultiplexes
  reserved park CQEs back to the waiting thread.

Pre-thread 4.5.4 measurement chose the compact capability-authorized shape for
failed wait and empty wake. 4.5.5 measured the real blocked/resume path through
`thread-lifecycle` under `make run-measure`, so the compact ParkSpace opcodes
remain the runtime ABI target for this slice.

## Security Invariants

- A thread never owns a separate capability table in the initial model.
- A thread cannot escape the authority of its containing process.
- A `ThreadHandle` names only a thread in the same process and is
  non-transferable in the first implementation.
- Thread creation is charged to one process-owned thread/kernel-stack ledger of
  record before the thread can become runnable.
- Process exit releases shared authority once, after all live threads are
  removed from scheduling.
- Per-process resource quotas are shared by all threads.
- `ThreadControl` changes only the current thread's FS base.
- `ThreadControl.exitThread` terminates only the current thread and is a
  capability-ring operation, not a syscall.
- Every waiter or direct handoff that can resume execution stores a generation
  checked `ThreadRef`.
- Process-owned user-buffer validation/copy/read paths hold the process
  `AddressSpace` lock; future shared-memory thread primitives still need
  mapping provenance or object pins when they derive keys from shared backing.

## Implementation Order

1. Add internal `Thread` state, make each process own one initial thread, move
   saved context / kernel stack / FS base / block state onto that thread, and
   charge the initial thread against private process ledgers.
   Done 2026-04-24 23:09 UTC.
2. Change scheduler queues, blocking, exit cleanup, and direct IPC targets from
   pid-oriented state to thread references while preserving one thread per
   process.
   Done 2026-04-24 23:33 UTC.
3. Add `ThreadSpawner`, `ThreadHandle`, and `ThreadControl.exitThread` with a
   QEMU smoke for create, join, detach, self-join rejection, second join
   rejection, and last-thread process exit.
   Done 2026-04-25.
4. [x] Implement the ParkSpace private wait/wake path from
   [Park Authority](park.md) after the scheduler can block and wake
   individual threads, then run 4.5.5 blocked/resume measurements before
   declaring the park ABI stable.
   Done 2026-04-25.

## Validation Plan

The first implementation smoke should create two threads in one process, prove
they share the address space and CapSet, prove each has an independent FS base,
join one thread from another, then let the last thread exit the process. The
existing `make run-spawn` path should keep covering `runtime-fs-base` and
`single-thread-runtime` so regressions in the pre-thread runtime contract stay
visible. `make run-measure` additionally records the private ParkSpace
blocked/resume timings and proves process exit with a parked park waiter.
