In-Process Threading Contract
This page records the implemented contract for kernel-managed threads inside
one process. The park authority contract is frozen separately in
Park Authority. These pages are the handoff from the initial
single-thread runtime checkpoint to same-process SMP work. The current slice
has per-thread completion rings for spawned child threads, per-CPU WFQ run
queues with bounded stealing, a caller-thread-bound SchedulingPolicyCap,
and a SchedulingContext cap that records identity, bind/revoke,
dispatcher budget charging/replenishment, bounded endpoint donation/return,
and fixed depletion/deadline notification cells. Same-process sibling
scheduling has formal accepted 1-to-2 evidence on capos-bench 2026-05-02
21:38 UTC against main commit 374f8556 (capOS work 1.883x / total
1.787x, both clearing the configured 1.6x gates; matching Linux pthread
baseline 1.988x/1.987x on the same physical-core pin set). The
2026-05-02 1-to-4 row was the diagnostic that justified Phase D’s fair-share
enqueue policy: capOS sat at 1.566x/1.538x while Linux scaled to
3.963x/3.858x. Phase D now runs per-CPU WFQ queues with bounded stealing
and manually accepted the 2026-05-10 1-to-4 diagnostic row
(3.088x/2.700x) while the harness-enforced gate remains 1-to-2
work/total speedup; see docs/benchmarks.md for the full evidence table
including historical pre-collapse rows. Phase F has landed the
one-SQ-consumer prerequisite, nohz telemetry, housekeeping/deferred-work
placement, the clockevent/deadline substrate, and bounded SQPOLL ring mode
including the non-periodic SQPOLL producer-wake progress path; the first
automatic nohz activation increment is closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md and
SQPOLL-driven auto-nohz activation is also closed via
docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md; generic
full-nohz for ordinary budgeted compute leases and timeout-based auto-revoke are
landed; policy-service AutoNoHz issuance remains future work.
Scope
The threading milestone changes the scheduler’s unit of execution from process
to thread while keeping the process as the authority, address-space, and
resource-accounting boundary. Same-process sibling scheduling on multiple CPUs
is functional for per-thread-ring processes. The accepted 1-to-2 performance
claim is now the formal capos-bench 5-run pair recorded on 2026-05-02
21:38 UTC against main commit 374f8556: capOS work 1.883x and total
1.787x clear the configured 1.6x gates; the matching Linux pthread
baseline on the same physical-core pin set (0,1,2,3) records
1.988x/1.987x, validating the workload shape. The 2026-05-02 1-to-4 row
was the diagnostic that justified Phase D: capOS sat at 1.566x/1.538x
while Linux scaled to 3.963x/3.858x. Phase D now runs per-CPU WFQ queues
with bounded stealing and its 2026-05-10 1-to-4 row (3.088x/2.700x) was
manually accepted from recorded diagnostics; the harness-enforced gate remains
1-to-2 work/total speedup. Historical pre-collapse rows and the post-collapse
3-run diagnostic remain in docs/benchmarks.md for reference. Phase E adds
the SchedulingContext cap (identity, caller-thread bind, revoke, budget
charging/replenishment, bounded synchronous endpoint donation/return, and
fixed depletion/deadline notification cells with drain observer results),
and Phase F has landed the bounded SQPOLL ring mode plus the
clockevent/deadline substrate. Automatic nohz activation, realtime
admission, and privileged userspace scheduler-policy services remain later
work.
This contract covers:
- process-owned versus thread-owned state;
- the initial thread creation ABI;
- per-thread FS-base/TLS rules;
- thread exit and join semantics;
- the per-thread ring blocking and completion-routing contract;
- the caller-thread-bound
SchedulingPolicyCapandSchedulingContextsurfaces that mutate per-thread WFQ weight/latency-class and per-thread scheduling-context binding; - the handoff to the 7.1.1 park authority design.
Ownership Split
The process remains the security boundary. All threads in one process share the same address space and capability table, so a thread has the same authority as its sibling threads.
| Process-owned state | Thread-owned state |
|---|---|
| Process id and process generation | Thread id and thread generation |
| User address space and CR3 | Saved CPU context and user register state |
| Capability table and resource ledger | Kernel stack and syscall stack top |
| Initial compatibility ring and ring arena ownership | Per-thread ring endpoint, scratch, and FS base |
| Read-only CapSet page | Scheduling/blocking state |
| ProcessHandle exit state | ThreadHandle join/exit state |
| Endpoint owner state and process-wide cleanup hooks | WFQ weight, latency class, virtual runtime, and virtual_finish_ns enqueue tag |
| Process-wide resource ledgers (thread records, kernel stacks, cap-table slots) | SchedulingContext binding (identity/generation, remaining budget, replenish/deadline timestamps, donation/return slot, notification recorder) |
The implementation migrated incrementally. The 7.2.0 slice made each process
contain a single initial Thread, with saved context, kernel stack, FS base,
and blocking state stored on that thread. Later slices changed scheduler-owned
queues, current execution, direct IPC handoff, and wake records to
generation-checked ThreadRef values, added creation and lifecycle caps, and
then assigned per-thread rings to spawned children.
Scheduler Contract
Scheduler stores runnable execution contexts as thread
references, not process ids. A thread reference is (pid, process_generation, tid, thread_generation). The process generation keeps handles from naming a
reused process; the thread generation keeps handles from naming a reused
thread slot inside a live process.
This identity applies to Scheduler.current, run queues, direct IPC targets,
Timer sleep waiters, process/terminal waiters, endpoint caller/receiver wake
records, and deferred cancellation state.
Runnable ownership is split across per-CPU run queues
(SCHEDULER_CPUS = 4). Each queue is ordered ascending by
virtual_finish_ns, which is recomputed per enqueue from
virtual_runtime_ns, the thread’s WFQ weight (clamped to
[MIN_WEIGHT, MAX_WEIGHT] in capos-abi::scheduler), and a per-class
slice scaled by LatencyClass (Interactive divides the slice,
Batch multiplies it, Normal/IpcServer pass it through). Default
placement targets the current CPU; a bounded steal path balances when a
CPU’s local queue is empty, recomputes the WFQ tag at the destination,
and records placement-spread / steal migrations under the measure
feature. Each per-CPU queue is reserved at thread-create time to the live
runnable-capable thread count so timer-tick, unblock, direct-IPC fallback,
and steal-requeue paths never allocate.
The run queue, current, direct IPC target, and blocked waiter scans are
thread-oriented. Address-space switches happen only when the next runnable
thread belongs to a different process. TSS.RSP0, the syscall kernel stack, and
FS base are updated on every thread switch because those are thread-local
machine resources. Per-thread runtime_ns advances 1:1 with elapsed CPU
time; virtual_runtime_ns advances by
elapsed_ns * REFERENCE_WEIGHT / weight so weight changes the cumulative
WFQ share rather than just an enqueue tie-breaker.
SchedulingContext bindings layer dispatcher budget on top of WFQ. A
thread may carry at most one SchedulingContextThreadBinding. While
bound, the dispatcher charges elapsed time against the binding’s
remaining_budget_ns, replenishes from period_ns at the next replenish
boundary, records deadline_or_timeout and budget_depleted
notifications in the per-context fixed cells, and routes synchronous
endpoint donation/return for passive receiver threads (donated_holder
in the notification snapshot tracks whether the holder is the donor or
the receiver). Stale-generation or revoked caps fail closed before
mutating scheduler state. Realtime-island admission, CPU placement
enforcement, and overrun-fault policy remain deferred.
The idle path is a per-CPU CPL0 (kernel-mode) idle thread; the former
special user-mode idle process has been removed. Each CPU’s idle thread is a
kernel-owned execution context — it runs on the kernel PML4 with a dedicated
idle kernel stack and cannot block, exit, or hold ordinary caps. A lightweight
synthetic idle Process record is retained per CPU only so the idle
ThreadRef resolves through scheduler bookkeeping; it maps no user code,
stack, or cap ring. See the “Idle paths” section of
docs/architecture/scheduling.md.
Phase F has landed the one-SQ-consumer prerequisite, nohz telemetry,
housekeeping/deferred-work placement, the clockevent/deadline substrate,
and a bounded SQPOLL ring-mode worker (MAX_SQPOLL_WORKERS = 16,
request_sqpoll_start_for_thread / finalize_pending_sqpoll_start_for_thread
with stale-owner rollback). Tick suppression now exists behind explicit
CpuIsolationLease admission, including ordinary budgeted compute leases that
target a live SchedulingContext; policy-service AutoNoHz issuance and generic
SQPOLL nohz for arbitrary rings remain future work.
Thread Creation ABI
Thread creation is exposed through a process-local ThreadSpawner capability.
It creates threads only in the caller’s current process. It does not grant
authority to another process and is non-transferable across IPC in the initial
implementation.
The initial control-plane shape is:
interface ThreadSpawner {
create @0 (
entry :UInt64,
stackTop :UInt64,
arg :UInt64,
fsBase :UInt64,
flags :UInt64
) -> (handleIndex :UInt16);
}
interface ThreadHandle {
join @0 () -> (exitCode :Int64);
exitCode @1 () -> (exited :Bool, exitCode :Int64);
}
interface ThreadControl {
getFsBase @0 () -> (fsBase :UInt64);
setFsBase @1 (fsBase :UInt64) -> ();
exitThread @2 (code :Int64) -> ();
}
Any 7.2 schema adjustment must update this page in the same branch before
implementation review. The stable semantics are that creation is in-process,
the returned handle is an observed result cap, ThreadHandle observes one
thread rather than the whole process, and current-thread exit is available
through both ThreadControl.exitThread and the raw exit(code) syscall.
The new thread starts in Ring 3 at entry with:
RDI = arg;RSI = tid;RDX = pid;RCX = the current thread's ring address;R8 = CAPSET_VADDR, or zero if the process has no CapSet.
The runtime supplies the user stack and TLS block. The kernel validates that
entry, stackTop, and fsBase are user-canonical, that stackTop is
16-byte aligned at entry, and that reserved flags bits are zero. Page
presence and stack-growth policy remain process address-space questions;
before a page-fault subsystem exists, an invalid thread stack can fault the
process.
Resource Accounting
Thread creation allocates kernel memory and is quota-backed by process-owned
ledger state, not per-capability helper counters. The 7.2.0 checkpoint charges
the initial thread during process creation; ThreadSpawner.create extends the
same ledgers to additional threads. The ledger of record is:
PROCESS_THREAD_LIMIT, the maximum live or retained thread records in one process, initially 16;PROCESS_THREAD_KERNEL_STACK_PAGES, initially matching the current per-thread kernel stack allocation size of 32 pages;thread_records_used/thread_records_max;thread_kernel_stack_pages_used/thread_kernel_stack_pages_max.
The initial process thread charges one thread record and one kernel-stack
allocation during process creation. ThreadSpawner.create reserves a thread
record and kernel-stack page budget before allocating the stack or publishing a
ThreadHandle; every later failure rolls both reservations back before
returning. Cap-slot reservation for the result handle remains charged to the
existing process cap-table ledger.
Creation failures are controlled application exceptions. Thread count,
kernel-stack budget, handle cap-slot exhaustion, and kernel stack allocation
failure return Overloaded with a specific message and no partially runnable
thread. Invalid entry, stack, FS base, or flags return Failed.
Thread exit releases the kernel stack only after the scheduler is running on a
different kernel stack. The thread record remains charged while a live
ThreadHandle, pending join waiter, or unjoined exit status can still observe
it. Once the handle is released without a pending join, or once a one-shot join
has consumed the status and no wait record pins it, the retained record charge
is released. Process exit releases all thread records and stack charges once.
The off-stack property is enforced by an OffStackToken witness on every stack
frame release path: the deferred per-thread drain calls
Process::release_thread_kernel_stack, whole-process teardown calls
Process::release_all_thread_kernel_stacks, and pre-publication rollback calls
Process::rollback_created_thread. The token constructor is private to the
scheduler module. Implicit Thread::Drop is deliberately not a release path;
if a Thread value reaches its destructor with a nonzero stack, it fails
closed by leaving the frames allocated instead of freeing a stack without an
off-stack witness.
FS Base And TLS
FS base is thread-owned. The existing ThreadControl.getFsBase and
ThreadControl.setFsBase operations keep their names, but after threading they
refer to the current thread, not the whole process. setFsBase continues to
reject non-user-canonical values and writes the CPU FS-base MSR immediately
when called by the running thread. Both methods route through
context-aware dispatch (CapCallContext::caller_thread) so the
operation always targets the caller, never a different thread; calling
ThreadControl from a non-live caller returns
ProcessFsBaseError::CallerNotLive.
The initial process thread uses the PT_TLS block installed by ELF loading.
Additional threads receive an FS base from ThreadSpawner.create; the runtime
is responsible for allocating and initializing each thread’s TLS/TCB data.
There is no process-global FS base. Current-thread FS-base operations are useful
for the single-thread runtime checkpoint, but they must not be treated as the
final threading ABI for language runtimes. True multi-threaded Go or
C/POSIX-like runtime support requires each ThreadRef to own a distinct TLS
block and FS base.
Context switching must save the outgoing thread’s FS base and restore the next thread’s FS base even when both threads belong to the same process and no CR3 switch is needed.
Thread Identity In Waiters And Dispatch
The concrete identity type for in-process scheduling is:
#![allow(unused)]
fn main() {
ThreadRef {
pid,
process_generation,
tid,
thread_generation,
}
}
Process identity still governs authority and accounting, but wakeup and
blocking state must name a thread. 7.2 changes context-aware capability
dispatch so CapCallContext carries both the caller process id for authority
checks and the caller ThreadRef for wake/cancel decisions. Existing pid-only
records that can resume execution or write a caller CQE must be widened before
multiple threads can run in one process.
The migration target is:
TimerSleepWaiterstores the sleepingThreadRefand validates the generation before waking it;- endpoint CALL, RECV, RETURN target, deferred-cancel, current-caller, and
direct IPC handoff records store the blocked or target
ThreadRef; - terminal line input and any other
ProcessWaiterconsumer store the waitingThreadRefand validate the generation before writing a CQE; ProcessHandle.waitrecords the waitingThreadRefwhile the handle still names the child process;ThreadHandle.joinrecords the waitingThreadRefand the targetThreadRef;cap_enterblocks the currentThreadRefon that thread’s ring endpoint;- process-exit cleanup cancels every waiter whose
pidandprocess_generationmatch the exiting process, regardless of thread id.
A generation mismatch on wake or completion is a stale waiter and must be drained without writing to userspace. This mirrors current process-generation behavior and prevents one thread slot reuse from receiving another thread’s Timer, endpoint, join, or ring completion.
Exit And Join
The current exit(code) syscall terminates the current thread. This preserves
single-thread process exit because the process exits when its last non-idle
thread exits, and it avoids tearing down a shared address space while sibling
threads are still current on other CPUs.
Thread exit does not add a new syscall. The initial implementation added
ThreadControl.exitThread(code) as a terminal capability-ring operation on
the current thread, with the same current-thread termination semantics as the
raw syscall. A successful invocation does not post a CQE back to the exiting
thread, because cap_enter will not return to that execution context. It
records the exit code, wakes or completes any valid join waiter, and removes
only the current thread from scheduling. If the last non-idle thread in a
process exits through exit(code) or exitThread, the process exits with that
thread’s code and completes the parent-facing ProcessHandle.
Whole-process termination remains a ProcessHandle operation. It releases the
shared capability table, cancels process-owned endpoint state, removes
timer/park/ring waiters for every thread in the process, and completes the
parent-facing ProcessHandle after the process is no longer current on any
CPU.
ThreadHandle.join is process-local and one-shot. If the target thread already
exited and its status is retained, join returns its code immediately and marks
the status joined. If it is still live, join blocks the caller’s thread until
the target exits. Self-join returns Failed. A second waiter, join after a
successful join, or join after detach returns Failed; it must not park an
ambiguous waiter. ThreadHandle.exitCode is nonblocking and may observe the
retained status while the handle is live, but it does not consume the one-shot
join right.
Releasing the last ThreadHandle before the target exits detaches the target:
the thread continues to run, but no exit status is retained after it exits
unless a join waiter already pins the state. Releasing the handle after exit
but before join drops the retained status and releases the thread-record
charge. A pending join waiter pins the handle state until completion or process
exit, so cap release cannot create a use-after-free. The exiting thread’s
kernel stack must not be freed while it is still executing on that stack; final
process teardown performs an explicit token-gated stack release after another
kernel stack is active, before the deferred Process value is dropped.
Fatal user faults remain process-fatal in the first implementation. Per-thread fault isolation can be designed later, after the basic scheduler and futex paths are stable.
Capability Ring And Blocking
The first Ring v2 implementation keeps the initial thread’s compatibility
ring at RING_VADDR and gives each spawned child thread a kernel-chosen ring
mapping inside the reserved process ring arena. Runtime-selected ring address
ranges remain a later VirtualMemory reservation extension.
ThreadSpawner.create allocates a ring record and user mapping for the new
thread, stores that mapping on the child ThreadRef, and passes the ring
address in the child start registers. cap_enter blocks the current thread
against that thread’s own CQ, so same-process sibling threads may block in
cap_enter independently. Timer, endpoint, join, park, and cancellation paths
must route completions by generation-checked ThreadRef to the target
thread’s ring endpoint.
The runtime’s single-owner ring-client invariant remains local to each ring
client. Well-formed userspace serializes submission and completion matching per
thread ring through capos-rt; it must not have two consumers racing on the
same SQ/CQ. The scheduler still refuses to run the exact same ThreadRef on
two CPUs at once, but it no longer treats every multithreaded pid as tied to
one scheduler CPU.
This is sufficient for functional same-process sibling scheduling. The formal
accepted 1-to-2 make run-thread-scale capOS evidence is the capos-bench
2026-05-02 21:38 UTC pair (work 1.883x, total 1.787x, both clearing the
configured 1.6x gates). The guest result row’s accepted field remains
diagnostic; the host summary enforces the work-window and total-time gates, and
refuses speedup enforcement unless CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS
records the QEMU CPU pin set. Linux validates the repaired benchmark shape
through four workers on physical cores (3.963x/3.858x). That capOS
4-worker row was diagnostic (1.566x/1.538x) and justified Phase D’s
per-CPU WFQ queues plus bounded stealing. The 2026-05-10 Phase D rerun
recorded 1-to-4 work/total diagnostics 3.088x/2.700x, manually accepted
for closeout; remaining risks are the shared scheduler lock, temporary CPU
pinning, CQ/join/exit/block/schedule overhead, broader workload classes, and
higher-thread-count evidence.
Scheduling Policy And Context Authority
SchedulingPolicyCap is the caller-thread-bound surface for WFQ knobs.
Every method routes through CapCallContext::caller_thread; there is no
per-cap-object ThreadHandle, no badge-encoded thread id, and no
cross-thread mutation in this slice. Cross-thread authority is deferred to
the privileged scheduler-policy service plan. The schema shape is:
interface SchedulingPolicyCap {
setWeight @0 (weight :UInt16) -> ();
setLatencyClass @1 (class :LatencyClass) -> ();
snapshot @2 () -> (
weight :UInt16,
class :LatencyClass,
runtimeNs :UInt64,
virtualRuntimeNs :UInt64,
);
}
setWeight validates against [MIN_WEIGHT, MAX_WEIGHT] at the cap
boundary and updates the caller thread’s WFQ weight; the new weight
applies to the next enqueue’s virtual_finish_ns tag and to subsequent
virtual_runtime_ns accounting. setLatencyClass swaps the per-thread
LatencyClass (Normal, Interactive, IpcServer, Batch) used to
scale the dispatcher slice. snapshot is a read-only observer over the
core WFQ state and does not expose the measure-only counters.
SchedulingContext is the schema-typed cap for dispatcher budget
authority:
interface SchedulingContext {
info @0 () -> (info :SchedulingContextInfo);
create @1 (spec :SchedulingContextSpec) -> (
contextIndex :UInt16,
identity :SchedulingContextIdentity,
result :SchedulingContextOperationResult,
dispatchEffect :SchedulingContextDispatchEffect,
);
bindCallerThread @2 () -> (
identity :SchedulingContextIdentity,
binding :SchedulingContextBinding,
result :SchedulingContextOperationResult,
dispatchEffect :SchedulingContextDispatchEffect,
);
revoke @3 () -> (
identity :SchedulingContextIdentity,
previousGeneration :UInt64,
result :SchedulingContextOperationResult,
dispatchEffect :SchedulingContextDispatchEffect,
);
drainNotifications @4 () -> (
notifications :SchedulingContextNotificationSnapshot,
);
}
create returns a same-interface child context as transferred result
cap 0 and becomes chargeable only after bindCallerThread. revoke
bumps the generation and clears any matching thread binding; later calls
through the stale cap generation report staleGeneration or fail closed
before mutating scheduler state. drainNotifications reads the fixed
per-context budget-depleted and deadline-or-timeout slots; the
scheduler updates these in place from hard paths without allocation,
including the holder identity and a donatedHolder bit for endpoint
donation/return. The bootstrap manifest grants SchedulingPolicyCap and
SchedulingContext only to focused-proof manifests; the default boot
manifest does not grant them.
Userspace API Surface
The capos-rt runtime exposes the threading caps as typed clients on top
of the per-thread ring:
ThreadControlClient–get_fs_base/set_fs_base/exit_thread, including*_waitblocking variants overRuntimeRingClient.ThreadSpawnerClient::create– submits theentry/stackTop/arg/fsBase/flagsABI and returns anOwnedCapability<ThreadHandle>delivered as transferred result cap 0 in the CQE.ThreadHandleClient–join,exit_code(nonblocking observer), and theirfinish_*helpers;finish_joindecodes the one-shot exit code.SchedulingPolicyClient–set_weight,set_latency_class, andsnapshot, all caller-thread-bound.SchedulingContextClient–info,create,bind_caller_thread,revoke, anddrain_notifications.
A typical spawn/join pseudocode against these clients is:
#![allow(unused)]
fn main() {
let handle = thread_spawner.create_wait(
&mut ring,
entry_addr,
user_stack_top,
arg,
fs_base,
/* flags */ 0,
timeout_ns,
)?;
// ... runtime work on the parent thread ...
let exit_code = thread_handle
.join_wait(&mut ring, timeout_ns)?;
}
The userspace runtime is responsible for the user stack, TLS/TCB, and any free-list bookkeeping for retired handles; the kernel only validates the ABI fields and charges the per-process ledgers.
Park Handoff
Park authority is defined in Park Authority. The scheduler changes above must leave room for a thread block reason that is not tied to the process ring CQ. The frozen handoff is:
- park wait blocks the current thread, not the whole process;
- park wake makes selected generation-checked
ThreadRefvalues runnable; - timeouts use the same monotonic time base as
Timer; - private park keys are based on address-space identity plus user virtual address;
- shared-memory park keys are MemoryObject-derived identity plus offset;
- the first implementation starts with compact
CAP_OP_PARKandCAP_OP_UNPARKoperations rather than generic Cap’n Proto methods; - park wait SQEs are thread-owned so ring dispatch cannot park a sibling
thread under the waiter’s
user_data; - blocking park wait is a syscall-context operation that releases runtime
ring-client ownership before the thread parks, while
capos-rtdemultiplexes reserved park CQEs back to the waiting thread.
Pre-thread 4.5.4 measurement chose the compact capability-authorized shape for
failed wait and empty wake. 4.5.5 measured the real blocked/resume path through
thread-lifecycle under make run-measure, so the compact ParkSpace opcodes
remain the runtime ABI target for this slice.
Security Invariants
- A thread never owns a separate capability table in the initial model.
- A thread cannot escape the authority of its containing process.
- A
ThreadHandlenames only a thread in the same process and is non-transferable in the first implementation. - Thread creation is charged to one process-owned thread/kernel-stack ledger of record before the thread can become runnable.
- Process exit releases shared authority once, after all live threads are removed from scheduling.
- Per-process resource quotas are shared by all threads.
ThreadControlchanges only the current thread’s FS base.ThreadControl.exitThreadterminates only the current thread and is a capability-ring operation, not a syscall.- Every waiter or direct handoff that can resume execution stores a generation
checked
ThreadRef. - Process-owned user-buffer validation/copy/read paths hold the process
AddressSpacelock; future shared-memory thread primitives still need mapping provenance or object pins when they derive keys from shared backing.
Implementation Order
- Add internal
Threadstate, make each process own one initial thread, move saved context / kernel stack / FS base / block state onto that thread, and charge the initial thread against private process ledgers. Done 2026-04-24 23:09 UTC. - Change scheduler queues, blocking, exit cleanup, and direct IPC targets from pid-oriented state to thread references while preserving one thread per process. Done 2026-04-24 23:33 UTC.
- Add
ThreadSpawner,ThreadHandle, andThreadControl.exitThreadwith a QEMU smoke for create, join, detach, self-join rejection, second join rejection, and last-thread process exit. Done 2026-04-25. - Implement the ParkSpace private wait/wake path from Park Authority after the scheduler can block and wake individual threads, then run 4.5.5 blocked/resume measurements before declaring the park ABI stable. Done 2026-04-25.
Validation
The thread-lifecycle proof creates multiple threads in one process, proves
they share the address space and CapSet, proves each has an independent FS
base, rejects invalid join cases, joins one thread from another, and lets the
last thread exit the process. The existing make run-spawn path keeps covering
runtime-fs-base and single-thread-runtime so regressions in the pre-thread
runtime contract stay visible. make run-measure additionally records the
private ParkSpace blocked/resume timings and proves process exit with a parked
park waiter. Phase D fairness/Interactive/weight-change smokes
(make run-thread-fairness, make run-thread-fairness-interactive,
make run-thread-fairness-weight-change) exercise the SchedulingPolicyCap
caller-thread-bound surface; the thread-scale proof carries the recorded
WFQ scaling evidence. The recorded 1-to-2 work/total speedup gate is the
host-enforced Phase D acceptance criterion; the 1-to-4 row remains a
manually accepted diagnostic. Safe runtime park wrappers and a focused
SchedulingContext budget/donation/notification smoke remain future
capos-rt and harness work.