Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Park Authority Contract

This page freezes the 7.1.1 design contract for thread-park (park/unpark) authority. It is the handoff from the in-process threading contract to the 7.2 implementation work and records the first 7.2.3 implementation status.

Linux prior art. Park solves the same problem as Linux futex(2): userspace owns the uncontended fast path through atomic operations on a 32-bit word, and the kernel parks/wakes threads only on contention. capOS uses the distinct name Park because the contract differs in important ways from Linux’s: it is capability-gated (no ambient authority), there is no priority inheritance, no requeue, no robust lists, and the shared variant is keyed by MemoryObject identity rather than (inode, pgoff). References to “Linux futex” in this page point to that prior art, not to the capOS API surface.

Scope

The first park milestone stays single-CPU and in-process. It gives a multi-threaded runtime one kernel primitive: park the current thread when a userspace word still has an expected value, and wake parked threads associated with that word. Userspace owns the uncontended path through ordinary atomic operations; the kernel owns only the contended sleep/wake path and timeout integration.

This contract covers:

  • production park authority objects;
  • private and shared park key identity;
  • the provisional compact wait/wake transport ABI;
  • scheduler, timeout, and process-exit interactions;
  • resource-accounting and security invariants;
  • the 4.5.5 measurement loop after real thread blocking exists.

This is not a Linux futex(2) compatibility surface. Priority inheritance, requeue, robust lists, shared-memory park-words before MemoryObject mapping identity is exposed, and SMP-safe user-buffer pinning remain later work.

Implementation Status

The 2026-04-25 7.2.3 slice implements:

  • schema marker interfaces for ParkSpace and SharedParkSpace;
  • compact CAP_OP_PARK and CAP_OP_UNPARK opcodes;
  • process-local, non-transferable ParkSpace grants through boot/spawn manifests;
  • private wait/wake keyed by the caller process address space and user virtual address;
  • per-thread Park block state with finite timeout integration;
  • one reserved CQE credit per parked waiter so wake/timeout delivery cannot be crowded out by ordinary completions;
  • QEMU correctness coverage in thread-lifecycle for mismatch, immediate timeout, wake-one, and wake-many;
  • 4.5.5 QEMU timing coverage in run-measure.

SharedParkSpace is a marker only. capos-rt has the marker type but no safe park client wrapper yet; the current correctness and measurement demos use raw compact SQEs so the ABI can settle before runtime synchronization wrappers claim the user_data namespace.

Design Grounding

The reviewed project documents for this contract are:

  • WORKPLAN.md;
  • docs/roadmap.md;
  • REVIEW.md;
  • REVIEW_FINDINGS.md;
  • docs/architecture/threading.md;
  • docs/architecture/scheduling.md;
  • docs/architecture/userspace-runtime.md;
  • docs/proposals/go-runtime-proposal.md.

docs/research/ was listed before selecting the milestone. The relevant research grounding is:

  • docs/research/out-of-kernel-scheduling.md for the kernel-assisted wait/wake split used by language runtimes;
  • docs/research/llvm-target.md for the Go/runtime syscall surface that needs thread creation, per-thread TLS, and futexes;
  • docs/research/genode.md for typed capability precedent and resource-accounted session state.

Authority Objects

ParkBench remains measurement-only. It is not a production authority and must not be granted by normal boot manifests.

The first production model has two authority objects:

interface ParkSpace {}
interface SharedParkSpace {}

These schema interfaces are marker interfaces for typed CapSet/result-cap identity. The wait and wake operations use compact ring opcodes rather than Cap’n Proto methods, because the pre-thread 4.5.4 measurement showed the generic Cap’n Proto path is not the right default for the park hot path.

ParkSpace is the first object to implement. It will be minted for a process by the same bootstrap/spawn path that grants ThreadControl and ThreadSpawner. It is process-local and non-transferable in the initial implementation. Holding it authorizes private park wait/wake only in the caller’s own address space; it does not grant memory access, cross-process wake authority, or the right to name arbitrary kernel wait queues.

SharedParkSpace is the shared-park object for a later MemoryObject-derived slice. A MemoryObject holder can derive a SharedParkSpace scoped to that MemoryObject’s backing identity. Shared park operations through that SharedParkSpace are keyed by object offset, not by one process’s virtual address. The first 7.2 implementation may leave SharedParkSpace unimplemented, but it must not choose a private-key ABI that prevents this shared-key model.

Park Keys

Private park keys are address-space scoped:

#![allow(unused)]
fn main() {
ParkKey::Private {
    address_space_id,
    address_space_generation,
    uaddr,
}
}

The first implementation can derive address_space_id and generation from the process id/generation while each process owns exactly one address space. The contract names address-space identity deliberately so a later fork/shared-AS model does not inherit a pid-shaped key.

Private parks are synchronization inside one address space. wake for a private key may wake only waiters in the same address space generation; a raw virtual address alone is never cross-process synchronization authority.

Shared park keys are MemoryObject scoped:

#![allow(unused)]
fn main() {
ParkKey::Shared {
    memory_object_id,
    memory_object_generation,
    offset,
}
}

Shared keys are disabled until the kernel can prove, while handling a park operation, that the submitted user address maps the MemoryObject backing the SharedParkSpace and can compute the byte offset in that backing object. Virtual aliases of the same shared page must converge on the same shared key. Private aliases within one address space do not converge unless they use the same user virtual address.

Shared parks require explicit shared-memory authority through the MemoryObject-derived SharedParkSpace. Never use raw virtual address alone for cross-process park/futex keys.

All park words are 32-bit and must be 4-byte aligned. wait validates the word as a readable user mapping before reading it. wake validates that the address is user-canonical and aligned; shared wake additionally validates the MemoryObject mapping identity so a caller cannot wake an unrelated object by guessing an offset.

Private-key cleanup is part of the ParkSpace contract, not an implementation detail of the Go runtime. Unmap, revoke, address-space generation change, and address-space teardown must drain or fail waiters for the old private key before the same virtual address can be reused as unrelated state. A stale private waiter may complete only against the address-space generation it was registered under; it must not observe or wake a later mapping with the same numeric uaddr.

Current implementation status: process/thread-exit cleanup exists, but VirtualMemory unmap/revoke draining for stale private keys is not implemented yet. Until that lands, the implemented private path is suitable for process lifetime park words and Go runtime bring-up, not for memory regions that are unmapped and reused while waiters may still exist.

Provisional Ring ABI

The 7.2 implementation starts with compact capability-authorized operations:

  • CAP_OP_PARK;
  • CAP_OP_UNPARK.

The numeric opcode values are assigned when the implementation edits capos-config/src/ring.rs. CAP_OP_PARK_BENCH remains reserved for measurement-only kernels and must not be repurposed.

CAP_OP_PARK uses the existing 64-byte SQE fields as:

SQE fieldMeaning
cap_idParkSpace for private wait, or SharedParkSpace for shared wait
user_datareturned in the wait completion CQE
addruser virtual address of the 32-bit park word
lenexpected 32-bit value
pipeline_deprelative timeout in monotonic nanoseconds; u64::MAX means no timeout
flagsmust be CAP_SQE_THREAD_OWNED
call_idowning thread id; a different thread leaves the SQE at the ring head

CAP_OP_UNPARK uses:

SQE fieldMeaning
cap_idParkSpace for private wake, or SharedParkSpace for shared wake
user_datareturned in the wake caller’s completion CQE
addruser virtual address of the 32-bit park word
lenmaximum number of waiters to wake; zero is malformed

Both operations require method_id, result_addr, result_len, pipeline_field, xfer_cap_count, and _reserved0 to be zero. CAP_OP_UNPARK also requires flags == 0, pipeline_dep == 0, and call_id == 0. Park operations are not promise-pipelineable in this slice. pipeline_dep is used as the wait timeout storage only for CAP_OP_PARK; future promise pipelining must keep rejecting CAP_SQE_PIPELINE on park opcodes or replace the park ABI in a reviewed branch.

Wait completions use non-negative CQE.result statuses:

ResultMeaning
PARK_WOKEN = 0a wake operation made the thread runnable
PARK_VALUE_MISMATCH = 1the loaded word did not equal expected
PARK_TIMED_OUT = 2the timeout expired before a wake
PARK_INTERRUPTED = 3a future cancellation/interrupt path aborted the wait

Wake completions return the non-negative number of threads woken. Malformed SQEs, invalid caps, unreadable wait words, unsupported cap object types, and stale authority use the existing negative transport errors until a later ABI adds a more specific compact-error namespace.

Ring Ownership And Dispatch Context

Park operations use the process capability ring for submission and CQE delivery, but blocking wait is not an ordinary long-lived runtime call. A runtime must not hold RuntimeRingClient while the thread is parked in CAP_OP_PARK; otherwise no sibling thread in the same process can borrow the same ring client to submit CAP_OP_UNPARK.

The runtime contract for park operations is:

  • capos-rt owns a process-wide park submission/completion path separate from the generic request-buffer RuntimeRingClient pending-call list;
  • park wait reserves a unique user_data value, writes the SQE while holding the runtime’s ring-submission lock, records a park-wait completion slot in runtime-owned memory, and releases the ring-submission lock before entering cap_enter;
  • park wait sets CAP_SQE_THREAD_OWNED and call_id to the current thread id so a sibling thread cannot drain the wait and park the wrong ThreadRef;
  • the park user_data namespace is reserved by the runtime so ordinary generic clients cannot accidentally claim a park completion;
  • all runtime CQ draining must route reserved park user_data completions to the park-wait slot instead of treating them as generic client completions;
  • if another thread drains the waiter CQE before the waiting thread returns from cap_enter, the waiting thread reads the already-recorded status from that park-wait slot;
  • park wake may use the ordinary serialized ring submission path because it completes without parking the caller’s thread.

CAP_OP_PARK is syscall-context only. Timer ring polling and any future interrupt-context ring drain must leave it unconsumed because consuming it can block the current thread and mutate scheduler state. CAP_OP_UNPARK also starts as syscall-context only; widening wake to timer polling would need a separate review of scheduler locking and completion delivery.

This design preserves one process ring and the single blocked cap_enter waiter rule. A thread blocked in Park is not the process ring’s CapEnter waiter, so a sibling can still enter the kernel to submit wake, Timer, IPC, or ordinary capability work through the same process ring.

Wait And Wake Semantics

wait is atomic with respect to wake for the same key:

  1. validate the SQE shape, including thread ownership, and authority cap;
  2. verify call_id names the current thread so a sibling cannot park on behalf of the waiter;
  3. validate the user address shape and derive the private or shared park key;
  4. lock the current process AddressSpace across validation and the user-word read for private keys; future shared keys must additionally prove mapping identity or pin the backing object;
  5. take the park bucket lock;
  6. read the 32-bit user word while the bucket lock is held;
  7. compare the loaded value with expected;
  8. if the value differs, post PARK_VALUE_MISMATCH without blocking;
  9. if the value matches and the timeout is zero, post PARK_TIMED_OUT without blocking;
  10. otherwise, record the current ThreadRef, key, timeout deadline, and user_data, then block only the current thread.

The user-word read, comparison, and enqueue are serialized with wake by the park scheduler path, and the read itself occurs while the process AddressSpace mutex is held. This prevents a page-table validation/use race and the classic lost wake where a waiter reads the old value, a sibling stores the new value and wakes no one, and the waiter then parks based on the stale read. Shared park-words still need mapping provenance or object pinning so a MemoryObject-derived key cannot be swapped out from under key derivation. The user word is not a kernel-owned mutex. Runtime code must use normal atomic load/store and memory-ordering rules around the park word.

wake derives the same key, removes up to maxWake valid waiters from that key’s FIFO list, posts PARK_WOKEN completions to the waiting process ring using the completion credits reserved when those waiters parked, and marks those ThreadRef values runnable after generation checks. A wake SQE is consumed only when the kernel can also post the wake caller’s own CQE; if that ordinary CQ slot is not available, no waiters are removed and the SQE remains pending like other uncompletable ring work. Stale waiters caused by thread or process generation mismatch are drained without writing to userspace, release their reserved completion credits, and do not count as successfully woken.

Timeouts use the same monotonic time base as Timer. The kernel may convert nanoseconds to scheduler ticks internally, but the ABI remains nanoseconds. Finite deadlines post PARK_TIMED_OUT through the waiting process ring using the waiter’s reserved completion credit and wake the blocked thread if the thread generation still matches.

An explicit wake, timeout, cancellation, process exit, and unmap/revoke cleanup race must produce exactly one waiter completion or cleanup-consumption path. Once any path consumes the waiter record, the other racing paths must observe it as gone and must not post a second CQE or wake a later ThreadRef.

Process exit removes every park waiter whose pid/process generation matches the exiting process. Thread exit removes that thread’s own park waiter before the thread record can be retained for join observation. These cleanup paths must not allocate.

Unmap, mapping revoke, and address-space teardown remove or fail private waiters for the affected key/generation before the old virtual address range is made reusable for unrelated mappings. A wake or timeout racing with cleanup must either complete the old waiter under its original generation or observe that cleanup already consumed it; it must not post a completion to a new owner of the same numeric address.

Resource Accounting

Park waits are bounded by the process thread ledger. A thread can be in only one scheduler block reason, so live park waiters cannot exceed live threads. The first private ParkSpace implementation stores the wait node in thread-owned block state and links it into a fixed process-owned waiter table. That is valid only because private ParkSpace caps are process-local and the first key is the process address space plus user virtual address. Shared SharedParkSpace support must move to object-owned fixed buckets scoped to MemoryObject identity. Wait, wake, timeout, and process-exit cleanup must not allocate. Registering a blocking wait reserves one deferred CQE credit in the waiting process. Ordinary completion posting treats reserved credits as unavailable, so wake and timeout paths can always post the waiter completion without losing the waiter. If the kernel cannot reserve that credit, it must not enqueue or block the wait; it either leaves the SQE pending until capacity exists or posts a negative completion for the wait attempt without consuming a waiter slot.

ParkSpace creation is charged as ordinary process capability/table state. If the first implementation needs per-process bucket storage beyond the cap object itself, that storage must be reserved before the ParkSpace is published and released when the process exits or the cap is finally dropped.

In the first private implementation, the waiter table is process-owned and survives release of the ParkSpace handle. CAP_OP_RELEASE of the last capability handle removes submit authority but cannot free a parked waiter’s storage. A waiter can still receive a PARK_WOKEN CQE from a wake operation that already resolved the authority object, a PARK_TIMED_OUT CQE from a finite deadline, or a future PARK_INTERRUPTED CQE from an explicit cancellation path. Thread or process exit drains the wait node without posting a CQE to the exiting thread/process and releases the reserved completion credit. If a runtime drops the last ParkSpace while it has indefinite waiters, it can deadlock its own process, but it cannot create a use-after-free or leak authority outside that process. Future shared SharedParkSpace storage must use explicit non-cap-table waiter pins so object-owned buckets are not freed while parked waiters remain.

SharedParkSpace storage is charged to the MemoryObject-derived object when shared parking lands. It must not create a second unbounded resource path where a holder can allocate wait queues by touching many offsets.

Security Invariants

  • Holding a ParkSpace or SharedParkSpace authorizes blocking/waking, not memory access. Wait still requires a readable user word.
  • Private ParkSpace caps are process-local and non-transferable in the first implementation.
  • Shared park authority must be derived from MemoryObject identity and offset, not from another process’s virtual address.
  • Park wait blocks the current thread, not the whole process.
  • Park wait SQEs are thread-owned; a non-owner cap_enter leaves the SQE at the ring head instead of parking the wrong thread.
  • Park wake can only make generation-checked ThreadRef values runnable.
  • Park completions are posted to the waiting process ring using the waiter SQE’s user_data.
  • Blocking wait registration reserves one CQE credit for the eventual waiter completion, and wake must not remove a waiter unless that credit exists.
  • CAP_OP_PARK is dispatched only from syscall-context cap_enter and never from timer or interrupt-context ring polling.
  • A parked private ParkSpace waiter is stored in process-owned fixed storage; future shared SharedParkSpace waiters must pin the authority object backing their bucket table until wake, timeout, thread exit, or process exit removes the waiter.
  • One process ring still has at most one blocked cap_enter waiter in 7.2; park wait does not create an extra blocked ring waiter.
  • Private ParkSpace wait reads hold the process AddressSpace lock across validation and the user-word read. SharedParkSpace park-words remain blocked until MemoryObject mapping provenance or explicit object pins cover shared key derivation.

Measurement Handoff

4.5.4 measured failed wait and empty wake before real threads existed. That result chooses a compact capability-authorized operation as the starting ABI for 7.2 rather than a generic Cap’n Proto wait/wake method pair.

4.5.5 is closed for the first real thread-blocking path. It measures:

  • value-mismatch wait;
  • empty wake;
  • wait-to-block;
  • wake-to-runnable;
  • wake-to-resume through cap_enter.

The 2026-04-25 QEMU sample printed:

[thread-lifecycle] park path avg cycles: failed_wait=6778 empty_wake=6840 wait_to_block=55994326 wake_to_runnable=28219 wake_to_resume=28000684

The compact shape still holds for this slice: CAP_OP_PARK and CAP_OP_UNPARK remain the production runtime ABI target, while ParkBench remains measurement-only.

Implementation Order

  1. Add ParkSpace and SharedParkSpace marker interfaces plus compact opcode constants.
  2. Add a process-local ParkSpace grant path next to ThreadControl and ThreadSpawner; keep it non-transferable.
  3. Add thread-owned Park block state and fixed private waiter storage with no wait/wake allocation.
  4. Dispatch CAP_OP_PARK and CAP_OP_UNPARK against ParkSpace for private address-space keys.
  5. Add QEMU smoke coverage for mismatch, timeout, wake-one, and wake-many. Safe runtime park wrappers remain a later capos-rt slice.
  6. Run 4.5.5 blocked/resume measurements and fold the result into the final ABI decision.
  7. Drain or fail private waiters on VirtualMemory unmap, mapping revoke, and address-space generation change before the affected virtual address range can be reused.
  8. Add MemoryObject-derived SharedParkSpace support only after mapping provenance or object pins cover shared key derivation under the same validation/use discipline.

Validation Plan

The first implementation smoke should create multiple threads in one process, park one or more threads on a userspace park word, wake them through the same ParkSpace, prove timeout and value-mismatch paths, and show that process exit drains pending waits. The runtime smoke should use the same capability through capos-rt so future Go work has a direct handoff.