# Park Authority Contract

This page freezes the 7.1.1 design contract for thread-park (`park`/`unpark`)
authority. It is the handoff from the in-process threading contract to the 7.2
implementation work and records the first 7.2.3 implementation status.

**Linux prior art.** `Park` solves the same problem as Linux `futex(2)`:
userspace owns the uncontended fast path through atomic operations on a 32-bit
word, and the kernel parks/wakes threads only on contention. capOS uses the
distinct name `Park` because the contract differs in important ways from
Linux's: it is capability-gated (no ambient authority), there is no priority
inheritance, no requeue, no robust lists, and the shared variant is keyed by
`MemoryObject` identity rather than `(inode, pgoff)`. References to "Linux
futex" in this page point to that prior art, not to the capOS API surface.

## Scope

The first park milestone stays single-CPU and in-process. It gives a
multi-threaded runtime one kernel primitive: park the current thread when a
userspace word still has an expected value, and wake parked threads associated
with that word. Userspace owns the uncontended path through ordinary atomic
operations; the kernel owns only the contended sleep/wake path and timeout
integration.

This contract covers:

- production park authority objects;
- private and shared park key identity;
- the provisional compact wait/wake transport ABI;
- scheduler, timeout, and process-exit interactions;
- resource-accounting and security invariants;
- the 4.5.5 measurement loop after real thread blocking exists.

This is not a Linux `futex(2)` compatibility surface. Priority inheritance,
requeue, robust lists, shared-memory park-words before MemoryObject mapping
identity is exposed, and SMP-safe user-buffer pinning remain later work.

## Implementation Status

The 2026-04-25 7.2.3 slice implements:

- schema marker interfaces for `ParkSpace` and `SharedParkSpace`;
- compact `CAP_OP_PARK` and `CAP_OP_UNPARK` opcodes;
- process-local, non-transferable ParkSpace grants through boot/spawn
  manifests;
- private wait/wake keyed by the caller process address space and user virtual
  address;
- per-thread `Park` block state with finite timeout integration;
- one reserved CQE credit per parked waiter so wake/timeout delivery cannot be
  crowded out by ordinary completions;
- QEMU correctness coverage in `thread-lifecycle` for mismatch, immediate
  timeout, wake-one, and wake-many;
- 4.5.5 QEMU timing coverage in `run-measure`.

`SharedParkSpace` is a marker only. `capos-rt` has the marker type but no safe park
client wrapper yet; the current correctness and measurement demos use raw
compact SQEs so the ABI can settle before runtime synchronization wrappers
claim the `user_data` namespace.

## Design Grounding

The reviewed project documents for this contract are:

- `WORKPLAN.md`;
- `docs/roadmap.md`;
- `REVIEW.md`;
- `REVIEW_FINDINGS.md`;
- `docs/architecture/threading.md`;
- `docs/architecture/scheduling.md`;
- `docs/architecture/userspace-runtime.md`;
- `docs/proposals/go-runtime-proposal.md`.

`docs/research/` was listed before selecting the milestone. The relevant
research grounding is:

- `docs/research/out-of-kernel-scheduling.md` for the kernel-assisted
  wait/wake split used by language runtimes;
- `docs/research/llvm-target.md` for the Go/runtime syscall surface that needs
  thread creation, per-thread TLS, and futexes;
- `docs/research/genode.md` for typed capability precedent and
  resource-accounted session state.

## Authority Objects

`ParkBench` remains measurement-only. It is not a production authority and
must not be granted by normal boot manifests.

The first production model has two authority objects:

```capnp
interface ParkSpace {}
interface SharedParkSpace {}
```

These schema interfaces are marker interfaces for typed CapSet/result-cap
identity. The wait and wake operations use compact ring opcodes rather than
Cap'n Proto methods, because the pre-thread 4.5.4 measurement showed the
generic Cap'n Proto path is not the right default for the park hot path.

`ParkSpace` is the first object to implement. It will be minted for a process
by the same bootstrap/spawn path that grants `ThreadControl` and
`ThreadSpawner`. It is process-local and non-transferable in the initial
implementation. Holding it authorizes private park wait/wake only in the
caller's own address space; it does not grant memory access, cross-process wake
authority, or the right to name arbitrary kernel wait queues.

`SharedParkSpace` is the shared-park object for a later MemoryObject-derived slice. A
MemoryObject holder can derive a SharedParkSpace scoped to that MemoryObject's backing
identity. Shared park operations through that SharedParkSpace are keyed by object
offset, not by one process's virtual address. The first 7.2 implementation may
leave `SharedParkSpace` unimplemented, but it must not choose a private-key ABI that
prevents this shared-key model.

## Park Keys

Private park keys are address-space scoped:

```rust
ParkKey::Private {
    address_space_id,
    address_space_generation,
    uaddr,
}
```

The first implementation can derive `address_space_id` and generation from the
process id/generation while each process owns exactly one address space. The
contract names address-space identity deliberately so a later fork/shared-AS
model does not inherit a pid-shaped key.

Private parks are synchronization inside one address space. `wake` for a
private key may wake only waiters in the same address space generation; a raw
virtual address alone is never cross-process synchronization authority.

Shared park keys are MemoryObject scoped:

```rust
ParkKey::Shared {
    memory_object_id,
    memory_object_generation,
    offset,
}
```

Shared keys are disabled until the kernel can prove, while handling a park
operation, that the submitted user address maps the MemoryObject backing the
SharedParkSpace and can compute the byte offset in that backing object. Virtual aliases
of the same shared page must converge on the same shared key. Private aliases
within one address space do not converge unless they use the same user virtual
address.

Shared parks require explicit shared-memory authority through the
MemoryObject-derived `SharedParkSpace`. Never use raw virtual address alone for
cross-process park/futex keys.

All park words are 32-bit and must be 4-byte aligned. `wait` validates the
word as a readable user mapping before reading it. `wake` validates that the
address is user-canonical and aligned; shared `wake` additionally validates
the MemoryObject mapping identity so a caller cannot wake an unrelated object
by guessing an offset.

Private-key cleanup is part of the ParkSpace contract, not an implementation
detail of the Go runtime. Unmap, revoke, address-space generation change, and
address-space teardown must drain or fail waiters for the old private key
before the same virtual address can be reused as unrelated state. A stale
private waiter may complete only against the address-space generation it was
registered under; it must not observe or wake a later mapping with the same
numeric `uaddr`.

Current implementation status: process/thread-exit cleanup exists, but
VirtualMemory unmap/revoke draining for stale private keys is not implemented
yet. Until that lands, the implemented private path is suitable for process
lifetime park words and Go runtime bring-up, not for memory regions that are
unmapped and reused while waiters may still exist.

## Provisional Ring ABI

The 7.2 implementation starts with compact capability-authorized operations:

- `CAP_OP_PARK`;
- `CAP_OP_UNPARK`.

The numeric opcode values are assigned when the implementation edits
`capos-config/src/ring.rs`. `CAP_OP_PARK_BENCH` remains reserved for
measurement-only kernels and must not be repurposed.

`CAP_OP_PARK` uses the existing 64-byte SQE fields as:

| SQE field | Meaning |
| --- | --- |
| `cap_id` | `ParkSpace` for private wait, or `SharedParkSpace` for shared wait |
| `user_data` | returned in the wait completion CQE |
| `addr` | user virtual address of the 32-bit park word |
| `len` | expected 32-bit value |
| `pipeline_dep` | relative timeout in monotonic nanoseconds; `u64::MAX` means no timeout |
| `flags` | must be `CAP_SQE_THREAD_OWNED` |
| `call_id` | owning thread id; a different thread leaves the SQE at the ring head |

`CAP_OP_UNPARK` uses:

| SQE field | Meaning |
| --- | --- |
| `cap_id` | `ParkSpace` for private wake, or `SharedParkSpace` for shared wake |
| `user_data` | returned in the wake caller's completion CQE |
| `addr` | user virtual address of the 32-bit park word |
| `len` | maximum number of waiters to wake; zero is malformed |

Both operations require `method_id`, `result_addr`, `result_len`,
`pipeline_field`, `xfer_cap_count`, and `_reserved0` to be zero.
`CAP_OP_UNPARK` also requires `flags == 0`, `pipeline_dep == 0`, and
`call_id == 0`. Park operations are not promise-pipelineable in this slice.
`pipeline_dep` is used as the wait timeout storage only for
`CAP_OP_PARK`; future promise pipelining must keep rejecting
`CAP_SQE_PIPELINE` on park opcodes or replace the park ABI in a reviewed
branch.

Wait completions use non-negative `CQE.result` statuses:

| Result | Meaning |
| --- | --- |
| `PARK_WOKEN = 0` | a wake operation made the thread runnable |
| `PARK_VALUE_MISMATCH = 1` | the loaded word did not equal `expected` |
| `PARK_TIMED_OUT = 2` | the timeout expired before a wake |
| `PARK_INTERRUPTED = 3` | a future cancellation/interrupt path aborted the wait |

Wake completions return the non-negative number of threads woken. Malformed
SQEs, invalid caps, unreadable wait words, unsupported cap object types, and
stale authority use the existing negative transport errors until a later ABI
adds a more specific compact-error namespace.

## Ring Ownership And Dispatch Context

Park operations use the process capability ring for submission and CQE
delivery, but blocking wait is not an ordinary long-lived runtime call. A
runtime must not hold `RuntimeRingClient` while the thread is parked in
`CAP_OP_PARK`; otherwise no sibling thread in the same process can borrow
the same ring client to submit `CAP_OP_UNPARK`.

The runtime contract for park operations is:

- `capos-rt` owns a process-wide park submission/completion path separate
  from the generic request-buffer `RuntimeRingClient` pending-call list;
- park wait reserves a unique `user_data` value, writes the SQE while holding
  the runtime's ring-submission lock, records a park-wait completion slot in
  runtime-owned memory, and releases the ring-submission lock before entering
  `cap_enter`;
- park wait sets `CAP_SQE_THREAD_OWNED` and `call_id` to the current thread id
  so a sibling thread cannot drain the wait and park the wrong `ThreadRef`;
- the park `user_data` namespace is reserved by the runtime so ordinary
  generic clients cannot accidentally claim a park completion;
- all runtime CQ draining must route reserved park `user_data` completions to
  the park-wait slot instead of treating them as generic client completions;
- if another thread drains the waiter CQE before the waiting thread returns
  from `cap_enter`, the waiting thread reads the already-recorded status from
  that park-wait slot;
- park wake may use the ordinary serialized ring submission path because it
  completes without parking the caller's thread.

`CAP_OP_PARK` is syscall-context only. Timer ring polling and any future
interrupt-context ring drain must leave it unconsumed because consuming it can
block the current thread and mutate scheduler state. `CAP_OP_UNPARK` also
starts as syscall-context only; widening wake to timer polling would need a
separate review of scheduler locking and completion delivery.

This design preserves one process ring and the single blocked `cap_enter`
waiter rule. A thread blocked in `Park` is not the process ring's
`CapEnter` waiter, so a sibling can still enter the kernel to submit wake,
Timer, IPC, or ordinary capability work through the same process ring.

## Wait And Wake Semantics

`wait` is atomic with respect to `wake` for the same key:

1. validate the SQE shape, including thread ownership, and authority cap;
2. verify `call_id` names the current thread so a sibling cannot park on behalf
   of the waiter;
3. validate the user address shape and derive the private or shared park key;
4. lock the current process `AddressSpace` across validation and the user-word
   read for private keys; future shared keys must additionally prove mapping
   identity or pin the backing object;
5. take the park bucket lock;
6. read the 32-bit user word while the bucket lock is held;
7. compare the loaded value with `expected`;
8. if the value differs, post `PARK_VALUE_MISMATCH` without blocking;
9. if the value matches and the timeout is zero, post
   `PARK_TIMED_OUT` without blocking;
10. otherwise, record the current `ThreadRef`, key, timeout deadline, and
   `user_data`, then block only the current thread.

The user-word read, comparison, and enqueue are serialized with `wake` by the
park scheduler path, and the read itself occurs while the process
`AddressSpace` mutex is held. This prevents a page-table validation/use race
and the classic lost wake where a waiter reads the old value, a sibling stores
the new value and wakes no one, and the waiter then parks based on the stale
read. Shared park-words still need mapping provenance or object pinning so a
MemoryObject-derived key cannot be swapped out from under key derivation. The
user word is not a kernel-owned mutex. Runtime code must use normal atomic
load/store and memory-ordering rules around the park word.

`wake` derives the same key, removes up to `maxWake` valid waiters from that
key's FIFO list, posts `PARK_WOKEN` completions to the waiting process
ring using the completion credits reserved when those waiters parked, and marks
those `ThreadRef` values runnable after generation checks. A wake SQE is
consumed only when the kernel can also post the wake caller's own CQE; if that
ordinary CQ slot is not available, no waiters are removed and the SQE remains
pending like other uncompletable ring work. Stale waiters caused by thread or
process generation mismatch are drained without writing to userspace, release
their reserved completion credits, and do not count as successfully woken.

Timeouts use the same monotonic time base as `Timer`. The kernel may convert
nanoseconds to scheduler ticks internally, but the ABI remains nanoseconds.
Finite deadlines post `PARK_TIMED_OUT` through the waiting process ring
using the waiter's reserved completion credit and wake the blocked thread if
the thread generation still matches.

An explicit wake, timeout, cancellation, process exit, and unmap/revoke cleanup
race must produce exactly one waiter completion or cleanup-consumption path.
Once any path consumes the waiter record, the other racing paths must observe it
as gone and must not post a second CQE or wake a later `ThreadRef`.

Process exit removes every park waiter whose pid/process generation matches
the exiting process. Thread exit removes that thread's own park waiter before
the thread record can be retained for join observation. These cleanup paths
must not allocate.

Unmap, mapping revoke, and address-space teardown remove or fail private
waiters for the affected key/generation before the old virtual address range is
made reusable for unrelated mappings. A wake or timeout racing with cleanup
must either complete the old waiter under its original generation or observe
that cleanup already consumed it; it must not post a completion to a new owner
of the same numeric address.

## Resource Accounting

Park waits are bounded by the process thread ledger. A thread can be in only
one scheduler block reason, so live park waiters cannot exceed live threads.
The first private ParkSpace implementation stores the wait node in
thread-owned block state and links it into a fixed process-owned waiter table.
That is valid only because private ParkSpace caps are process-local and the
first key is the process address space plus user virtual address. Shared
SharedParkSpace support must move to object-owned fixed buckets scoped to MemoryObject
identity. Wait, wake, timeout, and process-exit cleanup must not allocate.
Registering a blocking wait reserves one deferred CQE credit in the waiting
process. Ordinary completion posting treats reserved credits as unavailable, so
wake and timeout paths can always post the waiter completion without losing the
waiter. If the kernel cannot reserve that credit, it must not enqueue or block
the wait; it either leaves the SQE pending until capacity exists or posts a
negative completion for the wait attempt without consuming a waiter slot.

`ParkSpace` creation is charged as ordinary process capability/table state.
If the first implementation needs per-process bucket storage beyond the cap
object itself, that storage must be reserved before the ParkSpace is
published and released when the process exits or the cap is finally dropped.

In the first private implementation, the waiter table is process-owned and
survives release of the ParkSpace handle. `CAP_OP_RELEASE` of the last
capability handle removes submit authority but cannot free a parked waiter's
storage. A waiter can still receive a `PARK_WOKEN` CQE from a wake
operation that already resolved the authority object, a `PARK_TIMED_OUT`
CQE from a finite deadline, or a future `PARK_INTERRUPTED` CQE from an
explicit cancellation path. Thread or process exit drains the wait node without
posting a CQE to the exiting thread/process and releases the reserved
completion credit. If a runtime drops the last ParkSpace while it has
indefinite waiters, it can deadlock its own process, but it cannot create a
use-after-free or leak authority outside that process. Future shared SharedParkSpace
storage must use explicit non-cap-table waiter pins so object-owned buckets are
not freed while parked waiters remain.

`SharedParkSpace` storage is charged to the MemoryObject-derived object when shared
parking lands. It must not create a second unbounded resource path where a
holder can allocate wait queues by touching many offsets.

## Security Invariants

- Holding a ParkSpace or SharedParkSpace authorizes blocking/waking, not memory
  access. Wait still requires a readable user word.
- Private ParkSpace caps are process-local and non-transferable in the first
  implementation.
- Shared park authority must be derived from MemoryObject identity and offset,
  not from another process's virtual address.
- Park wait blocks the current thread, not the whole process.
- Park wait SQEs are thread-owned; a non-owner `cap_enter` leaves the SQE at
  the ring head instead of parking the wrong thread.
- Park wake can only make generation-checked ThreadRef values runnable.
- Park completions are posted to the waiting process ring using the waiter
  SQE's `user_data`.
- Blocking wait registration reserves one CQE credit for the eventual waiter
  completion, and wake must not remove a waiter unless that credit exists.
- `CAP_OP_PARK` is dispatched only from syscall-context `cap_enter` and
  never from timer or interrupt-context ring polling.
- A parked private ParkSpace waiter is stored in process-owned fixed storage;
  future shared SharedParkSpace waiters must pin the authority object backing their
  bucket table until wake, timeout, thread exit, or process exit removes the
  waiter.
- One process ring still has at most one blocked `cap_enter` waiter in 7.2;
  park wait does not create an extra blocked ring waiter.
- Private ParkSpace wait reads hold the process `AddressSpace` lock across
  validation and the user-word read. SharedParkSpace park-words remain blocked
  until MemoryObject mapping provenance or explicit object pins cover shared
  key derivation.

## Measurement Handoff

4.5.4 measured failed wait and empty wake before real threads existed. That
result chooses a compact capability-authorized operation as the starting ABI
for 7.2 rather than a generic Cap'n Proto `wait`/`wake` method pair.

4.5.5 is closed for the first real thread-blocking path. It measures:

- value-mismatch wait;
- empty wake;
- wait-to-block;
- wake-to-runnable;
- wake-to-resume through `cap_enter`.

The 2026-04-25 QEMU sample printed:

```text
[thread-lifecycle] park path avg cycles: failed_wait=6778 empty_wake=6840 wait_to_block=55994326 wake_to_runnable=28219 wake_to_resume=28000684
```

The compact shape still holds for this slice: `CAP_OP_PARK` and
`CAP_OP_UNPARK` remain the production runtime ABI target, while
`ParkBench` remains measurement-only.

## Implementation Order

1. [x] Add `ParkSpace` and `SharedParkSpace` marker interfaces plus compact opcode
   constants.
2. [x] Add a process-local ParkSpace grant path next to `ThreadControl` and
   `ThreadSpawner`; keep it non-transferable.
3. [x] Add thread-owned `Park` block state and fixed private waiter
   storage with no wait/wake allocation.
4. [x] Dispatch `CAP_OP_PARK` and `CAP_OP_UNPARK` against
   ParkSpace for private address-space keys.
5. [x] Add QEMU smoke coverage for mismatch, timeout, wake-one, and wake-many.
   Safe runtime park wrappers remain a later capos-rt slice.
6. [x] Run 4.5.5 blocked/resume measurements and fold the result into the
   final ABI decision.
7. [ ] Drain or fail private waiters on VirtualMemory unmap, mapping revoke,
   and address-space generation change before the affected virtual address
   range can be reused.
8. [ ] Add MemoryObject-derived SharedParkSpace support only after mapping provenance
   or object pins cover shared key derivation under the same validation/use
   discipline.

## Validation Plan

The first implementation smoke should create multiple threads in one process,
park one or more threads on a userspace park word, wake them through the same
ParkSpace, prove timeout and value-mismatch paths, and show that process exit
drains pending waits. The runtime smoke should use the same capability through
`capos-rt` so future Go work has a direct handoff.
