# Proposal: Resource Accounting and Quotas

Cross-cutting resource profiles, ledgers, reservation semantics, and
verification gates for bounded capOS sessions, services, drivers, storage,
networking, tests, and future language runtimes.


## Related

- [Authority Accounting](../authority-accounting-transfer-design.md) records
  the current transfer and resource-accounting invariants.
- [Memory Management](../architecture/memory.md) documents the current
  frame-grant and MemoryObject accounting baseline.
- [Go VirtualMemory Contract](../backlog/go-virtual-memory-contract.md)
  provides the first concrete virtual-reservation versus physical-commit
  ledger split for a future language runtime.

## Problem

capOS already has several resource limits: cap slots, frame grants, timer
waiters, thread and kernel-stack quotas, ring scratch, and spawn preflight
checks. Those are useful but fragmented. Local accounts, guests, anonymous
callers, external sessions, service accounts, drivers, storage services,
network stacks, tests, and future runtimes all need the same rule:

```text
No workload receives implicit unlimited consumption of finite system resources.
```

This proposal defines the common model. It extends the Stage S.9 authority
graph and per-process `ResourceLedger` design rather than replacing it.

## Principles

- A `ResourceProfile` is a policy template, not authority.
- Actual enforcement happens through ledgers, capability wrappers, brokers,
  supervisors, and kernel/resource-service admission checks.
- Every resource class has one ledger of record. Mirrors for status, metrics,
  or audit are derived views and must not be used for enforcement.
- Reservation happens before side effects. Commit publishes the resource.
  Release and rollback are mandatory on all success, failure, timeout,
  revocation, and process-exit paths.
- Identity metadata selects policy. It never consumes, releases, or bypasses
  quota by itself.
- Quota donation is explicit. A caller may donate budget to a service call,
  but a service cannot silently spend the caller's unrelated budget.

## Resource Profiles

Resource profiles are named templates selected by account records, manifest
seed data, service policy, external admission rules, or test manifests. A
profile should contain policy intent, not raw authority:

```capnp
struct ResourceProfile {
  profileId @0 :Data;
  versionId @1 :Data;
  epoch @2 :UInt64;
  homeQuotaBytes @3 :UInt64;
  tempQuotaBytes @4 :UInt64;
  processLimit @5 :UInt32;
  threadLimit @6 :UInt32;
  capLimit @7 :UInt32;
  memoryCommitLimitBytes @8 :UInt64;
  frameGrantLimitPages @9 :UInt64;
  memoryVirtualReservationLimitBytes @20 :UInt64;
  endpointQueueLimit @10 :UInt32;
  inFlightCallLimit @11 :UInt32;
  pendingIpcSubmissionLimit @12 :UInt32;
  ringScratchLimitBytes @13 :UInt64;
  logQuotaBytesPerWindow @14 :UInt64;
  networkProfile @15 :Text;
  cpuBudgetUsPerWindow @16 :UInt64;
  cpuWindowUs @17 :UInt64;
  timerWaiterLimit @18 :UInt32;
  launcherProfile @19 :Text;
}
```

The profile is evaluated by a broker or supervisor. The result is a set of
ledger limits, wrapper caps, service-specific budgets, and spawn constraints.
Changing a profile does not change a running workload until a trusted service
issues new limits, revokes old caps, or starts a replacement workload.

## Ledgers of Record

The ledger of record depends on the resource owner:

| Resource | Ledger of record |
| --- | --- |
| Capability slots | Process `CapTable` / process resource ledger |
| Processes and child subtrees | Supervisor or `ProcessSpawner` ledger |
| Threads and kernel stacks | Process-owned thread/kernel-stack ledger |
| Anonymous virtual reservations | Address-space or VM service reservation ledger |
| Anonymous committed memory | Address-space or VM service ledger |
| Physical frames and frame grants | Frame allocator / holder ledger |
| MemoryObject mappings | Per-process frame-grant ledger plus address-space tracking |
| Endpoint queues | Endpoint object ledger |
| In-flight calls and result caps | Caller/callee transport ledger |
| Pending IPC submissions | Process ring/resource ledger |
| Ring scratch and request buffers | Process ring/resource ledger |
| Timer sleeps and waiters | Timer service waiter ledger |
| Log bytes | Log service token bucket / retention ledger |
| Storage bytes and namespace entries | Store/Namespace service ledger |
| Temporary, cache, and home storage | Store/Namespace scoped sub-ledgers |
| Network listeners, sockets, and bytes | Network service or socket cap ledger |
| CPU share and runtime budget | Scheduler or scheduling-context ledger |
| DMA pool bytes, DMA buffer count, descriptor/ring depth, MMIO mappings, interrupt holds, in-flight DMA submissions | Device-manager ledgers, later |
| Model tokens, provider calls, tool calls | Provider/agent gateway ledgers, later |

No second module should maintain an independent enforcement counter for the
same resource. A status service may cache values for display only if it treats
the ledger owner as authoritative and never grants or rejects based on stale
cache state.

## Relationship To Tickless And Realtime CPU Authority

The CPU terms in
[Tickless and Realtime Scheduling](tickless-realtime-scheduling-proposal.md)
reuse this resource-accounting model:

```text
ResourceProfile.cpuBudgetUsPerWindow:
  coarse policy template only. Selecting a profile does not mint executable
  CPU-time authority.

ResourceLedger CPU budget:
  coarse best-effort accounting before realtime contexts exist, and the ledger
  of record for non-realtime CPU share/runtime limits.

SchedulingContext:
  spendable CPU-time object for realtime or admitted execution. It carries
  budget, period, relative deadline, priority/criticality, CPU mask, and
  overrun policy.

CpuIsolationLease:
  CPU placement, exclusivity, and noise/nohz authority. It is not CPU budget
  and must charge consumed runtime to a SchedulingContext or scheduler
  ResourceLedger.

NoHzEligibility / NoHzActivation:
  reviewed eligibility plus scheduler-proven current CPU state. They do not
  grant resource credit.

RealtimeIsland:
  admitted bundle consuming SchedulingContexts plus memory, device, ring, and
  optional CpuIsolationLease reservations.
```

Do not create a second CPU budget system under nohz, SQPOLL, or realtime
terminology. Those features select placement and execution mode; CPU time is
still charged through scheduling-context or scheduler-ledger authority.

## Reservation Lifecycle

Every resource allocation follows the same lifecycle:

```text
reserve(request, limits, expected_state)
  -> reserved(token)
  -> denied(reason)

commit(token)
  -> committed(resource)
  -> rollback(token, reason)

release(resource)
  -> released
```

Rules:

- `reserve` validates structure, bounds, ownership, and available quota before
  any externally visible mutation.
- `commit` publishes exactly the resource that was reserved.
- `rollback` restores all ledgers touched by the reservation.
- `release` is idempotent from the caller's perspective but changes ledger
  state at most once.
- Process exit and cap revocation bulk-release all resources owned only by the
  exiting process or revoked hold edge.
- Stale handles, exhausted quotas, malformed limits, and unknown profile
  versions fail closed with typed errors or denials, not panics.

The S.9 transfer transaction is the concrete model for cap transfer and spawn.
Other services should reuse the same preflight, reservation, commit, rollback,
and audit vocabulary.

## Donation and Shared Services

Shared services handle many sessions in one process. They need bounded
server-side state without treating caller identity as authority.

Donation is a lease from one ledger to another for a named operation:

```text
Donation {
  donorSessionId
  donorLedgerId
  receiverServiceId
  resourceClass
  amount
  expiresAtMs
  callId
}
```

A donation can pay for queue entries, scratch bytes, temporary storage,
outbound bytes, model tokens, or CPU budget needed to serve one request. It
does not grant unrelated authority to the service and does not let the caller
spend the service's own management budget. When the call finishes, times out,
is cancelled, or the session exits, unused donation is returned and used
donation is charged to the donor's accounting record.

Services may also have their own base budgets for resident state. Per-client
budgets and service base budgets are separate ledger entries so a single
client cannot hide consumption inside the service account.

## Profile Binding

Profiles are selected by policy inputs:

- manifest-seeded operators and recovery identities,
- local account records,
- service account records,
- guest and anonymous admission rules,
- external identity bindings,
- test manifests and QEMU smoke profiles,
- future driver, storage, network, and runtime launch policies.

The broker or supervisor translates those profiles into concrete limits at
session creation, spawn, service start, or cap minting time. The translation
must record:

- profile ID, version ID, and policy epoch,
- ledger owner and resource class,
- hard limit and optional token-bucket window,
- source policy and approving broker/supervisor,
- audit record ID for the grant,
- expiry or revocation epoch if the budget is leased.

A session can carry profile summaries for audit and display, but the summaries
do not enforce quota. Enforcement lives where the resource is created or used.

## Resource Classes

### Kernel and Process Resources

Cap slots, process count, thread count, kernel stacks, pending IPC
submissions, ring scratch, outstanding calls, and endpoint queue entries are
kernel or kernel-object resources. Their checks belong before spawn, thread
creation, transfer, IPC, and ring dispatch side effects.

### Memory

Current `VirtualMemory` mappings and held `MemoryObject` caps charge the
process-owned frame-grant ledger of record. The address space records borrowed
object-backed pages at the same tracking limit so unmap and teardown can
distinguish them from anonymous pages, but that tracking is not a second
enforcement counter. Future reserve/commit/decommit semantics split virtual
reservation from committed physical memory: `VirtualMemory.reserve` charges a
virtual-reservation ledger, while `VirtualMemory.commit` and compatibility
`VirtualMemory.map` charge the committed-memory/frame ledger before pages
become accessible. Decommit releases physical commit budget while preserving
virtual reservation budget until unmap.

### Storage

Storage services own byte, object, namespace-entry, and snapshot ledgers.
`home`, `config`, `cache`, and `tmp` are separate sub-ledgers even when backed
by the same Store. Temporary session storage expires on logout or session
expiry. Cache quota may be reclaimed by policy. Home/config quota should not
be reclaimed without explicit account/storage policy.

### Logging and Audit

Log volume uses token buckets and retention limits. Audit entries required for
security state transitions should have a protected emergency path; ordinary
application logs must not starve audit. If audit storage is unavailable, the
system enters a bounded emergency mode rather than silently dropping mandatory
security events.

### Network

Network profiles select listener authority, outbound connection classes,
socket counts, byte windows, and remote scopes. A normal local account may
receive client network caps; listener authority requires service policy,
operator policy, or an application-specific grant. Anonymous remote sessions
receive only protocol state needed to authenticate or create an account.

### CPU and Scheduling

CPU share and runtime budget belong to the scheduler or future scheduling
context. Until full scheduling-context donation exists, CPU limits can be
coarse token buckets and supervisor policy. Later realtime, media, and driver
work should use explicit period/budget/deadline records rather than ad hoc
sleep or polling loops.

### Devices and Providers

DMA pools, MMIO mappings, interrupts, cloud provider calls, LLM tokens, media
frames, and external API calls are scarce resources too. The first proof may
use service-level ledgers, but the rule is the same: one ledger of record,
typed reservation, explicit release, audit-visible denial.

For the S.11.2 userspace-driver transition, device ledgers must account at
least DMA pool bytes, DMA buffer count, descriptor or ring depth, MMIO mappings,
interrupt holds, and in-flight DMA submissions. A `DMAPool` reservation is not
only memory allocation; it is also device-visible write authority and must be
released through the same revoke/quiesce/reset path that makes future reuse
safe.

Canonical device ledger concepts:

```text
dma_pool_bytes
dma_buffer_count
dma_descriptor_count
mmio_mapping_count
interrupt_hold_count
inflight_dma_submission_count
```

These fields are device-manager accounting concepts even if the first
implementation uses different internal names. They must have one ledger of
record. DMA pool bytes and buffer counts are not interchangeable with ordinary
`MemoryObject` ownership, because device-visible memory also carries IOVA,
descriptor, reset, and stale-completion obligations.

## Failure Semantics

Quota failure is a normal result, not a crash:

| Condition | Result |
| --- | --- |
| Malformed request | Invalid input / typed transport error |
| Caller exceeds hard limit | Quota denied / overloaded |
| Service base budget exhausted | Service overloaded |
| Donated budget exhausted | Request denied or partial result |
| Stale profile version | Denied; refresh session/profile |
| Ledger mismatch or rollback failure | Enter recovery/emergency mode |

Retry policy belongs to the caller or supervisor. Kernel and service code must
not spin, allocate unbounded retry queues, or emit unbounded diagnostics after
quota failure.

## Audit and Status

Auditable events:

- profile-to-ledger translation,
- reservation denial,
- successful budget grant,
- donation start/commit/release,
- cap or resource revocation,
- process-exit cleanup,
- rollback or recovery-mode entry,
- administrative profile change.

Status views should expose current usage, limits, denial counts, and suppressed
diagnostic counts by resource class. They must redact sensitive account,
network, provider, and object identifiers unless the viewer holds a suitable
audit/status cap.

## Verification Gates

Before treating resource profiles as complete for any caller class, add checks
at the affected resource owner:

- Host tests for limit parsing, stale profile rejection, reservation/rollback,
  and one-ledger-of-record invariants.
- QEMU smokes proving quota denial for process/thread/cap, pending IPC
  submissions, endpoint queue, timer waiter, memory, storage, log, and network
  resources as they exist.
- Hostile exhaustion tests that do not panic, leak frames, leak cap slots, or
  leave partial child processes.
- Process-exit and revocation tests proving all charges release exactly once.
- Audit/status tests showing denial and cleanup are visible without exposing
  secrets.
- Kani or property tests for small pure ledger primitives when bounds are
  fixed enough to model.

## Relationships

- **Authority Accounting:** S.9 defines the current authority graph and
  process-ledger transaction model. This proposal generalizes the quota
  vocabulary to services, storage, networking, sessions, and future devices.
- **User Identity and Policy:** account and session resource profiles select
  templates. Brokers and supervisors translate them into ledgers and wrapper
  caps.
- **OOM Handling and Swap:** memory commitment, reclaim, and swap policy are
  the memory-specific part of this model.
- **Storage and Naming:** Store/Namespace services own storage ledgers for
  homes, config, cache, tmp, snapshots, and imports.
- **System Monitoring:** status and metrics expose derived ledger views, not
  parallel enforcement counters.

## Non-Goals

- No Unix cgroups clone as the primary abstraction.
- No identity-based quota enforcement in the kernel.
- No global mutable quota database trusted by every subsystem.
- No claim that existing code already enforces every resource class above.
- No unbounded best-effort mode for guests, anonymous callers, tests, or
  service accounts.

## Open Questions

- Which ledger IDs and status schemas should become stable ABI first?
- How much CPU-budget enforcement is useful before scheduling contexts exist?
- Should quota donation be represented as a general capability type or as
  method-specific sideband on selected service calls?
- Which storage quota primitive is first: bytes, object count, namespace
  entries, or snapshots?
- Which proofs belong in `capos-lib` versus resource-service-specific tests?
