Proposal: Resource Accounting and Quotas

Cross-cutting resource profiles, ledgers, reservation semantics, and verification gates for bounded capOS sessions, services, drivers, storage, networking, tests, and future language runtimes.

Resource Governance is the stable current-design authority for control classification, effective limits, pre-identity accounting, availability, and current enforcement gaps.
Authority Accounting records the current transfer and resource-accounting invariants.
Memory Management documents the current frame-grant and MemoryObject accounting baseline.
Go VirtualMemory Contract provides the first concrete virtual-reservation versus physical-commit ledger split for a future language runtime.

Problem

capOS already has several finite structural limits and some enforced ledgers: cap slots, frame grants, virtual reservations, threads, ring scratch, endpoint queues, timer waiters, and spawn preflight checks. Coverage and scope are fragmented: some profile fields are not wired, several limits are per object or proof workload rather than aggregate per owner, and fixed backing can exceed the selected logical bound. Local accounts, guests, anonymous callers, external sessions, service accounts, drivers, storage services, network stacks, tests, and future runtimes all need the same rule:

No workload receives implicit unlimited consumption of finite system resources.

This proposal defines the common model. It extends the Security Verification Track S.9 authority graph and per-process ResourceLedger design rather than replacing it.

Principles

A ResourceProfile is a policy template, not authority.
Actual enforcement happens through ledgers, capability wrappers, brokers, supervisors, and kernel/resource-service admission checks.
Every resource class has one ledger of record. Mirrors for status, metrics, or audit are derived views and must not be used for enforcement.
Reservation happens before side effects. Commit publishes the resource. Release and rollback are mandatory on all success, failure, timeout, revocation, and process-exit paths.
Identity metadata selects policy. It never consumes, releases, or bypasses quota by itself.
Quota donation is explicit. A caller may donate budget to a service call, but a service cannot silently spend the caller’s unrelated budget.
Availability is a security outcome. A bound that lets one cheap hostile request occupy every global slot, verifier, waiter, or accept loop is damage containment, not fair admission or DoS protection.
Proof constants stay labeled by manifest/workload until a production owner charges the same physical cost through an aggregate ledger.

Resource Profiles

Resource profiles are named templates selected by account records, manifest seed data, service policy, external admission rules, or test manifests. A profile should contain policy intent, not raw authority:

struct ResourceProfile {
  profileId @0 :Data;
  versionId @1 :Data;
  epoch @2 :UInt64;
  homeQuotaBytes @3 :UInt64;
  tempQuotaBytes @4 :UInt64;
  processLimit @5 :UInt32;
  threadLimit @6 :UInt32;
  capLimit @7 :UInt32;
  memoryCommitLimitBytes @8 :UInt64;
  frameGrantLimitPages @9 :UInt64;
  memoryVirtualReservationLimitBytes @20 :UInt64;
  endpointQueueLimit @10 :UInt32;
  inFlightCallLimit @11 :UInt32;
  retired12 @12 :UInt32; # was pending IPC submission quota; do not reuse
  ringScratchLimitBytes @13 :UInt64;
  logQuotaBytesPerWindow @14 :UInt64;
  networkProfile @15 :Text;
  cpuBudgetUsPerWindow @16 :UInt64;
  cpuWindowUs @17 :UInt64;
  timerWaiterLimit @18 :UInt32;
  launcherProfile @19 :Text;
}

The profile is evaluated by a broker or supervisor. The result is a set of ledger limits, wrapper caps, service-specific budgets, and spawn constraints. Changing a profile does not change a running workload until a trusted service issues new limits, revokes old caps, or starts a replacement workload.

Current kernel coverage includes manifest profile decoding, caller-session profile resolution during spawn, cap and thread limits on process construction, per-thread ring and reply scratch sizing, and per-endpoint queue/in-flight limits for profile-created endpoint caps. Services do not yet select an explicit profile, unknown profile resolution can fall back to defaults, endpoint limits are not aggregate, and processLimit, memory/frame, timer, network, log, and CPU policy fields do not all reach their resource owners. The QEMU proof make run-resource-profile covers an in-limit spawn, an over-cap spawn rejection before result authority escapes, rollback after that rejection, and a thread-limit rejection through ThreadSpawner.create; it does not prove the whole profile is enforced.

Ledgers of Record

The ledger of record depends on the resource owner:

Resource	Ledger of record
Capability slots	Process `CapTable` / process resource ledger
Processes and child subtrees	Supervisor or `ProcessSpawner` ledger
Threads and kernel stacks	Process-owned thread/kernel-stack ledger
Anonymous virtual reservations	Address-space or VM service reservation ledger
Anonymous committed memory	Address-space or VM service ledger
Physical frames and frame grants	Frame allocator / holder ledger
MemoryObject mappings	Per-process frame-grant ledger plus address-space tracking
Endpoint queues	Endpoint structural ledger plus aggregate process/session credit (aggregate layer is future)
In-flight calls and result caps	Caller/callee transport ledger
Deferred waiters, notifications, pipe waits, and promised answers	Caller/process or service ledger plus a system structural backstop and protected lifecycle reserve
Ring submissions	Fixed ring depth and per-dispatch budget; no profile ledger
Ring scratch and request buffers	Per-thread ring allocation plus aggregate process ledger (aggregate charging is future)
Timer sleeps and waiters	Timer service waiter ledger
Log bytes	Log service token bucket / retention ledger
Diagnostic formatting and output work	Producer/process or anonymous-ingress work budget plus protected audit lane
Storage bytes and namespace entries	Store/Namespace service ledger
Temporary, cache, and home storage	Store/Namespace scoped sub-ledgers
Network listeners, sockets, and bytes	Network service or socket cap ledger
Pre-auth connection, parsing, cookie rejection, and login work	Service/anonymous-ingress ledger
Password hashing and credential arena occupancy	Credential service anonymous/authenticated work ledger with protected recovery/setup reserve
Browser/session registry records and proxy state	Authenticated session ledger after successful admission; service base ledger before it
CPU share and runtime budget	Scheduler or scheduling-context ledger
Reclaim scans and swap I/O	Triggering owner donation or bounded system-pressure service budget
DMA pool bytes, DMA buffer count, descriptor/ring depth, MMIO mappings, interrupt holds, in-flight DMA submissions	Device-manager ledgers, later
Model tokens, provider calls, tool calls	Provider/agent gateway ledgers, later

No second module should maintain an independent enforcement counter for the same resource. A status service may cache values for display only if it treats the ledger owner as authoritative and never grants or rejects based on stale cache state.

Relationship To Tickless And Realtime CPU Authority

The CPU terms in Tickless and Realtime Scheduling reuse this resource-accounting model:

ResourceProfile.cpuBudgetUsPerWindow: coarse policy template only. Selecting a profile does not mint executable CPU-time authority.
ResourceLedger CPU budget: coarse best-effort accounting before realtime contexts exist, and the ledger of record for non-realtime CPU share/runtime limits.
SchedulingContext: spendable CPU-time object for realtime or admitted execution. It carries budget, period, relative deadline, priority/criticality, CPU mask, and overrun policy.
CpuIsolationLease: CPU placement, exclusivity, and noise/nohz authority. It is not CPU budget and must charge consumed runtime to a SchedulingContext or scheduler ResourceLedger.
NoHzEligibility / NoHzActivation: reviewed eligibility plus scheduler-proven current CPU state. They do not grant resource credit.
RealtimeIsland: admitted bundle consuming SchedulingContexts plus memory, device, ring, and optional CpuIsolationLease reservations.

Do not create a second CPU budget system under nohz, SQPOLL, or realtime terminology. Those features select placement and execution mode; CPU time is still charged through scheduling-context or scheduler-ledger authority. WFQ weight remains relative share, SchedulingContext budget/period is a hard throttle, and CpuIsolationLease is placement authority. None is a minimum service or SLA until aggregate per-CPU feasibility admission accounts for kernel, IRQ, housekeeping, and background work.

Reservation Lifecycle

Every resource allocation follows the same lifecycle:

reserve(request, limits, expected_state)
  -> reserved(token)
  -> denied(reason)

commit(token)
  -> committed(resource)
  -> rollback(token, reason)

release(resource)
  -> released

Rules:

reserve validates structure, bounds, ownership, and available quota before any externally visible mutation.
commit publishes exactly the resource that was reserved.
rollback restores all ledgers touched by the reservation.
A high-level close/release API may be idempotent from the caller’s perspective, but the internal reservation token changes ledger state exactly once. Duplicate, stale, oversized, and missing releases are detected and audited or recovered rather than hidden by saturating arithmetic.
Process exit and cap revocation bulk-release all resources owned only by the exiting process or revoked hold edge.
Stale handles, exhausted quotas, malformed limits, and unknown profile versions fail closed with typed errors or denials, not panics.

The Security Verification Track S.9 transfer transaction is the concrete model for cap transfer and spawn. Other services should reuse the same preflight, reservation, commit, rollback, and audit vocabulary.

Donation and Shared Services

Shared services handle many sessions in one process. They need bounded server-side state without treating caller identity as authority.

Donation is a lease from one ledger to another for a named operation:

Donation {
  donorLedgerGrant
  receiverServiceGrant
  resourceClass
  amount
  expiresAtMs
  callId
}

A donation can pay for queue entries, scratch bytes, temporary storage, outbound bytes, model tokens, or CPU budget needed to serve one request. It does not grant unrelated authority to the service and does not let the caller spend the service’s own management budget. When the call finishes, times out, is cancelled, or the session exits, unused donation is returned and used donation is charged to the donor’s accounting record.

Services may also have their own base budgets for resident state. Per-client budgets and service base budgets are separate ledger entries so a single client cannot hide consumption inside the service account. The grant fields above are unforgeable generation-bound references. Session, ledger, account, and service IDs may appear in redacted audit records, but caller-supplied IDs never select the charged ledger.

Profile Binding

Profiles are selected by policy inputs:

manifest-seeded operators and recovery identities,
local account records,
service account records,
guest and anonymous admission rules,
external identity bindings,
test manifests and QEMU smoke profiles,
future driver, storage, network, and runtime launch policies.

The broker or supervisor translates those profiles into concrete limits at session creation, spawn, service start, or cap minting time. The translation must record:

profile ID, version ID, and policy epoch,
ledger owner and resource class,
hard limit and optional token-bucket window,
source policy and approving broker/supervisor,
audit record ID for the grant,
expiry or revocation epoch if the budget is leased.

A session can carry profile summaries for audit and display, but the summaries do not enforce quota. Enforcement lives where the resource is created or used.

Resource Classes

Kernel and Process Resources

Cap slots, process count, thread count, kernel stacks, ring scratch, outstanding calls, and endpoint queue entries are kernel or kernel-object resources. Ring submissions are bounded separately by the fixed SQ depth and the per-dispatch budget, so they do not have a profile quota. The remaining checks belong before spawn, thread creation, transfer, IPC, and ring dispatch side effects.

Memory

Current VirtualMemory mappings and held MemoryObject caps charge the process-owned frame-grant ledger of record. The address space records borrowed object-backed pages at the same tracking limit so unmap and teardown can distinguish them from anonymous pages, but that tracking is not a second enforcement counter. Future reserve/commit/decommit semantics split virtual reservation from committed physical memory: VirtualMemory.reserve charges a virtual-reservation ledger, while VirtualMemory.commit and compatibility VirtualMemory.map charge the committed-memory/frame ledger before pages become accessible. Decommit releases physical commit budget while preserving virtual reservation budget until unmap.

Storage

Storage services own byte, object, namespace-entry, and snapshot ledgers. home, config, cache, and tmp are separate sub-ledgers even when backed by the same Store. Temporary session storage expires on logout or session expiry. Cache quota may be reclaimed by policy. Home/config quota should not be reclaimed without explicit account/storage policy.

Logging and Audit

Log volume uses token buckets and retention limits. Audit entries required for security state transitions should have a protected emergency path; ordinary application logs must not starve audit. If audit storage is unavailable, the system enters a bounded emergency mode rather than silently dropping mandatory security events.

Network

Network profiles select listener authority, outbound connection classes, socket counts, byte windows, and remote scopes. A normal local account may receive client network caps; listener authority requires service policy, operator policy, or an application-specific grant. Anonymous remote sessions receive only protocol state needed to authenticate or create an account. TCP/TLS state, parsing, rejected cookies, login hashing, and rejection bytes are charged to a service or anonymous-ingress ledger before identity exists. Authenticated work may consume only explicit session donation; already-spent pre-auth work is never charged retroactively. Source addresses can narrow transport reachability or feed anti-abuse telemetry but are not browser identity or resource-account identity, especially behind a load balancer.

CPU and Scheduling

CPU share and runtime budget belong to the scheduler or scheduling context. The current WFQ weight is relative arbitration and current budget/period contexts throttle spend; neither guarantees minimum service. Before any reservation or SLA claim, the scheduler must admit aggregate utilization plus kernel/IRQ/housekeeping reserve. Later realtime, media, and driver work should use explicit period/budget/deadline records rather than ad hoc sleep or polling loops.

Devices and Providers

DMA pools, MMIO mappings, interrupts, cloud provider calls, LLM tokens, media frames, and external API calls are scarce resources too. The first proof may use service-level ledgers, but the rule is the same: one ledger of record, typed reservation, explicit release, audit-visible denial.

For the Security Verification Track S.11.2 userspace-driver transition, device ledgers must account at least DMA pool bytes, DMA buffer count, descriptor or ring depth, MMIO mappings, interrupt holds, and in-flight DMA submissions. A DMAPool reservation is not only memory allocation; it is also device-visible write authority and must be released through the same revoke/quiesce/reset path that makes future reuse safe.

Canonical device ledger concepts:

dma_pool_bytes
dma_buffer_count
dma_descriptor_count
mmio_mapping_count
interrupt_hold_count
inflight_dma_submission_count

These fields are device-manager accounting concepts even if the first implementation uses different internal names. They must have one ledger of record. DMA pool bytes and buffer counts are not interchangeable with ordinary MemoryObject ownership, because device-visible memory also carries IOVA, descriptor, reset, and stale-completion obligations.

Failure Semantics

Quota failure is a normal result, not a crash:

Condition	Result
Malformed request	Invalid input / typed transport error
Caller exceeds hard limit	Quota denied / overloaded
Service base budget exhausted	Service overloaded
Donated budget exhausted	Request denied or partial result
Stale profile version	Denied; refresh session/profile
Ledger mismatch or rollback failure	Enter recovery/emergency mode

At L4 hard emergency capacity the network stack may drop/reset before it can frame a response. Once HTTP framing exists, ordinary policy exhaustion should produce a bounded 429 or 503. Neither transport drop nor HTTP overload is authentication, and retry must preserve the original absolute deadline.

Retry policy belongs to the caller or supervisor. Kernel and service code must not spin, allocate unbounded retry queues, or emit unbounded diagnostics after quota failure.

Audit and Status

Auditable events:

profile-to-ledger translation,
reservation denial,
successful budget grant,
donation start/commit/release,
cap or resource revocation,
process-exit cleanup,
rollback or recovery-mode entry,
administrative profile change.

Status views should expose current usage, limits, denial counts, and suppressed diagnostic counts by resource class. They must redact sensitive account, network, provider, and object identifiers unless the viewer holds a suitable audit/status cap.

Verification Gates

Before treating resource profiles as complete for any caller class, add checks at the affected resource owner:

Host tests for limit parsing, stale profile rejection, reservation/rollback, and one-ledger-of-record invariants.
QEMU smokes proving quota denial for process/thread/cap, endpoint queue, timer waiter, memory, storage, log, and network resources as they exist.
Hostile exhaustion tests that do not panic, leak frames, leak cap slots, or leave partial child processes.
Physical-cost and fan-out tests proving that many individually bounded objects cannot multiply uncharged backing, continuations, scans, hashing, or diagnostics.
Cross-owner liveness tests proving a slow, idle, or hostile workload cannot consume every shared slot while unrelated admitted lifecycle, health, recovery, or login work has no progress.
Process-exit and revocation tests proving all charges release exactly once.
Audit/status tests showing denial and cleanup are visible without exposing secrets.
Kani or property tests for small pure ledger primitives when bounds are fixed enough to model.

Relationships

Authority Accounting: Security Verification Track S.9 defines the current authority graph and process-ledger transaction model. This proposal generalizes the quota vocabulary to services, storage, networking, sessions, and future devices.
User Identity and Policy: account and session resource profiles select templates. Brokers and supervisors translate them into ledgers and wrapper caps.
OOM Handling and Swap: memory commitment, reclaim, and swap policy are the memory-specific part of this model.
Storage and Naming: Store/Namespace services own storage ledgers for homes, config, cache, tmp, snapshots, and imports.
System Monitoring: status and metrics expose derived ledger views, not parallel enforcement counters.

Non-Goals

No Unix cgroups clone as the primary abstraction.
No identity-based quota enforcement in the kernel.
No global mutable quota database trusted by every subsystem.
No claim that existing code already enforces every resource class above.
No unbounded best-effort mode for guests, anonymous callers, tests, or service accounts.
No use of an arbitrary low global limit, global try-lock, peer address, or listener-wide backoff as a substitute for authorization, accounting, or fair admission.

Open Questions

Which ledger IDs and status schemas should become stable ABI first?
How much CPU-budget enforcement is useful before scheduling contexts exist?
Should quota donation be represented as a general capability type or as method-specific sideband on selected service calls?
Which storage quota primitive is first: bytes, object count, namespace entries, or snapshots?
Which proofs belong in capos-lib versus resource-service-specific tests?

Keyboard shortcuts

capOS Documentation