Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Resource Accounting and Quotas

Cross-cutting resource profiles, ledgers, reservation semantics, and verification gates for bounded capOS sessions, services, drivers, storage, networking, tests, and future language runtimes.

  • Authority Accounting records the current transfer and resource-accounting invariants.
  • Memory Management documents the current frame-grant and MemoryObject accounting baseline.
  • Go VirtualMemory Contract provides the first concrete virtual-reservation versus physical-commit ledger split for a future language runtime.

Problem

capOS already has several resource limits: cap slots, frame grants, timer waiters, thread and kernel-stack quotas, ring scratch, and spawn preflight checks. Those are useful but fragmented. Local accounts, guests, anonymous callers, external sessions, service accounts, drivers, storage services, network stacks, tests, and future runtimes all need the same rule:

No workload receives implicit unlimited consumption of finite system resources.

This proposal defines the common model. It extends the Stage S.9 authority graph and per-process ResourceLedger design rather than replacing it.

Principles

  • A ResourceProfile is a policy template, not authority.
  • Actual enforcement happens through ledgers, capability wrappers, brokers, supervisors, and kernel/resource-service admission checks.
  • Every resource class has one ledger of record. Mirrors for status, metrics, or audit are derived views and must not be used for enforcement.
  • Reservation happens before side effects. Commit publishes the resource. Release and rollback are mandatory on all success, failure, timeout, revocation, and process-exit paths.
  • Identity metadata selects policy. It never consumes, releases, or bypasses quota by itself.
  • Quota donation is explicit. A caller may donate budget to a service call, but a service cannot silently spend the caller’s unrelated budget.

Resource Profiles

Resource profiles are named templates selected by account records, manifest seed data, service policy, external admission rules, or test manifests. A profile should contain policy intent, not raw authority:

struct ResourceProfile {
  profileId @0 :Data;
  versionId @1 :Data;
  epoch @2 :UInt64;
  homeQuotaBytes @3 :UInt64;
  tempQuotaBytes @4 :UInt64;
  processLimit @5 :UInt32;
  threadLimit @6 :UInt32;
  capLimit @7 :UInt32;
  memoryCommitLimitBytes @8 :UInt64;
  frameGrantLimitPages @9 :UInt64;
  memoryVirtualReservationLimitBytes @20 :UInt64;
  endpointQueueLimit @10 :UInt32;
  inFlightCallLimit @11 :UInt32;
  pendingIpcSubmissionLimit @12 :UInt32;
  ringScratchLimitBytes @13 :UInt64;
  logQuotaBytesPerWindow @14 :UInt64;
  networkProfile @15 :Text;
  cpuBudgetUsPerWindow @16 :UInt64;
  cpuWindowUs @17 :UInt64;
  timerWaiterLimit @18 :UInt32;
  launcherProfile @19 :Text;
}

The profile is evaluated by a broker or supervisor. The result is a set of ledger limits, wrapper caps, service-specific budgets, and spawn constraints. Changing a profile does not change a running workload until a trusted service issues new limits, revokes old caps, or starts a replacement workload.

Ledgers of Record

The ledger of record depends on the resource owner:

ResourceLedger of record
Capability slotsProcess CapTable / process resource ledger
Processes and child subtreesSupervisor or ProcessSpawner ledger
Threads and kernel stacksProcess-owned thread/kernel-stack ledger
Anonymous virtual reservationsAddress-space or VM service reservation ledger
Anonymous committed memoryAddress-space or VM service ledger
Physical frames and frame grantsFrame allocator / holder ledger
MemoryObject mappingsPer-process frame-grant ledger plus address-space tracking
Endpoint queuesEndpoint object ledger
In-flight calls and result capsCaller/callee transport ledger
Pending IPC submissionsProcess ring/resource ledger
Ring scratch and request buffersProcess ring/resource ledger
Timer sleeps and waitersTimer service waiter ledger
Log bytesLog service token bucket / retention ledger
Storage bytes and namespace entriesStore/Namespace service ledger
Temporary, cache, and home storageStore/Namespace scoped sub-ledgers
Network listeners, sockets, and bytesNetwork service or socket cap ledger
CPU share and runtime budgetScheduler or scheduling-context ledger
DMA pool bytes, DMA buffer count, descriptor/ring depth, MMIO mappings, interrupt holds, in-flight DMA submissionsDevice-manager ledgers, later
Model tokens, provider calls, tool callsProvider/agent gateway ledgers, later

No second module should maintain an independent enforcement counter for the same resource. A status service may cache values for display only if it treats the ledger owner as authoritative and never grants or rejects based on stale cache state.

Relationship To Tickless And Realtime CPU Authority

The CPU terms in Tickless and Realtime Scheduling reuse this resource-accounting model:

ResourceProfile.cpuBudgetUsPerWindow:
  coarse policy template only. Selecting a profile does not mint executable
  CPU-time authority.

ResourceLedger CPU budget:
  coarse best-effort accounting before realtime contexts exist, and the ledger
  of record for non-realtime CPU share/runtime limits.

SchedulingContext:
  spendable CPU-time object for realtime or admitted execution. It carries
  budget, period, relative deadline, priority/criticality, CPU mask, and
  overrun policy.

CpuIsolationLease:
  CPU placement, exclusivity, and noise/nohz authority. It is not CPU budget
  and must charge consumed runtime to a SchedulingContext or scheduler
  ResourceLedger.

NoHzEligibility / NoHzActivation:
  reviewed eligibility plus scheduler-proven current CPU state. They do not
  grant resource credit.

RealtimeIsland:
  admitted bundle consuming SchedulingContexts plus memory, device, ring, and
  optional CpuIsolationLease reservations.

Do not create a second CPU budget system under nohz, SQPOLL, or realtime terminology. Those features select placement and execution mode; CPU time is still charged through scheduling-context or scheduler-ledger authority.

Reservation Lifecycle

Every resource allocation follows the same lifecycle:

reserve(request, limits, expected_state)
  -> reserved(token)
  -> denied(reason)

commit(token)
  -> committed(resource)
  -> rollback(token, reason)

release(resource)
  -> released

Rules:

  • reserve validates structure, bounds, ownership, and available quota before any externally visible mutation.
  • commit publishes exactly the resource that was reserved.
  • rollback restores all ledgers touched by the reservation.
  • release is idempotent from the caller’s perspective but changes ledger state at most once.
  • Process exit and cap revocation bulk-release all resources owned only by the exiting process or revoked hold edge.
  • Stale handles, exhausted quotas, malformed limits, and unknown profile versions fail closed with typed errors or denials, not panics.

The S.9 transfer transaction is the concrete model for cap transfer and spawn. Other services should reuse the same preflight, reservation, commit, rollback, and audit vocabulary.

Donation and Shared Services

Shared services handle many sessions in one process. They need bounded server-side state without treating caller identity as authority.

Donation is a lease from one ledger to another for a named operation:

Donation {
  donorSessionId
  donorLedgerId
  receiverServiceId
  resourceClass
  amount
  expiresAtMs
  callId
}

A donation can pay for queue entries, scratch bytes, temporary storage, outbound bytes, model tokens, or CPU budget needed to serve one request. It does not grant unrelated authority to the service and does not let the caller spend the service’s own management budget. When the call finishes, times out, is cancelled, or the session exits, unused donation is returned and used donation is charged to the donor’s accounting record.

Services may also have their own base budgets for resident state. Per-client budgets and service base budgets are separate ledger entries so a single client cannot hide consumption inside the service account.

Profile Binding

Profiles are selected by policy inputs:

  • manifest-seeded operators and recovery identities,
  • local account records,
  • service account records,
  • guest and anonymous admission rules,
  • external identity bindings,
  • test manifests and QEMU smoke profiles,
  • future driver, storage, network, and runtime launch policies.

The broker or supervisor translates those profiles into concrete limits at session creation, spawn, service start, or cap minting time. The translation must record:

  • profile ID, version ID, and policy epoch,
  • ledger owner and resource class,
  • hard limit and optional token-bucket window,
  • source policy and approving broker/supervisor,
  • audit record ID for the grant,
  • expiry or revocation epoch if the budget is leased.

A session can carry profile summaries for audit and display, but the summaries do not enforce quota. Enforcement lives where the resource is created or used.

Resource Classes

Kernel and Process Resources

Cap slots, process count, thread count, kernel stacks, pending IPC submissions, ring scratch, outstanding calls, and endpoint queue entries are kernel or kernel-object resources. Their checks belong before spawn, thread creation, transfer, IPC, and ring dispatch side effects.

Memory

Current VirtualMemory mappings and held MemoryObject caps charge the process-owned frame-grant ledger of record. The address space records borrowed object-backed pages at the same tracking limit so unmap and teardown can distinguish them from anonymous pages, but that tracking is not a second enforcement counter. Future reserve/commit/decommit semantics split virtual reservation from committed physical memory: VirtualMemory.reserve charges a virtual-reservation ledger, while VirtualMemory.commit and compatibility VirtualMemory.map charge the committed-memory/frame ledger before pages become accessible. Decommit releases physical commit budget while preserving virtual reservation budget until unmap.

Storage

Storage services own byte, object, namespace-entry, and snapshot ledgers. home, config, cache, and tmp are separate sub-ledgers even when backed by the same Store. Temporary session storage expires on logout or session expiry. Cache quota may be reclaimed by policy. Home/config quota should not be reclaimed without explicit account/storage policy.

Logging and Audit

Log volume uses token buckets and retention limits. Audit entries required for security state transitions should have a protected emergency path; ordinary application logs must not starve audit. If audit storage is unavailable, the system enters a bounded emergency mode rather than silently dropping mandatory security events.

Network

Network profiles select listener authority, outbound connection classes, socket counts, byte windows, and remote scopes. A normal local account may receive client network caps; listener authority requires service policy, operator policy, or an application-specific grant. Anonymous remote sessions receive only protocol state needed to authenticate or create an account.

CPU and Scheduling

CPU share and runtime budget belong to the scheduler or future scheduling context. Until full scheduling-context donation exists, CPU limits can be coarse token buckets and supervisor policy. Later realtime, media, and driver work should use explicit period/budget/deadline records rather than ad hoc sleep or polling loops.

Devices and Providers

DMA pools, MMIO mappings, interrupts, cloud provider calls, LLM tokens, media frames, and external API calls are scarce resources too. The first proof may use service-level ledgers, but the rule is the same: one ledger of record, typed reservation, explicit release, audit-visible denial.

For the S.11.2 userspace-driver transition, device ledgers must account at least DMA pool bytes, DMA buffer count, descriptor or ring depth, MMIO mappings, interrupt holds, and in-flight DMA submissions. A DMAPool reservation is not only memory allocation; it is also device-visible write authority and must be released through the same revoke/quiesce/reset path that makes future reuse safe.

Canonical device ledger concepts:

dma_pool_bytes
dma_buffer_count
dma_descriptor_count
mmio_mapping_count
interrupt_hold_count
inflight_dma_submission_count

These fields are device-manager accounting concepts even if the first implementation uses different internal names. They must have one ledger of record. DMA pool bytes and buffer counts are not interchangeable with ordinary MemoryObject ownership, because device-visible memory also carries IOVA, descriptor, reset, and stale-completion obligations.

Failure Semantics

Quota failure is a normal result, not a crash:

ConditionResult
Malformed requestInvalid input / typed transport error
Caller exceeds hard limitQuota denied / overloaded
Service base budget exhaustedService overloaded
Donated budget exhaustedRequest denied or partial result
Stale profile versionDenied; refresh session/profile
Ledger mismatch or rollback failureEnter recovery/emergency mode

Retry policy belongs to the caller or supervisor. Kernel and service code must not spin, allocate unbounded retry queues, or emit unbounded diagnostics after quota failure.

Audit and Status

Auditable events:

  • profile-to-ledger translation,
  • reservation denial,
  • successful budget grant,
  • donation start/commit/release,
  • cap or resource revocation,
  • process-exit cleanup,
  • rollback or recovery-mode entry,
  • administrative profile change.

Status views should expose current usage, limits, denial counts, and suppressed diagnostic counts by resource class. They must redact sensitive account, network, provider, and object identifiers unless the viewer holds a suitable audit/status cap.

Verification Gates

Before treating resource profiles as complete for any caller class, add checks at the affected resource owner:

  • Host tests for limit parsing, stale profile rejection, reservation/rollback, and one-ledger-of-record invariants.
  • QEMU smokes proving quota denial for process/thread/cap, pending IPC submissions, endpoint queue, timer waiter, memory, storage, log, and network resources as they exist.
  • Hostile exhaustion tests that do not panic, leak frames, leak cap slots, or leave partial child processes.
  • Process-exit and revocation tests proving all charges release exactly once.
  • Audit/status tests showing denial and cleanup are visible without exposing secrets.
  • Kani or property tests for small pure ledger primitives when bounds are fixed enough to model.

Relationships

  • Authority Accounting: S.9 defines the current authority graph and process-ledger transaction model. This proposal generalizes the quota vocabulary to services, storage, networking, sessions, and future devices.
  • User Identity and Policy: account and session resource profiles select templates. Brokers and supervisors translate them into ledgers and wrapper caps.
  • OOM Handling and Swap: memory commitment, reclaim, and swap policy are the memory-specific part of this model.
  • Storage and Naming: Store/Namespace services own storage ledgers for homes, config, cache, tmp, snapshots, and imports.
  • System Monitoring: status and metrics expose derived ledger views, not parallel enforcement counters.

Non-Goals

  • No Unix cgroups clone as the primary abstraction.
  • No identity-based quota enforcement in the kernel.
  • No global mutable quota database trusted by every subsystem.
  • No claim that existing code already enforces every resource class above.
  • No unbounded best-effort mode for guests, anonymous callers, tests, or service accounts.

Open Questions

  • Which ledger IDs and status schemas should become stable ABI first?
  • How much CPU-budget enforcement is useful before scheduling contexts exist?
  • Should quota donation be represented as a general capability type or as method-specific sideband on selected service calls?
  • Which storage quota primitive is first: bytes, object count, namespace entries, or snapshots?
  • Which proofs belong in capos-lib versus resource-service-specific tests?