Proposal: Capability-Native System Monitoring
How capOS should expose logs, metrics, health, traces, crash records, and
service status without introducing global /proc, ambient log access, or a
privileged monitoring daemon that bypasses the capability model.
Problem
The current system is observable mostly through serial output, QEMU exit status, smoke-test lines, CQE error codes, and a small measurement-only build feature. That is enough for early kernel work, but it is not enough for a system whose claims depend on service decomposition, explicit authority, restart policy, auditability, and later cloud operation.
Monitoring is also not harmless. A monitoring service can reveal capability
topology, service names, badges, timing, crash context, request payloads, and
security decisions. If capOS imports a Unix-style “read everything under
/proc” or “global syslog” model, monitoring becomes an ambient authority
escape hatch. If it imports a kernel-programmable tracing model too early, it
adds a large privileged execution surface before the basic service graph is
stable.
The design target is narrower: make operational state visible through typed, attenuable capabilities. A process should observe only the services, logs, and signals it was granted authority to inspect.
Current State
Implemented signal sources:
- Kernel diagnostics are printed through COM1 serial via
kprintln!, timestamped with the PIT tick counter. Panic and fault paths use a mutex-free emergency serial writer. - Userspace logging currently goes through the kernel
Consolecapability, backed directly by serial and bounded per call. - Runtime panics can use an emergency console path, then exit with a fixed code.
- Capability-ring CQEs carry structured transport results, including negative
CAP_ERR_*values and serializedCapExceptionpayloads. - The ring tracks
cq_overflow, corrupted SQ/CQ recovery, and bounded SQE dispatch, but these facts are not exported as normal metrics. ProcessSpawnerandProcessHandle.waitexpose basic child lifecycle observation, but restart policy, health checks, and exported-cap lifecycle are future work.capos-lib::ResourceLedgertracks cap slots, outstanding calls, scratch bytes, and frame grants, but only as local accounting state.- The
measurefeature adds benchmark-only counters and TSC helpers for controlledmake run-measureboots. SystemConfig.logLevelexists in the schema and is printed at boot, but there is no filtering, routing, or retention policy behind it.
That means the system has useful raw signals but lacks a capability-shaped monitoring architecture.
Design Principles
- Observation is authority. Reading logs, status, metrics, traces, crash records, or audit entries requires a capability.
- No global monitoring root.
SystemStatus(all),LogReader(all), andServiceSupervisor(all)are powerful caps. Normal sessions receive scoped wrappers. - Kernel facts, userspace policy. The kernel may expose bounded facts about processes, rings, resources, and faults. Retention, filtering, aggregation, health semantics, restart policy, and user-facing views belong in userspace.
- Separate signal classes. Logs, metrics, lifecycle events, traces, health, crash records, and audit logs have different readers, retention rules, and security properties.
- Bounded by construction. Every producer path has a byte, entry, or time budget. Loss is explicit and summarized.
- Payload capture is exceptional. Default tracing records headers, interface IDs, method IDs, sizes, result codes, badges when authorized, and timing. Capturing method payloads needs a stronger cap because payloads may contain secrets.
- Serial remains emergency plumbing. Early boot, panic, and recovery still
need direct serial output. Normal services should receive log caps rather
than broad
Console. - Audit is not debug logging. Audit records security-relevant decisions and capability lifecycle events. It is append-only from producers and exposed through scoped readers.
- Pull by default, push when justified. Status and metrics are pull-shaped (reader polls a snapshot cap). Logs, lifecycle events, crash records, and audit entries are push-shaped (producer calls into a sink). Traces are pull with an explicit arm/drain lifecycle because capture is expensive. Each direction has its own cap surface; do not generalize one shape to cover all signals.
- Narrow kernel stats caps over one god-cap. The kernel exposes bounded
facts through several small read-only caps (ring, scheduler, resource
ledger, frames, endpoints, caps, crash) rather than one
KernelDiagnosticsthat grants everything. Narrow caps let an init-owned status service be assembled by composition, and let a broker lease a subset to an operator without handing over the rest.
Signal Taxonomy
Logs
Human-oriented diagnostic records:
- severity, component, service name, pid, optional session/service badge, monotonic timestamp, message text;
- rate-limited at producer and log service boundaries;
- suitable for serial forwarding, ring-buffer retention, and later storage;
- not a source of truth for security decisions.
Metrics
Low-cardinality numeric state:
- per-process ring SQ/CQ occupancy,
cq_overflow, invalid SQE counts, opcode counts, transport error counts; - scheduler runnable/blocked counts, direct IPC handoffs, cap-enter timeouts, process exits;
- resource ledger usage: cap slots, outstanding calls, scratch bytes, frame grants, endpoint queue occupancy, VM mapped pages;
- heap/frame allocator pressure;
- later device, network, storage, and CPU-time counters.
Metric shape is fixed to three forms:
- Counter — monotonic
u64, reset only by reboot. Cumulative semantics make aggregation composable. - Gauge —
i64that moves both ways. Used for queue depths, free-frame counts, mapped-page counts. - Histogram — fixed bucket layout carried in the descriptor,
u64per bucket. Used for ring-dispatch duration, context-switch latency, IPC RTT.
Richer shapes (top-k tables, exponential histograms) are emitted as opaque typed payloads through the producer-scoped envelope described under “Core Interfaces”; the generic reader treats them as data, and a schema-aware viewer decodes them. Metrics should be snapshots or monotonic counters, not unbounded label streams.
Events
Discrete lifecycle facts:
- process spawned, started, exited, waited, killed, or failed to load;
- service declared healthy, unhealthy, restarting, quiescing, or upgraded;
- endpoint queue overflow, cancellation, disconnected holder, transfer rollback;
- resource quota rejection;
- device reset, interrupt storm, link up/down, block I/O error once devices exist.
Events are useful for supervisors and status views. They may also feed logs.
Traces
Bounded high-detail capture for debugging:
- SQE/CQE records around one pid, service subtree, endpoint, cap id, or error class;
- optional capnp payload capture only with explicit authority;
- offline schema-aware viewer for reproducing and explaining a failure;
- short retention by default.
This is the Ring as Black Box milestone from WORKPLAN.md, not full replay.
Health
Declared service state:
- ready, starting, degraded, draining, failed, stopped;
- last successful health check and last failure reason;
- dependency health summaries;
- supervisor-owned restart intent and backoff state.
Health is not inferred only from process liveness. A process can be alive and unhealthy, or intentionally draining and still useful.
Crash Records
Panic, exception, and fatal userspace runtime records:
- boot stage, current pid if known, fault vector, RIP/CR2/error code where applicable, recent SQE context when safe, and last serial line cursor;
- bounded, redacted, and readable through a crash/debug capability;
- serial fallback remains mandatory when no reader exists.
Audit
Security and policy records:
- session creation, approval request, policy decision, cap grant, cap transfer, cap release/revocation, denial, declassification/relabel operation;
- no raw authentication proofs, private keys, bearer tokens, or full environment dumps;
- query access is scoped by session, service subtree, or operator role.
Proposed Architecture
flowchart TD
Kernel[Kernel primitives] --> KD[KernelDiagnostics cap]
Kernel --> Serial[Emergency serial]
Init[init / root supervisor] --> LogSvc[Log service]
Init --> MetricsSvc[Metrics service]
Init --> StatusSvc[Status service]
Init --> AuditSvc[Audit log]
Init --> TraceSvc[Trace capture service]
KD --> MetricsSvc
KD --> StatusSvc
KD --> TraceSvc
Services[Services and drivers] --> LogSink[Scoped LogSink caps]
Services --> Health[Health caps]
Services --> AuditWriter[Scoped AuditWriter caps]
LogSink --> LogSvc
Health --> StatusSvc
AuditWriter --> AuditSvc
Broker[AuthorityBroker] --> Readers[Scoped readers]
Readers --> Shell[Shell / agent / operator tools]
StatusSvc --> Readers
LogSvc --> Readers
MetricsSvc --> Readers
TraceSvc --> Readers
AuditSvc --> Readers
The important property is that there is no ambient monitoring namespace. The graph is assembled by init and supervisors. Readers are capabilities, not paths.
Core Interfaces
These are conceptual interfaces. They should not be added to
schema/capos.capnp until the current manifest-executor work is complete and a
specific implementation slice needs them.
enum Severity {
debug @0;
info @1;
warn @2;
error @3;
critical @4;
}
struct LogRecord {
tick @0 :UInt64;
severity @1 :Severity;
component @2 :Text;
pid @3 :UInt32;
badge @4 :UInt64;
message @5 :Text;
}
struct LogFilter {
minSeverity @0 :Severity;
componentPrefix @1 :Text;
pid @2 :UInt32;
includeDebug @3 :Bool;
}
interface LogSink {
write @0 (record :LogRecord) -> ();
}
interface LogReader {
read @0 (cursor :UInt64, maxRecords :UInt32, filter :LogFilter)
-> (records :List(LogRecord), nextCursor :UInt64, dropped :UInt64);
}
LogSink is what ordinary services receive. LogReader is what shells,
operators, supervisors, and diagnostic tools receive. A scoped reader can filter
to one service subtree or session before the caller ever sees the record.
struct ProcessStatus {
pid @0 :UInt32;
serviceName @1 :Text;
state @2 :Text;
capSlotsUsed @3 :UInt32;
capSlotsMax @4 :UInt32;
outstandingCalls @5 :UInt32;
cqReady @6 :UInt32;
cqOverflow @7 :UInt64;
lastExitCode @8 :Int64;
}
struct ServiceStatus {
name @0 :Text;
health @1 :Text;
pid @2 :UInt32;
restartCount @3 :UInt32;
lastError @4 :Text;
}
interface SystemStatus {
listProcesses @0 () -> (processes :List(ProcessStatus));
listServices @1 () -> (services :List(ServiceStatus));
service @2 (name :Text) -> (status :ServiceStatus);
}
SystemStatus is read-only. A broad instance can see the system; wrappers can
expose one service, one supervision subtree, or one session.
enum MetricKind {
counter @0;
gauge @1;
histogram @2;
}
struct MetricSample {
# Well-known fixed-name slot for counters and gauges the aggregator
# understands without additional schema lookup. Use this for stable
# kernel counters to keep the hot path allocation-free.
name @0 :Text;
kind @1 :MetricKind;
value @2 :Int64;
tick @3 :UInt64;
# Producer-scoped typed envelope for richer samples (histograms,
# top-k tables, per-subsystem structs). Payload is a capnp message;
# the schema is identified by `schemaHash` (capnp node id) and keyed
# per producer. Opaque to the generic reader; a schema-aware viewer
# decodes it.
producerId @4 :UInt64;
schemaHash @5 :UInt64;
payload @6 :Data;
}
struct MetricFilter {
prefix @0 :Text;
service @1 :Text;
}
interface MetricsReader {
snapshot @0 (filter :MetricFilter, maxSamples :UInt32)
-> (samples :List(MetricSample), truncated :Bool);
}
Early metrics should be fixed-name counters and gauges in the name/value
slot. Avoid arbitrary labels until there is a concrete memory and cardinality
policy. The producer-scoped envelope exists so richer samples do not force the
generic reader to learn a string-key taxonomy — if a producer needs per-queue
or per-device detail, it ships a typed capnp struct keyed by schemaHash
rather than synthesizing name strings.
struct TraceSelector {
pid @0 :UInt32;
serviceName @1 :Text;
errorCode @2 :Int32;
includePayloadBytes @3 :Bool;
}
struct TraceRecord {
tick @0 :UInt64;
pid @1 :UInt32;
opcode @2 :UInt16;
capId @3 :UInt32;
methodId @4 :UInt16;
interfaceId @5 :UInt64;
result @6 :Int32;
flags @7 :UInt16;
payload @8 :Data;
}
interface TraceCapture {
arm @0 (selector :TraceSelector, maxRecords :UInt32, maxBytes :UInt32)
-> (captureId :UInt64);
drain @1 (captureId :UInt64, maxRecords :UInt32)
-> (records :List(TraceRecord), complete :Bool, dropped :UInt64);
}
Payload capture should default off. A capture cap that can read payload bytes is closer to a debug privilege than a normal status cap.
enum HealthState {
starting @0;
ready @1;
degraded @2;
draining @3;
failed @4;
stopped @5;
}
interface Health {
check @0 () -> (state :HealthState, reason :Text);
}
interface ServiceSupervisor {
status @0 () -> (status :ServiceStatus);
restart @1 () -> ();
}
ServiceSupervisor is authority-changing. Normal monitoring readers should not
receive it. A broker can mint a leased ServiceSupervisor(net-stack) for one
operator action.
Kernel Diagnostics Contract
The kernel should expose a small read-only diagnostics surface for facts only the kernel can know:
- process table snapshot: pid, state, service name if known, wait state, exit code, ring physical identity hidden or omitted;
- ring snapshot: SQ/CQ head/tail-derived occupancy, overflow count, corrupted head/tail recovery counts, opcode/error counters;
- resource snapshot: cap slot usage, outstanding calls, scratch reservation, frame grant pages, mapped VM pages, free frame count, heap pressure;
- scheduler snapshot: tick count, current pid, run queue length, blocked count, direct IPC handoff count, timeout wake count;
- crash record: last panic/fault metadata and early boot stage.
The kernel should not implement log routing, alerting, dashboards, retention policy, restart decisions, RBAC, ABAC, or text search. Those are userspace service responsibilities.
Implementation shape:
- Maintain fixed-size counters in existing kernel structures where the source event already occurs.
- Prefer snapshots computed from existing state over duplicate counters when the cost is bounded.
- Expose snapshots through a small set of narrow read-only capabilities,
not one
KernelDiagnosticsgod-cap. The initial decomposition:SchedStats— tick count, current pid, run queue length, blocked count, direct IPC handoff count,cap_entertimeout/wake counts.FrameStats— free/used frame counts, frame-grant pages, allocator pressure histogram.RingStats— per-process SQ/CQ occupancy,cq_overflow, corrupted-head recovery counts, opcode counters, transport-error counters.CapTableStats— per-process slot occupancy, generation-rollover counts, insertion/remove rates.EndpointStats— per-endpoint waiter depth, RECV/RETURN match rate, abort/cancellation counts.CrashSnapshot— last panic/fault metadata, early boot stage, recent SQE context when safe.
- Each narrow cap exposes
snapshot() -> (sample :MetricSample)or a typed struct; none of them enumerates processes or reads cap tables beyond what the subsystem owns. A trusted status service composes the ones it needs; a broker leases a subset for operator sessions without the rest. ProcessInspector(pid-scoped process table, cap-table enumeration, VM map) is a distinct, stronger cap and lives with process-management authority, not with monitoring.- Convert broad diagnostics into scoped userspace wrappers before handing them to shells or applications.
- Keep panic/fault serial writes independent of any diagnostics service.
Promotion from the measure feature: the benchmark counters in
kernel/src/measure.rs graduate to always-on in RingStats / SchedStats
when the per-event cost is provably a single relaxed atomic add. Cycle-counter
instrumentation (rdtsc/rdtscp) stays behind cfg(feature = "measure")
because it is serializing and benchmark-only. The promotion threshold keeps
normal dispatch builds free of instrumentation cost without forcing monitoring
into a second build configuration.
Logging Model
Early boot has only serial. After init starts the log service, ordinary services
should receive LogSink rather than raw Console unless they need emergency
console access.
Recommended path:
- Kernel serial remains for boot, panic, and fault records.
- Init starts a userspace log service and passes scoped
LogSinkcaps to children. - The log service forwards selected records to
Consoleuntil persistent storage exists. SystemConfig.logLevelbecomes an initial policy input for which records the log service forwards and retains.- Session and operator tools receive scoped
LogReadercaps from a broker.
Services should not put secrets, raw capability payloads, full auth proofs, or arbitrary user input into logs without explicit redaction. Log records are data, not commands.
Metrics and Status
Status answers “what is alive and what state is it in.” Metrics answer “what is the numeric behavior over time.” Keeping them separate avoids a common failure mode where a human-readable status API grows into an unbounded metrics store.
Initial status fields should cover:
- pid, service name, binary name, process state, exit code;
- process handle wait state;
- supervisor health and restart policy once supervision exists;
- cap table occupancy and outstanding call count;
- ring CQ availability and overflow;
- endpoint queue occupancy where authorized.
Initial metrics should cover:
- ring dispatches, SQEs processed, per-op counts, transport error counts;
- cap-enter wait count, timeout count, wake count;
- scheduler context switches and direct IPC handoffs;
- frame free/used counts, frame grant pages, VM mapped pages;
- log records accepted, suppressed, dropped, and forwarded;
- trace records captured and dropped.
Avoid per-method, per-cap-id, per-badge, or per-user high-cardinality metrics by default. Those belong in short-lived traces or scoped logs.
Ring as Black Box
The first concrete monitoring milestone should be the existing WORKPLAN.md
Ring-as-Black-Box item:
- define a bounded capture format for SQE/CQE and endpoint transition records;
- export capture through a debug capability or QEMU-only debug path;
- build a host-side viewer that decodes records and capnp payloads when payload capture is authorized;
- add one failing-call smoke whose captured log can be inspected offline.
This buys immediate debugging value without committing to durable audit, network export, service restart policy, or replay semantics.
This is inspection, not record/replay. Replay requires stronger determinism, payload retention, timer/input modeling, and capability-state checkpoints.
Capture path cost. The capture cap (working name RingTap) is
feature-gated (cfg(feature = "debug_tap") analogous to measure). Every
armed tap imposes a serializing fan-out on dispatch; keeping it out of the
default kernel feature set prevents always-on cost. Arming a tap is itself
an auditable event — the tapped process and the audit log observe it —
and tap grants respect move-semantics so a tap cannot be silently cloned
past its intended holder. Payload-capturing taps require a separately
leased cap distinct from metadata-only capture because payloads may
contain secrets.
Health and Supervision
Health and restart policy should live with supervisors, not in a central kernel daemon.
Each supervisor owns:
- a narrowed
ProcessSpawner; - child
ProcessHandlecaps; - the cap bundle needed to restart its subtree;
- optional
Healthcaps exported by children; - a
LogSinkandAuditWriterfor its own decisions.
Status services aggregate supervisor-reported health. They should distinguish:
- no process exists;
- process exists but never reported ready;
- process is alive and ready;
- process is alive but degraded;
- process exited normally;
- process failed and supervisor is backing off;
- process was intentionally stopped or draining.
Restart authority should be a separate ServiceSupervisor cap. A read-only
SystemStatus cap must not be able to restart anything.
Audit Integration
Audit should share infrastructure with logging only at the storage or transport layer. Its semantics are different.
Audit producers:
AuthorityBrokerfor policy decisions and leased grants;- supervisors for restarts and service lifecycle actions;
- session manager for session creation and logout;
- kernel or status service for cap transfer/release/revocation summaries when those events become part of the exported authority graph;
- recovery tools for repair actions.
Audit readers are scoped:
- a user can read records for its own session;
- an operator can read a service subtree;
- a recovery or security role can read broader streams after policy approval.
Audit entries must avoid secrets and payload dumps. They should record object identity, service identity, policy decision summaries, and capability interface classes rather than raw data.
Security and Backpressure
Monitoring must not become the easiest denial-of-service path.
Required controls:
- Per-process log token buckets, matching the S.9 diagnostic aggregation design.
- Suppression summaries for repeated invalid submissions.
- Fixed-size ring buffers with explicit dropped counts.
- Maximum record size for logs, events, crash records, and traces.
- Bounded formatting outside interrupt context.
- No heap allocation in timer or panic paths.
- No unbounded metric label creation from user-controlled strings.
- Payload tracing disabled by default.
- Redaction rules at producer boundaries and at reader wrappers.
- Capability-scoped readers; no unauthenticated “debug all” endpoint.
When pressure forces dropping, preserve first-observation diagnostics and later summaries. Losing detailed logs is acceptable; corrupting scheduler progress or blocking the kernel on log I/O is not.
Relationship to Existing Proposals
- Service Architecture: monitoring services are ordinary userspace services spawned by init or supervisors. Logging policy and service topology stay out of the pre-init kernel path.
- Shell: the native and agent shell should receive scoped
SystemStatusandLogReadercaps in daily profiles, not global supervisor authority. - User Identity and Policy:
AuthorityBrokermints scoped readers and leased supervisor caps based on session policy;AuditLogrecords the decisions. - Error Handling: transport errors and
CapExceptionpayloads are monitoring signals, but retry policy remains userspace. - Authority Accounting: resource ledgers provide the first metrics substrate and define quota/backpressure boundaries.
- Security and Verification: hostile-input tests should cover log flood aggregation and bounded diagnostic paths.
- Live Upgrade: health, audit, and service status become prerequisites for credible upgrade orchestration.
Implementation Plan
-
Document the model. Keep monitoring as a future architecture proposal and do not disturb the current manifest-executor milestone.
-
Ring as Black Box. Implement bounded CQE/SQE capture, host-side decoding, and one failing-call smoke. This is the first useful monitoring artifact.
-
Userspace log service. Add
LogSinkandLogReaderschemas, start a log service from init, forward selected records to Console, and enforcelogLevel, record size, and drop summaries. -
Narrow kernel stats caps and SystemStatus. Add the narrow read-only caps (
SchedStats,FrameStats,RingStats,CapTableStats,EndpointStats,CrashSnapshot) as bounded snapshot surfaces. A userspaceSystemStatusservice composes the ones it needs and exposes scoped wrappers to shells and operator tools. LeaveProcessInspectorout of this step — it belongs with process-management authority, not monitoring. -
Metrics snapshots. Add fixed counters and gauges for ring, scheduler, resource, log, and trace state. Keep labels static until a cardinality policy exists.
-
Health and supervisor status. Add
Healthand read-only supervisor status once restart policy and exported service caps are concrete. Keep restart authority in separateServiceSupervisorcaps. -
Audit path. Add append-only audit records for broker decisions, cap grants, releases, revocations, restarts, and recovery actions. Start serial or memory backed; move to storage once the storage substrate exists.
-
Crash records. Preserve bounded panic/fault metadata across the current boot where possible; later store records durably.
-
Device, network, and storage metrics. Add driver metrics only after those drivers exist: interrupts, DMA/bounce usage, queue depth, RX/TX/drop/error counts, block latency, and reset events.
Non-Goals
- No global
/procor/sysequivalent with ambient read access. - No kernel-resident dashboard, alert manager, text search, or policy engine.
- No programmable kernel tracing language in the first monitoring design.
- No promise of durable log retention before storage exists.
- No default payload tracing.
- No service restart authority bundled into ordinary read-only status caps.
- No network export path until networking and policy can constrain it.
Open Questions
- Should
KernelDiagnosticsexpose snapshots only, or also a bounded event cursor? - What is the minimum timestamp model before wall-clock time exists?
- Should log records carry local cap IDs, stable object IDs, or only interface and service metadata by default?
- How should schema-aware trace decoding find schemas before a full
SchemaRegistryexists? - Which crash fields are safe to expose to non-recovery sessions?
- What retention policy is acceptable before persistent storage?
- Should
MetricsReaderuse typed structs for each subsystem instead of generic name/value samples? - Where should remote monitoring export fit once network transparency exists: a dedicated exporter service, capnp-rpc forwarding, or storage replication?