Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Capability-Native System Monitoring

How capOS should expose logs, metrics, health, traces, crash records, and service status without introducing global /proc, ambient log access, or a privileged monitoring daemon that bypasses the capability model.

Problem

The current system is observable mostly through serial output, QEMU exit status, smoke-test lines, CQE error codes, and a small measurement-only build feature. That is enough for early kernel work, but it is not enough for a system whose claims depend on service decomposition, explicit authority, restart policy, auditability, and later cloud operation.

Monitoring is also not harmless. A monitoring service can reveal capability topology, service names, badges, timing, crash context, request payloads, and security decisions. If capOS imports a Unix-style “read everything under /proc” or “global syslog” model, monitoring becomes an ambient authority escape hatch. If it imports a kernel-programmable tracing model too early, it adds a large privileged execution surface before the basic service graph is stable.

The design target is narrower: make operational state visible through typed, attenuable capabilities. A process should observe only the services, logs, and signals it was granted authority to inspect.

Current State

Implemented signal sources:

  • Kernel diagnostics are printed through COM1 serial via kprintln!, timestamped with the PIT tick counter. Panic and fault paths use a mutex-free emergency serial writer.
  • Userspace logging currently goes through the kernel Console capability, backed directly by serial and bounded per call.
  • Runtime panics can use an emergency console path, then exit with a fixed code.
  • Capability-ring CQEs carry structured transport results, including negative CAP_ERR_* values and serialized CapException payloads.
  • The ring tracks cq_overflow, corrupted SQ/CQ recovery, and bounded SQE dispatch, but these facts are not exported as normal metrics.
  • ProcessSpawner and ProcessHandle.wait expose basic child lifecycle observation, but restart policy, health checks, and exported-cap lifecycle are future work.
  • capos-lib::ResourceLedger tracks cap slots, outstanding calls, scratch bytes, and frame grants, but only as local accounting state.
  • The measure feature adds benchmark-only counters and TSC helpers for controlled make run-measure boots.
  • SystemConfig.logLevel exists in the schema and is printed at boot, but there is no filtering, routing, or retention policy behind it.

That means the system has useful raw signals but lacks a capability-shaped monitoring architecture.

Design Principles

  1. Observation is authority. Reading logs, status, metrics, traces, crash records, or audit entries requires a capability.
  2. No global monitoring root. SystemStatus(all), LogReader(all), and ServiceSupervisor(all) are powerful caps. Normal sessions receive scoped wrappers.
  3. Kernel facts, userspace policy. The kernel may expose bounded facts about processes, rings, resources, and faults. Retention, filtering, aggregation, health semantics, restart policy, and user-facing views belong in userspace.
  4. Separate signal classes. Logs, metrics, lifecycle events, traces, health, crash records, and audit logs have different readers, retention rules, and security properties.
  5. Bounded by construction. Every producer path has a byte, entry, or time budget. Loss is explicit and summarized.
  6. Payload capture is exceptional. Default tracing records headers, interface IDs, method IDs, sizes, result codes, badges when authorized, and timing. Capturing method payloads needs a stronger cap because payloads may contain secrets.
  7. Serial remains emergency plumbing. Early boot, panic, and recovery still need direct serial output. Normal services should receive log caps rather than broad Console.
  8. Audit is not debug logging. Audit records security-relevant decisions and capability lifecycle events. It is append-only from producers and exposed through scoped readers.
  9. Pull by default, push when justified. Status and metrics are pull-shaped (reader polls a snapshot cap). Logs, lifecycle events, crash records, and audit entries are push-shaped (producer calls into a sink). Traces are pull with an explicit arm/drain lifecycle because capture is expensive. Each direction has its own cap surface; do not generalize one shape to cover all signals.
  10. Narrow kernel stats caps over one god-cap. The kernel exposes bounded facts through several small read-only caps (ring, scheduler, resource ledger, frames, endpoints, caps, crash) rather than one KernelDiagnostics that grants everything. Narrow caps let an init-owned status service be assembled by composition, and let a broker lease a subset to an operator without handing over the rest.

Signal Taxonomy

Logs

Human-oriented diagnostic records:

  • severity, component, service name, pid, optional session/service badge, monotonic timestamp, message text;
  • rate-limited at producer and log service boundaries;
  • suitable for serial forwarding, ring-buffer retention, and later storage;
  • not a source of truth for security decisions.

Metrics

Low-cardinality numeric state:

  • per-process ring SQ/CQ occupancy, cq_overflow, invalid SQE counts, opcode counts, transport error counts;
  • scheduler runnable/blocked counts, direct IPC handoffs, cap-enter timeouts, process exits;
  • resource ledger usage: cap slots, outstanding calls, scratch bytes, frame grants, endpoint queue occupancy, VM mapped pages;
  • heap/frame allocator pressure;
  • later device, network, storage, and CPU-time counters.

Metric shape is fixed to three forms:

  • Counter — monotonic u64, reset only by reboot. Cumulative semantics make aggregation composable.
  • Gaugei64 that moves both ways. Used for queue depths, free-frame counts, mapped-page counts.
  • Histogram — fixed bucket layout carried in the descriptor, u64 per bucket. Used for ring-dispatch duration, context-switch latency, IPC RTT.

Richer shapes (top-k tables, exponential histograms) are emitted as opaque typed payloads through the producer-scoped envelope described under “Core Interfaces”; the generic reader treats them as data, and a schema-aware viewer decodes them. Metrics should be snapshots or monotonic counters, not unbounded label streams.

Events

Discrete lifecycle facts:

  • process spawned, started, exited, waited, killed, or failed to load;
  • service declared healthy, unhealthy, restarting, quiescing, or upgraded;
  • endpoint queue overflow, cancellation, disconnected holder, transfer rollback;
  • resource quota rejection;
  • device reset, interrupt storm, link up/down, block I/O error once devices exist.

Events are useful for supervisors and status views. They may also feed logs.

Traces

Bounded high-detail capture for debugging:

  • SQE/CQE records around one pid, service subtree, endpoint, cap id, or error class;
  • optional capnp payload capture only with explicit authority;
  • offline schema-aware viewer for reproducing and explaining a failure;
  • short retention by default.

This is the Ring as Black Box milestone from WORKPLAN.md, not full replay.

Health

Declared service state:

  • ready, starting, degraded, draining, failed, stopped;
  • last successful health check and last failure reason;
  • dependency health summaries;
  • supervisor-owned restart intent and backoff state.

Health is not inferred only from process liveness. A process can be alive and unhealthy, or intentionally draining and still useful.

Crash Records

Panic, exception, and fatal userspace runtime records:

  • boot stage, current pid if known, fault vector, RIP/CR2/error code where applicable, recent SQE context when safe, and last serial line cursor;
  • bounded, redacted, and readable through a crash/debug capability;
  • serial fallback remains mandatory when no reader exists.

Audit

Security and policy records:

  • session creation, approval request, policy decision, cap grant, cap transfer, cap release/revocation, denial, declassification/relabel operation;
  • no raw authentication proofs, private keys, bearer tokens, or full environment dumps;
  • query access is scoped by session, service subtree, or operator role.

Proposed Architecture

flowchart TD
    Kernel[Kernel primitives] --> KD[KernelDiagnostics cap]
    Kernel --> Serial[Emergency serial]

    Init[init / root supervisor] --> LogSvc[Log service]
    Init --> MetricsSvc[Metrics service]
    Init --> StatusSvc[Status service]
    Init --> AuditSvc[Audit log]
    Init --> TraceSvc[Trace capture service]

    KD --> MetricsSvc
    KD --> StatusSvc
    KD --> TraceSvc

    Services[Services and drivers] --> LogSink[Scoped LogSink caps]
    Services --> Health[Health caps]
    Services --> AuditWriter[Scoped AuditWriter caps]

    LogSink --> LogSvc
    Health --> StatusSvc
    AuditWriter --> AuditSvc

    Broker[AuthorityBroker] --> Readers[Scoped readers]
    Readers --> Shell[Shell / agent / operator tools]

    StatusSvc --> Readers
    LogSvc --> Readers
    MetricsSvc --> Readers
    TraceSvc --> Readers
    AuditSvc --> Readers

The important property is that there is no ambient monitoring namespace. The graph is assembled by init and supervisors. Readers are capabilities, not paths.

Core Interfaces

These are conceptual interfaces. They should not be added to schema/capos.capnp until the current manifest-executor work is complete and a specific implementation slice needs them.

enum Severity {
  debug @0;
  info @1;
  warn @2;
  error @3;
  critical @4;
}

struct LogRecord {
  tick @0 :UInt64;
  severity @1 :Severity;
  component @2 :Text;
  pid @3 :UInt32;
  badge @4 :UInt64;
  message @5 :Text;
}

struct LogFilter {
  minSeverity @0 :Severity;
  componentPrefix @1 :Text;
  pid @2 :UInt32;
  includeDebug @3 :Bool;
}

interface LogSink {
  write @0 (record :LogRecord) -> ();
}

interface LogReader {
  read @0 (cursor :UInt64, maxRecords :UInt32, filter :LogFilter)
      -> (records :List(LogRecord), nextCursor :UInt64, dropped :UInt64);
}

LogSink is what ordinary services receive. LogReader is what shells, operators, supervisors, and diagnostic tools receive. A scoped reader can filter to one service subtree or session before the caller ever sees the record.

struct ProcessStatus {
  pid @0 :UInt32;
  serviceName @1 :Text;
  state @2 :Text;
  capSlotsUsed @3 :UInt32;
  capSlotsMax @4 :UInt32;
  outstandingCalls @5 :UInt32;
  cqReady @6 :UInt32;
  cqOverflow @7 :UInt64;
  lastExitCode @8 :Int64;
}

struct ServiceStatus {
  name @0 :Text;
  health @1 :Text;
  pid @2 :UInt32;
  restartCount @3 :UInt32;
  lastError @4 :Text;
}

interface SystemStatus {
  listProcesses @0 () -> (processes :List(ProcessStatus));
  listServices @1 () -> (services :List(ServiceStatus));
  service @2 (name :Text) -> (status :ServiceStatus);
}

SystemStatus is read-only. A broad instance can see the system; wrappers can expose one service, one supervision subtree, or one session.

enum MetricKind {
  counter @0;
  gauge @1;
  histogram @2;
}

struct MetricSample {
  # Well-known fixed-name slot for counters and gauges the aggregator
  # understands without additional schema lookup. Use this for stable
  # kernel counters to keep the hot path allocation-free.
  name @0 :Text;
  kind @1 :MetricKind;
  value @2 :Int64;
  tick @3 :UInt64;

  # Producer-scoped typed envelope for richer samples (histograms,
  # top-k tables, per-subsystem structs). Payload is a capnp message;
  # the schema is identified by `schemaHash` (capnp node id) and keyed
  # per producer. Opaque to the generic reader; a schema-aware viewer
  # decodes it.
  producerId @4 :UInt64;
  schemaHash @5 :UInt64;
  payload    @6 :Data;
}

struct MetricFilter {
  prefix @0 :Text;
  service @1 :Text;
}

interface MetricsReader {
  snapshot @0 (filter :MetricFilter, maxSamples :UInt32)
      -> (samples :List(MetricSample), truncated :Bool);
}

Early metrics should be fixed-name counters and gauges in the name/value slot. Avoid arbitrary labels until there is a concrete memory and cardinality policy. The producer-scoped envelope exists so richer samples do not force the generic reader to learn a string-key taxonomy — if a producer needs per-queue or per-device detail, it ships a typed capnp struct keyed by schemaHash rather than synthesizing name strings.

struct TraceSelector {
  pid @0 :UInt32;
  serviceName @1 :Text;
  errorCode @2 :Int32;
  includePayloadBytes @3 :Bool;
}

struct TraceRecord {
  tick @0 :UInt64;
  pid @1 :UInt32;
  opcode @2 :UInt16;
  capId @3 :UInt32;
  methodId @4 :UInt16;
  interfaceId @5 :UInt64;
  result @6 :Int32;
  flags @7 :UInt16;
  payload @8 :Data;
}

interface TraceCapture {
  arm @0 (selector :TraceSelector, maxRecords :UInt32, maxBytes :UInt32)
      -> (captureId :UInt64);
  drain @1 (captureId :UInt64, maxRecords :UInt32)
      -> (records :List(TraceRecord), complete :Bool, dropped :UInt64);
}

Payload capture should default off. A capture cap that can read payload bytes is closer to a debug privilege than a normal status cap.

enum HealthState {
  starting @0;
  ready @1;
  degraded @2;
  draining @3;
  failed @4;
  stopped @5;
}

interface Health {
  check @0 () -> (state :HealthState, reason :Text);
}

interface ServiceSupervisor {
  status @0 () -> (status :ServiceStatus);
  restart @1 () -> ();
}

ServiceSupervisor is authority-changing. Normal monitoring readers should not receive it. A broker can mint a leased ServiceSupervisor(net-stack) for one operator action.

Kernel Diagnostics Contract

The kernel should expose a small read-only diagnostics surface for facts only the kernel can know:

  • process table snapshot: pid, state, service name if known, wait state, exit code, ring physical identity hidden or omitted;
  • ring snapshot: SQ/CQ head/tail-derived occupancy, overflow count, corrupted head/tail recovery counts, opcode/error counters;
  • resource snapshot: cap slot usage, outstanding calls, scratch reservation, frame grant pages, mapped VM pages, free frame count, heap pressure;
  • scheduler snapshot: tick count, current pid, run queue length, blocked count, direct IPC handoff count, timeout wake count;
  • crash record: last panic/fault metadata and early boot stage.

The kernel should not implement log routing, alerting, dashboards, retention policy, restart decisions, RBAC, ABAC, or text search. Those are userspace service responsibilities.

Implementation shape:

  • Maintain fixed-size counters in existing kernel structures where the source event already occurs.
  • Prefer snapshots computed from existing state over duplicate counters when the cost is bounded.
  • Expose snapshots through a small set of narrow read-only capabilities, not one KernelDiagnostics god-cap. The initial decomposition:
    • SchedStats — tick count, current pid, run queue length, blocked count, direct IPC handoff count, cap_enter timeout/wake counts.
    • FrameStats — free/used frame counts, frame-grant pages, allocator pressure histogram.
    • RingStats — per-process SQ/CQ occupancy, cq_overflow, corrupted-head recovery counts, opcode counters, transport-error counters.
    • CapTableStats — per-process slot occupancy, generation-rollover counts, insertion/remove rates.
    • EndpointStats — per-endpoint waiter depth, RECV/RETURN match rate, abort/cancellation counts.
    • CrashSnapshot — last panic/fault metadata, early boot stage, recent SQE context when safe.
  • Each narrow cap exposes snapshot() -> (sample :MetricSample) or a typed struct; none of them enumerates processes or reads cap tables beyond what the subsystem owns. A trusted status service composes the ones it needs; a broker leases a subset for operator sessions without the rest.
  • ProcessInspector (pid-scoped process table, cap-table enumeration, VM map) is a distinct, stronger cap and lives with process-management authority, not with monitoring.
  • Convert broad diagnostics into scoped userspace wrappers before handing them to shells or applications.
  • Keep panic/fault serial writes independent of any diagnostics service.

Promotion from the measure feature: the benchmark counters in kernel/src/measure.rs graduate to always-on in RingStats / SchedStats when the per-event cost is provably a single relaxed atomic add. Cycle-counter instrumentation (rdtsc/rdtscp) stays behind cfg(feature = "measure") because it is serializing and benchmark-only. The promotion threshold keeps normal dispatch builds free of instrumentation cost without forcing monitoring into a second build configuration.

Logging Model

Early boot has only serial. After init starts the log service, ordinary services should receive LogSink rather than raw Console unless they need emergency console access.

Recommended path:

  1. Kernel serial remains for boot, panic, and fault records.
  2. Init starts a userspace log service and passes scoped LogSink caps to children.
  3. The log service forwards selected records to Console until persistent storage exists.
  4. SystemConfig.logLevel becomes an initial policy input for which records the log service forwards and retains.
  5. Session and operator tools receive scoped LogReader caps from a broker.

Services should not put secrets, raw capability payloads, full auth proofs, or arbitrary user input into logs without explicit redaction. Log records are data, not commands.

Metrics and Status

Status answers “what is alive and what state is it in.” Metrics answer “what is the numeric behavior over time.” Keeping them separate avoids a common failure mode where a human-readable status API grows into an unbounded metrics store.

Initial status fields should cover:

  • pid, service name, binary name, process state, exit code;
  • process handle wait state;
  • supervisor health and restart policy once supervision exists;
  • cap table occupancy and outstanding call count;
  • ring CQ availability and overflow;
  • endpoint queue occupancy where authorized.

Initial metrics should cover:

  • ring dispatches, SQEs processed, per-op counts, transport error counts;
  • cap-enter wait count, timeout count, wake count;
  • scheduler context switches and direct IPC handoffs;
  • frame free/used counts, frame grant pages, VM mapped pages;
  • log records accepted, suppressed, dropped, and forwarded;
  • trace records captured and dropped.

Avoid per-method, per-cap-id, per-badge, or per-user high-cardinality metrics by default. Those belong in short-lived traces or scoped logs.

Ring as Black Box

The first concrete monitoring milestone should be the existing WORKPLAN.md Ring-as-Black-Box item:

  • define a bounded capture format for SQE/CQE and endpoint transition records;
  • export capture through a debug capability or QEMU-only debug path;
  • build a host-side viewer that decodes records and capnp payloads when payload capture is authorized;
  • add one failing-call smoke whose captured log can be inspected offline.

This buys immediate debugging value without committing to durable audit, network export, service restart policy, or replay semantics.

This is inspection, not record/replay. Replay requires stronger determinism, payload retention, timer/input modeling, and capability-state checkpoints.

Capture path cost. The capture cap (working name RingTap) is feature-gated (cfg(feature = "debug_tap") analogous to measure). Every armed tap imposes a serializing fan-out on dispatch; keeping it out of the default kernel feature set prevents always-on cost. Arming a tap is itself an auditable event — the tapped process and the audit log observe it — and tap grants respect move-semantics so a tap cannot be silently cloned past its intended holder. Payload-capturing taps require a separately leased cap distinct from metadata-only capture because payloads may contain secrets.

Health and Supervision

Health and restart policy should live with supervisors, not in a central kernel daemon.

Each supervisor owns:

  • a narrowed ProcessSpawner;
  • child ProcessHandle caps;
  • the cap bundle needed to restart its subtree;
  • optional Health caps exported by children;
  • a LogSink and AuditWriter for its own decisions.

Status services aggregate supervisor-reported health. They should distinguish:

  • no process exists;
  • process exists but never reported ready;
  • process is alive and ready;
  • process is alive but degraded;
  • process exited normally;
  • process failed and supervisor is backing off;
  • process was intentionally stopped or draining.

Restart authority should be a separate ServiceSupervisor cap. A read-only SystemStatus cap must not be able to restart anything.

Audit Integration

Audit should share infrastructure with logging only at the storage or transport layer. Its semantics are different.

Audit producers:

  • AuthorityBroker for policy decisions and leased grants;
  • supervisors for restarts and service lifecycle actions;
  • session manager for session creation and logout;
  • kernel or status service for cap transfer/release/revocation summaries when those events become part of the exported authority graph;
  • recovery tools for repair actions.

Audit readers are scoped:

  • a user can read records for its own session;
  • an operator can read a service subtree;
  • a recovery or security role can read broader streams after policy approval.

Audit entries must avoid secrets and payload dumps. They should record object identity, service identity, policy decision summaries, and capability interface classes rather than raw data.

Security and Backpressure

Monitoring must not become the easiest denial-of-service path.

Required controls:

  • Per-process log token buckets, matching the S.9 diagnostic aggregation design.
  • Suppression summaries for repeated invalid submissions.
  • Fixed-size ring buffers with explicit dropped counts.
  • Maximum record size for logs, events, crash records, and traces.
  • Bounded formatting outside interrupt context.
  • No heap allocation in timer or panic paths.
  • No unbounded metric label creation from user-controlled strings.
  • Payload tracing disabled by default.
  • Redaction rules at producer boundaries and at reader wrappers.
  • Capability-scoped readers; no unauthenticated “debug all” endpoint.

When pressure forces dropping, preserve first-observation diagnostics and later summaries. Losing detailed logs is acceptable; corrupting scheduler progress or blocking the kernel on log I/O is not.

Relationship to Existing Proposals

  • Service Architecture: monitoring services are ordinary userspace services spawned by init or supervisors. Logging policy and service topology stay out of the pre-init kernel path.
  • Shell: the native and agent shell should receive scoped SystemStatus and LogReader caps in daily profiles, not global supervisor authority.
  • User Identity and Policy: AuthorityBroker mints scoped readers and leased supervisor caps based on session policy; AuditLog records the decisions.
  • Error Handling: transport errors and CapException payloads are monitoring signals, but retry policy remains userspace.
  • Authority Accounting: resource ledgers provide the first metrics substrate and define quota/backpressure boundaries.
  • Security and Verification: hostile-input tests should cover log flood aggregation and bounded diagnostic paths.
  • Live Upgrade: health, audit, and service status become prerequisites for credible upgrade orchestration.

Implementation Plan

  1. Document the model. Keep monitoring as a future architecture proposal and do not disturb the current manifest-executor milestone.

  2. Ring as Black Box. Implement bounded CQE/SQE capture, host-side decoding, and one failing-call smoke. This is the first useful monitoring artifact.

  3. Userspace log service. Add LogSink and LogReader schemas, start a log service from init, forward selected records to Console, and enforce logLevel, record size, and drop summaries.

  4. Narrow kernel stats caps and SystemStatus. Add the narrow read-only caps (SchedStats, FrameStats, RingStats, CapTableStats, EndpointStats, CrashSnapshot) as bounded snapshot surfaces. A userspace SystemStatus service composes the ones it needs and exposes scoped wrappers to shells and operator tools. Leave ProcessInspector out of this step — it belongs with process-management authority, not monitoring.

  5. Metrics snapshots. Add fixed counters and gauges for ring, scheduler, resource, log, and trace state. Keep labels static until a cardinality policy exists.

  6. Health and supervisor status. Add Health and read-only supervisor status once restart policy and exported service caps are concrete. Keep restart authority in separate ServiceSupervisor caps.

  7. Audit path. Add append-only audit records for broker decisions, cap grants, releases, revocations, restarts, and recovery actions. Start serial or memory backed; move to storage once the storage substrate exists.

  8. Crash records. Preserve bounded panic/fault metadata across the current boot where possible; later store records durably.

  9. Device, network, and storage metrics. Add driver metrics only after those drivers exist: interrupts, DMA/bounce usage, queue depth, RX/TX/drop/error counts, block latency, and reset events.

Non-Goals

  • No global /proc or /sys equivalent with ambient read access.
  • No kernel-resident dashboard, alert manager, text search, or policy engine.
  • No programmable kernel tracing language in the first monitoring design.
  • No promise of durable log retention before storage exists.
  • No default payload tracing.
  • No service restart authority bundled into ordinary read-only status caps.
  • No network export path until networking and policy can constrain it.

Open Questions

  • Should KernelDiagnostics expose snapshots only, or also a bounded event cursor?
  • What is the minimum timestamp model before wall-clock time exists?
  • Should log records carry local cap IDs, stable object IDs, or only interface and service metadata by default?
  • How should schema-aware trace decoding find schemas before a full SchemaRegistry exists?
  • Which crash fields are safe to expose to non-recovery sessions?
  • What retention policy is acceptable before persistent storage?
  • Should MetricsReader use typed structs for each subsystem instead of generic name/value samples?
  • Where should remote monitoring export fit once network transparency exists: a dedicated exporter service, capnp-rpc forwarding, or storage replication?