# Proposal: Capability-Native System Monitoring

How capOS should expose logs, metrics, health, traces, crash records, and
service status without introducing global `/proc`, ambient log access, or a
privileged monitoring daemon that bypasses the capability model.


## Problem

The current system is observable mostly through serial output, QEMU exit status,
smoke-test lines, CQE error codes, and a small measurement-only build feature.
That is enough for early kernel work, but it is not enough for a system whose
claims depend on service decomposition, explicit authority, restart policy,
auditability, and later cloud operation.

Monitoring is also not harmless. A monitoring service can reveal capability
topology, service names, scoped subject references, transport metadata, timing,
crash context, request payloads, and security decisions. If capOS imports a
Unix-style "read everything under
`/proc`" or "global syslog" model, monitoring becomes an ambient authority
escape hatch. If it imports a kernel-programmable tracing model too early, it
adds a large privileged execution surface before the basic service graph is
stable.

The design target is narrower: make operational state visible through typed,
attenuable capabilities. A process should observe only the services, logs, and
signals it was granted authority to inspect.

## Current State

Implemented signal sources:

- Kernel diagnostics are printed through COM1 serial via `kprintln!`, timestamped
  with the PIT tick counter. Panic and fault paths use a mutex-free emergency
  serial writer.
- Userspace logging currently goes through the kernel `Console` capability,
  backed directly by serial and bounded per call.
- A Phase 1 capability log surface has landed: `LogSink`/`LogReader` over a
  bounded drop-oldest kernel ring (`kernel/src/cap/log.rs`), with
  `SystemConfig.logLevel` drop enforcement at the sink, serial forwarding of
  accepted records, and scoped sink/reader caps granted at spawn (proof:
  `make run-monitoring-log-smoke`). Metrics, status, health, traces, crash
  records, the narrow kernel stats caps, and persistent retention remain
  future phases.
- Runtime panics can use an emergency console path, then exit with a fixed code.
- Capability-ring CQEs carry structured transport results, including negative
  `CAP_ERR_*` values and serialized `CapException` payloads.
- The ring tracks `cq_overflow`, corrupted SQ/CQ recovery, and bounded SQE
  dispatch, but these facts are not exported as normal metrics.
- `ProcessSpawner` and `ProcessHandle.wait` expose basic child lifecycle
  observation, but restart policy, health checks, and exported-cap lifecycle are
  future work.
- `capos-lib::ResourceLedger` tracks cap slots, outstanding calls, scratch
  bytes, and frame grants, but only as local accounting state.
- The `measure` feature adds benchmark-only counters and TSC helpers for
  controlled `make run-measure` boots.
- `SystemConfig.logLevel` exists in the schema and is printed at boot, but there
  is no filtering, routing, or retention policy behind it.
- An `AuditLog` capability exists in the schema and kernel (`kernel/src/cap/audit_log.rs`),
  used by `AuthorityBroker` to record auth, setup, session, broker, and shell-launch
  events. Currently writes to serial via `kprintln!`; no ring-buffer reader cap or
  persistent retention yet.
- A `HardwareAuditLog` capability with a bounded volatile ring buffer and
  drain/snapshot readers exists for DMA/MMIO/Interrupt cap lifecycle events
  (`kernel/src/cap/hardware_audit.rs`), including sequence numbers and
  dropped-record counts. A userspace `hardware-audit-service` drains it into a
  Store/Namespace-backed hash-chained segment ring and exposes scoped
  `HardwareAuditReader` snapshots; the current backing `StoreCap` is
  RAM-backed, so post-reboot retention is still a storage-backend concern.
- `hardware_release_log` module (`kernel/src/cap/hardware_release_log.rs`) emits
  DMA pool, DMA buffer, DeviceMmio, and Interrupt release outcomes to serial; no
  reader cap or retention yet.

That means the system has useful raw signals and partial audit infrastructure but lacks
a unified capability-shaped monitoring architecture with log routing, metrics export,
and reader caps for most signal classes.

## Design Principles

1. **Observation is authority.** Reading logs, status, metrics, traces, crash
   records, or audit entries requires a capability.
2. **No global monitoring root.** `SystemStatus(all)`, `LogReader(all)`, and
   `ServiceSupervisor(all)` are powerful caps. Normal sessions receive scoped
   wrappers.
3. **Kernel facts, userspace policy.** The kernel may expose bounded facts about
   processes, rings, resources, and faults. Retention, filtering, aggregation,
   health semantics, restart policy, and user-facing views belong in userspace.
4. **Separate signal classes.** Logs, metrics, lifecycle events, traces, health,
   crash records, and audit logs have different readers, retention rules, and
   security properties.
5. **Bounded by construction.** Every producer path has a byte, entry, or time
   budget. Loss is explicit and summarized.
6. **Payload capture is exceptional.** Default tracing records headers,
   interface IDs, method IDs, sizes, result codes, and scoped transport
   identifiers only when authorized. Capturing method payloads needs a stronger
   cap because payloads may contain secrets.
7. **Serial remains emergency plumbing.** Early boot, panic, and recovery still
   need direct serial output. Normal services should receive log caps rather
   than broad `Console`.
8. **Audit is not debug logging.** Audit records security-relevant decisions and
   capability lifecycle events. It is append-only from producers and exposed
   through scoped readers.
9. **Pull by default, push when justified.** Status and metrics are pull-shaped
   (reader polls a snapshot cap). Logs, lifecycle events, crash records, and
   audit entries are push-shaped (producer calls into a sink). Traces are pull
   with an explicit arm/drain lifecycle because capture is expensive. Each
   direction has its own cap surface; do not generalize one shape to cover all
   signals.
10. **Narrow kernel stats caps over one god-cap.** The kernel exposes bounded
    facts through several small read-only caps (ring, scheduler, resource
    ledger, frames, endpoints, caps, crash) rather than one `KernelDiagnostics`
    that grants everything. Narrow caps let an init-owned status service be
    assembled by composition, and let a broker lease a subset to an operator
    without handing over the rest.

## Signal Taxonomy

### Logs

Human-oriented diagnostic records:

- severity, component, service name, pid, optional subject/service reference,
  monotonic timestamp, message text;
- rate-limited at producer and log service boundaries;
- suitable for serial forwarding, ring-buffer retention, and later storage;
- not a source of truth for security decisions.

### Metrics

Low-cardinality numeric state:

- per-process ring SQ/CQ occupancy, `cq_overflow`, invalid SQE counts, opcode
  counts, transport error counts;
- scheduler runnable/blocked counts, direct IPC handoffs, cap-enter timeouts,
  process exits;
- resource ledger usage: cap slots, outstanding calls, scratch bytes, frame
  grants, endpoint queue occupancy, VM mapped pages;
- heap/frame allocator pressure;
- later device, network, storage, and CPU-time counters.

Metric shape is fixed to three forms:

- **Counter** — monotonic `u64`, reset only by reboot. Cumulative semantics
  make aggregation composable.
- **Gauge** — `i64` that moves both ways. Used for queue depths, free-frame
  counts, mapped-page counts.
- **Histogram** — fixed bucket layout carried in the descriptor, `u64` per
  bucket. Used for ring-dispatch duration, context-switch latency, IPC RTT.

Richer shapes (top-k tables, exponential histograms) are emitted as opaque
typed payloads through the producer-scoped envelope described under "Core
Interfaces"; the generic reader treats them as data, and a schema-aware
viewer decodes them. Metrics should be snapshots or monotonic counters, not
unbounded label streams.

### Events

Discrete lifecycle facts:

- process spawned, started, exited, waited, killed, or failed to load;
- service declared healthy, unhealthy, restarting, quiescing, or upgraded;
- endpoint queue overflow, cancellation, disconnected holder, transfer rollback;
- resource quota rejection;
- device reset, interrupt storm, link up/down, block I/O error once devices
  exist.

Events are useful for supervisors and status views. They may also feed logs.

### Traces

Bounded high-detail capture for debugging:

- SQE/CQE records around one pid, service subtree, endpoint, cap id, or error
  class;
- optional capnp payload capture only with explicit authority;
- offline schema-aware viewer for reproducing and explaining a failure;
- short retention by default.

This is the `Ring as Black Box` milestone from `docs/tasks/README.md`, not full replay.

### Health

Declared service state:

- ready, starting, degraded, draining, failed, stopped;
- last successful health check and last failure reason;
- dependency health summaries;
- supervisor-owned restart intent and backoff state.

Health is not inferred only from process liveness. A process can be alive and
unhealthy, or intentionally draining and still useful.

### Crash Records

Panic, exception, and fatal userspace runtime records:

- boot stage, current pid if known, fault vector, RIP/CR2/error code where
  applicable, recent SQE context when safe, and last serial line cursor;
- bounded, redacted, and readable through a crash/debug capability;
- serial fallback remains mandatory when no reader exists.

### Audit

Security and policy records:

- session creation, approval request, policy decision, cap grant, cap transfer,
  cap release/revocation, denial, declassification/relabel operation;
- no raw authentication proofs, private keys, bearer tokens, or full environment
  dumps;
- query access is scoped by session, service subtree, or operator role.

## ITU-T X.700 Series Alignment

The ITU-T X.700 Systems Management framework (OSI management) predates
modern observability stacks by two decades but still offers a cleaner
decomposition than ad-hoc log/metric/trace categorization. capOS is not
implementing CMIS/CMIP (X.710/X.711 assume ASN.1 BER over an OSI stack
capOS will never speak); the value is the **signal taxonomy and field
model**, not the transport.

| capOS signal class | Closest ITU-T | What we take from it |
|---|---|---|
| Logs            | X.735 *Log control function* | Log record identity (moRef analog = `component`+`pid`+`service_ref`), severity mapping, scoped reader model. |
| Metrics         | X.739 *Metric objects and attributes* | Fixed metric shapes (counter / gauge / histogram) as opposed to open-ended label streams. |
| Events          | X.734 *Event report management function* | Discriminator-driven filtering, event-type taxonomy, producer/consumer separation. |
| Alarms (events) | X.733 *Alarm reporting function* | Perceived severity (cleared/indeterminate/warning/minor/major/critical), probable cause, specific problem, trend indication, proposed repair action. |
| Health          | X.731 *State management function* | Operational / administrative / usage state model (enabled/disabled, unlocked/locked, idle/active/busy) feeding `HealthState`. |
| Audit           | X.740 *Security audit trail function* | Audit record field model: event type, time, initiator, target, outcome, evidence chain. |
| Crash records   | X.733 + X.736 *Security alarm reporting function* | Structured cause + severity for fatal/integrity events; security-relevant crashes flow through both the crash cap and the audit cap. |

**FCAPS coverage.** X.700/X.701 defines the five management functional
areas: Fault, Configuration, Accounting, Performance, Security. This
proposal covers **Fault** (crash records, alarms), **Performance**
(metrics), and **Security** (audit). **Configuration** and
**Accounting** are deliberately out of scope here:

- **Configuration management** (X.700 "C") — versioned, signed
  configuration deltas applied to running services. Partially covered
  by `cloud-metadata-proposal.md` (`ManifestDelta`) but capOS has no
  general configuration-management proposal yet. Candidate for a
  separate proposal once the manifest-executor and live-upgrade work
  stabilize.
- **Accounting management** (X.700 "A") — per-principal, per-session,
  per-service resource-usage ledgers with retention and export. The
  kernel's `ResourceLedger` is the lowest layer; aggregation,
  persistence, and audit-grade usage records are undesigned. Candidate
  for a separate proposal; would compose with the audit cap and the
  user-identity session model.

### Updated Field Mappings

`LogRecord` maps roughly onto X.735 `logRecord`:

```
X.735 logRecord                    capOS LogRecord
---------------                    ---------------
logRecordId                        (cursor + pid + tick)
managedObjectClass                 component + service name
managedObjectInstance              pid + service_ref
eventType                          Severity (lossy; add explicit
                                    eventType once alarm/security
                                    records share the pipe)
eventTime                          tick (monotonic; wall-clock when
                                    available)
notificationIdentifier             not modeled; add when events need
                                    correlation IDs
```

Audit records should adopt X.740 fields explicitly. Proposed schema
extension once the audit service ships:

```capnp
enum AuditEventType {
  # X.740 §6.1 event categories, pruned to what capOS actually records.
  authentication    @0;   # login, logout, auth failure
  accessControl     @1;   # grant, deny, revoke, transfer
  policyDecision    @2;   # broker decision with plan + constraints
  objectLifecycle   @3;   # capability create/destroy, object reap
  securityAlarm     @4;   # X.736-shaped: integrity/confidentiality violation
  serviceControl    @5;   # restart, upgrade, quiesce, resume
  administrative    @6;   # manifest update, role change
}

enum AuditOutcome {
  success           @0;
  failure           @1;
  denied            @2;
  pending           @3;   # multi-party approval outstanding
}

struct AuditRecord {
  tick        @0 :UInt64;
  eventType   @1 :AuditEventType;
  initiator   @2 :Data;        # opaque principal/session ID
  target      @3 :Text;        # interface + service identity
  outcome     @4 :AuditOutcome;
  reason      @5 :Text;
  evidence    @6 :Data;        # opaque, bounded; no secrets
}
```

Alarms (X.733) are a structured subset of Events, not a new signal
class. The `ServiceStatus` / `Health` path emits alarms when degraded,
failed, or security-relevant thresholds trip:

```capnp
enum PerceivedSeverity {
  cleared        @0;
  indeterminate  @1;
  warning        @2;
  minor          @3;
  major          @4;
  critical       @5;
}

enum ProbableCause {
  # X.733 Annex A lists ~50 values; capOS starts with the handful that
  # match known failure modes and extends as needed.
  communicationsError    @0;
  integrityViolation     @1;
  operationalViolation   @2;
  softwareError          @3;
  underlyingResourceUnavailable @4;
  qualityOfServiceAlarm  @5;
  securityAlarmIntegrity @6;
  securityAlarmAccess    @7;
}

struct Alarm {
  tick            @0 :UInt64;
  managedObject   @1 :Text;           # service or cap identity
  severity        @2 :PerceivedSeverity;
  probableCause   @3 :ProbableCause;
  specificProblem @4 :Text;
  trend           @5 :AlarmTrend;
  proposedRepair  @6 :Text;
}
```

The taxonomy buys two things the Unix-style "syslog + Prometheus +
Jaeger" tower does not: (1) alarms as a first-class signal with a
defined severity lattice and probable-cause field, which is how
operators actually triage, and (2) audit as a distinct record type with
fixed fields rather than a convention-layer over free-form log
messages.

### ITU-T references

- ITU-T Rec. X.700 (09/92) — Management framework
- ITU-T Rec. X.701 (08/97) — Systems management overview
- ITU-T Rec. X.733 (02/92) — Alarm reporting function
- ITU-T Rec. X.734 (09/92) — Event report management function
- ITU-T Rec. X.735 (09/92) — Log control function
- ITU-T Rec. X.736 (01/92) — Security alarm reporting function
- ITU-T Rec. X.740 (01/92) — Security audit trail function
- ITU-T Rec. X.731 (01/92) — State management function
- ITU-T Rec. X.739 (11/93) — Metric objects and attributes

## Proposed Architecture

```mermaid
flowchart TD
    Kernel[Kernel primitives] --> KD[KernelDiagnostics cap]
    Kernel --> Serial[Emergency serial]

    Init[init / root supervisor] --> LogSvc[Log service]
    Init --> MetricsSvc[Metrics service]
    Init --> StatusSvc[Status service]
    Init --> AuditSvc[Audit log]
    Init --> TraceSvc[Trace capture service]

    KD --> MetricsSvc
    KD --> StatusSvc
    KD --> TraceSvc

    Services[Services and drivers] --> LogSink[Scoped LogSink caps]
    Services --> Health[Health caps]
    Services --> AuditWriter[Scoped AuditWriter caps]

    LogSink --> LogSvc
    Health --> StatusSvc
    AuditWriter --> AuditSvc

    Broker[AuthorityBroker] --> Readers[Scoped readers]
    Readers --> Shell[Shell / agent / operator tools]

    StatusSvc --> Readers
    LogSvc --> Readers
    MetricsSvc --> Readers
    TraceSvc --> Readers
    AuditSvc --> Readers
```

The important property is that there is no ambient monitoring namespace. The
graph is assembled by init and supervisors. Readers are capabilities, not paths.

## Core Interfaces

These are conceptual interfaces. They should not be added to
`schema/capos.capnp` until the current manifest-executor work is complete and a
specific implementation slice needs them.

```capnp
enum Severity {
  debug @0;
  info @1;
  warn @2;
  error @3;
  critical @4;
}

struct LogRecord {
  tick @0 :UInt64;
  severity @1 :Severity;
  component @2 :Text;
  pid @3 :UInt32;
  subjectRef @4 :Data;   # privacy-preserving subject/session correlation
  sessionRef @5 :Data;   # optional scoped session correlation
  serviceRef @6 :Data;   # optional authorized service/component correlation
  transportId @7 :Data;  # debug-only ring/endpoint metadata, not identity
  message @8 :Text;
}

struct LogFilter {
  minSeverity @0 :Severity;
  componentPrefix @1 :Text;
  pid @2 :UInt32;
  includeDebug @3 :Bool;
}

interface LogSink {
  write @0 (record :LogRecord) -> ();
}

interface LogReader {
  read @0 (cursor :UInt64, maxRecords :UInt32, filter :LogFilter)
      -> (records :List(LogRecord), nextCursor :UInt64, dropped :UInt64);
}
```

`LogSink` is what ordinary services receive. `LogReader` is what shells,
operators, supervisors, and diagnostic tools receive. A scoped reader can filter
to one service subtree or session before the caller ever sees the record.

Monitoring terminology should use snake-case names in prose and map them to
schema-style fields only at the Cap'n Proto boundary:

```text
subject_ref / session_ref:
  privacy-preserving identity or session correlation fields.

service_ref:
  service instance or component correlation where the reader is authorized.

transport_id:
  debug-only ring, endpoint, SQE/CQE, or waiter metadata; never subject
  identity.
```

Legacy endpoint badge terminology must not leak into user-facing monitoring
identity. If a low-level transport path still stores a badge-shaped selector,
monitoring may expose it only as debug `transport_id` under an appropriate
diagnostic cap, not as `subject_ref`, `session_ref`, or `service_ref`.

```capnp
struct ProcessStatus {
  pid @0 :UInt32;
  serviceName @1 :Text;
  state @2 :Text;
  capSlotsUsed @3 :UInt32;
  capSlotsMax @4 :UInt32;
  outstandingCalls @5 :UInt32;
  cqReady @6 :UInt32;
  cqOverflow @7 :UInt64;
  lastExitCode @8 :Int64;
}

struct ServiceStatus {
  name @0 :Text;
  health @1 :Text;
  pid @2 :UInt32;
  restartCount @3 :UInt32;
  lastError @4 :Text;
}

interface SystemStatus {
  listProcesses @0 () -> (processes :List(ProcessStatus));
  listServices @1 () -> (services :List(ServiceStatus));
  service @2 (name :Text) -> (status :ServiceStatus);
}
```

`SystemStatus` is read-only. A broad instance can see the system; wrappers can
expose one service, one supervision subtree, or one session.

```capnp
enum MetricKind {
  counter @0;
  gauge @1;
  histogram @2;
}

struct MetricSample {
  # Well-known fixed-name slot for counters and gauges the aggregator
  # understands without additional schema lookup. Use this for stable
  # kernel counters to keep the hot path allocation-free.
  name @0 :Text;
  kind @1 :MetricKind;
  value @2 :Int64;
  tick @3 :UInt64;

  # Producer-scoped typed envelope for richer samples (histograms,
  # top-k tables, per-subsystem structs). Payload is a capnp message;
  # the schema is identified by `schemaHash` (capnp node id) and keyed
  # per producer. Opaque to the generic reader; a schema-aware viewer
  # decodes it.
  producerId @4 :UInt64;
  schemaHash @5 :UInt64;
  payload    @6 :Data;
}

struct MetricFilter {
  prefix @0 :Text;
  service @1 :Text;
}

interface MetricsReader {
  snapshot @0 (filter :MetricFilter, maxSamples :UInt32)
      -> (samples :List(MetricSample), truncated :Bool);
}
```

Early metrics should be fixed-name counters and gauges in the `name`/`value`
slot. Avoid arbitrary labels until there is a concrete memory and cardinality
policy. The producer-scoped envelope exists so richer samples do not force the
generic reader to learn a string-key taxonomy — if a producer needs per-queue
or per-device detail, it ships a typed capnp struct keyed by `schemaHash`
rather than synthesizing `name` strings.

```capnp
struct TraceSelector {
  pid @0 :UInt32;
  serviceName @1 :Text;
  errorCode @2 :Int32;
  includePayloadBytes @3 :Bool;
}

struct TraceRecord {
  tick @0 :UInt64;
  pid @1 :UInt32;
  opcode @2 :UInt16;
  capId @3 :UInt32;
  methodId @4 :UInt16;
  interfaceId @5 :UInt64;
  result @6 :Int32;
  flags @7 :UInt16;
  payload @8 :Data;
}

interface TraceCapture {
  arm @0 (selector :TraceSelector, maxRecords :UInt32, maxBytes :UInt32)
      -> (captureId :UInt64);
  drain @1 (captureId :UInt64, maxRecords :UInt32)
      -> (records :List(TraceRecord), complete :Bool, dropped :UInt64);
}
```

Payload capture should default off. A capture cap that can read payload bytes is
closer to a debug privilege than a normal status cap.

```capnp
enum HealthState {
  starting @0;
  ready @1;
  degraded @2;
  draining @3;
  failed @4;
  stopped @5;
}

interface Health {
  check @0 () -> (state :HealthState, reason :Text);
}

interface ServiceSupervisor {
  status @0 () -> (status :ServiceStatus);
  restart @1 () -> ();
}
```

`ServiceSupervisor` is authority-changing. Normal monitoring readers should not
receive it. A broker can mint a leased `ServiceSupervisor(net-stack)` for one
operator action.

## Kernel Diagnostics Contract

The kernel should expose a small read-only diagnostics surface for facts only
the kernel can know:

- process table snapshot: pid, state, service name if known, wait state, exit
  code, ring physical identity hidden or omitted;
- ring snapshot: SQ/CQ head/tail-derived occupancy, overflow count, corrupted
  head/tail recovery counts, opcode/error counters;
- resource snapshot: cap slot usage, outstanding calls, scratch reservation,
  frame grant pages, mapped VM pages, free frame count, heap pressure;
- scheduler snapshot: tick count, current pid, run queue length, blocked count,
  direct IPC handoff count, timeout wake count;
- crash record: last panic/fault metadata and early boot stage.

The kernel should not implement log routing, alerting, dashboards, retention
policy, restart decisions, RBAC, ABAC, or text search. Those are userspace
service responsibilities.

Implementation shape:

- Maintain fixed-size counters in existing kernel structures where the source
  event already occurs.
- Prefer snapshots computed from existing state over duplicate counters when
  the cost is bounded.
- Expose snapshots through a **small set of narrow read-only capabilities**,
  not one `KernelDiagnostics` god-cap. The initial decomposition:
  - `SchedStats` — tick count, current pid, run queue length, blocked count,
    direct IPC handoff count, `cap_enter` timeout/wake counts.
  - `FrameStats` — free/used frame counts, frame-grant pages, allocator
    pressure histogram.
  - `RingStats` — per-process SQ/CQ occupancy, `cq_overflow`, corrupted-head
    recovery counts, opcode counters, transport-error counters.
  - `CapTableStats` — per-process slot occupancy, generation-rollover
    counts, insertion/remove rates.
  - `EndpointStats` — per-endpoint waiter depth, RECV/RETURN match rate,
    abort/cancellation counts.
  - `CrashSnapshot` — last panic/fault metadata, early boot stage, recent
    SQE context when safe.
- Each narrow cap exposes `snapshot() -> (sample :MetricSample)` or a typed
  struct; none of them enumerates processes or reads cap tables beyond what
  the subsystem owns. A trusted status service composes the ones it needs;
  a broker leases a subset for operator sessions without the rest.
- `ProcessInspector` (pid-scoped process table, cap-table enumeration,
  VM map) is a **distinct, stronger** cap and lives with process-management
  authority, not with monitoring.
- Convert broad diagnostics into scoped userspace wrappers before handing them
  to shells or applications.
- Keep panic/fault serial writes independent of any diagnostics service.

Promotion from the `measure` feature: the benchmark counters in
`kernel/src/measure.rs` graduate to always-on in `RingStats` / `SchedStats`
when the per-event cost is provably a single relaxed atomic add. Cycle-counter
instrumentation (`rdtsc`/`rdtscp`) stays behind `cfg(feature = "measure")`
because it is serializing and benchmark-only. The promotion threshold keeps
normal dispatch builds free of instrumentation cost without forcing monitoring
into a second build configuration.

## Logging Model

Early boot has only serial. After init starts the log service, ordinary services
should receive `LogSink` rather than raw `Console` unless they need emergency
console access.

Recommended path:

1. Kernel serial remains for boot, panic, and fault records.
2. Init starts a userspace log service and passes scoped `LogSink` caps to
   children.
3. The log service forwards selected records to `Console` until persistent
   storage exists.
4. `SystemConfig.logLevel` becomes an initial policy input for which records
   the log service forwards and retains.
5. Session and operator tools receive scoped `LogReader` caps from a broker.

Services should not put secrets, raw capability payloads, full auth proofs, or
arbitrary user input into logs without explicit redaction. Log records are data,
not commands.

## Metrics and Status

Status answers "what is alive and what state is it in." Metrics answer "what is
the numeric behavior over time." Keeping them separate avoids a common failure
mode where a human-readable status API grows into an unbounded metrics store.

Initial status fields should cover:

- pid, service name, binary name, process state, exit code;
- process handle wait state;
- supervisor health and restart policy once supervision exists;
- cap table occupancy and outstanding call count;
- ring CQ availability and overflow;
- endpoint queue occupancy where authorized.

Initial metrics should cover:

- ring dispatches, SQEs processed, per-op counts, transport error counts;
- cap-enter wait count, timeout count, wake count;
- scheduler context switches and direct IPC handoffs;
- frame free/used counts, frame grant pages, VM mapped pages;
- log records accepted, suppressed, dropped, and forwarded;
- trace records captured and dropped.

Timer/nohz/realtime metrics should be owned by monitoring rather than left as
one-off debug prints once those features exist:

- `scheduler_tick_count{cpu}`;
- `ticks_suppressed{cpu,mode}`;
- `nohz_enter_count{cpu,kind}`;
- `nohz_exit_count{cpu,reason}`;
- `oneshot_deadline_miss_count`;
- `sqpoll_busy_ns`;
- `sqpoll_sleep_count`;
- `deadline_expired_count`;
- `budget_exhausted_count`;
- `realtime_overrun_count`;
- `donation_depth_max`;
- `housekeeping_offload_count`.

These are correctness signals for nohz/realtime admission, not only
performance counters. A scoped monitoring reader may observe them only under
the same authority rules as other scheduler and service telemetry.

Current state alignment. Scheduler Phase D WFQ and Phase E
`SchedulingContext` have landed per `docs/changelog.md` (Phase D closed
2026-05-10), and Phase F is delivering one-SQ-consumer, nohz telemetry
counters, and housekeeping/deferred-work placement; **automatic nohz
activation's first increment is now closed** via
`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-activation.md`
(per the scheduler bullet in `docs/tasks/README.md`), and **SQPOLL-driven auto-nohz
activation is also closed** via
`docs/tasks/done/2026/scheduler-phase-f-auto-nohz-sqpoll.md`: a
ring-coupled `kernelSqpoll` lease whose bound ring is in SQPOLL
running/sleeping mode with a live owner is admitted for tick suppression,
with the SQPOLL ring-state re-check as the decisive rollback gate; the
`CpuIsolationLease` preflight performs real per-CPU periodic-tick suppression
for the narrow single-runnable-entity window with fail-closed rollback;
timeout-based auto-revoke and generic full-nohz for ordinary budgeted compute
leases are also landed. The nohz/realtime counter
families above describe the **target** monitoring surface for those
signals — the kernel may already maintain some counters internally as
Phase F lands them, but until the narrow read-only stats caps
(`SchedStats` / `RingStats` and friends) and a userspace metrics service
ship, those counters are scheduler-internal facts and not yet exported
through a monitoring cap. The metrics service is **not** authority to
trigger nohz mode changes; it observes counters under the authority
rules in this proposal.

Metric labels such as `mode`, `kind`, and `reason` must be fixed enums, not
free-form strings:

```rust
enum NoHzKind {
    Idle,
    KernelSqpoll,
    AutoCompute,
    AutoUserspacePoller,
    RealtimeIsland,
}

enum TickSuppressionMode {
    Idle,
    SqpollNoHz,
    AutoNoHz,
    RealtimeIsland,
}

enum NoHzExitReason {
    TimerDeadline,
    Ipi,
    DeviceIrq,
    SecondRunnable,
    NetworkForcedPeriodic,
    DeferredWork,
    LeaseRevoked,
    ClocksourceUnsafe,
    DebugWatchdog,
}
```

Future metric schemas should add enum variants through reviewed ABI changes
rather than accepting arbitrary labels.

Avoid per-method, per-cap-id, per-transport-id, or per-user high-cardinality
metrics by default. Those belong in short-lived traces or scoped logs.

Benchmark outputs follow the same cardinality rule. A completed, validated
benchmark run may import a small summary such as latest median, p95, sample
count, and pass/fail status for a named benchmark profile. Raw samples,
transcripts, host/QEMU configuration, correctness evidence, and comparison
tables are benchmark artifacts, not always-on monitoring metrics. Running a
profile that needs `measure`, debug taps, broad status readers, or other
diagnostic authority should emit an audit record because the act of measuring
can expose timing and topology data that ordinary services should not see.

## Ring as Black Box

The first concrete monitoring milestone is the completed `docs/tasks/README.md`
Ring-as-Black-Box item. The visible milestone was achieved by commit `da5f5e9`
at `2026-04-24 03:13 UTC`:

- define a bounded capture format for SQE/CQE records;
- export capture through a QEMU-only debug path;
- build a host-side viewer that decodes records and capnp payloads when payload
  capture is authorized;
- add one failing-call smoke whose captured log can be inspected offline.

This buys immediate debugging value without committing to durable audit,
network export, service restart policy, or replay semantics.

This is inspection, not record/replay. Replay requires stronger determinism,
payload retention, timer/input modeling, and capability-state checkpoints.

Capture path cost. The capture cap (working name `RingTap`) is
feature-gated (`cfg(feature = "debug_tap")` analogous to `measure`). Every
armed tap imposes a serializing fan-out on dispatch; keeping it out of the
default kernel feature set prevents always-on cost. Arming a tap is itself
an auditable event — the tapped process and the audit log observe it —
and tap grants respect move-semantics so a tap cannot be silently cloned
past its intended holder. Payload-capturing taps require a separately
leased cap distinct from metadata-only capture because payloads may
contain secrets.

## Health and Supervision

Health and restart policy should live with supervisors, not in a central kernel
daemon.

Each supervisor owns:

- a narrowed `ProcessSpawner`;
- child `ProcessHandle` caps;
- the cap bundle needed to restart its subtree;
- optional `Health` caps exported by children;
- a `LogSink` and `AuditWriter` for its own decisions.

Status services aggregate supervisor-reported health. They should distinguish:

- no process exists;
- process exists but never reported ready;
- process is alive and ready;
- process is alive but degraded;
- process exited normally;
- process failed and supervisor is backing off;
- process was intentionally stopped or draining.

Restart authority should be a separate `ServiceSupervisor` cap. A read-only
`SystemStatus` cap must not be able to restart anything.

## Audit Integration

Audit should share infrastructure with logging only at the storage or transport
layer. Its semantics are different.

Audit producers:

- `AuthorityBroker` for policy decisions and leased grants;
- supervisors for restarts and service lifecycle actions;
- session manager for session creation and logout;
- kernel or status service for cap transfer/release/revocation summaries when
  those events become part of the exported authority graph;
- recovery tools for repair actions.

Audit readers are scoped:

- a user can read records for its own session;
- an operator can read a service subtree;
- a recovery or security role can read broader streams after policy approval.

Audit entries must avoid secrets and payload dumps. They should record object
identity, service identity, policy decision summaries, and capability interface
classes rather than raw data.

## Security and Backpressure

Monitoring must not become the easiest denial-of-service path.

Required controls:

- Per-process log token buckets, matching the Security Verification Track S.9
  diagnostic aggregation design.
- Suppression summaries for repeated invalid submissions.
- Fixed-size ring buffers with explicit dropped counts.
- Maximum record size for logs, events, crash records, and traces.
- Bounded formatting outside interrupt context.
- No heap allocation in timer or panic paths.
- No unbounded metric label creation from user-controlled strings.
- Payload tracing disabled by default.
- Redaction rules at producer boundaries and at reader wrappers.
- Capability-scoped readers; no unauthenticated "debug all" endpoint.

When pressure forces dropping, preserve first-observation diagnostics and later
summaries. Losing detailed logs is acceptable; corrupting scheduler progress or
blocking the kernel on log I/O is not.

## Relationship to Existing Proposals

- Service Architecture: monitoring services are ordinary userspace services
  spawned by init or supervisors. Logging policy and service topology stay out
  of the pre-init kernel path.
- Shell: the native and agent shell should receive scoped `SystemStatus` and
  `LogReader` caps in daily profiles, not global supervisor authority.
- User Identity and Policy: `AuthorityBroker` mints scoped readers and leased
  supervisor caps based on session policy; `AuditLog` records the decisions.
- Error Handling: transport errors and `CapException` payloads are monitoring
  signals, but retry policy remains userspace.
- Authority Accounting: resource ledgers provide the first metrics substrate and
  define quota/backpressure boundaries.
- Security and Verification: hostile-input tests should cover log flood
  aggregation and bounded diagnostic paths. Each new monitoring boundary
  (kernel stats caps, log/metrics/trace/audit services, scheduler nohz
  telemetry exports) must be carried into the
  `docs/proposals/security-and-verification-proposal.md` Track S.7
  trust-boundary inventory before downstream services rely on it; the
  inventory is the canonical record that a boundary has been reviewed,
  not this proposal.
- Live Upgrade: health, audit, and service status become prerequisites for
  credible upgrade orchestration.
- System Performance Benchmarks: benchmark runners may read scoped status and
  metrics before and after a run, but benchmark artifacts and OS-comparison
  reports live outside the always-on metrics service. Only low-cardinality,
  validated summaries should be imported into monitoring.

## Implementation Plan

1. **Document the model.**
   Keep monitoring as a future architecture proposal and do not disturb the
   current manifest-executor milestone.

2. **Ring as Black Box.**
   Completed by commit `da5f5e9` at `2026-04-24 03:13 UTC`: bounded
   SQE/CQE capture, host-side decoding, and one failing-call smoke form the
   first useful monitoring artifact.

3. **Userspace log service.** *(Phase 1 landed.)*
   `LogSink`/`LogReader` schemas plus `LogRecord`/`LogFilter` exist (additive
   ordinals, reusing `LogLevel` as the severity type). A bounded drop-oldest
   kernel ring (`kernel/src/cap/log.rs`) backs both caps: the sink stamps the
   monotonic tick, drops records below the boot-seeded `SystemConfig.logLevel`
   threshold (`accepted = false`), bounds record size, and forwards accepted
   records to serial; the reader returns cursor/filtered records with
   `nextCursor` and a `dropped` overflow count. Scoped `LogSink`/`LogReader`
   caps are granted to children at spawn; `make run-monitoring-log-smoke`
   proves the drop, the read-back, and the reader-side `minLevel` filter.
   Remaining: the wider `Severity` (with `critical`), the correlation fields
   (`subjectRef`/`sessionRef`/`serviceRef`/`transportId`), per-process token
   buckets / suppression summaries, and persistent retention.

4. **Narrow kernel stats caps and SystemStatus.**
   Add the narrow read-only caps (`SchedStats`, `FrameStats`, `RingStats`,
   `CapTableStats`, `EndpointStats`, `CrashSnapshot`) as bounded snapshot
   surfaces. A userspace `SystemStatus` service composes the ones it needs
   and exposes scoped wrappers to shells and operator tools. Leave
   `ProcessInspector` out of this step — it belongs with process-management
   authority, not monitoring.

5. **Metrics snapshots.**
   Add fixed counters and gauges for ring, scheduler, resource, log, and trace
   state. Keep labels static until a cardinality policy exists.

6. **Health and supervisor status.**
   Add `Health` and read-only supervisor status once restart policy and exported
   service caps are concrete. Keep restart authority in separate
   `ServiceSupervisor` caps.

7. **Audit path.**
   Add append-only audit records for broker decisions, cap grants, releases,
   revocations, restarts, and recovery actions. Start serial or memory backed;
   move to storage once the storage substrate exists.

8. **Crash records.**
   Preserve bounded panic/fault metadata across the current boot where possible;
   later store records durably.

9. **Device, network, and storage metrics.**
   Add driver metrics only after those drivers exist: interrupts, DMA/bounce
   usage, queue depth, RX/TX/drop/error counts, block latency, and reset events.

## Non-Goals

- No global `/proc` or `/sys` equivalent with ambient read access.
- No kernel-resident dashboard, alert manager, text search, or policy engine.
- No programmable kernel tracing language in the first monitoring design.
- No promise of durable log retention before storage exists.
- No default payload tracing.
- No service restart authority bundled into ordinary read-only status caps.
- No network export path until networking and policy can constrain it.

## Open Questions

- Should `KernelDiagnostics` expose snapshots only, or also a bounded event
  cursor?
- What is the minimum timestamp model before wall-clock time exists?
- Should log records carry local cap IDs, stable object IDs, or only interface
  and service metadata by default?
- How should schema-aware trace decoding find schemas before a full
  `SchemaRegistry` exists?
- Which crash fields are safe to expose to non-recovery sessions?
- What retention policy is acceptable before persistent storage?
- Should `MetricsReader` use typed structs for each subsystem instead of generic
  name/value samples?
- Where should remote monitoring export fit once network transparency exists:
  a dedicated exporter service, capnp-rpc forwarding, or storage replication?

## Cross-References

This proposal is reader-facing target design. The canonical trackers for the
observability-adjacent risks and verification obligations it depends on live
elsewhere:

- `docs/proposals/security-and-verification-proposal.md` **Track S.7 --
  Stage-6-aware refresh** owns the trust-boundary inventory that any new
  monitoring boundary (kernel stats caps, log/metrics/trace/audit services,
  scheduler nohz telemetry exports, payload-capturing taps) must be carried
  into before downstream services rely on it. Track S.7 already lists the
  active scheduler-evolution surfaces (Phase D WFQ, Phase E
  `SchedulingContext`, Phase F one-SQ-consumer and nohz telemetry) plus the
  WASI host-adapter Phase W.4 entropy/argv boundary as inventory items to
  carry forward.
- `docs/design-risks-register.md` **R12 -- Verification coverage is partial,
  not full proof** is the canonical caveat for any monitoring claim that
  could be read as a verified property. Bounded Kani/Loom/Miri/proptest
  coverage plus the panic-surface inventory are not whole-system functional
  refinement; monitoring records and audit entries describing security-
  relevant decisions must respect that distinction in their wording.
- `docs/design-risks-register.md` **Q9 -- CPU accounting and scheduling
  contexts** is the canonical answer for the CPU-time, weighted-vruntime,
  and `SchedulingContext` budget/donation/depletion semantics that monitoring
  metrics should observe rather than redefine. The nohz/realtime counter
  families in this proposal target the same surfaces; cross-service donation
  policy, full nohz activation, isolation leases, and fairness across
  principals remain proposal-shaped per Q9 and are tracked in
  `docs/proposals/scheduler-evolution-proposal.md` and
  `docs/backlog/scheduler-evolution.md`.

Adjacent risk-register entries observed by monitoring but owned elsewhere
include R4 (Resource accounting fragmentation, source of the
`ResourceLedger` metrics substrate), R8 (Networking lives inside the kernel
TCB, gating exporter-service placement), and R11 (Pre-auth and post-auth
share a shell process, gating who may receive scoped `LogReader` /
`SystemStatus` / `AuditLog` readers).