# Proposal: System Performance Benchmarks

How capOS should benchmark system performance against other operating systems
without producing misleading numbers, rewarding special-case optimizations, or
treating speed as a substitute for correct capability behavior.


## Problem

capOS already has smoke tests, QEMU boot proofs, ring-tap debugging, and a
`measure` feature for focused cycle measurements. Those are necessary, but they
do not answer the product-level question: can capOS remain effective on common
workloads while preserving its capability model?

Generic OS benchmark suites are useful but dangerous in this project. Most
assume POSIX process, file, pipe, socket, and shell semantics. capOS should not
fake broad ambient Unix authority just to run a familiar benchmark. It also
should not compare a capability-native path against Linux, FreeBSD, or a
microkernel by publishing a single blended score that hides unsupported
semantics, incorrect outputs, or different isolation boundaries.

The benchmark system needs to produce three kinds of evidence:

- **Primitive cost:** capability calls, IPC, scheduling, park waits, VM changes,
  process creation, memory copy, and later device I/O.
- **Common workload adequacy:** database, compression, build, network, storage,
  shell/session, service graph, and runtime workloads that users recognize.
- **Correctness under load:** workload outputs, service boundaries, capability
  denial paths, and data integrity must remain correct while performance is
  measured.

## Current State

Implemented measurement and comparison hooks:

- `make run-measure` builds a separate measurement kernel feature and boots
  `system-measure.cue`.
- `kernel/src/measure.rs` records benchmark-only dispatch counters and cycle
  segments for ring processing, SQE validation, cap lookup, Cap'n Proto
  encode/decode, method body dispatch, CQE posting, and waiter wake checks.
- The measurement manifest grants `ring-nop` a measurement-only `NullCap` and
  `ParkBench` capability through `ProcessSpawner`.
- `demos/ring-nop` measures `CAP_OP_NOP`, empty and small `NullCap` calls, and
  compact-versus-generic park-shaped operations.
- `demos/thread-lifecycle` measures private `ParkSpace` failed wait, empty
  wake, wait-to-block, wake-to-runnable, and wake-to-resume paths.
- `make run-smoke`, `make run-spawn`, `make run-net`, and focused service smokes
  provide correctness and user-visible behavior proofs, but they do not yet
  emit structured performance results.

That is enough for local dispatch decisions. It is not enough for comparing
capOS with Linux, FreeBSD, seL4-based systems, Genode scenarios, or other OS
baselines on common workloads.

## Design Principles

1. **Correctness gates first.** A benchmark result is publishable only when the
   workload's output verifier passes and capOS-specific authority checks still
   hold.
2. **No semantic laundering.** Unsupported POSIX features are reported as
   unsupported or not applicable, not silently emulated through broad authority.
3. **Benchmark artifacts are not normal metrics.** Always-on monitoring may
   expose low-cost counters. Benchmark logs, raw samples, host configuration,
   and per-run outputs are retained as explicit benchmark artifacts.
4. **Compare like mechanisms where possible.** Compare capOS capability IPC to
   Linux pipes, Unix domain sockets, io_uring, or futexes only when the semantic
   differences are declared in the result.
5. **Use common suites as references, not design masters.** lmbench, UnixBench,
   fio, iperf3, SQLite speedtest, Phoronix/OpenBenchmarking profiles, and SPEC
   CPU are valuable precedent. capOS should adopt their methodology where it
   fits and reject assumptions that would distort capOS.
6. **Publish raw context.** Results include kernel commit, manifest, QEMU
   command, CPU model, host OS, compiler, build flags, feature flags, warmup,
   run count, and raw logs.
7. **Separate hosted and native comparisons.** Early capOS runs in QEMU. Compare
   against Linux/FreeBSD guests under the same QEMU/KVM envelope, and separately
   against native host OS runs when the question is absolute hardware
   performance.
8. **Regression gates are narrower than claims.** CI gates should catch local
   regressions in stable paths. Public OS comparisons need controlled machines,
   repeated runs, and manual review.
9. **Security posture is part of the result.** A fast result that requires a
   broader cap bundle, disabled validation, payload tracing, or a special
   kernel build must be labeled as such.
10. **No single score.** capOS should publish a matrix of workload results and
    ratios, not an aggregate score that implies all workloads matter equally.

## Benchmark Tiers

### Tier 0: Existing Correctness Smokes

Tier 0 is not a performance suite. It is the mandatory correctness floor:

- default boot/login/shell smoke;
- focused spawn, shell, terminal, credential, login, chat, adventure,
  revocable-read, memory-object, ringtap, networking, and measurement smokes;
- host tests for config, ring Loom, capos-lib, mkmanifest, generated code, and
  runtime surface checks.

No performance result should be retained when the relevant Tier 0 proof fails.

### Tier 1: capOS-Native Primitive Benchmarks

These benchmarks measure the cost of capOS mechanisms directly:

| Area | Initial measurements | Correctness condition |
|---|---|---|
| Ring transport | `CAP_OP_NOP`, empty `NullCap`, small payload `NullCap`, CQE post | expected CQE result, no overflow, bounded dropped count |
| Cap dispatch | cap lookup, generation rejection, revoked cap rejection, invalid method | correct `CAP_ERR_*` or `CapException` |
| IPC | endpoint CALL/RECV/RETURN round trip, direct handoff, transfer copy/move | reply payload and transferred-cap identity match oracle |
| Park/threading | failed wait, timeout, wake-one, wake-many, wake-to-resume | waiter count and join status match oracle |
| Scheduler | context switch latency, timer wake latency, direct IPC handoff latency | no runnable-thread loss or unexpected starvation |
| Process lifecycle | spawn, ELF load, wait, failed spawn rejection | child output and exit code match manifest oracle |
| VM/memory | map/protect/unmap, MemoryObject map, frame allocation/free | data visibility, W^X, quota, and cleanup checks pass |
| Terminal/session | readLine/write latency and throughput under foreground ownership | echo/cancellation/stale-input checks pass |

These are capOS results first. Linux or FreeBSD baselines can use matching
native mechanisms, but the report must describe the mapping. For example, a
capOS endpoint IPC round trip can be compared with Linux `pipe`, Unix-domain
socket, `eventfd`, or futex ping-pong results, but none is a perfect semantic
match.

### Tier 2: Translated OS Microbenchmarks

lmbench and UnixBench are useful because they isolate OS primitives such as
system-call overhead, process creation, context switching, pipes, networking,
and filesystem reads. They are also Unix-shaped.

capOS should implement a `capos-osbench` harness that translates the benchmark
intent into capability-native operations:

- `fork/exec/wait` intent becomes `ProcessSpawner.spawn` plus
  `ProcessHandle.wait`.
- `pipe` throughput/context switching becomes Endpoint or a future byte-stream
  or socket capability round trip, labeled by transport.
- `getpid` syscall overhead becomes a minimal kernel fact cap or `CAP_OP_NOP`,
  labeled as "capOS ring entry" rather than "POSIX syscall".
- file reread and mmap benchmarks remain unsupported until Store/Namespace and
  file-backed mappings exist.
- networking tests map to `TcpSocket`/`TcpListener` once the Telnet and socket
  capability work lands.

The translated suite must emit `not_applicable` for missing capability
subsystems instead of adding compatibility shims that change the OS being
measured.

### Tier 3: Portable Common Workloads

These benchmarks answer whether capOS is useful on recognizable work:

| Workload | Candidate benchmark | capOS prerequisite | Result verifier |
|---|---|---|---|
| SQLite database | SQLite `speedtest1`, optionally via a Phoronix profile on reference OSes | C runtime or native port, Store/Namespace or RAM-backed DB | SQLite exit status, optional SQL result checksum |
| OLTP database | TPC-C/TPC-E-inspired profile, not an official TPC result until disclosure and durability rules are met | durable Store/block I/O, SQL/database stack, transaction integrity, terminal/client driver model | committed transaction counts, invariant checks, ACID/error-injection proof |
| Decision-support database | TPC-H/TPC-DS-inspired profile at declared scale factors, not an official TPC result until rules are met | SQL/query engine, bulk data load, durable or explicitly memory-backed storage, query result verifier | query answer hashes, load status, scale factor, refresh/query stream status |
| Key-value serving | YCSB-style read/update/scan/insert mixes | Store/Namespace, KV service, stable client driver | operation counts, latency distribution, value/hash verifier |
| Storage engine | RocksDB/LevelDB `db_bench`-style fill/read/overwrite/seek profiles | file/store semantics, fsync/sync policy, storage engine port | key/value integrity, database reopen, configured write durability |
| Compression | `xz`, `zstd`, or small native compressor corpus | C/Rust userspace runtime and file/store access | compressed output hash and decompression hash |
| Build/developer workload | small Rust/C package build, later IX package build | process spawning, Store/Namespace, toolchain support | output artifact hash and build log status |
| Network throughput | iperf3-equivalent TCP stream and request/response latency | `TcpSocket`, network harness | byte count, JSON/structured summary, peer checksum |
| Storage I/O | fio-equivalent sequential/random read/write, verify mode | block device, Store/Namespace, direct I/O policy | fio-style verify/checksum result |
| File service | SPECstorage-inspired workload profile | network filesystem or capOS file-service equivalent, durable storage, client load generation | throughput, response time, data integrity |
| Java/server runtime | SPECjbb 2015 or Renaissance-inspired profiles | JVM or Java compatibility profile, timers, threads, networking/storage as needed | benchmark verifier and SLA/throughput summary |
| HTTP service | `wrk`-style request load against a capOS HTTP service | TCP, HTTP service, stable response corpus | response checksum/status mix, latency distribution, error rate |
| Cloud services | CloudSuite-inspired data caching/serving/search/web profiles | multi-service graph, storage/network/runtime support | workload-specific answer checks and service SLOs |
| Microservices | DeathStarBench/TailBench-inspired tail-latency profiles | service graph, network or local RPC, load generator, tracing/status caps | request correctness, p95/p99 latency, no unauthorized cap exposure |
| ML storage | MLPerf Storage-inspired data feeding profile | high-throughput storage path, dataset loader, accelerator or simulated training reader | records/images delivered, latency/throughput, data checksum |
| ML inference/training | MLPerf-inspired inference/training profile | model runtime, accelerator/GPU capability or CPU baseline, dataset and accuracy harness | accuracy/quality target plus throughput or time-to-train |
| Shell/session | boot-to-shell, Telnet shell, command launch latency | current shell plus terminal/socket path | transcript oracle and authority denial checks |
| Service graph | chat/adventure/resident service load | shared-service demos | scripted transcript and service identity checks |
| Runtime/library | Go/Lua/Wasm micro and app kernels | relevant runtime proposal milestones | language-level test suite or checksum oracle |

Early capOS should start with RAM-backed variants where storage is not ready,
but those results must be labeled as memory-backed. A RAM-backed database result
does not compare to a Linux disk-backed SQLite result.

Industry benchmark families belong later than SQLite speedtest and simple
compression/build profiles. TPC-C/TPC-E and TPC-H/TPC-DS are database-system
references with strict workload, disclosure, pricing, and correctness
expectations. SPEC, MLPerf, CloudSuite, TailBench, and DeathStarBench bring
similar setup and disclosure obligations in their domains. capOS can use
inspired profiles to exercise the same workload classes before it can make
official or directly comparable claims, but reports must label them as such and
state which upstream rules are not yet satisfied.

### Tier 4: User-Story Benchmarks

User-story benchmarks measure complete workflows that a person, operator, or
service owner would recognize. They are intentionally broader than a single
primitive or portable benchmark profile, and they should be described by the
user outcome they prove rather than by the current demo implementation.

Initial user stories:

| Story | Example capOS proof | Result verifier |
|---|---|---|
| Start a local session | boot to an interactive shell or terminal prompt | transcript reaches ready prompt with expected cap bundle |
| Authenticate and receive authority | anonymous session upgrades to an operator/session profile | wrong credential denied, right credential grants exact profile |
| Run a delegated task | launch a child process with a narrow cap bundle | child output, exit code, and denied extra authority match oracle |
| Use a remote terminal | host-local TCP terminal reaches the same shell/session model | connect, authenticate, run command, clean disconnect |
| Use a resident service | client talks to a long-running service through scoped authority | request/reply transcript and service-visible identity match oracle |
| Serve a network request | network-facing service handles requests while local work continues | response checksum, latency, and no unauthorized cap exposure |
| Complete a developer workflow | build or transform an artifact from declared inputs | output hash, logs, and resource profile match declared policy |
| Recover from expected failure | service fault, rejected grant, timeout, or restart path | failure is bounded, audited, and visible through status |

User-story results report latency distribution, success rate, resource usage,
and authority outcome. They are the closest evidence for "effective on common
workloads," but they are not substitutes for primitive measurements when a
regression appears.

## Reference Operating Systems

Initial comparisons should use these environments:

| Reference | Why include it | Caveat |
|---|---|---|
| Linux guest under same QEMU/KVM flags | Stable baseline with broad benchmark support | Linux has mature drivers, filesystems, VM, scheduler, and libc |
| FreeBSD guest under same QEMU/KVM flags | Second mature Unix-like baseline, useful for POSIX-independent signal | Not every benchmark profile has equal FreeBSD support |
| Linux native host | Shows absolute host hardware ceiling | Not directly comparable to capOS-in-QEMU latency |
| seL4 or Genode reports/scenarios | Prior art for capability/microkernel IPC and service decomposition | Often not the same hardware, workload, or application stack |

The default published table should show capOS versus Linux guest first. Native
host and external microkernel data belong in separate context columns, not the
primary ratio.

## Correctness Model

Every benchmark definition carries:

- expected input corpus hash;
- command or manifest used to run the workload;
- output verifier;
- allowed nondeterminism, such as timestamps or generated IDs;
- capOS authority profile;
- unsupported-feature policy;
- result parser version.

A result is invalid when:

- the output verifier fails;
- QEMU exits abnormally;
- the kernel panics or reports an unexpected fault;
- the benchmark had to grant broader authority than its declared profile;
- host logs show dropped records that invalidate the measurement;
- the run used a special fast path not available in the declared configuration;
- the reference OS result used a materially different workload size or dataset.

Correctness should be stored alongside the performance value. A fast failed run
is not a slow successful run; it is no result.

## Measurement Method

Controlled runs should use:

- fixed capOS commit, reference OS image hash, benchmark source hash, compiler
  version, and toolchain flags;
- fixed QEMU version, machine type, CPU model, memory size, SMP count, KVM/TCG
  mode, disk image type, and network backend;
- warmup runs for workloads with caches, JITs, connection setup, or first-use
  allocation;
- at least 5 measured runs for primitive and user-story benchmarks, more when
  coefficient of variation is high;
- median, min, max, standard deviation, and p95/p99 for latency where sample
  count supports it;
- raw logs retained for the benchmark artifact;
- no performance claim from one isolated run unless explicitly labeled as a
  smoke measurement.

Cycle-counter measurements remain inside `cfg(feature = "measure")` and are
used for relative path decisions. Wall-clock user-story and workload
comparisons use host-side timestamps around QEMU transcripts or in-guest
monotonic timers when the timer contract is adequate.

## Result Schema

The benchmark harness should emit a structured artifact, not a free-form log:

```capnp
enum BenchmarkStatus {
  passed       @0;
  failed       @1;
  unsupported  @2;
  invalid      @3;
}

struct BenchmarkResult {
  runId          @0 :Text;
  benchmarkName  @1 :Text;
  tier           @2 :UInt16;
  status         @3 :BenchmarkStatus;
  correctnessId  @4 :Text;
  configHash     @5 :Data;
  artifactHash   @6 :Data;
  notes          @7 :Text;

  result :union {
    measurement @8 :MeasurementSummary;
    failure     @9 :RunFailure;
    unsupported @10 :RunFailure;
    invalid     @11 :RunFailure;
  }
}

struct MeasurementSummary {
  unit           @0 :Text;
  lowerIsBetter  @1 :Bool;
  median         @2 :Float64;
  p95            @3 :Float64;
  samples        @4 :List(Float64);
}

struct RunFailure {
  reason  @0 :Text;
  detail  @1 :Text;
}
```

This schema is conceptual. It should not be added to `schema/capos.capnp` until
a concrete benchmark-runner service exists. The important property is that
measurement values exist only in the passed/publishable branch; failed,
unsupported, and invalid runs carry reasons instead of zero-valued scalar
defaults. Before that, host scripts can emit JSON with the same shape.

## Integration With System Monitoring

System Monitoring should expose operational state; the benchmark system should
store explicit run artifacts. The overlap is narrow:

- benchmark runs may read scoped `MetricsReader`, `SystemStatus`, `RingStats`,
  `SchedStats`, and later device stats before and after a run;
- benchmark summaries may be imported into a metrics service as low-cardinality
  gauges such as `benchmark.last_median_ms`, keyed by benchmark name and
  profile, after validation;
- raw samples, transcripts, QEMU logs, host environment, and correctness
  evidence belong in a `BenchmarkStore` or CI artifact store, not in
  always-on metrics;
- starting a privileged benchmark profile is an auditable event because it may
  require measurement-only caps, debug taps, or broad status readers;
- benchmark readers should receive scoped read-only caps, not global monitoring
  roots.

The existing `system-monitoring-proposal.md` boundary remains correct:
cycle-counter instrumentation stays behind `measure`, while cheap counters can
later graduate into narrow stats caps.

## External Grounding

Relevant local design grounding:

- `docs/build-run-test.md`
- `docs/status.md`
- `docs/proposals/system-monitoring-proposal.md`
- `docs/architecture/capability-ring.md`
- `docs/architecture/park.md`
- `docs/architecture/scheduling.md`
- `docs/research/sel4.md`
- `docs/research/zircon.md`
- `docs/research/genode.md`
- `docs/research/out-of-kernel-scheduling.md`

External sources checked:

- USENIX lmbench paper page:
  `https://www.usenix.org/conference/usenix-1996-annual-technical-conference/lmbench-portable-tools-performance-analysis`
- fio documentation:
  `https://fio.readthedocs.io/en/master/fio_doc.html`
- iperf3 documentation:
  `https://software.es.net/iperf/`
- SPEC CPU 2017 overview and run rules:
  `https://www.spec.org/osg/cpu2017/`
  and `https://www.spec.org/cpu2017/Docs/runrules.html`
- Byte UnixBench repository:
  `https://github.com/kdlucas/byte-unixbench`
- SQLite testing documentation and OpenBenchmarking SQLite speedtest profile:
  `https://www.sqlite.org/testing.html`
  and `https://openbenchmarking.org/test/pts/sqlite-speedtest`
- TPC benchmark overview, TPC-C, TPC-H, and TPC-DS descriptions:
  `https://www.tpc.org/information/benchmarks5.asp`,
  `https://www.tpc.org/tpcc/default5.asp`,
  `https://www.tpc.org/tpch/default5.asp`,
  and `https://www.tpc.org/tpcds/`
- YCSB and storage-engine benchmark references:
  `https://hse-project.github.io/apps/ycsb/`,
  `https://github.com/facebook/rocksdb/wiki/Benchmarking-tools`,
  and `https://github.com/google/leveldb`
- SPECjbb 2015, Renaissance, and HTTP service benchmark references:
  `https://www.spec.org/jbb2015/`,
  `https://renaissance.dev/`,
  and `https://github.com/wg/wrk`
- Cloud/service benchmark references:
  `https://github.com/parsa-epfl/cloudsuite`,
  `https://github.com/delimitrou/DeathStarBench`,
  and `https://tailbench.csail.mit.edu/`
- Storage and ML benchmark references:
  `https://www.spec.org/storage2020/`,
  `https://mlcommons.org/working-groups/benchmarks/storage/`,
  `https://mlcommons.org/benchmarks/training/`,
  and `https://docs.mlcommons.org/inference/index_gh/`
- OpenBenchmarking test-suite/profile descriptions:
  `https://openbenchmarking.org/suites/`
  and `https://openbenchmarking.org/tests`

The relevant lessons are straightforward:

- lmbench isolates OS primitives from larger application behavior and was
  explicitly used to compare system implementations.
- fio and iperf3 provide flexible, parameterized I/O and network workload
  models with machine-readable output and verification options.
- SPEC CPU's run rules show why disclosure, correct output, and configuration
  control matter when publishing comparative results.
- UnixBench is useful as a historical system benchmark, but its own workload
  descriptions reveal Unix assumptions that capOS must translate carefully.
- SQLite speedtest is a recognizable application workload with broad public
  baseline data, but database benchmarking must distinguish RAM-backed and
  storage-backed results.
- TPC-C/TPC-E and TPC-H/TPC-DS are the right industry references for later
  OLTP and decision-support database claims, but capOS should treat early runs
  as TPC-inspired unless it can satisfy the relevant TPC rules and disclosure
  requirements.
- YCSB and `db_bench` are useful earlier data-system pressure tests because
  they can exercise key-value, read/write mix, and storage-engine behavior
  before capOS has a full SQL system.
- SPECjbb and Renaissance become relevant only when a Java profile exists;
  until then they are runtime targets, not near-term OS benchmarks.
- CloudSuite, DeathStarBench, and TailBench are good references for cloud,
  microservice, and tail-latency user stories, but they require a mature
  service graph, load generation, and workload-specific correctness checks.
- SPECstorage and MLPerf Storage are later storage references once capOS has
  durable storage and enough client/load infrastructure to avoid misleading
  fio-only claims.
- MLPerf inference/training is relevant only after model runtimes and
  accelerator or CPU-baseline execution are credible, and any result must carry
  the benchmark's accuracy or quality target rather than only throughput.
- OpenBenchmarking/Phoronix-style test profiles are useful precedent for
  packaging benchmark definitions separately from result storage.

## Implementation Plan

1. **Structured parser for current `run-measure`.**
   Add a host parser that converts existing `measure:` and demo output lines
   into JSON artifacts with config hash, raw log path, and verifier status.

2. **Primitive benchmark manifest set.**
   Split ring, park, IPC, process, VM, and scheduler benchmarks into focused
   manifests so each can be repeated independently without running unrelated
   demos.

3. **Reference guest harness.**
   Add Linux guest scripts that run equivalent primitive tests under the same
   QEMU/KVM settings. Keep these scripts outside the capOS boot image.

4. **Translated OS microbench suite.**
   Implement `capos-osbench` for the subset of lmbench/UnixBench intents that
   capOS can represent honestly. Emit unsupported results for missing Store,
   file, mmap, and socket primitives until those subsystems exist.

5. **Common workload pilots.**
   Start with workloads that can be made deterministic early: compression,
   SQLite speedtest against RAM-backed storage once Store exists, shell/session
   latency, and remote-terminal user-story latency after the current milestone.

6. **Network and storage workloads.**
   Add iperf3/fio-equivalent profiles only after socket and block/storage
   capabilities exist. Use verification modes for write workloads.

7. **Benchmark store and monitoring bridge.**
   Add a `BenchmarkStore` service or CI artifact convention. Import only
   validated summary values into monitoring metrics, and audit privileged
   benchmark starts.

8. **Regression gates.**
   Add narrow CI thresholds for stable primitive paths. Use review-only
   warnings for noisy or hardware-dependent workloads until enough history
   exists.

## Reporting Format

Published reports should include:

- executive table with benchmark, status, unit, capOS median, Linux guest
  median, ratio, and notes;
- separate sections for primitive, common workload, and user-story results;
- correctness summary with failed/unsupported/invalid runs;
- configuration appendix with hashes and QEMU commands;
- raw artifact links;
- explicit warning for benchmark-only builds, debug tap runs, or special caps.

Do not publish a capOS "system score." The useful output is a workload matrix
with enough context to explain the result.

## Non-Goals

- No POSIX compatibility layer purely to run Unix benchmarks.
- No public comparison that treats unsupported workloads as zero performance.
- No single aggregate score.
- No benchmark-only fast paths in normal dispatch builds.
- No always-on cycle-counter tracing.
- No network result publication before the network path has correctness and
  authority proofs.
- No storage result publication before write verification and crash/error
  semantics are defined.

## Open Questions

- Which Linux primitive baselines should be first-class: pipe, Unix socket,
  futex, eventfd, io_uring, or all of them?
- Should the benchmark store be a capOS service, a host CI artifact convention,
  or both?
- What variance threshold should turn a benchmark from a CI gate into a
  review-only signal?
- How should reference OS images be pinned and distributed without bloating the
  repository?
- What is the earliest honest SQLite storage profile: RAM-only, MemoryObject
  backed, Store-backed, or block-backed?
- Should benchmark definitions be modeled as manifest fragments, host-side
  YAML/JSON, or capOS service objects?
