Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: System Performance Benchmarks

How capOS should benchmark system performance against other operating systems without producing misleading numbers, rewarding special-case optimizations, or treating speed as a substitute for correct capability behavior.

Problem

capOS already has smoke tests, QEMU boot proofs, ring-tap debugging, and a measure feature for focused cycle measurements. Those are necessary, but they do not answer the product-level question: can capOS remain effective on common workloads while preserving its capability model?

Generic OS benchmark suites are useful but dangerous in this project. Most assume POSIX process, file, pipe, socket, and shell semantics. capOS should not fake broad ambient Unix authority just to run a familiar benchmark. It also should not compare a capability-native path against Linux, FreeBSD, or a microkernel by publishing a single blended score that hides unsupported semantics, incorrect outputs, or different isolation boundaries.

The benchmark system needs to produce three kinds of evidence:

  • Primitive cost: capability calls, IPC, scheduling, park waits, VM changes, process creation, memory copy, and later device I/O.
  • Common workload adequacy: database, compression, build, network, storage, shell/session, service graph, and runtime workloads that users recognize.
  • Correctness under load: workload outputs, service boundaries, capability denial paths, and data integrity must remain correct while performance is measured.

Current State

Implemented measurement and comparison hooks:

  • make run-measure builds a separate measurement kernel feature and boots system-measure.cue.
  • kernel/src/measure.rs records benchmark-only dispatch counters and cycle segments for ring processing, SQE validation, cap lookup, Cap’n Proto encode/decode, method body dispatch, CQE posting, and waiter wake checks.
  • The measurement manifest grants ring-nop a measurement-only NullCap and ParkBench capability through ProcessSpawner.
  • demos/ring-nop measures CAP_OP_NOP, empty and small NullCap calls, and compact-versus-generic park-shaped operations.
  • demos/thread-lifecycle measures private ParkSpace failed wait, empty wake, wait-to-block, wake-to-runnable, and wake-to-resume paths.
  • make run-smoke, make run-spawn, make run-net, and focused service smokes provide correctness and user-visible behavior proofs, but they do not yet emit structured performance results.

That is enough for local dispatch decisions. It is not enough for comparing capOS with Linux, FreeBSD, seL4-based systems, Genode scenarios, or other OS baselines on common workloads.

Design Principles

  1. Correctness gates first. A benchmark result is publishable only when the workload’s output verifier passes and capOS-specific authority checks still hold.
  2. No semantic laundering. Unsupported POSIX features are reported as unsupported or not applicable, not silently emulated through broad authority.
  3. Benchmark artifacts are not normal metrics. Always-on monitoring may expose low-cost counters. Benchmark logs, raw samples, host configuration, and per-run outputs are retained as explicit benchmark artifacts.
  4. Compare like mechanisms where possible. Compare capOS capability IPC to Linux pipes, Unix domain sockets, io_uring, or futexes only when the semantic differences are declared in the result.
  5. Use common suites as references, not design masters. lmbench, UnixBench, fio, iperf3, SQLite speedtest, Phoronix/OpenBenchmarking profiles, and SPEC CPU are valuable precedent. capOS should adopt their methodology where it fits and reject assumptions that would distort capOS.
  6. Publish raw context. Results include kernel commit, manifest, QEMU command, CPU model, host OS, compiler, build flags, feature flags, warmup, run count, and raw logs.
  7. Separate hosted and native comparisons. Early capOS runs in QEMU. Compare against Linux/FreeBSD guests under the same QEMU/KVM envelope, and separately against native host OS runs when the question is absolute hardware performance.
  8. Regression gates are narrower than claims. CI gates should catch local regressions in stable paths. Public OS comparisons need controlled machines, repeated runs, and manual review.
  9. Security posture is part of the result. A fast result that requires a broader cap bundle, disabled validation, payload tracing, or a special kernel build must be labeled as such.
  10. No single score. capOS should publish a matrix of workload results and ratios, not an aggregate score that implies all workloads matter equally.

Benchmark Tiers

Tier 0: Existing Correctness Smokes

Tier 0 is not a performance suite. It is the mandatory correctness floor:

  • default boot/login/shell smoke;
  • focused spawn, shell, terminal, credential, login, chat, adventure, revocable-read, memory-object, ringtap, networking, and measurement smokes;
  • host tests for config, ring Loom, capos-lib, mkmanifest, generated code, and runtime surface checks.

No performance result should be retained when the relevant Tier 0 proof fails.

Tier 1: capOS-Native Primitive Benchmarks

These benchmarks measure the cost of capOS mechanisms directly:

AreaInitial measurementsCorrectness condition
Ring transportCAP_OP_NOP, empty NullCap, small payload NullCap, CQE postexpected CQE result, no overflow, bounded dropped count
Cap dispatchcap lookup, generation rejection, revoked cap rejection, invalid methodcorrect CAP_ERR_* or CapException
IPCendpoint CALL/RECV/RETURN round trip, direct handoff, transfer copy/movereply payload and transferred-cap identity match oracle
Park/threadingfailed wait, timeout, wake-one, wake-many, wake-to-resumewaiter count and join status match oracle
Schedulercontext switch latency, timer wake latency, direct IPC handoff latencyno runnable-thread loss or unexpected starvation
Process lifecyclespawn, ELF load, wait, failed spawn rejectionchild output and exit code match manifest oracle
VM/memorymap/protect/unmap, MemoryObject map, frame allocation/freedata visibility, W^X, quota, and cleanup checks pass
Terminal/sessionreadLine/write latency and throughput under foreground ownershipecho/cancellation/stale-input checks pass

These are capOS results first. Linux or FreeBSD baselines can use matching native mechanisms, but the report must describe the mapping. For example, a capOS endpoint IPC round trip can be compared with Linux pipe, Unix-domain socket, eventfd, or futex ping-pong results, but none is a perfect semantic match.

Tier 2: Translated OS Microbenchmarks

lmbench and UnixBench are useful because they isolate OS primitives such as system-call overhead, process creation, context switching, pipes, networking, and filesystem reads. They are also Unix-shaped.

capOS should implement a capos-osbench harness that translates the benchmark intent into capability-native operations:

  • fork/exec/wait intent becomes ProcessSpawner.spawn plus ProcessHandle.wait.
  • pipe throughput/context switching becomes Endpoint or a future byte-stream or socket capability round trip, labeled by transport.
  • getpid syscall overhead becomes a minimal kernel fact cap or CAP_OP_NOP, labeled as “capOS ring entry” rather than “POSIX syscall”.
  • file reread and mmap benchmarks remain unsupported until Store/Namespace and file-backed mappings exist.
  • networking tests map to TcpSocket/TcpListener once the Telnet and socket capability work lands.

The translated suite must emit not_applicable for missing capability subsystems instead of adding compatibility shims that change the OS being measured.

Tier 3: Portable Common Workloads

These benchmarks answer whether capOS is useful on recognizable work:

WorkloadCandidate benchmarkcapOS prerequisiteResult verifier
SQLite databaseSQLite speedtest1, optionally via a Phoronix profile on reference OSesC runtime or native port, Store/Namespace or RAM-backed DBSQLite exit status, optional SQL result checksum
OLTP databaseTPC-C/TPC-E-inspired profile, not an official TPC result until disclosure and durability rules are metdurable Store/block I/O, SQL/database stack, transaction integrity, terminal/client driver modelcommitted transaction counts, invariant checks, ACID/error-injection proof
Decision-support databaseTPC-H/TPC-DS-inspired profile at declared scale factors, not an official TPC result until rules are metSQL/query engine, bulk data load, durable or explicitly memory-backed storage, query result verifierquery answer hashes, load status, scale factor, refresh/query stream status
Key-value servingYCSB-style read/update/scan/insert mixesStore/Namespace, KV service, stable client driveroperation counts, latency distribution, value/hash verifier
Storage engineRocksDB/LevelDB db_bench-style fill/read/overwrite/seek profilesfile/store semantics, fsync/sync policy, storage engine portkey/value integrity, database reopen, configured write durability
Compressionxz, zstd, or small native compressor corpusC/Rust userspace runtime and file/store accesscompressed output hash and decompression hash
Build/developer workloadsmall Rust/C package build, later IX package buildprocess spawning, Store/Namespace, toolchain supportoutput artifact hash and build log status
Network throughputiperf3-equivalent TCP stream and request/response latencyTcpSocket, network harnessbyte count, JSON/structured summary, peer checksum
Storage I/Ofio-equivalent sequential/random read/write, verify modeblock device, Store/Namespace, direct I/O policyfio-style verify/checksum result
File serviceSPECstorage-inspired workload profilenetwork filesystem or capOS file-service equivalent, durable storage, client load generationthroughput, response time, data integrity
Java/server runtimeSPECjbb 2015 or Renaissance-inspired profilesJVM or Java compatibility profile, timers, threads, networking/storage as neededbenchmark verifier and SLA/throughput summary
HTTP servicewrk-style request load against a capOS HTTP serviceTCP, HTTP service, stable response corpusresponse checksum/status mix, latency distribution, error rate
Cloud servicesCloudSuite-inspired data caching/serving/search/web profilesmulti-service graph, storage/network/runtime supportworkload-specific answer checks and service SLOs
MicroservicesDeathStarBench/TailBench-inspired tail-latency profilesservice graph, network or local RPC, load generator, tracing/status capsrequest correctness, p95/p99 latency, no unauthorized cap exposure
ML storageMLPerf Storage-inspired data feeding profilehigh-throughput storage path, dataset loader, accelerator or simulated training readerrecords/images delivered, latency/throughput, data checksum
ML inference/trainingMLPerf-inspired inference/training profilemodel runtime, accelerator/GPU capability or CPU baseline, dataset and accuracy harnessaccuracy/quality target plus throughput or time-to-train
Shell/sessionboot-to-shell, Telnet shell, command launch latencycurrent shell plus terminal/socket pathtranscript oracle and authority denial checks
Service graphchat/adventure/resident service loadshared-service demosscripted transcript and service identity checks
Runtime/libraryGo/Lua/Wasm micro and app kernelsrelevant runtime proposal milestoneslanguage-level test suite or checksum oracle

Early capOS should start with RAM-backed variants where storage is not ready, but those results must be labeled as memory-backed. A RAM-backed database result does not compare to a Linux disk-backed SQLite result.

Industry benchmark families belong later than SQLite speedtest and simple compression/build profiles. TPC-C/TPC-E and TPC-H/TPC-DS are database-system references with strict workload, disclosure, pricing, and correctness expectations. SPEC, MLPerf, CloudSuite, TailBench, and DeathStarBench bring similar setup and disclosure obligations in their domains. capOS can use inspired profiles to exercise the same workload classes before it can make official or directly comparable claims, but reports must label them as such and state which upstream rules are not yet satisfied.

Tier 4: User-Story Benchmarks

User-story benchmarks measure complete workflows that a person, operator, or service owner would recognize. They are intentionally broader than a single primitive or portable benchmark profile, and they should be described by the user outcome they prove rather than by the current demo implementation.

Initial user stories:

StoryExample capOS proofResult verifier
Start a local sessionboot to an interactive shell or terminal prompttranscript reaches ready prompt with expected cap bundle
Authenticate and receive authorityanonymous session upgrades to an operator/session profilewrong credential denied, right credential grants exact profile
Run a delegated tasklaunch a child process with a narrow cap bundlechild output, exit code, and denied extra authority match oracle
Use a remote terminalhost-local TCP terminal reaches the same shell/session modelconnect, authenticate, run command, clean disconnect
Use a resident serviceclient talks to a long-running service through scoped authorityrequest/reply transcript and service-visible identity match oracle
Serve a network requestnetwork-facing service handles requests while local work continuesresponse checksum, latency, and no unauthorized cap exposure
Complete a developer workflowbuild or transform an artifact from declared inputsoutput hash, logs, and resource profile match declared policy
Recover from expected failureservice fault, rejected grant, timeout, or restart pathfailure is bounded, audited, and visible through status

User-story results report latency distribution, success rate, resource usage, and authority outcome. They are the closest evidence for “effective on common workloads,” but they are not substitutes for primitive measurements when a regression appears.

Reference Operating Systems

Initial comparisons should use these environments:

ReferenceWhy include itCaveat
Linux guest under same QEMU/KVM flagsStable baseline with broad benchmark supportLinux has mature drivers, filesystems, VM, scheduler, and libc
FreeBSD guest under same QEMU/KVM flagsSecond mature Unix-like baseline, useful for POSIX-independent signalNot every benchmark profile has equal FreeBSD support
Linux native hostShows absolute host hardware ceilingNot directly comparable to capOS-in-QEMU latency
seL4 or Genode reports/scenariosPrior art for capability/microkernel IPC and service decompositionOften not the same hardware, workload, or application stack

The default published table should show capOS versus Linux guest first. Native host and external microkernel data belong in separate context columns, not the primary ratio.

Correctness Model

Every benchmark definition carries:

  • expected input corpus hash;
  • command or manifest used to run the workload;
  • output verifier;
  • allowed nondeterminism, such as timestamps or generated IDs;
  • capOS authority profile;
  • unsupported-feature policy;
  • result parser version.

A result is invalid when:

  • the output verifier fails;
  • QEMU exits abnormally;
  • the kernel panics or reports an unexpected fault;
  • the benchmark had to grant broader authority than its declared profile;
  • host logs show dropped records that invalidate the measurement;
  • the run used a special fast path not available in the declared configuration;
  • the reference OS result used a materially different workload size or dataset.

Correctness should be stored alongside the performance value. A fast failed run is not a slow successful run; it is no result.

Measurement Method

Controlled runs should use:

  • fixed capOS commit, reference OS image hash, benchmark source hash, compiler version, and toolchain flags;
  • fixed QEMU version, machine type, CPU model, memory size, SMP count, KVM/TCG mode, disk image type, and network backend;
  • warmup runs for workloads with caches, JITs, connection setup, or first-use allocation;
  • at least 5 measured runs for primitive and user-story benchmarks, more when coefficient of variation is high;
  • median, min, max, standard deviation, and p95/p99 for latency where sample count supports it;
  • raw logs retained for the benchmark artifact;
  • no performance claim from one isolated run unless explicitly labeled as a smoke measurement.

Cycle-counter measurements remain inside cfg(feature = "measure") and are used for relative path decisions. Wall-clock user-story and workload comparisons use host-side timestamps around QEMU transcripts or in-guest monotonic timers when the timer contract is adequate.

Result Schema

The benchmark harness should emit a structured artifact, not a free-form log:

enum BenchmarkStatus {
  passed       @0;
  failed       @1;
  unsupported  @2;
  invalid      @3;
}

struct BenchmarkResult {
  runId          @0 :Text;
  benchmarkName  @1 :Text;
  tier           @2 :UInt16;
  status         @3 :BenchmarkStatus;
  correctnessId  @4 :Text;
  configHash     @5 :Data;
  artifactHash   @6 :Data;
  notes          @7 :Text;

  result :union {
    measurement @8 :MeasurementSummary;
    failure     @9 :RunFailure;
    unsupported @10 :RunFailure;
    invalid     @11 :RunFailure;
  }
}

struct MeasurementSummary {
  unit           @0 :Text;
  lowerIsBetter  @1 :Bool;
  median         @2 :Float64;
  p95            @3 :Float64;
  samples        @4 :List(Float64);
}

struct RunFailure {
  reason  @0 :Text;
  detail  @1 :Text;
}

This schema is conceptual. It should not be added to schema/capos.capnp until a concrete benchmark-runner service exists. The important property is that measurement values exist only in the passed/publishable branch; failed, unsupported, and invalid runs carry reasons instead of zero-valued scalar defaults. Before that, host scripts can emit JSON with the same shape.

Integration With System Monitoring

System Monitoring should expose operational state; the benchmark system should store explicit run artifacts. The overlap is narrow:

  • benchmark runs may read scoped MetricsReader, SystemStatus, RingStats, SchedStats, and later device stats before and after a run;
  • benchmark summaries may be imported into a metrics service as low-cardinality gauges such as benchmark.last_median_ms, keyed by benchmark name and profile, after validation;
  • raw samples, transcripts, QEMU logs, host environment, and correctness evidence belong in a BenchmarkStore or CI artifact store, not in always-on metrics;
  • starting a privileged benchmark profile is an auditable event because it may require measurement-only caps, debug taps, or broad status readers;
  • benchmark readers should receive scoped read-only caps, not global monitoring roots.

The existing system-monitoring-proposal.md boundary remains correct: cycle-counter instrumentation stays behind measure, while cheap counters can later graduate into narrow stats caps.

External Grounding

Relevant local design grounding:

  • docs/build-run-test.md
  • docs/status.md
  • docs/proposals/system-monitoring-proposal.md
  • docs/architecture/capability-ring.md
  • docs/architecture/park.md
  • docs/architecture/scheduling.md
  • docs/research/sel4.md
  • docs/research/zircon.md
  • docs/research/genode.md
  • docs/research/out-of-kernel-scheduling.md

External sources checked:

  • USENIX lmbench paper page: https://www.usenix.org/conference/usenix-1996-annual-technical-conference/lmbench-portable-tools-performance-analysis
  • fio documentation: https://fio.readthedocs.io/en/master/fio_doc.html
  • iperf3 documentation: https://software.es.net/iperf/
  • SPEC CPU 2017 overview and run rules: https://www.spec.org/osg/cpu2017/ and https://www.spec.org/cpu2017/Docs/runrules.html
  • Byte UnixBench repository: https://github.com/kdlucas/byte-unixbench
  • SQLite testing documentation and OpenBenchmarking SQLite speedtest profile: https://www.sqlite.org/testing.html and https://openbenchmarking.org/test/pts/sqlite-speedtest
  • TPC benchmark overview, TPC-C, TPC-H, and TPC-DS descriptions: https://www.tpc.org/information/benchmarks5.asp, https://www.tpc.org/tpcc/default5.asp, https://www.tpc.org/tpch/default5.asp, and https://www.tpc.org/tpcds/
  • YCSB and storage-engine benchmark references: https://hse-project.github.io/apps/ycsb/, https://github.com/facebook/rocksdb/wiki/Benchmarking-tools, and https://github.com/google/leveldb
  • SPECjbb 2015, Renaissance, and HTTP service benchmark references: https://www.spec.org/jbb2015/, https://renaissance.dev/, and https://github.com/wg/wrk
  • Cloud/service benchmark references: https://github.com/parsa-epfl/cloudsuite, https://github.com/delimitrou/DeathStarBench, and https://tailbench.csail.mit.edu/
  • Storage and ML benchmark references: https://www.spec.org/storage2020/, https://mlcommons.org/working-groups/benchmarks/storage/, https://mlcommons.org/benchmarks/training/, and https://docs.mlcommons.org/inference/index_gh/
  • OpenBenchmarking test-suite/profile descriptions: https://openbenchmarking.org/suites/ and https://openbenchmarking.org/tests

The relevant lessons are straightforward:

  • lmbench isolates OS primitives from larger application behavior and was explicitly used to compare system implementations.
  • fio and iperf3 provide flexible, parameterized I/O and network workload models with machine-readable output and verification options.
  • SPEC CPU’s run rules show why disclosure, correct output, and configuration control matter when publishing comparative results.
  • UnixBench is useful as a historical system benchmark, but its own workload descriptions reveal Unix assumptions that capOS must translate carefully.
  • SQLite speedtest is a recognizable application workload with broad public baseline data, but database benchmarking must distinguish RAM-backed and storage-backed results.
  • TPC-C/TPC-E and TPC-H/TPC-DS are the right industry references for later OLTP and decision-support database claims, but capOS should treat early runs as TPC-inspired unless it can satisfy the relevant TPC rules and disclosure requirements.
  • YCSB and db_bench are useful earlier data-system pressure tests because they can exercise key-value, read/write mix, and storage-engine behavior before capOS has a full SQL system.
  • SPECjbb and Renaissance become relevant only when a Java profile exists; until then they are runtime targets, not near-term OS benchmarks.
  • CloudSuite, DeathStarBench, and TailBench are good references for cloud, microservice, and tail-latency user stories, but they require a mature service graph, load generation, and workload-specific correctness checks.
  • SPECstorage and MLPerf Storage are later storage references once capOS has durable storage and enough client/load infrastructure to avoid misleading fio-only claims.
  • MLPerf inference/training is relevant only after model runtimes and accelerator or CPU-baseline execution are credible, and any result must carry the benchmark’s accuracy or quality target rather than only throughput.
  • OpenBenchmarking/Phoronix-style test profiles are useful precedent for packaging benchmark definitions separately from result storage.

Implementation Plan

  1. Structured parser for current run-measure. Add a host parser that converts existing measure: and demo output lines into JSON artifacts with config hash, raw log path, and verifier status.

  2. Primitive benchmark manifest set. Split ring, park, IPC, process, VM, and scheduler benchmarks into focused manifests so each can be repeated independently without running unrelated demos.

  3. Reference guest harness. Add Linux guest scripts that run equivalent primitive tests under the same QEMU/KVM settings. Keep these scripts outside the capOS boot image.

  4. Translated OS microbench suite. Implement capos-osbench for the subset of lmbench/UnixBench intents that capOS can represent honestly. Emit unsupported results for missing Store, file, mmap, and socket primitives until those subsystems exist.

  5. Common workload pilots. Start with workloads that can be made deterministic early: compression, SQLite speedtest against RAM-backed storage once Store exists, shell/session latency, and remote-terminal user-story latency after the current milestone.

  6. Network and storage workloads. Add iperf3/fio-equivalent profiles only after socket and block/storage capabilities exist. Use verification modes for write workloads.

  7. Benchmark store and monitoring bridge. Add a BenchmarkStore service or CI artifact convention. Import only validated summary values into monitoring metrics, and audit privileged benchmark starts.

  8. Regression gates. Add narrow CI thresholds for stable primitive paths. Use review-only warnings for noisy or hardware-dependent workloads until enough history exists.

Reporting Format

Published reports should include:

  • executive table with benchmark, status, unit, capOS median, Linux guest median, ratio, and notes;
  • separate sections for primitive, common workload, and user-story results;
  • correctness summary with failed/unsupported/invalid runs;
  • configuration appendix with hashes and QEMU commands;
  • raw artifact links;
  • explicit warning for benchmark-only builds, debug tap runs, or special caps.

Do not publish a capOS “system score.” The useful output is a workload matrix with enough context to explain the result.

Non-Goals

  • No POSIX compatibility layer purely to run Unix benchmarks.
  • No public comparison that treats unsupported workloads as zero performance.
  • No single aggregate score.
  • No benchmark-only fast paths in normal dispatch builds.
  • No always-on cycle-counter tracing.
  • No network result publication before the network path has correctness and authority proofs.
  • No storage result publication before write verification and crash/error semantics are defined.

Open Questions

  • Which Linux primitive baselines should be first-class: pipe, Unix socket, futex, eventfd, io_uring, or all of them?
  • Should the benchmark store be a capOS service, a host CI artifact convention, or both?
  • What variance threshold should turn a benchmark from a CI gate into a review-only signal?
  • How should reference OS images be pinned and distributed without bloating the repository?
  • What is the earliest honest SQLite storage profile: RAM-only, MemoryObject backed, Store-backed, or block-backed?
  • Should benchmark definitions be modeled as manifest fragments, host-side YAML/JSON, or capOS service objects?