Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Benchmarks

capOS benchmark rows are evidence records. Each row should say what workload ran, what was verified, how time was measured, what machine envelope was used, and where the raw artifacts were stored. A faster row whose verifier did not complete is not a performance result.

The broader benchmark model is in System Performance Benchmarks. Future parallel-pattern coverage is in HPC Parallel Processing Patterns.

Current CPU Workloads

capOS currently has two CPU-scaling workloads:

WorkloadTargetTimed regionVerifierPrimary use
run-smp-process-scaleIndependent worker processesworker compute only, after setup and before result reportingaggregate prime count and checksumExercises multiple process-owned rings running CPU work on more than one scheduler CPU.
run-thread-scaleSibling threads in one processchecksum work window, separate from spawn/join/shutdown totalsdeterministic root checksum and metadata checksMeasures same-process thread scheduling, per-thread rings, and scheduler overhead.

Both workloads keep serial and harness artifacts under target/. The capOS rows below were collected under QEMU/KVM. The matching Linux rows use the same workload shape where possible, but units differ by harness and should not be compared directly across systems. Compare speedup ratios within a row.

Process-Scale SMP

make run-smp-process-scale boots a focused manifest, runs independent worker processes, and times the CPU-bound worker window. Each worker owns its own process ring. The timed section avoids syscalls and serial output; the coordinator verifies the aggregate result after workers finish.

The current workload counts primes over 2..3_000_000 using balanced contiguous splits. capOS reports a worker-side user-mode cycle counter shifted right by 20 bits. Linux reports guest clock_gettime nanoseconds.

Controlled benchmark-VM reruns were recorded on GCE n2-highcpu-8 at capOS commit 0d89a91b (2026-04-30 11:09 UTC) with nested QEMU/KVM on Ubuntu 6.17.0-1012-gcp, QEMU 8.2.2, Rust nightly 1.97.0-nightly (c935696dd 2026-04-29), and host logical CPUs 0,1,2,3 mapped to distinct physical cores with SMT siblings 4,5,6,7.

Systemsmp1 mediansmp2 mediansmp4 median1-to-2 speedup1-to-4 speedup
capOS1,639 scaled cycles875 scaled cycles1,111 scaled cycles1.873x1.475x
Linux1,275,187,210 ns659,218,025 ns337,877,986 ns1.934x3.774x

The capOS 4-vCPU row improved over the 1-vCPU row but was slower than the 2-vCPU row. Linux continued improving through 4 vCPUs under the same pinning and workload. Raw capOS artifacts are under target/smp-process-scale/pinned-20260430T1113Z/; raw Linux artifacts are under target/linux-smp-process-scale/pinned-20260430T1118Z/.

SMT Run

The same harness can run an eight-logical-CPU case on the benchmark VM. That machine exposes four physical cores and eight SMT threads, so the smp8-smt row is an SMT measurement on a 4-core host.

The SMT run was recorded at commit 7c15dd47 (2026-04-30 11:45 UTC) with QEMU pinned to logical CPUs 0,1,2,3,4,5,6,7.

Systemsmp1 mediansmp2 mediansmp4 mediansmp8-smt median
capOS1,500 scaled cycles787 scaled cycles1,052 scaled cycles1,595 scaled cycles
Linux1,274,507,854 ns647,611,418 ns337,479,795 ns198,903,231 ns
System1-to-2 speedup1-to-4 speedup1-to-8 speedup
capOS1.906x1.426x0.940x
Linux1.968x3.777x6.408x

Raw capOS SMT artifacts are under target/smp-process-scale/smt8-20260430T1148Z/. Raw Linux SMT artifacts are under target/linux-smp-process-scale/smt8-20260430T1151Z/.

In-Process Thread Scaling

make run-thread-scale runs sibling threads inside one process. Child threads use per-thread rings. The workload computes fixed-size checksum blocks; the default shape is a blocking parent join, 262,144 blocks (16 MiB), and work_rounds=64.

The harness records both a work-window time and a total time. The work window brackets the checksum computation. Total time includes thread startup, synchronization, shutdown, and join overhead. For scheduler analysis, both numbers matter: work speedup shows CPU placement and dispatch during the syscall-free section, while total speedup shows the cost of the surrounding thread lifecycle.

The old 1 MiB workload with a spinning parent is historical only because the matching Linux pthread baseline also stayed flat at four workers. The current rows use the repaired 16 MiB blocking-parent shape unless noted.

Recorded evidence:

System / modePlacementRuns1-to-2 work1-to-2 total1-to-4 work1-to-4 totalNotes
Linux pthread baseline (benchmark VM, 2026-05-10 19:46 UTC)physical-core logical CPUs 0,1,2,351.996x1.995x3.974x3.850xSame checksum workload and pin set as the 2026-05-10 capOS row.
capOS (Phase D WFQ, benchmark VM, 2026-05-10 19:32 UTC)physical-core logical CPUs 0,1,2,351.809x1.774x3.088x2.700xPer-thread weights/latency classes, per-CPU WFQ queues, bounded steal path.
Linux pthread baseline (benchmark VM, 2026-05-02 21:34 UTC)physical-core logical CPUs 0,1,2,351.988x1.987x3.963x3.858xSame repaired workload before Phase D.
capOS (single global queue, benchmark VM, 2026-05-02 21:35 UTC)physical-core logical CPUs 0,1,2,351.883x1.787x1.566x1.538xShows the four-worker cost of the single global runnable queue.
Linux pthread baseline (2026-05-01 report)physical-core logical CPUs51.991x1.990x3.958x3.834xRepaired-shape baseline recorded in docs/changelog.md; target artifact directory is not named in the source record.
capOS (pre-collapse placement, 2026-05-01 report)physical-core logical CPUs51.828x1.687x3.029x2.386xCommit 136b72de; per-CPU placement model later replaced by the queue-collapse cleanup; target artifact directory is not named in the source record.
capOS, switch logs suppressed (pre-collapse, 2026-05-01 report)physical-core logical CPUs51.913x1.636x3.272x2.303xSame commit and model with scheduler switch logs suppressed; target artifact directory is not named in the source record.
capOS (post-collapse, single global queue, 2026-05-02 10:42 UTC)physical-core logical CPUs 0,1,2,3 on the benchmark VM31.890x1.792x1.504x1.436xQueue-collapse row recorded in docs/backlog/scheduler-evolution.md; target artifact directory is not named in the source record.

The 2026-05-10 Phase D WFQ row uses the same repaired shape as the 2026-05-02 pair: blocking parent join, 262,144 blocks, work_rounds=64, five runs, KVM-backed QEMU pinned to physical-core logical CPUs 0,1,2,3, and a matching Linux pthread baseline on the same pin set. Raw capOS artifacts are under target/thread-scale/20260510T193200Z/; raw Linux artifacts are under target/linux-thread-scale/20260510T194600Z/.

The 2026-05-02 capOS/Linux pair used main commit 374f8556; raw capOS artifacts are under target/thread-scale/20260502T213544Z/, and raw Linux artifacts are under target/linux-thread-scale/20260502T213445Z/.

The row improved the four-worker work window from 1.566x to 3.088x and the four-worker total window from 1.538x to 2.700x compared with the single-global-queue row. Linux on the same host and pin set recorded 3.974x work and 3.850x total at four workers. The remaining difference is the scheduler/runtime optimization target for later work.

Guest-side attribution is available with CAPOS_THREAD_SCALE_GUEST_MEASURE=1. It emits aggregate and per-phase measurements for spawn_ready, work, shutdown, and final_total, including scheduler choice, lock, timer, TLB, serial, shared-kernel-lock, network-poll, thread-placement, and sampled user-PC buckets. Host-side QEMU profiling is available with CAPOS_THREAD_SCALE_PROFILE=1.

Interpreting CPU Counts

CPU-count rows are meaningful only with a recorded topology:

  • Physical-core rows require enough physical cores for the vCPU count.
  • SMT rows should say they are SMT rows and list the logical CPU set.
  • Pinning QEMU with taskset is useful, but it is not CPU isolation by itself. Stronger runs should record isolcpus/nohz_full/rcu_nocbs, cpuset, or systemd affinity policy when used.
  • Pinning QEMU to fewer host logical CPUs than guest vCPUs measures oversubscription behavior, not core scaling.
  • Current QEMU/KVM results should stay separate from future direct cloud or bare-metal runs.

The current capOS benchmark table reaches four physical-core rows and an eight-logical-CPU SMT row on a 4-core/8-thread VM. It does not yet measure 16-core or 32-core systems.

Next CPU-Scaling Work

The next CPU-scaling milestone should be designed around direct hardware or a dedicated perf runner rather than nested QEMU as the primary evidence source. The benchmark suite needs:

  • hardware discovery records for socket/core/SMT topology, APIC mode, timer source, frequency policy, memory size, and firmware/device model;
  • workload rows at 1, 2, 4, 8, 16, and 32 workers where the machine has enough physical cores, plus separately labeled SMT rows;
  • at least one static map/reduce checksum workload, one uneven dynamic-task workload, one barrier-heavy phase loop, and one IPC/service-bound workload;
  • work-window and total-time reporting for every workload;
  • matching Linux native baselines on the same hardware where a comparable workload exists;
  • scheduler/runtime counters for queue depth, migrations, steals, reschedule IPIs, TLB shootdowns, timer ticks, lock wait/hold time, blocked time, and runnable but not running time;
  • raw artifacts with source commit, toolchain, kernel config, host topology, run count, warmup policy, and verifier output.

QEMU should remain useful for boot and regression coverage, but it should not be the primary source for a 16/32-core SMP scalability milestone.

Commands

Run the capOS process-scale workload:

make run-smp-process-scale

Run the process-scale workload with QEMU pinned to selected host CPUs:

CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1 make run-smp-process-scale

Run the process-scale SMT row on a host with at least eight logical CPUs:

CAPOS_SMP_SCALE_INCLUDE_SMT=1 \
  CAPOS_SMP_SCALE_QEMU_TASKSET_CPUS=0,1,2,3,4,5,6,7 \
  make run-smp-process-scale

Run the thread-scale workload:

CAPOS_THREAD_SCALE_RUNS=5 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale

Run the larger-workload Amdahl row:

CAPOS_THREAD_SCALE_RUNS=5 \
  CAPOS_THREAD_SCALE_TOTAL_BLOCKS=1048576 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale

Run a one-sample host-side QEMU profiling pass:

CAPOS_THREAD_SCALE_PROFILE=1 \
  CAPOS_THREAD_SCALE_RUNS=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale

Run a one-sample guest-side measurement pass:

CAPOS_THREAD_SCALE_GUEST_MEASURE=1 \
  CAPOS_THREAD_SCALE_RUNS=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  make run-thread-scale

Run only the host summary parser against an existing results.csv without booting QEMU:

CAPOS_THREAD_SCALE_SUMMARY_ONLY=1 \
  CAPOS_THREAD_SCALE_SUMMARY_CSV=<results.csv> \
  CAPOS_THREAD_SCALE_SUMMARY_KVM_EVIDENCE=1 \
  CAPOS_THREAD_SCALE_QEMU_TASKSET_CPUS=0,1,2,3 \
  CAPOS_THREAD_SCALE_TOTAL_BLOCKS=262144 \
  CAPOS_THREAD_SCALE_PARENT_WAIT=join \
  CAPOS_THREAD_SCALE_WORK_ROUNDS=64 \
  tools/qemu-thread-scale-harness.sh

Run the native Linux pthread baseline for the thread-scale checksum workload:

LINUX_THREAD_SCALE_TASKSET_CPUS=0,1,2,3 \
  make run-linux-thread-scale-baseline

Run the Linux process-scale comparison:

LINUX_SMP_SCALE_KERNEL=target/linux-smp-process-scale/kernel/vmlinuz \
  tools/linux-smp-process-scale-baseline.sh

On hosts where /boot/vmlinuz is not readable by the current user, copy a kernel image into ignored target/ storage first through the host’s normal administrative path, then pass it as LINUX_SMP_SCALE_KERNEL. The script does not invoke sudo itself.