Proposal: HPC Parallel Processing Patterns

capOS should grow from focused SMP/threading speedup proofs into a correctness-gated suite of generic parallel processing patterns. The suite should cover the single-node and multi-node algorithm shapes commonly used in HPC without pretending that capOS already supports MPI, POSIX files, shared memory libraries, or cluster networking.

This proposal extends System Performance Benchmarks. It also defines the workload matrix that the future full-SMP scalability milestone tracked in Scheduler Evolution Phase F.5 should use when capOS is ready for 16/32-core evidence on top of the SMP substrate and the in-process threading contract. The old single checksum workload remains useful as one static map/reduce row, but it is too narrow to stand in for broad multicore behavior.

Design Grounding

Local grounding:

Benchmarks
System Performance Benchmarks
SMP Phase C
SMP
Ring v2 For Full SMP
Scheduler Evolution, in particular the full-SMP scalability focus that names Phase F.5 as the 16/32-core milestone this suite feeds
In-Process Threading
HPC Parallel Patterns

External grounding summarized in the research note covers Berkeley dwarfs, NAS Parallel Benchmarks, HPL/LINPACK, HPCG, Graph500, MPI collectives, and OpenMP loop/task/reduction constructs.

Current capOS Benchmark Analysis

Current CPU-scaling evidence is useful but narrow:

make run-smp-process-scale exercises independent worker processes under QEMU/KVM. Its prime-counting workload is static partition plus final verification. Current rows reach 1, 2, and 4 vCPUs, plus one 8-logical-CPU SMT row on a 4-core/8-thread host.
make run-thread-scale uses a fixed-size checksum workload with per-thread rings and guest phase counters. The strongest current row records capOS 1-to-4 work/total speedups 3.088x / 2.700x under QEMU/KVM, while the matching Linux pthread baseline on the same host and pin set records 3.974x / 3.850x.
Native Linux pthread baselines show the checksum shape can scale on the benchmark host, but also expose coordinator and oversubscription sensitivity. Larger workloads help separate Amdahl effects from thread lifecycle overhead.
Guest measurement now covers scheduler, serial, scheduler-lock, timer, TLB, and user-PC attribution, but the workload still represents only static partitioned CPU work.

So the current suite covers one pattern well: independent fixed chunks with a final checksum/reduction. It does not yet cover dynamic task scheduling, barriers, prefix/scan, all-to-all movement, stencils, sparse/dense kernels, graph frontiers, pipelines, or multi-node communication. It also does not yet produce direct-hardware 16/32-core rows, which is the bar that Scheduler Evolution Phase F.5 sets for full-SMP scalability evidence on top of the SMP bring-up substrate.

Goals

Classify parallel benchmark coverage by algorithm pattern, not by a single score or one “HPC benchmark” label.
Keep correctness and authority gates ahead of speed claims.
Provide single-node pattern kernels before multi-node transport exists.
Give scheduler, runtime, memory, IPC, storage, and networking work concrete future coverage targets.
Allow Linux/FreeBSD/MPI/OpenMP comparisons only when the semantic mapping is declared.

Non-Goals

Do not port MPI or OpenMP as a prerequisite for the first pattern kernels.
Do not run full HPL, HPCG, NAS, or Graph500 before capOS has the required runtime, memory, file/store, and network substrate.
Do not add POSIX compatibility or ambient filesystem authority only to run a familiar suite.
Do not count SMT diagnostics as core-count scaling evidence.
Do not add benchmark-only fast paths to normal kernel dispatch.

Pattern Coverage Matrix

Each pattern should have a capOS-native kernel, a result verifier, and a declared authority profile. Multi-node variants stay future until network transport and distributed-capability authority are explicit.

Pattern	Single-node kernel	Multi-node shape	Verification
Static map/reduce	split fixed-size byte/block ranges across threads or processes	scatter chunks, local compute, reduce root	deterministic root hash or numeric reduction
Dynamic task pool	variable-cost tasks in a bounded deque or queue	work requests between nodes or delegated task shards	all task ids completed once, result hash, cancellation proof
Barrier phase loop	repeated phase computation with a barrier between phases	barrier across ranks or services	phase count, no early phase observation
Prefix/scan	per-thread prefix over numeric blocks	distributed scan over rank partitions	prefix checksum and boundary carry checks
Stencil/halo	1D/2D/3D grid update with neighbor halo buffers	halo exchange between rank partitions	final grid checksum and boundary oracle
Dense tiled compute	tiled matrix multiply or small LU-like update	2D block-cyclic tile distribution	matrix checksum/residual bound
Sparse iterative compute	CSR-like sparse matrix-vector plus dot products	partitioned sparse rows with global reductions	residual/checksum and iteration count
FFT/transpose	staged local FFT-like butterflies plus matrix transpose	all-to-all transpose between ranks	output checksum against reference
Sort/partition	integer bucket partition plus local sort	sample/splitter exchange and all-to-all buckets	sortedness, permutation checksum
Graph frontier	BFS-like frontier over synthetic graph	distributed frontier exchange	parent tree/level validation
Pipeline/stream	bounded producer/stage/consumer service graph	service pipeline across nodes	ordered records, backpressure, no dropped records
Collective-only	barrier, broadcast, gather, scatter, reduce, allreduce	same operations over networked ranks	collective-specific oracle and timeout behavior

Proposed Stages

Stage 0: Keep Current CPU Rows Explainable

Keep make run-thread-scale as the fixed-size checksum workload so historical rows remain comparable. Add new pattern targets alongside it rather than changing the meaning of the existing table. The benchmark page should describe the workload, timed region, verifier, environment, and limitations directly.

Stage 1: Single-Node Pattern Kernels

Add a parallel-patterns demo crate and host harness that can run small single-node kernels under one process and multiple worker processes. Workers should be expressible both as same-process threads via the in-process threading contract and as independent processes over the SMP substrate, so the rows distinguish thread-local pick/wake costs from process/IPC boundaries. These are the first rows needed for the future full-SMP hardware profile that Scheduler Evolution Phase F.5 treats as its 16/32-core success bar:

static_reduce: successor to the checksum workload, reusable as the sanity baseline.
dynamic_pool: uneven task sizes to force runtime scheduling and fairness.
barrier_loop: repeated phases to expose barrier and wakeup overhead.
scan: prefix computation to exercise ordered fan-in/fan-out.
stencil_2d: shared-buffer or private-buffer halo copies inside one node.

Each kernel prints compact structured lines with pattern, workers, cpus, input_class, verified, work, total, and relevant counters. Host harness summaries must keep raw logs under target/parallel-patterns/. The hardware profile should run these kernels at 1, 2, 4, 8, 16, and 32 workers when the machine has enough physical cores, with SMT rows separated.

Stage 2: Memory And IPC Intensive Kernels

After MemoryObject/shared-buffer and IPC paths mature, add:

sparse_spmv: CSR-style row partition with deterministic matrix generator;
graph_bfs: synthetic graph frontier with visited-set validation;
sort_bucket: bucket partition, prefix counts, local sort, and merge verification;
pipeline_stream: bounded service stages with backpressure telemetry.

These kernels should run both thread-local and process/service forms so capOS can distinguish scheduler overhead from IPC, cap-table, shared-buffer, and service-boundary costs. The thread form follows the in-process threading contract; the process and service forms exercise the cross-CPU wake, migration, and stale-context paths that Scheduler Evolution Phase F.5 expects to harden on top of the SMP substrate.

Stage 3: Capability-Native Collectives

Introduce a small collective service or library abstraction before pretending to support MPI. The first operations are:

barrier;
broadcast;
scatter/gather;
reduce/allreduce;
scan;
all-to-all for fixed-size blocks.

Collectives are benchmark subjects and future runtime building blocks, not ambient cluster authority. A caller receives only the communicator/session cap for its benchmark group. Membership, timeout, cancellation, and stale-session behavior are part of the verifier.

Stage 4: Multi-Node Harness

After capOS has a network-capability path suitable for services, add a multi-node harness that can start N capOS guests or capOS plus Linux reference guests under a controlled topology. The first target is not full MPI; it is a capOS-native rank/session model:

rank membership is represented by explicit capabilities;
transport authority is scoped to the benchmark group;
result collection includes per-node raw logs and topology metadata;
failed, slow, or stale ranks produce controlled errors instead of hanging the harness indefinitely.

Only then should capOS attempt NAS-like, HPL-like, HPCG-like, or Graph500-like profiles with clearly labeled deviations from upstream rules.

Authority And Safety Rules

A benchmark group cap grants participation, not ambient network or process authority.
Distributed pattern kernels must separate control-plane capabilities from data-plane buffers or sockets.
Every kernel must have bounded allocation, queue, and message sizes.
Timeouts and cancellation are correctness paths, not harness afterthoughts.
Result verification must fail closed before speed summaries are accepted.
Measurement features may add counters, but normal dispatch must remain the code path being evaluated unless the result is labeled as a measure build.

Reporting Format

Pattern results should extend the existing benchmark artifact conventions:

source commit, manifest, input class, worker/rank count, CPU count, run count;
host, QEMU/KVM, pinning, SMT/core topology, and network topology;
capOS authority profile and any benchmark-only feature flags;
per-run raw logs and results.csv;
median work and total windows, plus variance;
verifier status and reason for any not_applicable or diagnostic result;
comparison-system mapping, such as OpenMP taskloop, pthreads, MPI collectives, or native Linux process/thread equivalents.

Near-Term Recommendation

Do not start with HPL, HPCG, or Graph500 ports. Start with a small capOS-native parallel-patterns harness after the current thread-scale milestone closes. The first five kernels should be static_reduce, dynamic_pool, barrier_loop, scan, and stencil_2d. That set broadens coverage from “static independent chunks” to synchronization, irregular scheduling, ordered reductions, and neighbor exchange while staying within single-node capOS mechanisms.

When networking and storage mature, extend the same pattern definitions to multi-node and data-intensive variants rather than creating a parallel, unrelated benchmark suite. Pattern adoption stays paced by the substrate it exercises: Scheduler Evolution Phase F.5 gates the 16/32-core single-node rows, the SMP proposal gates the per-CPU substrate those rows depend on, and the in-process threading contract gates the same-process worker forms each pattern kernel needs.

Keyboard shortcuts

capOS Documentation