Proposal: HPC Parallel Processing Patterns
capOS should grow from focused SMP/threading speedup proofs into a correctness-gated suite of generic parallel processing patterns. The suite should cover the single-node and multi-node algorithm shapes commonly used in HPC without pretending that capOS already supports MPI, POSIX files, shared memory libraries, or cluster networking.
This proposal extends System Performance Benchmarks. It also defines the workload matrix that the future full-SMP scalability milestone tracked in Scheduler Evolution Phase F.5 should use when capOS is ready for 16/32-core evidence on top of the SMP substrate and the in-process threading contract. The old single checksum workload remains useful as one static map/reduce row, but it is too narrow to stand in for broad multicore behavior.
Design Grounding
Local grounding:
- Benchmarks
- System Performance Benchmarks
- SMP Phase C
- SMP
- Ring v2 For Full SMP
- Scheduler Evolution, in particular the full-SMP scalability focus that names Phase F.5 as the 16/32-core milestone this suite feeds
- In-Process Threading
- HPC Parallel Patterns
External grounding summarized in the research note covers Berkeley dwarfs, NAS Parallel Benchmarks, HPL/LINPACK, HPCG, Graph500, MPI collectives, and OpenMP loop/task/reduction constructs.
Current capOS Benchmark Analysis
Current CPU-scaling evidence is useful but narrow:
make run-smp-process-scaleexercises independent worker processes under QEMU/KVM. Its prime-counting workload is static partition plus final verification. Current rows reach 1, 2, and 4 vCPUs, plus one 8-logical-CPU SMT row on a 4-core/8-thread host.make run-thread-scaleuses a fixed-size checksum workload with per-thread rings and guest phase counters. The strongest current row records capOS 1-to-4 work/total speedups3.088x/2.700xunder QEMU/KVM, while the matching Linux pthread baseline on the same host and pin set records3.974x/3.850x.- Native Linux pthread baselines show the checksum shape can scale on the benchmark host, but also expose coordinator and oversubscription sensitivity. Larger workloads help separate Amdahl effects from thread lifecycle overhead.
- Guest measurement now covers scheduler, serial, scheduler-lock, timer, TLB, and user-PC attribution, but the workload still represents only static partitioned CPU work.
So the current suite covers one pattern well: independent fixed chunks with a final checksum/reduction. It does not yet cover dynamic task scheduling, barriers, prefix/scan, all-to-all movement, stencils, sparse/dense kernels, graph frontiers, pipelines, or multi-node communication. It also does not yet produce direct-hardware 16/32-core rows, which is the bar that Scheduler Evolution Phase F.5 sets for full-SMP scalability evidence on top of the SMP bring-up substrate.
Goals
- Classify parallel benchmark coverage by algorithm pattern, not by a single score or one “HPC benchmark” label.
- Keep correctness and authority gates ahead of speed claims.
- Provide single-node pattern kernels before multi-node transport exists.
- Give scheduler, runtime, memory, IPC, storage, and networking work concrete future coverage targets.
- Allow Linux/FreeBSD/MPI/OpenMP comparisons only when the semantic mapping is declared.
Non-Goals
- Do not port MPI or OpenMP as a prerequisite for the first pattern kernels.
- Do not run full HPL, HPCG, NAS, or Graph500 before capOS has the required runtime, memory, file/store, and network substrate.
- Do not add POSIX compatibility or ambient filesystem authority only to run a familiar suite.
- Do not count SMT diagnostics as core-count scaling evidence.
- Do not add benchmark-only fast paths to normal kernel dispatch.
Pattern Coverage Matrix
Each pattern should have a capOS-native kernel, a result verifier, and a declared authority profile. Multi-node variants stay future until network transport and distributed-capability authority are explicit.
| Pattern | Single-node kernel | Multi-node shape | Verification |
|---|---|---|---|
| Static map/reduce | split fixed-size byte/block ranges across threads or processes | scatter chunks, local compute, reduce root | deterministic root hash or numeric reduction |
| Dynamic task pool | variable-cost tasks in a bounded deque or queue | work requests between nodes or delegated task shards | all task ids completed once, result hash, cancellation proof |
| Barrier phase loop | repeated phase computation with a barrier between phases | barrier across ranks or services | phase count, no early phase observation |
| Prefix/scan | per-thread prefix over numeric blocks | distributed scan over rank partitions | prefix checksum and boundary carry checks |
| Stencil/halo | 1D/2D/3D grid update with neighbor halo buffers | halo exchange between rank partitions | final grid checksum and boundary oracle |
| Dense tiled compute | tiled matrix multiply or small LU-like update | 2D block-cyclic tile distribution | matrix checksum/residual bound |
| Sparse iterative compute | CSR-like sparse matrix-vector plus dot products | partitioned sparse rows with global reductions | residual/checksum and iteration count |
| FFT/transpose | staged local FFT-like butterflies plus matrix transpose | all-to-all transpose between ranks | output checksum against reference |
| Sort/partition | integer bucket partition plus local sort | sample/splitter exchange and all-to-all buckets | sortedness, permutation checksum |
| Graph frontier | BFS-like frontier over synthetic graph | distributed frontier exchange | parent tree/level validation |
| Pipeline/stream | bounded producer/stage/consumer service graph | service pipeline across nodes | ordered records, backpressure, no dropped records |
| Collective-only | barrier, broadcast, gather, scatter, reduce, allreduce | same operations over networked ranks | collective-specific oracle and timeout behavior |
Proposed Stages
Stage 0: Keep Current CPU Rows Explainable
Keep make run-thread-scale as the fixed-size checksum workload so historical
rows remain comparable. Add new pattern targets alongside it rather than
changing the meaning of the existing table. The benchmark page should describe
the workload, timed region, verifier, environment, and limitations directly.
Stage 1: Single-Node Pattern Kernels
Add a parallel-patterns demo crate and host harness that can run small
single-node kernels under one process and multiple worker processes. Workers
should be expressible both as same-process threads via the
in-process threading contract and as
independent processes over the SMP substrate, so the rows
distinguish thread-local pick/wake costs from process/IPC boundaries. These
are the first rows needed for the future full-SMP hardware profile that
Scheduler Evolution Phase F.5
treats as its 16/32-core success bar:
static_reduce: successor to the checksum workload, reusable as the sanity baseline.dynamic_pool: uneven task sizes to force runtime scheduling and fairness.barrier_loop: repeated phases to expose barrier and wakeup overhead.scan: prefix computation to exercise ordered fan-in/fan-out.stencil_2d: shared-buffer or private-buffer halo copies inside one node.
Each kernel prints compact structured lines with pattern, workers,
cpus, input_class, verified, work, total, and relevant counters.
Host harness summaries must keep raw logs under target/parallel-patterns/.
The hardware profile should run these kernels at 1, 2, 4, 8, 16, and 32
workers when the machine has enough physical cores, with SMT rows separated.
Stage 2: Memory And IPC Intensive Kernels
After MemoryObject/shared-buffer and IPC paths mature, add:
sparse_spmv: CSR-style row partition with deterministic matrix generator;graph_bfs: synthetic graph frontier with visited-set validation;sort_bucket: bucket partition, prefix counts, local sort, and merge verification;pipeline_stream: bounded service stages with backpressure telemetry.
These kernels should run both thread-local and process/service forms so capOS can distinguish scheduler overhead from IPC, cap-table, shared-buffer, and service-boundary costs. The thread form follows the in-process threading contract; the process and service forms exercise the cross-CPU wake, migration, and stale-context paths that Scheduler Evolution Phase F.5 expects to harden on top of the SMP substrate.
Stage 3: Capability-Native Collectives
Introduce a small collective service or library abstraction before pretending to support MPI. The first operations are:
- barrier;
- broadcast;
- scatter/gather;
- reduce/allreduce;
- scan;
- all-to-all for fixed-size blocks.
Collectives are benchmark subjects and future runtime building blocks, not ambient cluster authority. A caller receives only the communicator/session cap for its benchmark group. Membership, timeout, cancellation, and stale-session behavior are part of the verifier.
Stage 4: Multi-Node Harness
After capOS has a network-capability path suitable for services, add a multi-node harness that can start N capOS guests or capOS plus Linux reference guests under a controlled topology. The first target is not full MPI; it is a capOS-native rank/session model:
- rank membership is represented by explicit capabilities;
- transport authority is scoped to the benchmark group;
- result collection includes per-node raw logs and topology metadata;
- failed, slow, or stale ranks produce controlled errors instead of hanging the harness indefinitely.
Only then should capOS attempt NAS-like, HPL-like, HPCG-like, or Graph500-like profiles with clearly labeled deviations from upstream rules.
Authority And Safety Rules
- A benchmark group cap grants participation, not ambient network or process authority.
- Distributed pattern kernels must separate control-plane capabilities from data-plane buffers or sockets.
- Every kernel must have bounded allocation, queue, and message sizes.
- Timeouts and cancellation are correctness paths, not harness afterthoughts.
- Result verification must fail closed before speed summaries are accepted.
- Measurement features may add counters, but normal dispatch must remain the code path being evaluated unless the result is labeled as a measure build.
Reporting Format
Pattern results should extend the existing benchmark artifact conventions:
- source commit, manifest, input class, worker/rank count, CPU count, run count;
- host, QEMU/KVM, pinning, SMT/core topology, and network topology;
- capOS authority profile and any benchmark-only feature flags;
- per-run raw logs and
results.csv; - median work and total windows, plus variance;
- verifier status and reason for any
not_applicableordiagnosticresult; - comparison-system mapping, such as OpenMP taskloop, pthreads, MPI collectives, or native Linux process/thread equivalents.
Near-Term Recommendation
Do not start with HPL, HPCG, or Graph500 ports. Start with a small
capOS-native parallel-patterns harness after the current thread-scale
milestone closes. The first five kernels should be static_reduce,
dynamic_pool, barrier_loop, scan, and stencil_2d. That set broadens
coverage from “static independent chunks” to synchronization, irregular
scheduling, ordered reductions, and neighbor exchange while staying within
single-node capOS mechanisms.
When networking and storage mature, extend the same pattern definitions to multi-node and data-intensive variants rather than creating a parallel, unrelated benchmark suite. Pattern adoption stays paced by the substrate it exercises: Scheduler Evolution Phase F.5 gates the 16/32-core single-node rows, the SMP proposal gates the per-CPU substrate those rows depend on, and the in-process threading contract gates the same-process worker forms each pattern kernel needs.