Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: HPC Parallel Processing Patterns

capOS should grow from focused SMP/threading speedup proofs into a correctness-gated suite of generic parallel processing patterns. The suite should cover the single-node and multi-node algorithm shapes commonly used in HPC without pretending that capOS already supports MPI, POSIX files, shared memory libraries, or cluster networking.

This proposal extends System Performance Benchmarks. It does not replace the selected In-Process Threading Scalability milestone. The immediate milestone remains proving same-process sibling thread speedup on the existing fixed-size checksum workload.

Design Grounding

Local grounding:

External grounding summarized in the research note covers Berkeley dwarfs, NAS Parallel Benchmarks, HPL/LINPACK, HPCG, Graph500, MPI collectives, and OpenMP loop/task/reduction constructs.

Current capOS Benchmark Analysis

Current CPU-scaling evidence is useful but narrow:

  • make run-smp-process-scale proves independent worker processes can improve from one to two CPUs under QEMU/KVM. It is accepted for the completed Multi-Process SMP Concurrency milestone. Its prime-counting workload is static partition plus final verification, and the 4-vCPU/SMT rows remain diagnostic rather than accepted scaling claims.
  • make run-thread-scale uses a fixed-size checksum workload with per-thread rings and guest phase counters. It remains accepted=false: two-thread work is still roughly flat against one thread, total/shutdown cost grows with worker count, and SMT8 rows are informational on the 4-core/8-thread benchmark host.
  • Native Linux pthread baselines show the same checksum shape can scale in some cases, but also expose coordinator/oversubscription sensitivity. Larger workloads help separate Amdahl effects from thread lifecycle overhead.
  • Guest measurement now covers scheduler, serial, scheduler-lock, timer, TLB, and user-PC attribution, but the workload still represents only static partitioned CPU work.

So the current suite covers one pattern well: independent fixed chunks with a final checksum/reduction. It does not yet cover dynamic task scheduling, barriers, prefix/scan, all-to-all movement, stencils, sparse/dense kernels, graph frontiers, pipelines, or multi-node communication.

Goals

  • Classify parallel benchmark coverage by algorithm pattern, not by a single score or one “HPC benchmark” label.
  • Keep correctness and authority gates ahead of speed claims.
  • Provide single-node pattern kernels before multi-node transport exists.
  • Give scheduler, runtime, memory, IPC, storage, and networking work concrete future coverage targets.
  • Allow Linux/FreeBSD/MPI/OpenMP comparisons only when the semantic mapping is declared.

Non-Goals

  • Do not port MPI or OpenMP as a prerequisite for the first pattern kernels.
  • Do not run full HPL, HPCG, NAS, or Graph500 before capOS has the required runtime, memory, file/store, and network substrate.
  • Do not add POSIX compatibility or ambient filesystem authority only to run a familiar suite.
  • Do not count SMT diagnostics as core-count scaling evidence.
  • Do not add benchmark-only fast paths to normal kernel dispatch.

Pattern Coverage Matrix

Each pattern should have a capOS-native kernel, a result verifier, and a declared authority profile. Multi-node variants stay future until network transport and distributed-capability authority are explicit.

PatternSingle-node kernelMulti-node shapeVerification
Static map/reducesplit fixed-size byte/block ranges across threads or processesscatter chunks, local compute, reduce rootdeterministic root hash or numeric reduction
Dynamic task poolvariable-cost tasks in a bounded deque or queuework requests between nodes or delegated task shardsall task ids completed once, result hash, cancellation proof
Barrier phase looprepeated phase computation with a barrier between phasesbarrier across ranks or servicesphase count, no early phase observation
Prefix/scanper-thread prefix over numeric blocksdistributed scan over rank partitionsprefix checksum and boundary carry checks
Stencil/halo1D/2D/3D grid update with neighbor halo buffershalo exchange between rank partitionsfinal grid checksum and boundary oracle
Dense tiled computetiled matrix multiply or small LU-like update2D block-cyclic tile distributionmatrix checksum/residual bound
Sparse iterative computeCSR-like sparse matrix-vector plus dot productspartitioned sparse rows with global reductionsresidual/checksum and iteration count
FFT/transposestaged local FFT-like butterflies plus matrix transposeall-to-all transpose between ranksoutput checksum against reference
Sort/partitioninteger bucket partition plus local sortsample/splitter exchange and all-to-all bucketssortedness, permutation checksum
Graph frontierBFS-like frontier over synthetic graphdistributed frontier exchangeparent tree/level validation
Pipeline/streambounded producer/stage/consumer service graphservice pipeline across nodesordered records, backpressure, no dropped records
Collective-onlybarrier, broadcast, gather, scatter, reduce, allreducesame operations over networked rankscollective-specific oracle and timeout behavior

Proposed Stages

Stage 0: Preserve Current Milestone Semantics

Do not broaden make run-thread-scale while it is the selected milestone gate. The current target should remain fixed-size checksum work with work/total speedup thresholds and raw KVM evidence. Pattern expansion begins after the same-process scaling milestone closes or as separate diagnostic targets that cannot be mistaken for the milestone.

Stage 1: Single-Node Pattern Kernels

Add a parallel-patterns demo crate and host harness that can run small single-node kernels under one process and multiple worker processes:

  • static_reduce: successor to the checksum workload, reusable as the sanity baseline.
  • dynamic_pool: uneven task sizes to force runtime scheduling and fairness.
  • barrier_loop: repeated phases to expose barrier and wakeup overhead.
  • scan: prefix computation to exercise ordered fan-in/fan-out.
  • stencil_2d: shared-buffer or private-buffer halo copies inside one node.

Each kernel prints compact structured lines with pattern, workers, cpus, input_class, verified, work, total, and relevant counters. Host harness summaries must keep raw logs under target/parallel-patterns/.

Stage 2: Memory And IPC Intensive Kernels

After MemoryObject/shared-buffer and IPC paths mature, add:

  • sparse_spmv: CSR-style row partition with deterministic matrix generator;
  • graph_bfs: synthetic graph frontier with visited-set validation;
  • sort_bucket: bucket partition, prefix counts, local sort, and merge verification;
  • pipeline_stream: bounded service stages with backpressure telemetry.

These kernels should run both thread-local and process/service forms so capOS can distinguish scheduler overhead from IPC, cap-table, shared-buffer, and service-boundary costs.

Stage 3: Capability-Native Collectives

Introduce a small collective service or library abstraction before pretending to support MPI. The first operations are:

  • barrier;
  • broadcast;
  • scatter/gather;
  • reduce/allreduce;
  • scan;
  • all-to-all for fixed-size blocks.

Collectives are benchmark subjects and future runtime building blocks, not ambient cluster authority. A caller receives only the communicator/session cap for its benchmark group. Membership, timeout, cancellation, and stale-session behavior are part of the verifier.

Stage 4: Multi-Node Harness

After capOS has a network-capability path suitable for services, add a multi-node harness that can start N capOS guests or capOS plus Linux reference guests under a controlled topology. The first target is not full MPI; it is a capOS-native rank/session model:

  • rank membership is represented by explicit capabilities;
  • transport authority is scoped to the benchmark group;
  • result collection includes per-node raw logs and topology metadata;
  • failed, slow, or stale ranks produce controlled errors instead of hanging the harness indefinitely.

Only then should capOS attempt NAS-like, HPL-like, HPCG-like, or Graph500-like profiles with clearly labeled deviations from upstream rules.

Authority And Safety Rules

  • A benchmark group cap grants participation, not ambient network or process authority.
  • Distributed pattern kernels must separate control-plane capabilities from data-plane buffers or sockets.
  • Every kernel must have bounded allocation, queue, and message sizes.
  • Timeouts and cancellation are correctness paths, not harness afterthoughts.
  • Result verification must fail closed before speed summaries are accepted.
  • Measurement features may add counters, but normal dispatch must remain the code path being evaluated unless the result is labeled as a measure build.

Reporting Format

Pattern results should extend the existing benchmark artifact conventions:

  • source commit, manifest, input class, worker/rank count, CPU count, run count;
  • host, QEMU/KVM, pinning, SMT/core topology, and network topology;
  • capOS authority profile and any benchmark-only feature flags;
  • per-run raw logs and results.csv;
  • median work and total windows, plus variance;
  • verifier status and reason for any not_applicable or diagnostic result;
  • comparison-system mapping, such as OpenMP taskloop, pthreads, MPI collectives, or native Linux process/thread equivalents.

Near-Term Recommendation

Do not start with HPL, HPCG, or Graph500 ports. Start with a small capOS-native parallel-patterns harness after the current thread-scale milestone closes. The first five kernels should be static_reduce, dynamic_pool, barrier_loop, scan, and stencil_2d. That set broadens coverage from “static independent chunks” to synchronization, irregular scheduling, ordered reductions, and neighbor exchange while staying within single-node capOS mechanisms.

When networking and storage mature, extend the same pattern definitions to multi-node and data-intensive variants rather than creating a parallel, unrelated benchmark suite.