# Proposal: HPC Parallel Processing Patterns

capOS should grow from focused SMP/threading speedup proofs into a
correctness-gated suite of generic parallel processing patterns. The suite
should cover the single-node and multi-node algorithm shapes commonly used in
HPC without pretending that capOS already supports MPI, POSIX files, shared
memory libraries, or cluster networking.

This proposal extends [System Performance Benchmarks](system-performance-benchmarks-proposal.md).
It does not replace the selected **In-Process Threading Scalability**
milestone. The immediate milestone remains proving same-process sibling thread
speedup on the existing fixed-size checksum workload.

## Design Grounding

Local grounding:

- [Benchmarks](../benchmarks.md)
- [System Performance Benchmarks](system-performance-benchmarks-proposal.md)
- [SMP Phase C](../backlog/smp-phase-c.md)
- [SMP](smp-proposal.md)
- [Ring v2 For Full SMP](ring-v2-smp-proposal.md)
- [Scheduler Evolution](scheduler-evolution-proposal.md)
- [In-Process Threading](../architecture/threading.md)
- [HPC Parallel Patterns](../research/hpc-parallel-patterns.md)

External grounding summarized in the research note covers Berkeley dwarfs,
NAS Parallel Benchmarks, HPL/LINPACK, HPCG, Graph500, MPI collectives, and
OpenMP loop/task/reduction constructs.

## Current capOS Benchmark Analysis

Current CPU-scaling evidence is useful but narrow:

- `make run-smp-process-scale` proves independent worker processes can improve
  from one to two CPUs under QEMU/KVM. It is accepted for the completed
  Multi-Process SMP Concurrency milestone. Its prime-counting workload is
  static partition plus final verification, and the 4-vCPU/SMT rows remain
  diagnostic rather than accepted scaling claims.
- `make run-thread-scale` uses a fixed-size checksum workload with per-thread
  rings and guest phase counters. It remains `accepted=false`: two-thread work
  is still roughly flat against one thread, total/shutdown cost grows with
  worker count, and SMT8 rows are informational on the 4-core/8-thread
  benchmark host.
- Native Linux pthread baselines show the same checksum shape can scale in
  some cases, but also expose coordinator/oversubscription sensitivity. Larger
  workloads help separate Amdahl effects from thread lifecycle overhead.
- Guest measurement now covers scheduler, serial, scheduler-lock, timer,
  TLB, and user-PC attribution, but the workload still represents only static
  partitioned CPU work.

So the current suite covers one pattern well: independent fixed chunks with a
final checksum/reduction. It does not yet cover dynamic task scheduling,
barriers, prefix/scan, all-to-all movement, stencils, sparse/dense kernels,
graph frontiers, pipelines, or multi-node communication.

## Goals

- Classify parallel benchmark coverage by algorithm pattern, not by a single
  score or one "HPC benchmark" label.
- Keep correctness and authority gates ahead of speed claims.
- Provide single-node pattern kernels before multi-node transport exists.
- Give scheduler, runtime, memory, IPC, storage, and networking work concrete
  future coverage targets.
- Allow Linux/FreeBSD/MPI/OpenMP comparisons only when the semantic mapping is
  declared.

## Non-Goals

- Do not port MPI or OpenMP as a prerequisite for the first pattern kernels.
- Do not run full HPL, HPCG, NAS, or Graph500 before capOS has the required
  runtime, memory, file/store, and network substrate.
- Do not add POSIX compatibility or ambient filesystem authority only to run a
  familiar suite.
- Do not count SMT diagnostics as core-count scaling evidence.
- Do not add benchmark-only fast paths to normal kernel dispatch.

## Pattern Coverage Matrix

Each pattern should have a capOS-native kernel, a result verifier, and a
declared authority profile. Multi-node variants stay future until network
transport and distributed-capability authority are explicit.

| Pattern | Single-node kernel | Multi-node shape | Verification |
| --- | --- | --- | --- |
| Static map/reduce | split fixed-size byte/block ranges across threads or processes | scatter chunks, local compute, reduce root | deterministic root hash or numeric reduction |
| Dynamic task pool | variable-cost tasks in a bounded deque or queue | work requests between nodes or delegated task shards | all task ids completed once, result hash, cancellation proof |
| Barrier phase loop | repeated phase computation with a barrier between phases | barrier across ranks or services | phase count, no early phase observation |
| Prefix/scan | per-thread prefix over numeric blocks | distributed scan over rank partitions | prefix checksum and boundary carry checks |
| Stencil/halo | 1D/2D/3D grid update with neighbor halo buffers | halo exchange between rank partitions | final grid checksum and boundary oracle |
| Dense tiled compute | tiled matrix multiply or small LU-like update | 2D block-cyclic tile distribution | matrix checksum/residual bound |
| Sparse iterative compute | CSR-like sparse matrix-vector plus dot products | partitioned sparse rows with global reductions | residual/checksum and iteration count |
| FFT/transpose | staged local FFT-like butterflies plus matrix transpose | all-to-all transpose between ranks | output checksum against reference |
| Sort/partition | integer bucket partition plus local sort | sample/splitter exchange and all-to-all buckets | sortedness, permutation checksum |
| Graph frontier | BFS-like frontier over synthetic graph | distributed frontier exchange | parent tree/level validation |
| Pipeline/stream | bounded producer/stage/consumer service graph | service pipeline across nodes | ordered records, backpressure, no dropped records |
| Collective-only | barrier, broadcast, gather, scatter, reduce, allreduce | same operations over networked ranks | collective-specific oracle and timeout behavior |

## Proposed Stages

### Stage 0: Preserve Current Milestone Semantics

Do not broaden `make run-thread-scale` while it is the selected milestone gate.
The current target should remain fixed-size checksum work with work/total
speedup thresholds and raw KVM evidence. Pattern expansion begins after the
same-process scaling milestone closes or as separate diagnostic targets that
cannot be mistaken for the milestone.

### Stage 1: Single-Node Pattern Kernels

Add a `parallel-patterns` demo crate and host harness that can run small
single-node kernels under one process and multiple worker processes:

- `static_reduce`: successor to the checksum workload, reusable as the sanity
  baseline.
- `dynamic_pool`: uneven task sizes to force runtime scheduling and fairness.
- `barrier_loop`: repeated phases to expose barrier and wakeup overhead.
- `scan`: prefix computation to exercise ordered fan-in/fan-out.
- `stencil_2d`: shared-buffer or private-buffer halo copies inside one node.

Each kernel prints compact structured lines with `pattern`, `workers`,
`cpus`, `input_class`, `verified`, `work`, `total`, and relevant counters.
Host harness summaries must keep raw logs under `target/parallel-patterns/`.

### Stage 2: Memory And IPC Intensive Kernels

After `MemoryObject`/shared-buffer and IPC paths mature, add:

- `sparse_spmv`: CSR-style row partition with deterministic matrix generator;
- `graph_bfs`: synthetic graph frontier with visited-set validation;
- `sort_bucket`: bucket partition, prefix counts, local sort, and merge
  verification;
- `pipeline_stream`: bounded service stages with backpressure telemetry.

These kernels should run both thread-local and process/service forms so capOS
can distinguish scheduler overhead from IPC, cap-table, shared-buffer, and
service-boundary costs.

### Stage 3: Capability-Native Collectives

Introduce a small collective service or library abstraction before pretending
to support MPI. The first operations are:

- barrier;
- broadcast;
- scatter/gather;
- reduce/allreduce;
- scan;
- all-to-all for fixed-size blocks.

Collectives are benchmark subjects and future runtime building blocks, not
ambient cluster authority. A caller receives only the communicator/session cap
for its benchmark group. Membership, timeout, cancellation, and stale-session
behavior are part of the verifier.

### Stage 4: Multi-Node Harness

After capOS has a network-capability path suitable for services, add a
multi-node harness that can start N capOS guests or capOS plus Linux reference
guests under a controlled topology. The first target is not full MPI; it is a
capOS-native rank/session model:

- rank membership is represented by explicit capabilities;
- transport authority is scoped to the benchmark group;
- result collection includes per-node raw logs and topology metadata;
- failed, slow, or stale ranks produce controlled errors instead of hanging
  the harness indefinitely.

Only then should capOS attempt NAS-like, HPL-like, HPCG-like, or Graph500-like
profiles with clearly labeled deviations from upstream rules.

## Authority And Safety Rules

- A benchmark group cap grants participation, not ambient network or process
  authority.
- Distributed pattern kernels must separate control-plane capabilities from
  data-plane buffers or sockets.
- Every kernel must have bounded allocation, queue, and message sizes.
- Timeouts and cancellation are correctness paths, not harness afterthoughts.
- Result verification must fail closed before speed summaries are accepted.
- Measurement features may add counters, but normal dispatch must remain the
  code path being evaluated unless the result is labeled as a measure build.

## Reporting Format

Pattern results should extend the existing benchmark artifact conventions:

- source commit, manifest, input class, worker/rank count, CPU count, run count;
- host, QEMU/KVM, pinning, SMT/core topology, and network topology;
- capOS authority profile and any benchmark-only feature flags;
- per-run raw logs and `results.csv`;
- median work and total windows, plus variance;
- verifier status and reason for any `not_applicable` or `diagnostic` result;
- comparison-system mapping, such as OpenMP taskloop, pthreads, MPI collectives,
  or native Linux process/thread equivalents.

## Near-Term Recommendation

Do not start with HPL, HPCG, or Graph500 ports. Start with a small
capOS-native `parallel-patterns` harness after the current thread-scale
milestone closes. The first five kernels should be `static_reduce`,
`dynamic_pool`, `barrier_loop`, `scan`, and `stencil_2d`. That set broadens
coverage from "static independent chunks" to synchronization, irregular
scheduling, ordered reductions, and neighbor exchange while staying within
single-node capOS mechanisms.

When networking and storage mature, extend the same pattern definitions to
multi-node and data-intensive variants rather than creating a parallel,
unrelated benchmark suite.
