# Research: HPC Parallel Patterns

This note grounds the capOS proposal for generic parallel processing pattern
coverage. It is not a request to port full HPC suites immediately. The point is
to classify which algorithm shapes capOS benchmarks should eventually cover so
future SMP, threading, runtime, storage, network, and multi-node claims do not
rest only on embarrassingly parallel worker loops.

## Source Set

- The Berkeley "View" report argues that parallel programming systems need
  multiple styles of parallelism, and uses the dwarf taxonomy to describe
  important computational kernels rather than one benchmark score:
  <https://people.eecs.berkeley.edu/~krste/papers/BerkeleyView.pdf>.
- NASA's NAS Parallel Benchmarks cover EP, IS, CG, MG, FT, BT, SP, LU, UA,
  DC, and DT across MPI, OpenMP, serial, and hybrid variants:
  <https://www.nas.nasa.gov/software/npb.html>.
- TOP500 describes LINPACK/HPL as dense linear-system evidence and warns that
  one number cannot describe overall system performance:
  <https://www.top500.org/project/linpack/>.
- Netlib's HPL page identifies HPL as a distributed-memory double-precision
  dense linear-system implementation:
  <https://netlib.sandia.gov/benchmark/hpl/>.
- HPCG complements HPL by exercising sparse matrix-vector multiplication,
  vector updates, global dot products, Gauss-Seidel smoothing, triangular
  solve, and multigrid-preconditioned conjugate gradient:
  <https://www.hpcg-benchmark.org/>.
- Graph500 covers graph construction, breadth-first search, and shortest-path
  kernels, with shared-memory, distributed-memory, and external-memory/cloud
  variants:
  <https://graph500.org/?page_id=12>.
- MPI 4.1 collective communication names the standard multi-rank movement and
  computation patterns: barrier, broadcast, gather, scatter, allgather,
  all-to-all, allreduce/reduce, reduce-scatter, and scan:
  <https://www.mpi-forum.org/docs/mpi-4.1/mpi41-report/node114.htm>.
- OpenMP 5.2 covers node-local loop/task/SIMD/reduction mechanisms, including
  `taskloop` partitioning loop iterations into explicit tasks and reduction
  clauses for parallel recurrence calculations:
  <https://www.openmp.org/spec-html/5.2/openmp.html>,
  <https://www.openmp.org/spec-html/5.2/openmpse74.html>, and
  <https://www.openmp.org/spec-html/5.2/openmpse27.html>.

## Consequences For capOS

The current capOS CPU-scaling benchmarks are necessary but narrow. They
exercise static worker partitioning, final result verification, and a small
amount of spawn/join or process-wait coordination. That covers one important
HPC pattern: independent tasks with a final reduction. It does not cover:

- structured grids and stencil/halo exchange;
- dense tiled matrix work;
- sparse matrix and irregular memory access;
- FFT/transposes and global all-to-all style communication;
- graph frontier expansion and high-fanout irregular queues;
- task graphs with dependency scheduling and cancellation;
- collectives as first-class operations;
- multi-node communication and authority boundaries.

The benchmark plan should therefore treat "parallel processing" as a matrix of
patterns rather than a single scaling demo. A useful capOS coverage target is:

| Pattern family | Source precedent | capOS evidence it should force |
| --- | --- | --- |
| Static map/reduce | OpenMP loop/reduction, NAS EP | low-overhead thread/process creation, result aggregation, no hot-path syscalls |
| Dynamic task graph | OpenMP tasks, Berkeley composition point | work queues, cancellation, dependency fan-in/fan-out, scheduler fairness under uneven tasks |
| Stencil and halo exchange | NAS MG/BT/SP/LU | shared buffers, neighbor exchange, barriers, cache locality, future network transport |
| Dense tiled linear algebra | HPL/LINPACK | compute locality, tile scheduling, reductions, optional SIMD/library runtime support |
| Sparse iterative solver | HPCG, NAS CG | irregular memory access, sparse matrix-vector work, global dot-product reductions |
| FFT/transpose | NAS FT | all-to-all movement, temporary buffers, memory pressure, future multi-node transpose |
| Sort/partition | NAS IS | all-to-all buckets, prefix/scan, allocator and queue pressure |
| Graph frontier | Graph500 | irregular frontier queues, atomic-like visited updates, high fanout, load imbalance |
| Collective communication | MPI collectives | barrier, broadcast, scatter/gather, reduce/allreduce, all-to-all semantics |
| Pipeline/stream | Berkeley composition point, future service graphs | bounded queues, backpressure, stage-local authority, telemetry |

The near-term capOS subset should stay CPU-only and single-node until the
selected in-process threading milestone is closed. The first expansion should
add pattern kernels that reuse existing userspace/runtime mechanisms, then let
future networking and storage milestones add multi-node and data-intensive
variants.
