SMP Phase C Backlog
ARCHIVED — milestones complete; residual full-SMP-hardware work tracked in Scheduler Evolution “Phase F.5: Full-SMP Hardware Scalability”. Both visible milestones this backlog tracks landed: Multi-Process SMP Concurrency (the
make run-smp-process-scaleproof is complete) and In-Process Threading Scalability (closed at commit136b72de,2026-05-01 14:58 UTC). No SMP track is active indocs/tasks/README.md. This file is retained as historical context and as the proof-contract reference; do not select new work from it – the next visible SMP milestone is the planning slot inscheduler-evolution.mdPhase F.5.
Detailed context for the selected SMP Phase C AP scheduler-owner proof and the remaining full-concurrent-SMP and in-process thread-scaling follow-on work.
Visible Goal
Move from a single scheduler owner to multiple CPUs that can run independent scheduler-owned kernel/user work concurrently, and prove that capability-owned processes can improve wall-clock performance on a deterministic CPU-bound workload under QEMU/KVM.
This backlog tracks two distinct visible milestones:
- Multi-Process SMP Concurrency:
make run-smp-process-scaleshould boot a focused manifest, run a deterministic SMP scaling demo across independent worker processes, print verified workload output, and report comparable 1/2/4-process timing. The proof is complete only when repeated KVM-backed-smp 1and-smp 2runs show near-linear speedup for the selected workload, while the ordinary manifest, ring, thread, park, and process-exit smokes still pass under-smp 2. - In-Process Threading Scalability:
make run-thread-scaleshould run a deterministic workload across sibling threads inside one process, verify the result, and report comparable 1/2/4-thread timing. This milestone closed at commit136b72de(2026-05-01 14:58 UTC) against the pre-collapse per-CPU placement model: caller-aware child publication and the existing timer fast-path slices produced repeated KVM-backed physical-core evidence above the configured 1-to-2 work and total speedup thresholds. The 4-worker row remained diagnostic rather than a linear-scaling claim. The 2026-05-02 per-CPU run-queue collapse retired that placement chain (caller-aware publication, per-CPU runnable queues, local-first stealing, theWakePolicy::QueueCpu(usize)variant). A post-collapse 3-run diagnostic oncapos-bench2026-05-02 10:42 UTC measured 1-to-2 work/total1.890x/1.792x(slight improvement) and 1-to-4 work/total1.504x/1.436x(clear regression on single-queue scheduler-lock contention). The formal capOS+Linux accepted-evidence pair landed against the same single-global-queue scheduler oncapos-bench2026-05-02 21:38 UTC againstmaincommit374f8556: capOS work1.883x/ total1.787xclear the configured 1-to-2 gates, while the 1-to-4 row (capOS1.566x/1.538xvs Linux3.963x/3.858x) is the diagnostic gating Phase D’s fair-share enqueue policy. Reintroducing per-CPU runnable queues with that policy must materially close the capOS-vs-Linux 1-to-4 gap before per-CPU queues land back in the scheduler. Seedocs/architecture/scheduling.md,docs/benchmarks.md, anddocs/backlog/scheduler-evolution.mdfor the current state.
Full concurrent SMP scheduling remains the underlying kernel goal for the multi-process milestone. It means more than one CPU can own scheduler work simultaneously, including per-CPU runnable ownership, cross-CPU idle-to-runnable handoff, reschedule IPIs, safe current-thread tracking, and reviewed lock/residency rules. The multi-process scaling demo is the first user-visible acceptance test for that kernel capability.
Completed Gates
- Ground the multi-CPU scheduling slice in the SMP proposal, scheduler and
threading docs, and relevant
docs/research/files. - Migrate syscall entry/exit to the GS-base/
swapgsper-CPU path, including non-sysretqscheduler/exit paths. - Add LAPIC timer, EOI, and IPI support for per-CPU ticks and cross-CPU coordination. The active backend is PIT-calibrated xAPIC MMIO with PIT/PIC fallback; x2APIC remains a later backend.
- Add TLB shootdown before any user address space can run on more than one CPU over its lifetime.
- Extend scheduler state from BSP-only ownership to per-CPU current-thread tracking with AP idle/runnable handoff. The first AP scheduler proof uses one AP as scheduler owner while the BSP stays in kernel idle, preserving the process-wide ring invariant.
- Add QEMU proof that AP cpu=1 executes scheduler-owned work and the
existing manifest/ring/thread/park smokes still pass under
-smp 2.
In-Process Threading Closeout Rules
-
Resolve the scheduler hot-lock blocker before calling the selected milestone a scalability proof. The implementation at the time had per-CPU runnable queues and dispatch state, but they remained under one global
Schedulerlock. A closeout branch should either split the hot dispatch path so ordinary timer preemption, local run-queue selection, and sibling CPU-bound thread requeue do not serialize on one global lock, or explicitly narrow the milestone to “functional in-process threading” and select a follow-on scheduler-lock scalability milestone. Completed 2026-05-01 14:58 UTC after repairing the benchmark shape against Linux baseline evidence and tightening caller-aware child publication: the repaired blocking-parent 16 MiB/64-round shape scales on Linux, and controlled physical-core capOS evidence passed the enforced 1-to-2 work and total gates. Four-worker capOS scaling remained a separate follow-up because total time still showed scheduler/exit/join overhead. (Update 2026-05-02: the per-CPU runnable queues and the caller-aware child publication described here were later collapsed into a single global runnable queue with the per-CPU run-queue-collapse cleanup slice; the recorded 1-to-2 capOS gates were against that pre-collapse placement model. The current single-global-queue scheduler now has its own formal accepted 1-to-2 pair oncapos-bench2026-05-02 21:38 UTC againstmaincommit374f8556(capOS work1.883x/ total1.787x; Linux baseline1.988x/1.987x); the 1-to-4 row remains the diagnostic gating Phase D’s fair-share enqueue policy. Per-CPU queues and caller-aware placement return when that policy ships and materially closes the capOS-vs-Linux 1-to-4 gap. Seedocs/architecture/scheduling.md,docs/benchmarks.md, anddocs/backlog/scheduler-evolution.mdfor current state.) -
Add a bounded timer continuation fast path as a conservative split-prep slice. Completed 2026-05-01 10:29 UTC: a user-mode LAPIC timer tick may keep running the current non-idle thread without entering
sched::schedule()only when a previous locked slow path has published a clean hard-work summary, the CPU has no pending reschedule IPI, and the per-CPU one-skip budget has not been exhausted. The 2026-05-01 11:40 UTC follow-up keeps every dirty producer forcing at least one locked timer pass, then allows remaining run queues and handoff-current markers alone to be treated as fairness/protection state for one continued tick. Direct IPC, deferred cleanup, Timer sleeps, and timed cap-enter/Park waiters still keep the hard slow-path bit set. The full scheduler path remains authoritative and still runs regularly for ring SQEs, cap-wait scans, cleanup, and accounting. This narrows timer-side scheduler-lock contention but does not by itself close the selected scalability milestone. Controlledcapos-benchphysical-core0-3before/after evidence for the initial strict-clean version stayedaccepted=false: baselinetarget/thread-scale/timer-fastpath-baseline-main-physical-20260501T102938/reported work speedups0.998xand0.998x; after-changetarget/thread-scale/timer-fastpath-after-physical-20260501T104700/reported work speedups1.001xand0.999x. Controlledcapos-benchphysical-core0-3evidence for the fairness-only follow-up also stayedaccepted=false: baselinetarget/thread-scale/20260501T120224Z/recorded work speedups1.001xand0.999xplus total speedups0.913xand0.587x; after-changetarget/thread-scale/20260501T120709Z/recorded work speedups1.001xand1.000xplus total speedups1.125xand0.828x. -
Add timer-fast-path attribution counters for guest-measure thread-scale runs. Completed 2026-05-01 10:58 UTC: aggregate and per-phase
timerlines now report fast-path attempts, continues, and fallback reasons for slow-required/dirty summaries, skip-budget exhaustion, pending reschedule IPIs, no-current/non-idle CPUs, and inactive/invalid scheduler CPUs. These counters answer whether the bounded continuation path fires inside benchmark phases. They are benchmark-only instrumentation and do not close the currentaccepted=falsespeedup gate. Local one-run evidence intarget/thread-scale/20260501T110157Z/passed with the new fields present in every 1/2/4-threadmeasure.log; the timed work phase recordedfast_path_continues=0for all three rows. -
Add timer slow-summary reason attribution for guest-measure thread-scale runs. Completed 2026-05-01 11:28 UTC: aggregate and per-phase
timer_slow_summarylines now report required/clean counts and the predicate reasons that keepTIMER_SLOW_PATH_REQUIREDset after a locked timer slow path. Reason fields cover nonempty run queues, direct IPC targets, handoff-current state, pending process termination/drop/stack release, timer sleeps, and timed cap-enter versus park waiters. Local one-run evidence intarget/thread-scale/20260501T112359Z/passed; the work phase showedrequired=2/4/8,clean=0,run_queue_nonempty=2/4/8,handoff_current=2/4/8, and zero timer sleeps/timed waiters for the 1/2/4-thread rows. The behavior follow-up keeps the output shape but changesrequiredto mean hard timer work, not run queues or handoff markers alone. This attribution does not close the selectedaccepted=falsespeedup gate. -
Add explicit thread-placement evidence and conservative new-child publication spreading. Completed 2026-05-01 12:37 UTC, refined 2026-05-01 13:20 UTC, and repaired 2026-05-01 14:58 UTC after the blocking-parent benchmark exposed a placement regression. Guest-measure runs now emit aggregate and per-phase
thread_placementlines for publish targets, caller-current publish buckets, caller-aware avoid, fallback, and strict-load fallback counts, selected CPUs, first-selected CPUs, and migration events across CPU slots 0-3. Newly created non-single-owner threads avoid the caller’s current CPU only when another active ready scheduler CPU has a strictly lower non-idle dispatch load under the scheduler lock; on equal load, an active-ready caller CPU wins the tie instead of falling through to CPU0-biased least-loaded scanning. Single-owner processes stay pinned to CPU0. Timer, unblock, direct-IPC fallback, steal retry, and steal requeue paths keep their existing allocation-free targeting behavior. (Update 2026-05-02: the per-CPU run queues described here were later collapsed into a single global run queue, retiring the caller-aware placement and steal scans. Seedocs/architecture/scheduling.mdand the per-CPU run-queue collapse entry indocs/backlog/scheduler-evolution.mdfor current state. Per-CPU queues return with the fair-share enqueue policy that Phase D will own.)The earlier avoid-caller rule passed the old spinning-parent 1-to-2 gate but was wrong for the repaired blocking-parent benchmark: a controlled run before the strict-load fix regressed to 1-to-2 work/total speedups `0.886x`/`0.928x` because the children were biased away from an otherwise available caller CPU. After the strict-load fix, controlled physical-core evidence passed the enforced 1-to-2 work/total gates with `1.828x`/`1.687x`. The same run recorded diagnostic 1-to-4 work/total speedups `3.029x`/`2.386x`; with scheduler switch diagnostics suppressed, those 1-to-4 diagnostics recorded `3.272x`/`2.303x`. Four-worker capOS scaling remains a follow-up, not a completed linear-scaling claim. -
Preserve correctness gates while narrowing the lock: generation-checked
ThreadRefownership, no stale runnable queue entries after process or thread exit, direct-IPC preference without bypassing ownership checks, allocation-free timer/unblock runnable publication, and cleanrun-smp2-smokesevidence. Completed 2026-05-01 14:58 UTC: the caller-aware publication change preserves single-owner pinning and leaves timer/unblock/requeue/direct-IPC targeting unchanged; ordinary-smp 2regression coverage passed. -
Rerun controlled physical-core evidence after any scheduler hot-lock change. The milestone should stay open until host-summary work and total gates pass, or until the milestone scope is intentionally changed and recorded in
docs/tasks/README.md,docs/roadmap.md, and this backlog. Completed 2026-05-01 14:58 UTC after benchmark repair: the matching Linux baseline validated the repaired blocking-parent 16 MiB/64-round shape on the selected physical CPU set with 1-to-2 work/total speedups1.991x/1.990xand 1-to-4 work/total speedups3.958x/3.834x. Controlled capOS evidence passed the enforced 1-to-2 work/total gates with1.828x/1.687x. -
Track post-closeout 4-worker scalability caveats separately from the recorded 1-to-2 milestone. The repaired benchmark proved the configured 1-to-2 work and total thresholds only against the pre-2026-05-02 per-CPU placement model. Linux now scales under the same repaired shape, so the remaining 4-worker capOS gap was not a benchmark-shape excuse. The strongest evidence at that time was: unsuppressed capOS 1-to-4 work/total speedups
3.029x/2.386x, scheduler-switch-log-suppressed diagnostics3.272x/2.303x, and guest-measure runs that showed globalSchedulerlock wait/hold cycles plus exit/join/block/schedule overhead while shared kernel locks were not visibly contended. Treat those numbers as historical; superseded by the formalcapos-bench2026-05-02 21:38 UTC pair againstmaincommit374f8556(capOS work1.883x/ total1.787xclears the configured 1-to-2 gates; 1-to-4 capOS1.566x/1.538xvs Linux3.963x/3.858xremains the diagnostic that gates Phase D’s fair-share enqueue policy). Future four-core scaling claims should add an explicit 1-to-4 gate, keep placement evidence enabled, separate work-window from total-time attribution, and continue splitting hot scheduler metadata/lock paths.
Multi-Process SMP Concurrency Gates
- Split the current one-owner scheduler latch into per-CPU scheduler run
queues or equivalent ownership that can keep more than one CPU executing
scheduler-owned work at the same time. Completed in commit
20f6894(2026-04-30 05:30 UTC) with per-CPU scheduler ownership, current and handoff tracking, per-CPU idle/fallback cleanup slots, and temporary BSP pinning for endpoint-, launcher-, spawner-, and thread-authority holders so process-wide ring paths remain single-owner during this milestone. - Add reschedule IPIs for idle-to-runnable handoff across scheduler owners. The current scheduler tree tracks pending reschedule IPIs per target CPU, wakes halted scheduler-owner loops for newly runnable work, and uses the same serialized fixed LAPIC IPI send path as TLB shootdown without claiming a general preemptive reschedule interrupt.
- Prove concurrent scheduler-owned work on more than one CPU with
independent worker processes first. This avoids process-wide capability
ring races while still proving real multi-core execution. The focused
proof harness is on mainline as of commit
c2790c0(2026-04-30 07:38 UTC), and the completed milestone is recorded at commit3fb89923(2026-04-30 09:45 UTC). - Add an SMP scaling demo binary and focused manifest. The first workload is segmented prime counting over generated ranges. It partitions work statically by worker index, avoids hot-path syscalls and serial output, produces aggregate prime-count/checksum verification, and prints one compact result line per accepted case.
- Add a host harness for
make run-smp-process-scalethat runs the same workload under-smp 1,-smp 2, and optionally-smp 4, captures raw logs, and reports worker count, CPU count, ticks or cycles, output checksum, and speedup. A single noisy QEMU run is not enough evidence for a scaling claim; keep raw repeated-run artifacts for review.tools/qemu-smp-process-scale-harness.shbuilds/usescapos-smp-process-scale.iso, stores serial logs undertarget/smp-process-scale/<timestamp>/, defaults to five repetitions, reports per-case medians, and enforces the 1.6x 1-to-2 median threshold only when KVM-backed evidence is available. - Treat near-linear 1-to-2 CPU speedup as the first publishable target. Use a threshold high enough to reject accidental concurrency illusions but low enough for QEMU/KVM variance, for example at least 1.6x median speedup over repeated runs. Record the exact threshold in the harness when this milestone is selected for implementation.
make run-smp-process-scale Proof Contract
This target is the acceptance test for Multi-Process SMP Concurrency. It must stay narrower than the later in-process threading milestone: one process ring per worker process, no sibling threads in the timed section, no shared ParkSpace words, no IPC throughput loop, and no completion-ring demux claim.
The first implementation should add:
- a focused
system-smp-process-scale.cuemanifest; - a coordinator binary that receives the manifest-granted
ProcessSpawner, spawns a fixed set of worker process cases, waits for each child, verifies aggregate results, and prints the compact result lines; - a worker binary or a small family of worker binaries that execute one static partition of the deterministic workload and report only their final result through a parent endpoint or other existing spawn-result path after the timed section finishes;
- a
tools/qemu-smp-process-scale-harness.shhost harness wired tomake run-smp-process-scale.
The workload should be segmented prime counting over generated integer ranges.
Each run case divides the same total range into workers contiguous segments.
Worker i handles segment i without terminal output, IPC calls, heap-heavy
allocation, or capability operations in the timed region. The coordinator
collects one post-compute result per worker and verifies the aggregate prime
count plus a stable checksum or hash against known constants before it accepts
timing evidence.
The guest must print one line per accepted run case in this shape:
[smp-process-scale] cpus=<n> workers=<n> range=<lo>..<hi> primes=<count> checksum=<hex> elapsed=<ticks-or-cycles> verified=true
The exact time source can be monotonic ticks or a cycle counter, but it must be an in-guest measurement that brackets only the worker-process computation after spawn/setup and before serial reporting. If timer granularity makes the proof too noisy, increase the total range instead of measuring host wall time as the primary signal. Host wall time may be reported as secondary harness metadata.
The host harness policy is:
- default to
CAPOS_SMP_SCALE_RUNS=5complete repetitions per CPU-count case; - run and report the advertised 1/2/4-worker timing cases. At minimum that
means
-smp 1/one worker,-smp 2/two workers, and a 4-worker timing case; the preferred 4-worker case is-smp 4when the local QEMU/KVM host exposes four usable vCPUs, otherwise the harness must still report the 4-worker case under the largest available SMP count and mark why a 4-vCPU run was not collected; - require KVM for a speedup claim. If
/dev/kvmor QEMU KVM acceleration is unavailable, the target may run a functional verification mode, but it must report that publishable speedup evidence was not collected; - keep raw serial and terminal logs under a stable
target/subdirectory such astarget/smp-process-scale/<timestamp>/; - summarize the median verified elapsed value for each case and require at
least
1.6xmedian speedup from the-smp 1/one-worker baseline to the-smp 2/two-worker case before accepting the near-linear 1-to-2 speedup claim; - rerun the ordinary manifest, ring, thread, park, and process-exit smokes
under
-smp 2before marking the selected milestone complete.
As of commit 3fb89923 (2026-04-30 09:45 UTC), the focused manifest,
process-scale demo, and
host-side harness wiring produce passing default repeated KVM-backed speedup
evidence. The accepted run in
target/smp-process-scale/cycle-balanced-default/ recorded medians
smp1=1693, smp2=1053, smp4=2314, or 1.608x, satisfying the required
1.6x threshold. The worker-reported elapsed value is a scaled user-mode cycle
count, and the static worker ranges are contiguous but cost-balanced for the
prime-counting loop. The ordinary -smp 2 smoke gate also passed:
target/smp2-smokes/run-smoke.log covers the default manifest smoke, and
target/smp2-smokes/run-spawn.log covers endpoint roundtrip, ring-reserved
opcodes, timer/runtime children, thread lifecycle, park cleanup, generic child
waits, and process exit. The Multi-Process SMP Concurrency milestone is
complete. The harness fails closed when the focused manifest, ISO, expected
compact proof lines, or speedup evidence are unavailable instead of fabricating
timing evidence.
tools/linux-smp-process-scale-baseline.sh is the reference-OS comparison for
this proof. It builds a tiny static Linux initramfs that runs the same forked,
deterministic prime-counting workload under the same QEMU/KVM CPU and memory
envelope, records raw logs under target/linux-smp-process-scale/, and uses
the same default five-run median policy. The script defaults now match capOS’
balanced contiguous splits; rerun the Linux comparison before publishing a new
OS-comparison table for the accepted capOS evidence.
The process-scale harnesses also expose an opt-in smp8-smt diagnostic through
CAPOS_SMP_SCALE_INCLUDE_SMT=1 and LINUX_SMP_SCALE_INCLUDE_SMT=1. It uses
the same range and aggregate verifier with eight contiguous ranges and is
collected only when the host reports at least eight logical CPUs. This case is
for SMT behavior on 4-core/8-thread hosts; it must not be treated as 8-core
evidence or included in the accepted 1-to-2 speedup gate.
The proof must not depend on KVM paravirtual APIC, IPI, or TLB-flush features. The current architectural xAPIC MMIO LAPIC timer/IPI path remains the correctness surface; paravirtual APIC acceleration is future performance work.
Before the scheduler implementation branch claims this target, review the non-blocked findings that could invalidate the evidence:
- panic-surface hardening for guarded unwraps, stale queues, blocking waits, process/thread exit, endpoint cancellation, and rollback restoration paths touched by scheduler ownership changes;
- quota/exhaustion behavior for the child-process, process-handle, outstanding call, scratch, frame, and invalid-SQE paths used by the coordinator and workers;
- release/revoke epoch behavior only for capabilities the demo actually grants.
Findings unrelated to this proof, such as DMA provenance, shared ParkSpace unmap/reuse, or same-process per-thread ring routing, should stay tracked in the migrated review-finding task records but must not be represented as blockers for independent worker-process SMP scaling.
SMP Review-Finding Reconciliation
This section classifies the review-finding task records for the selected multi-process SMP proof. It does not close those findings; it defines what the next scheduler and harness branches must satisfy before they can depend on the paths involved in the proof.
Blocking or proof-invalidating for this milestone:
- Scheduler panic surfaces touched by ownership changes. A branch that
changes scheduler ownership, per-CPU queues, idle-to-runnable handoff, or
process/thread exit cleanup must audit and either harden or explicitly test
the relevant
docs/panic-surface-inventory.mdscheduler rows:block_current_on_cap_enter,capos_block_current_syscall, stale run-queue process references,exit_current,current_ring_and_caps, schedulerstart, and context-restore CR3 assumptions. The branch should add targeted host or QEMU coverage for each panic surface it claims to close. - Process/resource exhaustion on paths used by the coordinator. The proof
depends on
ProcessSpawner,ProcessHandle.wait, result-cap adoption, and likely a parent endpoint or equivalent post-compute result path. Those paths must keep controlled failures for cap-slot exhaustion, process-handle exhaustion, endpoint queue pressure, scratch/result-buffer pressure, outstanding call pressure, and frame-grant/frame-exhaustion pressure from loading worker ELF pages, stacks, and TLS. Existing endpoint pending-RECV and queued-CALL overload coverage can be reused, but new coordinator-specific resource pressure introduced by the demo needs matching coverage before the proof is used as milestone evidence. - Runtime invalid-SQE flood handling if the harness exercises malformed submissions. The process-scaling demo should not need malformed SQEs. If a future scheduler or harness branch adds invalid-submission stress to this target, it inherits whatever invalid-submission review-finding task records remain open at that time. Runtime flood handling and log/rate-limit suppression should be evaluated separately because active remediation may close one without closing the other. Otherwise invalid-submission remediation remains a separate track and should not block the pure scaling proof.
Guardrails that must be preserved but are not standalone blockers for the independent worker-process proof:
- Explicit revoke/epoch tests. The demo should use only the capabilities needed to spawn workers and collect their final results. It must not claim peer revocation, stale session rejection, or object-epoch behavior unless it grants revocable/session-sensitive authority and adds flow-specific revoke or expiry tests.
- ParkSpace unmap/reuse enforcement. Independent worker processes should
avoid shared ParkSpace words in the timed workload. The ordinary park smoke
still has to pass under
-smp 2before milestone completion. - Process-wide capability ring constraint. The proof remains valid only because each worker has its own process ring and the timed section avoids ring traffic. It must not be cited as evidence for same-process sibling thread scalability, per-thread completion routing, or Ring v2.
- Raw evidence retention. Local repeated KVM logs are enough for this
development milestone, but production/reproducibility claims remain governed
by the provenance finding. Keep raw
target/smp-process-scale/<timestamp>/artifacts for review and avoid implying third-party reproducibility.
Out of scope for this milestone unless a branch expands the demo surface:
- DMA owner state, generation-checked DMA/MMIO/IRQ handles, stale interrupt proofs, and DMA ResourceLedger/OOM implementation;
- shared ParkSpace unmap/reuse beyond preserving existing park smokes;
- same-process thread creation, join, TLS, per-thread rings, and Ring v2 completion routing.
In-Process Threading Scalability Gates
- Define the per-thread capability-ring/completion-routing contract needed
before same-process sibling threads can claim independent scaling.
Completed 2026-04-30 10:19 UTC in
docs/proposals/ring-v2-smp-proposal.md: the first Ring v2 slice uses kernel-chosen child-thread ring mappings, a sharedRingEndpointrecord for initial and child rings, andThreadRef -> RingEndpointas the routing model. - Move capability-ring waiting/completion routing to the per-thread
ThreadRefmodel before claiming same-process sibling threads scale independently on different CPUs. Endpoint, timer, park, process-wait, thread-join, deferred-cancel, and direct IPC completion paths must all route through the target thread’sRingEndpointbefore same-process scaling can be claimed. Completed through the Ring v2/thread-scale substrate: spawned child threads receive independent ring endpoints, and local/controlled thread-scale evidence verifies child rings. - Ensure thread creation, FS/TLS setup, thread exit, join, park waits,
and process exit remain generation-checked and safe when sibling threads
can be resident on different CPUs. Completed through the reviewed
thread-scale implementation and the closeout
run-smp2-smokespass. - Add an in-process thread scaling demo that uses the same class of
deterministic CPU-bound workload as the multi-process proof, but splits
work across sibling threads in one process. Prefer fixed-size
parallel hashing/checksum chunks over prime counting for this milestone:
equal-byte chunks have much more uniform work than trial division over
increasing integer ranges, still keep the timed region syscall-free, and
verify through one deterministic root hash. Print one compact result line
per run.
Completed with the
demos/thread-scaleproof and reusabledemos/thread-scale-workloadcrate. - Add a host harness for
make run-thread-scalethat runs 1/2/4-thread cases under matching QEMU CPU counts, captures raw logs, and rejects results until the verified median speedup reaches the accepted threshold. Completed 2026-05-01 14:58 UTC after benchmark repair: the harness enforces KVM-backed 1-to-2 work and total thresholds when requested, carriesparent_waitandwork_roundsthrough CSV metadata, and the repaired blocking-parent 16 MiB/64-round run passed both enforced physical-core gates. 2026-04-30 12:34 UTC functional checkpoint: this branch adds the same-process demo and QEMU harness as diagnostic evidence only. The harness retains raw serial logs undertarget/thread-scale/<timestamp>/, parses exactly one verified[thread-scale]line per 1/2/4-thread case, and reports median elapsed values plus diagnostic speedups. Focused phase diagnostics now add guest cycle fields forspawn_ready,work,shutdown, andtotalto separate thread creation/ready time, the syscall-free workload window, and thread exit/join time.elapsedremains the workload value and is an alias ofwork, so harness speedup calculations continue to use only the timed workload. The retained artifacts are raw QEMU serial/terminal/stdout/stderr logs plusresults.csvandsummary.log. Host-side QEMU profiling is opt-in throughCAPOS_THREAD_SCALE_PROFILE=1; it requiresperfand storesperf.data,perf.script,perf.report.txt, andprofile-command.txtplusqemu.statusin each case-run artifact directory. These are host samples of the QEMU process and the preserved workload exit status, not guest symbol attribution by themselves, so the guest phase counters remain the default diagnostic. Guest-side kernel measurement is separately opt-in throughCAPOS_THREAD_SCALE_GUEST_MEASURE=1; it rebuilds the thread-scale ISO with the benchmark-only kernelmeasurefeature and retains release symbols for that benchmark build only. It writes the kernelmeasure:segment summaries from each case-run serial log to that case-run’smeasure.logand records the per-case userspace symbol map path inresults.csvunderguest_symbol_map. It also writes auser-pc-symbols.logreport beside eachmeasure.logand records that path underuser_pc_symbol_report; the report maps aggregate and per-phaseuser_pc_samplesexact-RIP buckets to the nearest userspace symbol address not greater than the PC. Those segment counters cover scheduler choice, schedule save/requeue, timer and park wake paths, cap-wait scans, thread exit/join cleanup, and process exit/drop cleanup. First-slice shared-kernel contention counters now add aggregate and per-phaseshared_kernel_locklines for frame allocator alloc/free lock acquisitions, contention, and spin loops, plus the ring-dispatch cap-table and ring-scratch locks beforecap::ring::process_ring. Follow-up counters also cover endpoint inner queue locks, endpoint cancellation scratch locks, and all direct per-process address-space lock sites. Heap attribution now routes the global allocator mutex throughSharedKernelLock::Heapin measure builds; one-run guest-measure evidence recorded zero timed-work-phase heap acquisitions for the syscall-free benchmark and nonzero spawn/shutdown allocator activity. These remain benchmark-onlymeasureattribution and do not close the broader shared-service contention finding. Fresh result rows now explicitly classify the benchmark hot section as syscall-free CPU work with ring and allocator activity limited to setup/shutdown, no endpoint or network activity, and result-only logging. The harness requires those benchmark-class fields for new QEMU parses, validates the expected values for this benchmark, carries them intoresults.csv, and keeps summary-only replay tolerant of legacy CSV files that predate the class columns. Local one-run evidence is retained intarget/thread-scale/20260501T083254Z/. Network/polling attribution now adds aggregate and per-phasemeasure: network_polllines for initialized virtio-net scheduler, runtime, and interface polling; the built-in TCP HTTP proof poll; virtqueue poll spins and completions; and pending network waiter scans. The guest-measure harness requires those lines. Local one-run evidence intarget/thread-scale/20260501T093505Z/passed and retained zero aggregate and per-phase network/poll counters for the 1/2/4-thread rows. The default thread-scale manifest has no virtio-net device, and the scheduler poll entry returns before the driver mutex in that no-device case. For this CPU-bound benchmark they are zero-evidence guardrails, not service-throughput proof and not milestone acceptance. The symbol map and resolved report are benchmark-only nearest-symbol attribution aids for interpreting rawuser_pc_samplesbuckets, not line-level profiling, a complete guest profiler, or normal-build guest attribution. These diagnostics are for reviewers, not speedup acceptance. The guest result line deliberately printsaccepted=falseas diagnostic guest-side state. Host acceptance is a separate summary decision:CAPOS_THREAD_SCALE_REQUIRE_SPEEDUP=1requires KVM-backed evidence and the configured 1-to-2 medianwork/elapsedspeedup threshold, but it does not fail merely because parsed guest rows carryaccepted=false. The total-case summary gate is separate and opt-in:CAPOS_THREAD_SCALE_REQUIRE_TOTAL_SPEEDUP=1requires KVM-backed evidence and the configuredCAPOS_THREAD_SCALE_TOTAL_SPEEDUP_THRESHOLDagainst the 1-to-2 mediantotalspeedup. It is also supported by summary-only replay and is not enforced by default.capos-benchdiagnostic runtarget/thread-scale/capos-bench-thread-20260430T125613Z/usedn2-highcpu-8KVM with QEMU pinned to physical CPUs0-3for five runs per case. Median elapsed cycles were thread156244112, thread284429072, and thread4140666438; diagnostic speedups were thread1-to-thread20.666xand thread1-to-thread40.400x, with all rows stillaccepted=false. After phase diagnostics landed,capos-benchruntarget/thread-scale/capos-bench-phase-20260430T134301Z/used the same pinned physical CPU set and recorded five-run medians: thread1elapsed/work=56285136,spawn_ready=43054612,shutdown=57693626,total=157008630; thread284432724,76247932,142200058,303096216; thread4140768008,205527230,395434364,741943554. The phase output shows shutdown/join cost increasing sharply with worker count, but all rows still remainaccepted=false. After child-ring endpoints and the optional SMT8 diagnostic landed,capos-benchphysical-core runtarget/thread-scale/20260430T151909Z/recorded five-run medians pinned to logical CPUs0-3: thread1elapsed/work=56215128,spawn_ready=41692656,shutdown=57753172,total=155536564; thread284420848,74791942,142065130,301170274; thread4140697028,143691606,395397620,679786606. Final SMT diagnostic runtarget/thread-scale/capos-bench-final-smt8-20260430T154058Z/at commit19f2fc66used logical CPUs0-7and recorded medians: thread156272620,54277322,57824172,168448508; thread284343990,72757730,142229724,299693446; thread4140992614,144614212,396264522,681167764; thread8253352976,290422132,1239856304,1786188514. All rows remainaccepted=false, and thread8 is informational SMT evidence only. Scheduler-unpin final diagnostic runtarget/thread-scale/scheduler-unpin-final2-20260430T160700Z/removed the scheduler’s transient same-pid pinning and verified 1/2/4-thread cases without the child-ring map/unmap TLB shootdown panics seen during this slice. One-run medians were thread1elapsed/work=56293734,spawn_ready=39202342,shutdown=34848540,total=130344694; thread257101752,95921604,69869786,222894030; thread4274828354,275826356,407818252,958473044. Diagnostic speedups were thread1-to-thread20.986xand thread1-to-thread40.205x; all rows remainaccepted=false. Follow-up local checks passedmake run-smp2-smokesintarget/smp2-smokes/20260430T160936Z/and reran three thread-scale samples intarget/thread-scale/scheduler-unpin-rerun-20260430T161104Z/. That rerun kept correctness intact but recorded thread4902520658cycles under local oversubscription, so it remains diagnostic only. After guest-side measurement landed,capos-benchruns at commita5c4f789recorded five-run medians with QEMU pinned to host logical CPUs0-3, which map to distinct physical cores on that host: thread156341030, thread256166300, thread470122044(1.003x,0.803x). The SMT diagnostic pinned to logical CPUs0-7recorded medians thread156315082, thread256233080, thread462630052, thread8125488946(1.001x,0.899x,0.449x). The one-run guest-measure pass intarget/thread-scale/20260430T182824Z/recorded per-casemeasure.logfiles. Top measured guest-side cycle totals werering_processingandmethod_body, withsched_choose_nextandthread_exit_join_cleanupgrowing at higher thread counts. A follow-up local phase-aware guest-measure pass intarget/thread-scale/20260430T184532Z/verified that each casemeasure.lognow includes final-summarymeasure: checkpointandmeasure: phaseattribution forspawn_ready,work,shutdown, andfinal_total; the harness rejects guest-measure runs missing any of those phase summaries. These runs remain diagnostic andaccepted=false. After phase-aware guest measurement landed on main at commitda92ed42,capos-benchreran the diagnostic with QEMU pinned to host logical CPUs0-3, which map to distinct physical cores on that host. Runtarget/thread-scale/capos-bench-phase-main-20260430T191146Z/recorded five-run medians: thread1elapsed/work=56242252,spawn_ready=38789562,shutdown=34859130,total=130093430; thread256233998,91718518,61923280,205126974; thread462926552,109723566,119015960,297970796. SMT diagnostic runtarget/thread-scale/capos-bench-phase-smt8-main-20260430T191408Z/pinned QEMU to logical CPUs0-7and recorded medians: thread156198166,41134070,34781494,132161420; thread256196302,42453050,63546086,162449504; thread462361512,87093620,109458814,258043804; thread8125378372,249877254,528656458,904149404. A one-run host-profile plus guest-measure sample intarget/thread-scale/capos-bench-profile-phase-main-20260430T191703Z/used temporary host perf access with QEMU pinned to logical CPUs0-3, then restoredkernel.perf_event_paranoid=4. The host reports still show QEMU/KVM execution,ioctl, QEMU mutexes, and MMIO/read helpers near the top; guest phase counters show no ring dispatches in the measured work phase, while shutdown/join and scheduler choice costs grow with worker count. These results remain diagnostic andaccepted=false. Artifact content verification after collection checkedsummary.logandresults.csvfor the two five-run diagnostics and the one-run profile sample, plus the profile sample’smeasure.logandperf.report.txt, against the recorded medians, pinning,accepted=falsestatus, guest phase claims, and host-profile claims. Join-cleanup optimization follow-up on branchworkplan/thread-scale-join-cleanupadds per-thread pending join-waiter accounting so exiting worker threads that never blocked inThreadHandle.joinskip the thread-handle waiter scan. Local evidence:target/thread-scale/join-cleanup-local-20260430T193657Z/passed functional guest-measure verification, andtarget/thread-scale-join-cleanup-run-spawn.logpassedmake run-spawn; local timing remains diagnostic because the host was not a controlled benchmark environment. Controlledcapos-benchreruns for this branch kept all rowsaccepted=false: physical-core runtarget/thread-scale/capos-bench-join-cleanup-20260430T194536Z/recorded medians thread156173118, thread256166224, thread462070170(1.000x,0.905x), and SMT diagnostictarget/thread-scale/capos-bench-join-cleanup-smt8-20260430T194734Z/recorded medians thread156251116, thread256197306, thread462519276, thread8122089762(1.001x,0.900x,0.461x). Scheduler-choice cleanup follow-up on branchworkplan/thread-scale-scheduler-choiceremoves a redundant blocked-thread scan from the idle fallback inchoose_next_locked. Local functional evidence:target/thread-scale/scheduler-choice-local-20260430T200257Z/passed guest-measure verification. Controlledcapos-benchruntarget/thread-scale/capos-bench-scheduler-choice-20260430T201041Z/recorded medians thread156171526, thread256301462, thread462433702(0.998x,0.900x), so the cleanup does not close the milestone. The immediate review-finding note that the scheduler still had a two-CPU owner mask is addressed by raising the temporary scheduler-owned CPU slot count and wake mask to four, so the 4-thread diagnostic can exercise four scheduler owners. This is only a blocker-removal step. The open attribution, serial/logging, scheduler-lock counter, workload-baseline, and per-CPU run-queue findings in the migrated review-finding task records remain required before accepting a speedup claim. Initial local build gates passed. The firstmake run-smp2-smokesattempt intarget/smp2-smokes/four-scheduler-cpus-20260430T202129Z/exposed an early boot failure after the enlarged static scheduler value crossed a fragile initialization path. The implementation now uses a capacity-reserved deferred process-drop queue instead of embedding oneProcessslot per scheduler CPU in theSchedulerstatic. Boundedrun-spawnsmoke evidence passed intarget/smp2-smokes/four-scheduler-cpus-spawn-pending-vec-20260430T203055Z/. Fullmake run-smp2-smokespassed intarget/smp2-smokes/four-scheduler-cpus-full-20260430T203214Z/. Local thread-scale guest-measure verification passed intarget/thread-scale/four-scheduler-cpus-local-20260430T203356Z/withCAPOS_THREAD_SCALE_RUNS=1, QEMU pinned to local CPUs0-1, and cases through-smp 4; local timing remains noisy and is not controlled speedup evidence. Controlledcapos-benchruns then verified the effect on the benchmark host. Physical-core runtarget/thread-scale/capos-bench-four-scheduler-cpus-20260430T203733Z/used QEMU pinned to logical CPUs0-3, recorded medians thread156144884, thread256190496, thread436386164(0.999x,1.543x), and kernel logs show AP scheduler owners on CPUs 1-3 starting benchmark threads. SMT diagnostictarget/thread-scale/capos-bench-four-scheduler-cpus-smt8-20260430T203945Z/used logical CPUs0-7, recorded medians thread156181720, thread256191504, thread456213928, thread8116270280(1.000x,0.999x,0.483x). Both rows remainaccepted=false; the physical 4-thread speedup is close to but below the1.6xthreshold, and the SMT8 row is informational because the scheduler owner mask remains four CPUs. Scheduler-attribution follow-up branchworkplan/thread-scale-scheduler-attributionadds guest-side total and per-phase scheduler counters for direct-target, run-queue, and idle candidate classes; runnable/retry/drop outcomes; and reschedule IPI target/sent/skipped/failure counts. Local functional verification intarget/thread-scale/scheduler-attribution-local-20260430T210322Z/passed all 1/2/4-thread cases withCAPOS_THREAD_SCALE_GUEST_MEASURE=1,CAPOS_THREAD_SCALE_RUNS=1, and QEMU pinned to local CPUs0-1; the shell wrapper reported failure only because it reused zsh’s read-onlystatusparameter after the harness had already written a successfulsummary.log. The 4-thread work phase now records scheduler retry pressure (55run-queue candidate checks,7idle candidate checks,28runnable outcomes, and34retry outcomes) while still recording zero ring dispatches. This materially improves attribution but does not close the broader scheduler-lock, serial, CR3/TLB, guest-symbol, or workload-baseline requirements in the migrated review-finding task records. Serial-attribution follow-up adds guest-side total and per-phase serial byte counters toCAPOS_THREAD_SCALE_GUEST_MEASURE=1. Bytes are counted after LF-to-CRLF expansion and after a UART byte is emitted, including emergency writes in measure kernels. Local functional verification intarget/thread-scale/serial-attribution-local-20260430T212243Z/passed all 1/2/4-thread cases withCAPOS_THREAD_SCALE_RUNS=1and QEMU pinned to local CPUs0-1; the stricter harness now requires aggregate and per-phase serial lines. The run recorded total serial bytes of4161,4788, and6295; work-phase serial bytes stayed at74in each case, while shutdown serial bytes rose from70to145to631. This closes the serial-byte counter blind spot, but it does not close scheduler-lock, CR3/TLB, guest-symbol, workload-baseline, or logging-suppression A/B requirements in the migrated review-finding task records. Scheduler-lock attribution follow-up adds guest-side total and per-phase global scheduler-lock counters toCAPOS_THREAD_SCALE_GUEST_MEASURE=1. It records acquisitions, contended acquisitions, try-lock failures asspin_loops, contended wait cycles, and hold cycles. Local functional verification intarget/thread-scale/lock-attribution-local-20260430T214854Z/passed all 1/2/4-thread cases withCAPOS_THREAD_SCALE_RUNS=1and QEMU pinned to local CPUs0-1; the stricter harness now requires aggregate and per-phase scheduler-lock lines. The local 4-thread final-total counters were234acquisitions,104contended acquisitions,2,161,691spin loops,1,239,033,542wait cycles, and570,372,812hold cycles; the 4-thread work phase still had15acquisitions,5contended acquisitions,95,047spin loops,37,181,792wait cycles, and32,762,392hold cycles. This closes the first scheduler-lock counter blind spot; hold cycles include measure acquisition-counter update overhead and exclude release-counter update and unlock overhead, so they are first-pass attribution rather than exact critical-section time. At that point, CR3/TLB, guest-symbol, workload-baseline, logging-suppression A/B, and controlled benchmark-host confirmation requirements in migrated review-finding task records remained open; timer tick count attribution was queued for the follow-up recorded below. Controlledcapos-benchreruns after this landed on main at commit6eff7ae4used QEMU pinned to logical CPUs0-3for physical-core evidence and0-7for the informational SMT diagnostic. Physical-core runtarget/thread-scale/capos-bench-lock-main-physical-20260430T220944Z/recorded medians thread156309194, thread256302666, thread428301916(1.000x,1.990x); SMT diagnostictarget/thread-scale/capos-bench-lock-main-smt8-20260430T221246Z/recorded medians thread156379514, thread256186566, thread428259776, thread8131264324(1.003x,1.995x,0.430x). A one-run guest-measure confirmation intarget/thread-scale/capos-bench-lock-main-measure-20260430T221543Z/verified scheduler, serial, and scheduler-lock lines on the benchmark host. Host perf profiling was not collected becauseperf_event_paranoid=4blocked unprivileged perf on the restarted VM. Timer-attribution follow-up on branchworkplan/thread-scale-timer-attributionadds guest-side total and per-phase timer counters toCAPOS_THREAD_SCALE_GUEST_MEASURE=1, distinguishing user-mode timer interrupts entering the scheduler path from kernel-mode timer interrupts that only advance time and EOI, with separate BSP tick-advance counts. The harness now requires aggregate and per-phase timer lines. Local functional verification intarget/thread-scale/timer-attribution-local-20260430T223441Z/passed all 1/2/4-thread cases withCAPOS_THREAD_SCALE_RUNS=1, QEMU pinned to local CPUs0-1, and guest measurement enabled. Aggregate timer counters were7/7/0/7,25/17/8/9, and132/101/31/23(interrupts/user_scheduler/kernel_only/bsp_tick_advances); the 4-thread work phase recorded7/7/0/1. The remaining attribution requirements at that point were CR3/TLB, guest-symbol or guest-PC sampling, workload-baseline, and logging-suppression A/B evidence. CR3/TLB-attribution follow-up on branchworkplan/thread-scale-tlb-attributionadds guest-side total and per-phase TLB counters toCAPOS_THREAD_SCALE_GUEST_MEASURE=1, covering runtime CR3 writes, pending-flush checks, pending full TLB flushes, remote shootdown requests, target CPUs, shootdown IPIs, and deferred completion drains. The harness now requires aggregate and per-phase TLB lines. Local functional verification intarget/thread-scale/tlb-attribution-local-20260430T225628Z/passed all 1/2/4-thread cases withCAPOS_THREAD_SCALE_RUNS=1, QEMU pinned to local CPUs0-1, and guest measurement enabled. Aggregate TLB counters were3/28/0/0/0/0/0,7/52/3/3/3/3/2, and14/139/17/7/17/17/4(cr3_writes/pending_flush_checks/pending_flush_all/shootdown_requests/shootdown_target_cpus/shootdown_ipis/deferred_completion_drains); the 4-thread work phase recorded0/10/0/0/0/0/0. The remaining attribution requirements at that point were guest-symbol or guest-PC sampling, workload-baseline evidence, and logging-suppression A/B evidence. Logging-suppression A/B follow-up addsCAPOS_THREAD_SCALE_SUPPRESS_SWITCH_LOGS=1tomake run-thread-scale. The knob suppresses scheduler transition diagnostics in the benchmark kernel while preserving proof, error, and measurement output. Local one-run A/B verification withCAPOS_THREAD_SCALE_GUEST_MEASURE=1,CAPOS_THREAD_SCALE_RUNS=1, and QEMU pinned to local CPUs0-1produced artifacts intarget/thread-scale/logging-ab-baseline-local-20260430T231800Z/andtarget/thread-scale/logging-ab-suppressed-local-20260430T232600Z/. Targeted scheduler diagnostic line counts dropped from7/12/18to0/0/0for the 1/2/4-thread cases, and aggregate serial bytes dropped from4161/4743/5889to3894/4280/5047. This closes only the logging A/B blind spot; guest-symbol or guest-PC sampling and workload/cacheline baseline evidence remained open. Linux pthread baseline follow-up addsmake run-linux-thread-scale-baselinefor the exact fixed-size thread-scale checksum workload. Controlled nativecapos-benchruns at commit370ce145with taskset pinned to physical-core logical CPUs0-3recorded padded-slot capOS-shaped work-window medians of306776,152293, and1120024ns for 1/2/4 workers (2.014x,0.274x). Compact-slot medians were similar at316388,152291, and1123534ns (2.078x,0.282x), so result-slot false sharing is not the visible differentiator for the current workload shape. The SMT diagnostic pinned to0-7recorded padded work medians303877,155565,170019, and243481ns for 1/2/4/8 workers (1.953x,1.787x,1.248x). The exact baseline shows the one-megabyte workload and coordinator spin window are not a clean four-core linear-scaling reference. This closes the exact Linux pthread baseline and result-slot padding blind spots only; guest-symbol or guest-PC sampling and larger-workload/Amdahl- sensitivity evidence remain open. Benchmark repair follow-up completed 2026-05-01 14:58 UTC: the default host baselines now use blocking parent join, 262,144 blocks (16 MiB), andwork_rounds=64instead of the old 1 MiB/spinning-parent shape. Controlled Linux evidence on the selected physical CPU set recorded 1-to-2 work/total speedups1.991x/1.990xand 1-to-4 work/total speedups3.958x/3.834x, proving the repaired benchmark shape can scale on the host before capOS results are interpreted as scheduler evidence. Guest-PC sampling follow-up adds a measure-only exact-RIP histogram for user-mode timer interrupts while a thread-scale case is active. The harness now requires aggregate and per-phaseuser_pc_sampleslines forCAPOS_THREAD_SCALE_GUEST_MEASURE=1. Local one-run verification intarget/thread-scale/guest-pc-sampling-local-20260501T001500Z/usedCAPOS_THREAD_SCALE_RUNS=1with QEMU pinned to local CPUs0-1and passed all 1/2/4-thread cases. Aggregate PC sample counts were6,17, and55with zero overflow; the 4-thread phase counts were spawn-ready13, work9, shutdown33, and final-total55. This closes the guest-PC sampling blind spot only; the later symbol-map harness slice preserves a benchmark-only userspace map for interpreting those raw PC buckets, and larger-workload Amdahl-sensitivity evidence remained open until the follow-up below. Resolved PC attribution report follow-up completed 2026-05-01 06:13 UTC on branchworkplan/thread-scale-pc-symbol-report: guest-measure case-runs now writeuser-pc-symbols.logbesidemeasure.logand record it inresults.csvunderuser_pc_symbol_report. Local verification intarget/thread-scale/20260501T060822Z/usedCAPOS_THREAD_SCALE_GUEST_MEASURE=1,CAPOS_THREAD_SCALE_RUNS=1, and QEMU pinned to local CPUs0-1; the thread4 report resolves sampled PCs toworker_entry,run_case, andRingClient::waitnearest symbols and keeps PCs below the first symbol as explicit<unmapped>rows. Larger-workload/Amdahl follow-up addsCAPOS_THREAD_SCALE_TOTAL_BLOCKSandLINUX_THREAD_SCALE_TOTAL_BLOCKSso the same deterministic checksum workload can run beyond the default one-megabyte case. Controlledcapos-benchruns at commit32c066b8used1,048,576blocks (64 MiB). With QEMU pinned to physical-core logical CPUs0-3, capOS work medians were112590712,112511206, and36369098cycles for 1/2/4 workers (1.001x,3.096x), while total medians were189204910,218898002, and205640850cycles (0.864x,0.920x). The matching native Linux physical-core baseline recorded work medians17766664,8961256, and7442107ns (1.983x,2.387x) and total medians17883289,9094596, and10090354ns (1.966x,1.772x). SMT diagnostic rows pinned to0-7recorded capOS 1/2/4/8-worker work speedups of1.002x,2.870x, and0.644xand Linux speedups of1.993x,2.458x, and2.658x. Raw artifacts are undertarget/thread-scale/amdahl-1048576-physical-20260501T003700Z/,target/thread-scale/amdahl-1048576-smt8-20260501T004200Z/,target/linux-thread-scale/amdahl-1048576-physical-20260501T003400Z/, andtarget/linux-thread-scale/amdahl-1048576-smt8-20260501T004000Z/. This closes the larger-workload evidence blind spot, but the milestone remains open because 1-to-2 work scaling is flat and total-case scaling remains below 1x for 2/4 workers. The guest rows still carry diagnosticaccepted=false; host-summary acceptance remains gated by KVM evidence and the configured 1-to-2 median work and opt-in total thresholds. Guest-measure runs now preserve the benchmark-only userspace symbol map needed to interpret raw PC buckets after collection. Post-threshold-policycapos-benchreruns at main commitf198b099verified the host-summary total-speedup fields while keeping the milestone open. Physical-core pinning0-3recorded work speedups1.002xand1.002xplus total speedups0.911xand0.601xfor 2/4 workers intarget/thread-scale/total-threshold-main-physical-20260501T065028Z/. SMT diagnostic pinning0-7recorded 1/2/4/8 work speedups1.001x,0.998x, and0.333xplus total speedups0.913x,0.621x, and0.200xintarget/thread-scale/total-threshold-main-smt8-20260501T065443Z/. Scheduler-lock site attribution follow-up completed 2026-05-01 09:52 UTC: guest-measure kernels keep the existing aggregatemeasure: scheduler_lockline and add aggregate plus per-phasemeasure: scheduler_lock_sitecounters for generic, timer pre-ring, timer select, blocking, process exit, thread exit, start/idle selection, wake/unblock, and metadata classes. The harness requires those lines forCAPOS_THREAD_SCALE_GUEST_MEASURE=1. Local one-run evidence intarget/thread-scale/20260501T100202Z/verified the new lines and still reportedaccepted=falsewith 1-to-2/1-to-4 work speedups0.998xand1.001xand total speedups0.921xand0.509x. This is bounded split-prep attribution for the known global scheduler-lock bottleneck, not speedup evidence; the later caller-aware placement closeout above is the controlled evidence that passed the work and total gates. - Record aggregate same-process worker placement for
make run-thread-scaleand fix creation-time local concentration. Completed 2026-05-01 12:37 UTC: guest-measure output recorded aggregate publish, selected-CPU, first-selected CPU, and migration buckets for CPU slots 0-3. Newly created non-single-owner threads were published to the least-loaded active scheduler CPU slot, while single-owner capability pinning, generation checks, direct-IPC preference, and allocation-free timer/unblock paths were preserved. This aggregate evidence proved the 4-worker first-selected distribution reached all four scheduler CPU slots, but it was not per-worker identity tracking and it was not speedup evidence. (Update 2026-05-02: the publish counters and the caller-aware placement chain were retired with the per-CPU run-queue collapse;make run-thread-scaleand the kernel measure printer no longer emit the publish__cpu / publish_caller_* fields. Selected-CPU, first-selected CPU, and migration buckets remain. Per-CPU placement evidence returns with the fair-share enqueue policy that Phase D will own.) - If later attribution needs individual worker histories, add per-worker placement output for first scheduled CPU, latest scheduled CPU, migration count, and runnable-owner distribution without replacing the aggregate counters used by the thread-scale harness.
- Treat same-process speedup as a separate claim from multi-process SMP
concurrency. Passing
make run-smp-process-scalemust not imply this milestone is complete. Completed: same-process speedup was accepted only aftermake run-thread-scalecontrolled evidence on the thread-scale harness, separate from the earlier process-scale milestone. - Keep the ordinary
-smp 2regression gate repeatable while the thread-scaling implementation evolves. Themake run-smp2-smokestarget runs the default manifest smoke and the spawn manifest smoke with-smp 2, retaining raw per-target logs under the configured target directory. Closeout evidence passed.
Task Selection
Choose a task that isolates scheduler and CPU parallelism rather than a subsystem bottleneck. Both milestones should use workload shapes with these properties:
- CPU-bound and deterministic, with no network, disk, terminal, or heap-heavy hot path.
- Naturally partitionable into independent chunks so workers do not share a lock, mutable buffer, or capability ring while the timed section runs.
- Verifiable by a compact checksum, count, or known-answer oracle.
- Long enough to dominate boot, process spawn, timer granularity, and serial logging overhead.
- Runnable as independent worker processes for the multi-process milestone, and runnable as sibling threads through the per-thread completion-routing model used by the in-process milestone.
Avoid using IPC throughput, capability-ring dispatch, park wake storms, console logging, or allocator stress as the first SMP scaling claim. Those are valid later benchmarks, but they measure shared kernel bottlenecks as much as CPU scheduling. Same-process thread scaling remains a separate milestone because it needs accepted per-thread-ring timing evidence, not only functional sibling execution.
For the in-process milestone, the default workload should be a uniform fixed-size chunk workload such as BLAKE3-style tree hashing, CRC32C over disjoint buffers, or a small native deterministic block-hash loop. The first implementation does not need a cryptographic dependency; it does need fixed-size chunks, per-thread private output slots, and a root checksum that detects missing, duplicated, or reordered chunks. Prime counting remains valid historical evidence for multi-process concurrency, but it is a weaker same-process scaling workload because numeric range cost is not uniform.
Grounding Files
docs/proposals/smp-proposal.mddocs/proposals/ring-v2-smp-proposal.mddocs/architecture/scheduling.mddocs/architecture/threading.mddocs/research/completion-ring-threading.mddocs/research/out-of-kernel-scheduling.mddocs/research/sel4.mddocs/research/zircon.mddocs/research/x2apic-and-virtualization.md
Notes
Initial multi-CPU scheduling may keep the current process ring while the
runtime serializes process-ring consumption. Full SMP where sibling threads
from one process wait independently on different CPUs should not keep the
process-wide CQ as the kernel ABI endpoint. The target transport model is
per-thread capability rings: cap_enter(min_complete, timeout_ns) waits on the
current thread’s CQ, kernel waiters route completions by generation-checked
ThreadRef, and SQPOLL becomes a per-ring mode with one kernel SQ consumer.
SharedParkSpace park-words still need MemoryObject mapping provenance or object pins before shared-key derivation lands.
2026-04-25 11:36 UTC: commit d88bca7 recorded the First AP Scheduler proof.
AP cpu=1 can run scheduler-owned user contexts under -smp 2, and a one-way
scheduler-owner latch prevents the BSP and AP from both entering
scheduler-owned user work while the process-wide ring remains the active
transport.