Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Research: NO_HZ, SQPOLL, and Realtime Scheduling

This note records the external grounding for capOS tickless idle, SQPOLL-oriented full-nohz CPU isolation, and future realtime scheduling contexts. It was written from the 2026-04-29 shared design discussion and checked against primary Linux/seL4 documentation.

Local Grounding

The local research files present before adding this note were:

capnp-error-handling.md
completion-ring-threading.md
eros-capros-coyotos.md
genode.md
hosted-agent-harnesses.md
ix-on-capos-hosting.md
llvm-target.md
multimedia-pipeline-latency.md
os-error-handling.md
out-of-kernel-scheduling.md
pingora.md
plan9-inferno.md
realtime-multimodal-agent-apis.md
robotics-realtime-control.md
sel4.md
small-llm-survey.md
x2apic-and-virtualization.md
zircon.md

Relevant local docs:

  • Scheduling: current LAPIC tick, bounded timeout waiters, timer-side ring polling, AP scheduler-owner proof, and the user-mode idle blocker.
  • SMP: LAPIC/IPI foundation and deferred per-CPU run queue/concurrent scheduler ownership work.
  • Ring v2 For Full SMP: per-thread rings and the rule that SQPOLL must have exactly one SQ consumer.
  • Out-of-kernel scheduling: scheduling contexts, user-space policy, and kernel budget enforcement split.
  • Multimedia pipeline latency: admitted realtime island model for media graphs.
  • Robotics realtime control: scheduling-context authority, control-loop admission, and passive-server donation lessons.
  • x2APIC and APIC virtualization: x2APIC as a later backend, not a prerequisite for the current xAPIC LAPIC timer path.

External Sources Checked

NO_HZ Findings

Linux separates three timer policies:

  • periodic scheduler ticks;
  • tick suppression only while a CPU is idle (NO_HZ_IDLE);
  • adaptive tick suppression for CPUs with one runnable task (NO_HZ_FULL).

The first capOS target should match the conservative shape of NO_HZ_IDLE, not Linux NO_HZ_FULL. The Linux docs explicitly call idle tick suppression common/default-useful, while NO_HZ_FULL is specialized for realtime and HPC loads and requires at least one non-adaptive CPU for timekeeping. That maps to capOS because the current scheduler tick still performs too much work: timeout expiry, waiter wakeup, run-queue rotation, timer-side ring dispatch, and transitional network polling.

Linux also records a cost: dyntick-idle adds instructions on idle entry/exit and may require expensive clockevent reprogramming. capOS should therefore add counters before changing behavior and should retain a runtime ForcedPeriodic fallback.

Timekeeping Findings

Linux’s timer stack distinguishes:

  • clock sources: monotonic timeline counters;
  • clock events: hardware devices that interrupt at selected future times;
  • scheduler ticks: one user of clock events, not the timebase itself.

This split is the important design point for capOS. Current TICK_COUNT style timekeeping is adequate for periodic scheduling but becomes the wrong owner once the scheduler can stop the tick. capOS should introduce a monotonic now_ns clocksource layer before enabling tickless idle.

Linux hrtimers provide two lessons without requiring capOS to clone the whole subsystem:

  • waiters should be stored by absolute expiry time, not by periodic tick count;
  • time-ordered expiry structures simplify deadline-based wakeup and avoid scanning every timer on every tick.

capOS already bounds waiter counts, so the first implementation can use a small ordered array, BTreeMap, or heap. The security property is bounded, non-allocating interrupt-path expiry, not a specific data structure.

CPU Isolation and Housekeeping Findings

Linux CPU isolation treats housekeeping as first-class work: unbound timers, workqueues, maintenance, statistics, deferred cleanup, watchdog work, and remote scheduler ticks must move away from isolated CPUs or be explicitly disabled. Linux also requires at least one housekeeping CPU.

For capOS this means full-nohz must not be modeled as a timer flag. It is a CPU ownership contract:

isolated CPU = no unrelated runnable work + no unbound kernel work + explicit
wake/deadline events only

The same rule applies whether the isolated entity is a kernel SQPOLL worker, a userspace poller, or a future admitted realtime loop. A future CpuLease or CpuIsolationLease capability should name the owner, allowed CPU set, allowed mode, accounting target, and revocation policy. Without such authority, a latency-sensitive hint must not grant exclusive CPU access.

io_uring SQPOLL Findings

Linux IORING_SETUP_SQPOLL creates a kernel thread that polls the submission queue. While it remains active, applications can publish SQEs and observe CQEs without entering the kernel on each submission. When the poller sleeps after its idle period, it sets IORING_SQ_NEED_WAKEUP; userspace must call io_uring_enter(..., IORING_ENTER_SQ_WAKEUP) or let liburing do that wake.

The capOS consequence is not “copy io_uring”. It is an ownership rule:

SQPOLL ring: kernel worker owns SQ head; userspace owns SQ tail and CQ head;
cap_enter does not become a second SQ consumer.

This requires Ring v2 or an equivalent per-thread ring endpoint. The current process-wide ring and timer-side ring polling are incompatible with safe SQPOLL because they cannot prevent two consumers from draining the same SQ.

SQPOLL full-nohz should therefore be staged behind:

  • per-thread rings;
  • a ring mode bit and quiescent mode transitions;
  • per-CPU scheduler ownership and reschedule IPIs;
  • a housekeeping CPU;
  • removal or explicit placement of scheduler-tick-polled networking.

Realtime Findings

Linux SCHED_DEADLINE uses runtime, deadline, and period parameters and depends on admission/bandwidth management. Its documentation is explicit that without admission control, no scheduling guarantee follows. That directly separates per-request deadline metadata from CPU budget authority.

PREEMPT_RT’s main lesson is that realtime latency is destroyed by long non-preemptible sections, unbounded interrupt handling, and priority inversion. Linux addresses this by making most kernel execution schedulable, using priority-inheritance-aware locks, and threading interrupts. capOS does not need to clone PREEMPT_RT, but any realtime path must keep IRQ top halves short, avoid blocking locks in admitted hot paths, and provide donation or inheritance for capability service calls.

seL4 MCS provides the strongest capability-OS precedent. Scheduling contexts are kernel objects representing CPU-time authority; they carry budget and period, are configured through per-CPU scheduling-control authority, and are enforced with a sporadic-server model. Passive servers can run on a caller’s donated scheduling context and return it on reply.

For capOS:

  • SQE.deadline_ns is request freshness metadata.
  • SchedulingContext is CPU-time authority.
  • RealtimeIsland is the admission object for a whole graph/loop.
  • Scheduling-context donation is how timing survives synchronous capability calls through passive services.
  • SQPOLL and AutoNoHz are executor/isolation backends, not the realtime authority itself.

capOS Design Consequences

  1. Implement tickless idle before full-nohz.
  2. Split clocksource from clockevent before stopping periodic ticks.
  3. Convert timeout waiters to absolute monotonic deadlines before one-shot scheduling.
  4. Replace user-mode idle with kernel/per-CPU idle before real tickless idle.
  5. Keep periodic preemption while there is runnable contention.
  6. Keep networking in ForcedPeriodic or move it to explicit IRQ/deadline polling before enabling tickless on network-active CPUs.
  7. Treat full-nohz as a CPU lease and housekeeping design, not a standalone timer optimization.
  8. Add SQPOLL only after per-thread rings and per-CPU scheduler ownership.
  9. Require one SQ consumer per ring mode.
  10. Use SQE.deadline_ns only for freshness/drop/propagation policy; put budget, period, priority, CPU mask, and overrun policy in SchedulingContext.
  11. Use realtime islands for media/robotics/control graphs; reject hard realtime claims until kernel path, IRQ, device, and WCET evidence exist.