Research: NO_HZ, SQPOLL, and Realtime Scheduling
This note records the external grounding for capOS tickless idle, SQPOLL-oriented full-nohz CPU isolation, and future realtime scheduling contexts. It was written from the 2026-04-29 shared design discussion and checked against primary Linux/seL4 documentation.
Local Grounding
The local research files present before adding this note were:
capnp-error-handling.md
completion-ring-threading.md
eros-capros-coyotos.md
genode.md
hosted-agent-harnesses.md
ix-on-capos-hosting.md
llvm-target.md
multimedia-pipeline-latency.md
os-error-handling.md
out-of-kernel-scheduling.md
pingora.md
plan9-inferno.md
realtime-multimodal-agent-apis.md
robotics-realtime-control.md
sel4.md
small-llm-survey.md
x2apic-and-virtualization.md
zircon.md
Relevant local docs:
- Scheduling: current LAPIC tick, bounded timeout waiters, timer-side ring polling, AP scheduler-owner proof, and the user-mode idle blocker.
- SMP: LAPIC/IPI foundation and deferred per-CPU run queue/concurrent scheduler ownership work.
- Ring v2 For Full SMP: per-thread rings and the rule that SQPOLL must have exactly one SQ consumer.
- Out-of-kernel scheduling: scheduling contexts, user-space policy, and kernel budget enforcement split.
- Multimedia pipeline latency: admitted realtime island model for media graphs.
- Robotics realtime control: scheduling-context authority, control-loop admission, and passive-server donation lessons.
- x2APIC and APIC virtualization: x2APIC as a later backend, not a prerequisite for the current xAPIC LAPIC timer path.
External Sources Checked
- Linux kernel documentation, NO_HZ: Reducing Scheduling-Clock Ticks.
- Linux kernel documentation, Clock sources, Clock events, sched_clock() and delay timers.
- Linux kernel documentation, High resolution timers and dynamic ticks design notes.
- Linux kernel documentation, hrtimers - subsystem for high-resolution kernel timers.
- Linux kernel documentation, CPU Isolation.
- Linux kernel documentation, Housekeeping.
- Linux man-pages project, io_uring_setup(2).
- Linux kernel documentation, Deadline Task Scheduling.
- Linux kernel documentation, PREEMPT_RT theory of operation.
- seL4 documentation, MCS Extensions tutorial.
NO_HZ Findings
Linux separates three timer policies:
- periodic scheduler ticks;
- tick suppression only while a CPU is idle (
NO_HZ_IDLE); - adaptive tick suppression for CPUs with one runnable task (
NO_HZ_FULL).
The first capOS target should match the conservative shape of NO_HZ_IDLE,
not Linux NO_HZ_FULL. The Linux docs explicitly call idle tick suppression
common/default-useful, while NO_HZ_FULL is specialized for realtime and HPC
loads and requires at least one non-adaptive CPU for timekeeping. That maps to
capOS because the current scheduler tick still performs too much work:
timeout expiry, waiter wakeup, run-queue rotation, timer-side ring dispatch,
and transitional network polling.
Linux also records a cost: dyntick-idle adds instructions on idle entry/exit
and may require expensive clockevent reprogramming. capOS should therefore
add counters before changing behavior and should retain a runtime
ForcedPeriodic fallback.
Timekeeping Findings
Linux’s timer stack distinguishes:
- clock sources: monotonic timeline counters;
- clock events: hardware devices that interrupt at selected future times;
- scheduler ticks: one user of clock events, not the timebase itself.
This split is the important design point for capOS. Current TICK_COUNT style
timekeeping is adequate for periodic scheduling but becomes the wrong owner
once the scheduler can stop the tick. capOS should introduce a monotonic
now_ns clocksource layer before enabling tickless idle.
Linux hrtimers provide two lessons without requiring capOS to clone the whole subsystem:
- waiters should be stored by absolute expiry time, not by periodic tick count;
- time-ordered expiry structures simplify deadline-based wakeup and avoid scanning every timer on every tick.
capOS already bounds waiter counts, so the first implementation can use a
small ordered array, BTreeMap, or heap. The security property is bounded,
non-allocating interrupt-path expiry, not a specific data structure.
CPU Isolation and Housekeeping Findings
Linux CPU isolation treats housekeeping as first-class work: unbound timers, workqueues, maintenance, statistics, deferred cleanup, watchdog work, and remote scheduler ticks must move away from isolated CPUs or be explicitly disabled. Linux also requires at least one housekeeping CPU.
For capOS this means full-nohz must not be modeled as a timer flag. It is a CPU ownership contract:
isolated CPU = no unrelated runnable work + no unbound kernel work + explicit
wake/deadline events only
The same rule applies whether the isolated entity is a kernel SQPOLL worker,
a userspace poller, or a future admitted realtime loop. A future CpuLease or
CpuIsolationLease capability should name the owner, allowed CPU set, allowed
mode, accounting target, and revocation policy. Without such authority, a
latency-sensitive hint must not grant exclusive CPU access.
io_uring SQPOLL Findings
Linux IORING_SETUP_SQPOLL creates a kernel thread that polls the submission
queue. While it remains active, applications can publish SQEs and observe CQEs
without entering the kernel on each submission. When the poller sleeps after
its idle period, it sets IORING_SQ_NEED_WAKEUP; userspace must call
io_uring_enter(..., IORING_ENTER_SQ_WAKEUP) or let liburing do that wake.
The capOS consequence is not “copy io_uring”. It is an ownership rule:
SQPOLL ring: kernel worker owns SQ head; userspace owns SQ tail and CQ head;
cap_enter does not become a second SQ consumer.
This requires Ring v2 or an equivalent per-thread ring endpoint. The current process-wide ring and timer-side ring polling are incompatible with safe SQPOLL because they cannot prevent two consumers from draining the same SQ.
SQPOLL full-nohz should therefore be staged behind:
- per-thread rings;
- a ring mode bit and quiescent mode transitions;
- per-CPU scheduler ownership and reschedule IPIs;
- a housekeeping CPU;
- removal or explicit placement of scheduler-tick-polled networking.
Realtime Findings
Linux SCHED_DEADLINE uses runtime, deadline, and period parameters and
depends on admission/bandwidth management. Its documentation is explicit that
without admission control, no scheduling guarantee follows. That directly
separates per-request deadline metadata from CPU budget authority.
PREEMPT_RT’s main lesson is that realtime latency is destroyed by long non-preemptible sections, unbounded interrupt handling, and priority inversion. Linux addresses this by making most kernel execution schedulable, using priority-inheritance-aware locks, and threading interrupts. capOS does not need to clone PREEMPT_RT, but any realtime path must keep IRQ top halves short, avoid blocking locks in admitted hot paths, and provide donation or inheritance for capability service calls.
seL4 MCS provides the strongest capability-OS precedent. Scheduling contexts are kernel objects representing CPU-time authority; they carry budget and period, are configured through per-CPU scheduling-control authority, and are enforced with a sporadic-server model. Passive servers can run on a caller’s donated scheduling context and return it on reply.
For capOS:
SQE.deadline_nsis request freshness metadata.SchedulingContextis CPU-time authority.RealtimeIslandis the admission object for a whole graph/loop.- Scheduling-context donation is how timing survives synchronous capability calls through passive services.
- SQPOLL and AutoNoHz are executor/isolation backends, not the realtime authority itself.
capOS Design Consequences
- Implement tickless idle before full-nohz.
- Split clocksource from clockevent before stopping periodic ticks.
- Convert timeout waiters to absolute monotonic deadlines before one-shot scheduling.
- Replace user-mode idle with kernel/per-CPU idle before real tickless idle.
- Keep periodic preemption while there is runnable contention.
- Keep networking in
ForcedPeriodicor move it to explicit IRQ/deadline polling before enabling tickless on network-active CPUs. - Treat full-nohz as a CPU lease and housekeeping design, not a standalone timer optimization.
- Add SQPOLL only after per-thread rings and per-CPU scheduler ownership.
- Require one SQ consumer per ring mode.
- Use
SQE.deadline_nsonly for freshness/drop/propagation policy; put budget, period, priority, CPU mask, and overrun policy inSchedulingContext. - Use realtime islands for media/robotics/control graphs; reject hard realtime claims until kernel path, IRQ, device, and WCET evidence exist.