Research: NO_HZ, SQPOLL, and Realtime Scheduling
This note records the external grounding for capOS tickless idle, SQPOLL-oriented full-nohz CPU isolation, and future realtime scheduling contexts. It was written from the 2026-04-29 shared design discussion and checked against primary Linux/seL4 documentation.
Local Grounding
Relevant local docs:
- Scheduling: current LAPIC tick, bounded timeout waiters, timer-side ring polling, AP scheduler-owner proof, CPL0 idle-thread paths, and Phase F nohz/SQPOLL activation state machine.
- SMP: LAPIC/IPI foundation and deferred per-CPU run queue/concurrent scheduler ownership work.
- Ring v2 For Full SMP: per-thread rings and the rule that SQPOLL must have exactly one SQ consumer.
- Out-of-kernel scheduling: scheduling contexts, user-space policy, and kernel budget enforcement split.
- Multimedia pipeline latency: admitted realtime island model for media graphs.
- Robotics realtime control: scheduling-context authority, control-loop admission, and passive-server donation lessons.
- x2APIC and APIC virtualization: x2APIC as a later backend, not a prerequisite for the current xAPIC LAPIC timer path.
External Sources Checked
- Linux kernel documentation, NO_HZ: Reducing Scheduling-Clock Ticks.
- Linux kernel documentation, Clock sources, Clock events, sched_clock() and delay timers.
- Linux kernel documentation, High resolution timers and dynamic ticks design notes.
- Linux kernel documentation, hrtimers - subsystem for high-resolution kernel timers.
- Linux kernel documentation, CPU Isolation.
- Linux kernel documentation, Housekeeping.
- Linux man-pages project, io_uring_setup(2).
- Linux kernel documentation, Deadline Task Scheduling.
- Linux kernel documentation, PREEMPT_RT theory of operation.
- seL4 documentation, MCS Extensions tutorial.
NO_HZ Findings
Linux separates three timer policies:
- periodic scheduler ticks;
- tick suppression only while a CPU is idle (
NO_HZ_IDLE); - adaptive tick suppression for CPUs with one runnable task (
NO_HZ_FULL).
The first capOS target should match the conservative shape of NO_HZ_IDLE,
not Linux NO_HZ_FULL. The Linux docs explicitly call idle tick suppression
common/default-useful, while NO_HZ_FULL is specialized for realtime and HPC
loads and requires at least one non-adaptive CPU for timekeeping. That maps to
capOS because the current scheduler tick still performs too much work:
timeout expiry, waiter wakeup, run-queue rotation, timer-side ring dispatch,
and transitional network polling.
Linux also records a cost: dyntick-idle adds instructions on idle entry/exit
and may require expensive clockevent reprogramming. capOS should therefore
add counters before changing behavior and should retain a runtime
ForcedPeriodic fallback.
Timekeeping Findings
Linux’s timer stack distinguishes:
- clock sources: monotonic timeline counters;
- clock events: hardware devices that interrupt at selected future times;
- scheduler ticks: one user of clock events, not the timebase itself.
This split is the important design point for capOS. Current TICK_COUNT style
timekeeping is adequate for periodic scheduling but becomes the wrong owner
once the scheduler can stop the tick. capOS should introduce a monotonic
now_ns clocksource layer before enabling tickless idle.
Linux hrtimers provide two lessons without requiring capOS to clone the whole subsystem:
- waiters should be stored by absolute expiry time, not by periodic tick count;
- time-ordered expiry structures simplify deadline-based wakeup and avoid scanning every timer on every tick.
capOS already bounds waiter counts, so the first implementation can use a
small ordered array, BTreeMap, or heap. The security property is bounded,
non-allocating interrupt-path expiry, not a specific data structure.
CPU Isolation and Housekeeping Findings
Linux CPU isolation treats housekeeping as first-class work: unbound timers, workqueues, maintenance, statistics, deferred cleanup, watchdog work, and remote scheduler ticks must move away from isolated CPUs or be explicitly disabled. Linux also requires at least one housekeeping CPU.
For capOS this means full-nohz must not be modeled as a timer flag. It is a CPU ownership contract:
isolated CPU = no unrelated runnable work + no unbound kernel work + explicit
wake/deadline events only
The same rule applies whether the isolated entity is a kernel SQPOLL worker,
a userspace poller, or a future admitted realtime loop. CpuIsolationLease
names the owner, allowed CPU set, allowed mode, accounting target, and
revocation policy. It performs real per-CPU periodic-tick suppression for the
narrow single-runnable-entity window (Phase F closed), and a ring-coupled
kernelSqpoll lease suppresses ticks while its bound ring is in SQPOLL
running/sleeping mode with a live owner (SQPOLL-driven auto-nohz closed).
Without a CpuIsolationLease, a latency-sensitive hint must not grant exclusive
CPU access. Generic full-nohz for explicitly budgeted compute threads, a
generic SQPOLL nohz state machine for explicitly leased caller-thread rings,
and timeout-based auto-revoke have since landed. Broader
userspace-poller/device-queue issuance remains future work.
io_uring SQPOLL Findings
Linux IORING_SETUP_SQPOLL creates a kernel thread that polls the submission
queue. While it remains active, applications can publish SQEs and observe CQEs
without entering the kernel on each submission. When the poller sleeps after
its idle period, it sets IORING_SQ_NEED_WAKEUP; userspace must call
io_uring_enter(..., IORING_ENTER_SQ_WAKEUP) or let liburing do that wake.
The capOS consequence is not “copy io_uring”. It is an ownership rule:
SQPOLL ring: kernel worker owns SQ head; userspace owns SQ tail and CQ head;
cap_enter does not become a second SQ consumer.
This requires Ring v2 or an equivalent per-thread ring endpoint. The current process-wide ring and timer-side ring polling are incompatible with safe SQPOLL because they cannot prevent two consumers from draining the same SQ.
SQPOLL full-nohz required: per-thread rings; a ring mode bit and quiescent mode
transitions; per-CPU scheduler ownership and reschedule IPIs; a housekeeping
CPU; removal or explicit placement of scheduler-tick-polled networking. Those
prerequisites are now closed (Phase F one-SQ-consumer, bounded SQPOLL ring
mode, housekeeping/deferred-work placement, per-CPU idle thread). SQPOLL-driven
nohz activation is implemented for explicitly leased caller-thread
kernelSqpoll rings, including producer wake, bounded service progress,
rollback, and stale-owner rejection. Broad userspace-poller/device-queue policy
issuance remains future work.
Realtime Findings
Linux SCHED_DEADLINE uses runtime, deadline, and period parameters and
depends on admission/bandwidth management. Its documentation is explicit that
without admission control, no scheduling guarantee follows. That directly
separates per-request deadline metadata from CPU budget authority.
PREEMPT_RT’s main lesson is that realtime latency is destroyed by long non-preemptible sections, unbounded interrupt handling, and priority inversion. Linux addresses this by making most kernel execution schedulable, using priority-inheritance-aware locks, and threading interrupts. capOS does not need to clone PREEMPT_RT, but any realtime path must keep IRQ top halves short, avoid blocking locks in admitted hot paths, and provide donation or inheritance for capability service calls.
seL4 MCS provides the strongest capability-OS precedent. Scheduling contexts are kernel objects representing CPU-time authority; they carry budget and period, are configured through per-CPU scheduling-control authority, and are enforced with a sporadic-server model. Passive servers can run on a caller’s donated scheduling context and return it on reply.
For capOS:
SQE.deadline_nsis request freshness metadata.SchedulingContextis CPU-time authority.RealtimeIslandis the admission object for a whole graph/loop.- Scheduling-context donation is how timing survives synchronous capability calls through passive services.
- SQPOLL and AutoNoHz are executor/isolation backends, not the realtime authority itself.
capOS Design Consequences
- Implement tickless idle before full-nohz.
- Split clocksource from clockevent before stopping periodic ticks.
- Convert timeout waiters to absolute monotonic deadlines before one-shot scheduling.
Replace user-mode idle with kernel/per-CPU idle before real tickless idle.Done: the scheduler idle path is a CPL0 per-CPU kernel idle thread; the user-mode idle process is removed.- Keep periodic preemption while there is runnable contention.
- Keep networking in
ForcedPeriodicor move it to explicit IRQ/deadline polling before enabling tickless on network-active CPUs. Network-polling placement is landed as a fail-closed admission gate; placement routing for arbitrary network-active CPUs remains future work. - Treat full-nohz as a CPU lease and housekeeping design, not a standalone
timer optimization.
CpuIsolationLeaseis now implemented, generic full-nohz is landed for explicitly budgeted compute leases, and policy-service issuance remains future work. Add SQPOLL only after per-thread rings and per-CPU scheduler ownership.Done: one-SQ-consumer ring ownership, bounded SQPOLL ring mode, and SQPOLL-driven auto-nohz activation are all closed.- Require one SQ consumer per ring mode. Done: enforced by the Phase F one-SQ-consumer ring ownership gate.
- Use
SQE.deadline_nsonly for freshness/drop/propagation policy; put budget, period, priority, CPU mask, and overrun policy inSchedulingContext. - Use realtime islands for media/robotics/control graphs; reject hard realtime claims until kernel path, IRQ, device, and WCET evidence exist.