Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Research: NO_HZ, SQPOLL, and Realtime Scheduling

This note records the external grounding for capOS tickless idle, SQPOLL-oriented full-nohz CPU isolation, and future realtime scheduling contexts. It was written from the 2026-04-29 shared design discussion and checked against primary Linux/seL4 documentation.

Local Grounding

Relevant local docs:

  • Scheduling: current LAPIC tick, bounded timeout waiters, timer-side ring polling, AP scheduler-owner proof, CPL0 idle-thread paths, and Phase F nohz/SQPOLL activation state machine.
  • SMP: LAPIC/IPI foundation and deferred per-CPU run queue/concurrent scheduler ownership work.
  • Ring v2 For Full SMP: per-thread rings and the rule that SQPOLL must have exactly one SQ consumer.
  • Out-of-kernel scheduling: scheduling contexts, user-space policy, and kernel budget enforcement split.
  • Multimedia pipeline latency: admitted realtime island model for media graphs.
  • Robotics realtime control: scheduling-context authority, control-loop admission, and passive-server donation lessons.
  • x2APIC and APIC virtualization: x2APIC as a later backend, not a prerequisite for the current xAPIC LAPIC timer path.

External Sources Checked

NO_HZ Findings

Linux separates three timer policies:

  • periodic scheduler ticks;
  • tick suppression only while a CPU is idle (NO_HZ_IDLE);
  • adaptive tick suppression for CPUs with one runnable task (NO_HZ_FULL).

The first capOS target should match the conservative shape of NO_HZ_IDLE, not Linux NO_HZ_FULL. The Linux docs explicitly call idle tick suppression common/default-useful, while NO_HZ_FULL is specialized for realtime and HPC loads and requires at least one non-adaptive CPU for timekeeping. That maps to capOS because the current scheduler tick still performs too much work: timeout expiry, waiter wakeup, run-queue rotation, timer-side ring dispatch, and transitional network polling.

Linux also records a cost: dyntick-idle adds instructions on idle entry/exit and may require expensive clockevent reprogramming. capOS should therefore add counters before changing behavior and should retain a runtime ForcedPeriodic fallback.

Timekeeping Findings

Linux’s timer stack distinguishes:

  • clock sources: monotonic timeline counters;
  • clock events: hardware devices that interrupt at selected future times;
  • scheduler ticks: one user of clock events, not the timebase itself.

This split is the important design point for capOS. Current TICK_COUNT style timekeeping is adequate for periodic scheduling but becomes the wrong owner once the scheduler can stop the tick. capOS should introduce a monotonic now_ns clocksource layer before enabling tickless idle.

Linux hrtimers provide two lessons without requiring capOS to clone the whole subsystem:

  • waiters should be stored by absolute expiry time, not by periodic tick count;
  • time-ordered expiry structures simplify deadline-based wakeup and avoid scanning every timer on every tick.

capOS already bounds waiter counts, so the first implementation can use a small ordered array, BTreeMap, or heap. The security property is bounded, non-allocating interrupt-path expiry, not a specific data structure.

CPU Isolation and Housekeeping Findings

Linux CPU isolation treats housekeeping as first-class work: unbound timers, workqueues, maintenance, statistics, deferred cleanup, watchdog work, and remote scheduler ticks must move away from isolated CPUs or be explicitly disabled. Linux also requires at least one housekeeping CPU.

For capOS this means full-nohz must not be modeled as a timer flag. It is a CPU ownership contract:

isolated CPU = no unrelated runnable work + no unbound kernel work + explicit
wake/deadline events only

The same rule applies whether the isolated entity is a kernel SQPOLL worker, a userspace poller, or a future admitted realtime loop. CpuIsolationLease names the owner, allowed CPU set, allowed mode, accounting target, and revocation policy. It performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window (Phase F closed), and a ring-coupled kernelSqpoll lease suppresses ticks while its bound ring is in SQPOLL running/sleeping mode with a live owner (SQPOLL-driven auto-nohz closed). Without a CpuIsolationLease, a latency-sensitive hint must not grant exclusive CPU access. Generic full-nohz for explicitly budgeted compute threads, a generic SQPOLL nohz state machine for explicitly leased caller-thread rings, and timeout-based auto-revoke have since landed. Broader userspace-poller/device-queue issuance remains future work.

io_uring SQPOLL Findings

Linux IORING_SETUP_SQPOLL creates a kernel thread that polls the submission queue. While it remains active, applications can publish SQEs and observe CQEs without entering the kernel on each submission. When the poller sleeps after its idle period, it sets IORING_SQ_NEED_WAKEUP; userspace must call io_uring_enter(..., IORING_ENTER_SQ_WAKEUP) or let liburing do that wake.

The capOS consequence is not “copy io_uring”. It is an ownership rule:

SQPOLL ring: kernel worker owns SQ head; userspace owns SQ tail and CQ head;
cap_enter does not become a second SQ consumer.

This requires Ring v2 or an equivalent per-thread ring endpoint. The current process-wide ring and timer-side ring polling are incompatible with safe SQPOLL because they cannot prevent two consumers from draining the same SQ.

SQPOLL full-nohz required: per-thread rings; a ring mode bit and quiescent mode transitions; per-CPU scheduler ownership and reschedule IPIs; a housekeeping CPU; removal or explicit placement of scheduler-tick-polled networking. Those prerequisites are now closed (Phase F one-SQ-consumer, bounded SQPOLL ring mode, housekeeping/deferred-work placement, per-CPU idle thread). SQPOLL-driven nohz activation is implemented for explicitly leased caller-thread kernelSqpoll rings, including producer wake, bounded service progress, rollback, and stale-owner rejection. Broad userspace-poller/device-queue policy issuance remains future work.

Realtime Findings

Linux SCHED_DEADLINE uses runtime, deadline, and period parameters and depends on admission/bandwidth management. Its documentation is explicit that without admission control, no scheduling guarantee follows. That directly separates per-request deadline metadata from CPU budget authority.

PREEMPT_RT’s main lesson is that realtime latency is destroyed by long non-preemptible sections, unbounded interrupt handling, and priority inversion. Linux addresses this by making most kernel execution schedulable, using priority-inheritance-aware locks, and threading interrupts. capOS does not need to clone PREEMPT_RT, but any realtime path must keep IRQ top halves short, avoid blocking locks in admitted hot paths, and provide donation or inheritance for capability service calls.

seL4 MCS provides the strongest capability-OS precedent. Scheduling contexts are kernel objects representing CPU-time authority; they carry budget and period, are configured through per-CPU scheduling-control authority, and are enforced with a sporadic-server model. Passive servers can run on a caller’s donated scheduling context and return it on reply.

For capOS:

  • SQE.deadline_ns is request freshness metadata.
  • SchedulingContext is CPU-time authority.
  • RealtimeIsland is the admission object for a whole graph/loop.
  • Scheduling-context donation is how timing survives synchronous capability calls through passive services.
  • SQPOLL and AutoNoHz are executor/isolation backends, not the realtime authority itself.

capOS Design Consequences

  1. Implement tickless idle before full-nohz.
  2. Split clocksource from clockevent before stopping periodic ticks.
  3. Convert timeout waiters to absolute monotonic deadlines before one-shot scheduling.
  4. Replace user-mode idle with kernel/per-CPU idle before real tickless idle. Done: the scheduler idle path is a CPL0 per-CPU kernel idle thread; the user-mode idle process is removed.
  5. Keep periodic preemption while there is runnable contention.
  6. Keep networking in ForcedPeriodic or move it to explicit IRQ/deadline polling before enabling tickless on network-active CPUs. Network-polling placement is landed as a fail-closed admission gate; placement routing for arbitrary network-active CPUs remains future work.
  7. Treat full-nohz as a CPU lease and housekeeping design, not a standalone timer optimization. CpuIsolationLease is now implemented, generic full-nohz is landed for explicitly budgeted compute leases, and policy-service issuance remains future work.
  8. Add SQPOLL only after per-thread rings and per-CPU scheduler ownership. Done: one-SQ-consumer ring ownership, bounded SQPOLL ring mode, and SQPOLL-driven auto-nohz activation are all closed.
  9. Require one SQ consumer per ring mode. Done: enforced by the Phase F one-SQ-consumer ring ownership gate.
  10. Use SQE.deadline_ns only for freshness/drop/propagation policy; put budget, period, priority, CPU mask, and overrun policy in SchedulingContext.
  11. Use realtime islands for media/robotics/control graphs; reject hard realtime claims until kernel path, IRQ, device, and WCET evidence exist.