# Research: NO_HZ, SQPOLL, and Realtime Scheduling

This note records the external grounding for capOS tickless idle,
SQPOLL-oriented full-nohz CPU isolation, and future realtime scheduling
contexts. It was written from the 2026-04-29 shared design discussion and
checked against primary Linux/seL4 documentation.

## Local Grounding

The local research files present before adding this note were:

```text
capnp-error-handling.md
completion-ring-threading.md
eros-capros-coyotos.md
genode.md
hosted-agent-harnesses.md
ix-on-capos-hosting.md
llvm-target.md
multimedia-pipeline-latency.md
os-error-handling.md
out-of-kernel-scheduling.md
pingora.md
plan9-inferno.md
realtime-multimodal-agent-apis.md
robotics-realtime-control.md
sel4.md
small-llm-survey.md
x2apic-and-virtualization.md
zircon.md
```

Relevant local docs:

- [Scheduling](../architecture/scheduling.md): current LAPIC tick, bounded
  timeout waiters, timer-side ring polling, AP scheduler-owner proof, and the
  user-mode idle blocker.
- [SMP](../proposals/smp-proposal.md): LAPIC/IPI foundation and deferred
  per-CPU run queue/concurrent scheduler ownership work.
- [Ring v2 For Full SMP](../proposals/ring-v2-smp-proposal.md): per-thread
  rings and the rule that SQPOLL must have exactly one SQ consumer.
- [Out-of-kernel scheduling](out-of-kernel-scheduling.md): scheduling contexts,
  user-space policy, and kernel budget enforcement split.
- [Multimedia pipeline latency](multimedia-pipeline-latency.md): admitted
  realtime island model for media graphs.
- [Robotics realtime control](robotics-realtime-control.md): scheduling-context
  authority, control-loop admission, and passive-server donation lessons.
- [x2APIC and APIC virtualization](x2apic-and-virtualization.md): x2APIC as a
  later backend, not a prerequisite for the current xAPIC LAPIC timer path.

## External Sources Checked

- Linux kernel documentation,
  [NO_HZ: Reducing Scheduling-Clock Ticks](https://docs.kernel.org/timers/no_hz.html).
- Linux kernel documentation,
  [Clock sources, Clock events, sched_clock() and delay timers](https://www.kernel.org/doc/html/v6.0/timers/timekeeping.html).
- Linux kernel documentation,
  [High resolution timers and dynamic ticks design notes](https://docs.kernel.org/timers/highres.html).
- Linux kernel documentation,
  [hrtimers - subsystem for high-resolution kernel timers](https://www.kernel.org/doc/html/latest/timers/hrtimers.html).
- Linux kernel documentation,
  [CPU Isolation](https://www.kernel.org/doc/html/next/admin-guide/cpu-isolation.html).
- Linux kernel documentation,
  [Housekeeping](https://www.kernel.org/doc/html/latest/core-api/housekeeping.html).
- Linux man-pages project,
  [io_uring_setup(2)](https://man7.org/linux/man-pages/man2/io_uring_setup.2.html).
- Linux kernel documentation,
  [Deadline Task Scheduling](https://www.kernel.org/doc/html/latest/scheduler/sched-deadline.html).
- Linux kernel documentation,
  [PREEMPT_RT theory of operation](https://docs.kernel.org/core-api/real-time/theory.html).
- seL4 documentation,
  [MCS Extensions tutorial](https://docs.sel4.systems/Tutorials/mcs.html).

## NO_HZ Findings

Linux separates three timer policies:

- periodic scheduler ticks;
- tick suppression only while a CPU is idle (`NO_HZ_IDLE`);
- adaptive tick suppression for CPUs with one runnable task (`NO_HZ_FULL`).

The first capOS target should match the conservative shape of `NO_HZ_IDLE`,
not Linux `NO_HZ_FULL`. The Linux docs explicitly call idle tick suppression
common/default-useful, while `NO_HZ_FULL` is specialized for realtime and HPC
loads and requires at least one non-adaptive CPU for timekeeping. That maps to
capOS because the current scheduler tick still performs too much work:
timeout expiry, waiter wakeup, run-queue rotation, timer-side ring dispatch,
and transitional network polling.

Linux also records a cost: dyntick-idle adds instructions on idle entry/exit
and may require expensive clockevent reprogramming. capOS should therefore
add counters before changing behavior and should retain a runtime
`ForcedPeriodic` fallback.

## Timekeeping Findings

Linux's timer stack distinguishes:

- clock sources: monotonic timeline counters;
- clock events: hardware devices that interrupt at selected future times;
- scheduler ticks: one user of clock events, not the timebase itself.

This split is the important design point for capOS. Current `TICK_COUNT` style
timekeeping is adequate for periodic scheduling but becomes the wrong owner
once the scheduler can stop the tick. capOS should introduce a monotonic
`now_ns` clocksource layer before enabling tickless idle.

Linux hrtimers provide two lessons without requiring capOS to clone the whole
subsystem:

- waiters should be stored by absolute expiry time, not by periodic tick
  count;
- time-ordered expiry structures simplify deadline-based wakeup and avoid
  scanning every timer on every tick.

capOS already bounds waiter counts, so the first implementation can use a
small ordered array, `BTreeMap`, or heap. The security property is bounded,
non-allocating interrupt-path expiry, not a specific data structure.

## CPU Isolation and Housekeeping Findings

Linux CPU isolation treats housekeeping as first-class work: unbound timers,
workqueues, maintenance, statistics, deferred cleanup, watchdog work, and
remote scheduler ticks must move away from isolated CPUs or be explicitly
disabled. Linux also requires at least one housekeeping CPU.

For capOS this means full-nohz must not be modeled as a timer flag. It is a
CPU ownership contract:

```text
isolated CPU = no unrelated runnable work + no unbound kernel work + explicit
wake/deadline events only
```

The same rule applies whether the isolated entity is a kernel SQPOLL worker,
a userspace poller, or a future admitted realtime loop. A future `CpuLease` or
`CpuIsolationLease` capability should name the owner, allowed CPU set, allowed
mode, accounting target, and revocation policy. Without such authority, a
latency-sensitive hint must not grant exclusive CPU access.

## io_uring SQPOLL Findings

Linux `IORING_SETUP_SQPOLL` creates a kernel thread that polls the submission
queue. While it remains active, applications can publish SQEs and observe CQEs
without entering the kernel on each submission. When the poller sleeps after
its idle period, it sets `IORING_SQ_NEED_WAKEUP`; userspace must call
`io_uring_enter(..., IORING_ENTER_SQ_WAKEUP)` or let liburing do that wake.

The capOS consequence is not "copy io_uring". It is an ownership rule:

```text
SQPOLL ring: kernel worker owns SQ head; userspace owns SQ tail and CQ head;
cap_enter does not become a second SQ consumer.
```

This requires Ring v2 or an equivalent per-thread ring endpoint. The current
process-wide ring and timer-side ring polling are incompatible with safe
SQPOLL because they cannot prevent two consumers from draining the same SQ.

SQPOLL full-nohz should therefore be staged behind:

- per-thread rings;
- a ring mode bit and quiescent mode transitions;
- per-CPU scheduler ownership and reschedule IPIs;
- a housekeeping CPU;
- removal or explicit placement of scheduler-tick-polled networking.

## Realtime Findings

Linux `SCHED_DEADLINE` uses runtime, deadline, and period parameters and
depends on admission/bandwidth management. Its documentation is explicit that
without admission control, no scheduling guarantee follows. That directly
separates per-request deadline metadata from CPU budget authority.

PREEMPT_RT's main lesson is that realtime latency is destroyed by long
non-preemptible sections, unbounded interrupt handling, and priority
inversion. Linux addresses this by making most kernel execution schedulable,
using priority-inheritance-aware locks, and threading interrupts. capOS does
not need to clone PREEMPT_RT, but any realtime path must keep IRQ top halves
short, avoid blocking locks in admitted hot paths, and provide donation or
inheritance for capability service calls.

seL4 MCS provides the strongest capability-OS precedent. Scheduling contexts
are kernel objects representing CPU-time authority; they carry budget and
period, are configured through per-CPU scheduling-control authority, and are
enforced with a sporadic-server model. Passive servers can run on a caller's
donated scheduling context and return it on reply.

For capOS:

- `SQE.deadline_ns` is request freshness metadata.
- `SchedulingContext` is CPU-time authority.
- `RealtimeIsland` is the admission object for a whole graph/loop.
- Scheduling-context donation is how timing survives synchronous capability
  calls through passive services.
- SQPOLL and AutoNoHz are executor/isolation backends, not the realtime
  authority itself.

## capOS Design Consequences

1. Implement tickless idle before full-nohz.
2. Split clocksource from clockevent before stopping periodic ticks.
3. Convert timeout waiters to absolute monotonic deadlines before one-shot
   scheduling.
4. Replace user-mode idle with kernel/per-CPU idle before real tickless idle.
5. Keep periodic preemption while there is runnable contention.
6. Keep networking in `ForcedPeriodic` or move it to explicit IRQ/deadline
   polling before enabling tickless on network-active CPUs.
7. Treat full-nohz as a CPU lease and housekeeping design, not a standalone
   timer optimization.
8. Add SQPOLL only after per-thread rings and per-CPU scheduler ownership.
9. Require one SQ consumer per ring mode.
10. Use `SQE.deadline_ns` only for freshness/drop/propagation policy; put
    budget, period, priority, CPU mask, and overrun policy in
    `SchedulingContext`.
11. Use realtime islands for media/robotics/control graphs; reject hard
    realtime claims until kernel path, IRQ, device, and WCET evidence exist.

