Proposal: Tickless and Realtime Scheduling
This proposal captures the scheduling design from the 2026-04-29 discussion: tickless idle is useful soon, generic full-nohz is premature, SQPOLL-oriented full-nohz belongs behind Ring v2 and CPU isolation, and realtime requires scheduling contexts rather than only per-request deadlines.
Design Grounding
The local docs/research/ contents were checked before adding this proposal.
The directly relevant grounding is:
- NO_HZ, SQPOLL, and Realtime Scheduling
- Out-of-kernel scheduling
- Completion rings and threaded runtimes
- Multimedia pipeline latency
- Robotics realtime control
- x2APIC and APIC virtualization
- Scheduling
- Ring v2 For Full SMP
- SMP
- Realtime Voice Agent Shell
External grounding is recorded in the research note so reviewers can audit the prior-art claims without treating this proposal as the source of truth.
Goals
- Add tickless idle: when a CPU has no runnable work, stop the periodic scheduler tick and program the local timer for the earliest known deadline.
- Split monotonic timekeeping from timer interrupt delivery.
- Convert scheduler timeout waiters to absolute monotonic deadlines.
- Stage full-nohz as an explicit CPU isolation/lease mode for SQPOLL and realtime executors, not as a generic scheduler default.
- Define
SQE.deadline_nsas request freshness metadata. - Define
SchedulingContextas CPU-time authority. - Define
RealtimeIslandas the admission object for media, robotics, provider, and other bounded realtime graphs.
Non-Goals
- No generic
NO_HZ_FULLfor arbitrary user threads in the near term. - No SQPOLL on the current process-wide ring.
- No second SQ consumer through timer-side polling for SQPOLL rings.
- No TSC-deadline or x2APIC requirement for the first tickless-idle milestone.
- No hard realtime claim before kernel-path, IRQ, device, locking, and WCET evidence exists.
- No full realtime policy blob inside every SQE.
CPU Authority Taxonomy
These terms must not drift into overlapping authority systems:
ResourceProfile:
policy template selected by identity, session, account, or service profile;
it is not spendable authority by itself.
ResourceLedger:
coarse accounting and quota owner for a resource class. It records and
enforces limits, including non-realtime CPU share/runtime budgets where the
scheduler has not minted finer scheduling contexts.
SchedulingContext:
spendable CPU-time authority with budget, period, relative deadline,
priority/criticality, CPU mask, and overrun policy.
CpuIsolationLease:
placement, exclusivity, and nohz/noise-isolation authority for a CPU or CPU
set. It does not grant CPU-time credit and must charge consumed time through
a SchedulingContext or coarse scheduler ResourceLedger.
NoHzEligibility:
a reviewed claim or hint that a thread, ring, poller, or island may use nohz
isolation if the scheduler can prove the current CPU state allows it.
NoHzActivation:
the scheduler-proven current CPU state that actually suppresses ticks.
RealtimeIsland:
admitted bundle of SchedulingContexts, memory reservations, device
reservations, rings, endpoint/service constraints, and optional
CpuIsolationLeases.
Scheduling-context donation is not generic resource donation. It donates only execution budget/deadline along a synchronous capability path; it does not donate capability authority, invocation subject identity, disclosure scope, memory budget, network budget, storage budget, or service-management authority.
Layer 1: Tickless Idle
Tickless idle should be the first behavioral milestone. It applies only when the CPU has no runnable thread and no local work that still depends on a periodic scheduler tick.
Clocksource
Add a monotonic clock layer:
#![allow(unused)]
fn main() {
pub fn monotonic_ns() -> u64;
}
The first backend can use the current periodic tick as a compatibility source while the system is still periodic. The selected QEMU/x86_64 backend should eventually use a calibrated stable counter, with SMP consistency handled when multiple scheduler owners exist.
Required invariant:
monotonic_ns() never moves backwards on one CPU.
Clockevent
Add a small scheduler timer backend boundary:
#![allow(unused)]
fn main() {
trait ClockEvent {
fn program_periodic(period_ns: u64);
fn program_oneshot(delta_ns: u64);
fn stop();
fn min_delta_ns() -> u64;
fn max_delta_ns() -> u64;
}
}
The first backend is the current PIT-calibrated xAPIC LAPIC timer on vector 48. PIT/PIC and periodic LAPIC remain fallback paths.
Deadline Waiters
Convert timeout state from tick counts to absolute deadlines:
#![allow(unused)]
fn main() {
struct DeadlineWaiter {
deadline_ns: u64,
target: ThreadRef,
kind: WaiterKind,
user_data: u64,
}
}
Affected paths:
Timer.sleep;cap_enter(timeout_ns);- ParkSpace timeout;
- future process/thread wait timeouts;
- network poll deadline through
NetworkPollClock.
Waiter storage remains bounded. No interrupt path may allocate.
Network Poll Clock
The current kernel-resident networking path is scheduler-polled, so it keeps a
CPU in ForcedPeriodic unless networking exposes an explicit poll clock. The
intermediate interface should be:
#![allow(unused)]
fn main() {
trait NetworkPollClock {
fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
}
next_poll_deadline_ns lets the scheduler include TCP/runtime timers in
earliest_global_deadline(). poll_until_budget prevents network progress
from becoming an unbounded idle-exit or interrupt path. A CPU with active
networking may enter tickless idle only when the network runtime is inactive or
has exposed a bounded deadline through this interface.
Kernel Idle
Tickless idle depends on replacing the user-mode idle process with a kernel/per-CPU idle context. Timer IRQ handling must distinguish:
IRQ from CPL3 user thread -> save/restore user context
IRQ from CPL0 idle -> wake/check scheduler without fake user context
Idle entry shape:
if no runnable work:
deadline = earliest_global_deadline()
clockevent.program_oneshot(deadline - now)
enter_kernel_idle()
The idle loop enables interrupts, halts, wakes on timer/IPI/device interrupt, then rechecks runnable work and deadline expiry.
Tickless State
Per CPU:
Periodic:
normal scheduler tick active
TicklessIdle:
no runnable thread
one-shot local timer programmed for earliest deadline
CPU in kernel idle
ForcedPeriodic:
fallback when a subsystem still needs regular polling
Enter TicklessIdle only when:
run queue empty
no direct IPC target
no deferred completion work
no timer-side ring work required
clockevent supports one-shot
kernel idle context available
network runtime inactive or deadline-driven
Keep periodic preemption whenever there is runnable contention. Even one runnable user thread remains periodic until Ring v2, CPU accounting, and timer-side polling dependencies are resolved.
Layer 2: SQPOLL NoHz
SQPOLL full-nohz is a later CPU ownership mode:
full-nohz is not a timer feature here;
it is part of the SQPOLL CPU ownership contract.
Required prerequisites:
- Ring v2 or equivalent per-thread rings;
- one SQ consumer per ring;
- per-CPU scheduler ownership;
- reschedule IPI and idle-to-runnable handoff;
- at least one housekeeping CPU;
- explicit placement of network polling away from isolated CPUs.
Ring mode:
#![allow(unused)]
fn main() {
enum RingMode {
Syscall,
SqpollStarting,
Sqpoll,
SqpollStopping,
}
}
In syscall mode, the owner thread’s cap_enter drains SQ. In SQPOLL mode, a
kernel worker owns SQ head; userspace owns SQ tail and CQ head; cap_enter
waits for completions and may wake a sleeping poller, but it does not drain
SQ.
SQPOLL state:
Disabled -> Starting -> Running -> IdleSpinning -> Sleeping -> Stopping
The wake protocol uses a NEED_WAKEUP flag. Userspace release-stores the SQ
tail, acquire-loads flags, and invokes a wake path only if the poller has gone
to sleep.
The race-free sequence is normative.
Poller before sleeping:
#![allow(unused)]
fn main() {
flags.fetch_or(NEED_WAKEUP, SeqCst);
let tail = sq_tail.load(Acquire);
if sq_head != tail {
flags.fetch_and(!NEED_WAKEUP, Release);
continue;
}
park();
}
Producer:
#![allow(unused)]
fn main() {
write_sqe();
sq_tail.store(new_tail, Release);
fence(SeqCst);
let flags = flags.load(Acquire);
if flags & NEED_WAKEUP != 0 {
wake_poller();
}
}
The poller must set NEED_WAKEUP before the final tail recheck. Otherwise a
producer can publish a new SQE after the poller checks the tail but before it
parks, losing the wake.
The NEED_WAKEUP publication must also be ordered before the final tail
recheck by a full store-to-load barrier. A SeqCst RMW is the simplest
portable rule for the ABI text; an implementation may substitute an explicitly
reviewed architecture-specific fence or park primitive that provides the same
ordering. A plain release store or release-only RMW is not sufficient for this
protocol.
The producer must likewise order the SQ tail publication before checking
NEED_WAKEUP. The normative sequence uses a full fence between
sq_tail.store(..., Release) and flags.load(Acquire); an implementation may
substitute an explicitly reviewed equivalent that prevents the producer from
missing NEED_WAKEUP while the poller misses the new tail before parking.
An SQPOLL CPU may suppress the periodic tick only if:
cpu role is SqpollIsolated
exactly one runnable entity is the poller
no ordinary user thread is runnable there
no timer-side SQ polling is enabled
no network scheduler polling is pinned there
no deferred cleanup is pinned there
stable clocksource/accounting exists
housekeeping CPU is online
If any condition fails, restore periodic tick or migrate the unrelated work.
NoHz Activation Proof Obligations
To enter SqpollNoHz or future AutoNoHz, the scheduler must prove:
exactly one runnable entity is assigned to the CPU
at least one housekeeping CPU is online
no local network polling dependency remains
no timer-side SQ polling can run for the active ring
no local deferred cleanup or unbound kernel worker is pinned there
no unmigratable IRQ targets that CPU unless explicitly allowed
clocksource and CPU accounting are boundary/counter driven, not tick driven
revocation latency is within the lease policy
The proof is dynamic. If any condition stops holding, the scheduler must restore periodic tick, migrate unrelated work, revoke the lease, or leave nohz mode before continuing.
Layer 3: AutoNoHz CPU Lease
The long-term design should split eligibility from activation.
Eligibility says a thread, process, ring, or realtime island may use nohz isolation:
#![allow(unused)]
fn main() {
enum NoHzKind {
Idle,
KernelSqpoll,
AutoCompute,
AutoUserspacePoller,
RealtimeIsland,
}
struct NoHzEligibility {
kind: NoHzKind,
max_revocation_latency_ns: u64,
preferred_cpus: CpuSet,
allow_busy_spin: bool,
accounting_target: CpuAccountingTarget,
}
enum CpuAccountingTarget {
CurrentSchedulingContext,
SchedulerResourceLedger,
}
}
Activation is a scheduler proof that a CPU currently satisfies isolation conditions. Without a lease, a latency-sensitive hint may influence placement but must not grant exclusive CPU access.
Future lease shape:
CpuIsolationLease:
owner process/session
allowed CPU set
allowed mode: poller/compute/kernel-worker
accounting target, not CPU-time credit
revocation policy
Housekeeping must be explicit:
Housekeeping CPU set:
global timers
deferred frees
cleanup
statistics
non-critical kernel workers
debug/watchdog
load balancing and migration control
Layer 4: Deadline Metadata
Deadline metadata lives in fixed ring ABI fields, not in a Cap’n Proto SQE
envelope and not in variable side metadata. The current fixed SQE layout should
not be silently reinterpreted; add these fields through a versioned
CapSqeV2/ring ABI gate when the transport is ready.
#![allow(unused)]
fn main() {
#[repr(C)]
struct CapSqeV2 {
// existing fixed CapSqe fields, unchanged in order and meaning
deadline_ns: u64, // absolute monotonic deadline, 0 = none
qos_flags: u32, // drop/allow/reorder/propagate semantics
sched_ctx_id: u32, // 0 = current/default scheduling context
}
}
deadline_ns is an absolute monotonic timestamp. It is request freshness
metadata, not a promise of nanosecond wakeup precision. The kernel may round
timer programming to clockevent granularity, coalesce timers where policy
allows, or report a miss when dispatch observes the timestamp has already
expired. The field remains u64 nanoseconds because absolute u64 ns values
are simple, tracing-friendly, and shared with existing timeout surfaces; a
u64 microsecond field saves no ABI space.
Only consider a compact profile if SQE space becomes critical:
#![allow(unused)]
fn main() {
deadline_delta_us: u32
}
That profile would be a soft-deadline compact transport shape only. It is not
the primary realtime or SchedulingContext ABI and must not replace
deadline_ns for admitted realtime work.
ABI negotiation uses both bootstrap metadata and a runtime query surface:
#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
ring_addr: u64,
ring_abi_version: u32,
sqe_size: u16,
cqe_size: u16,
}
}
- Process bootstrap passes the ring ABI version and fixed entry sizes alongside the ring address.
RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live incapos-config/src/ring.rs; the kernel andcapos-rtimport the same definition rather than carrying local copies.- A future
RuntimeInfo/SystemInfoquery returns the kernel-supported ring ABI range so language runtimes can fail before mapping incompatible rings. cap_enterrejects unsupported SQE versions or entry sizes with stable transport errors such asCAP_ERR_UNSUPPORTED_RING_ABIandCAP_ERR_UNSUPPORTED_SQE_VERSION.- Runtimes in Rust, C, Go, and other languages must generate or mirror the exact fixed layout for the negotiated version.
Suggested flags:
DROP_IF_LATE:
if now > deadline_ns before dispatch, post DEADLINE_EXPIRED
ALLOW_LATE:
dispatch anyway, but CQE/telemetry marks late
PROPAGATE_DEADLINE:
endpoint CALL/RETURN carries deadline metadata to server-side request
DEADLINE_ORDERED:
SQPOLL may reorder within a bounded window only when all reorder-safety
checks below pass
NO_BLOCKING_PATH:
reject if target method/op is not declared realtime-safe
Do not put budget, period, priority, criticality, or CPU affinity into each SQE. Deadline is per request. Budget is execution authority.
DEADLINE_ORDERED is valid only when all of the following are true:
the ring mode permits reordering
the SQE marks this request reorderable
the target capability interface and method declare reorder-safe semantics
the reordering window is bounded
the operation does not depend on earlier same-ring requests for correctness
Ordered side effects such as write A; write B; flush or lock; mutate; unlock must not be deadline-reordered unless the target method contract
explicitly defines that sequence as reorder-safe.
Layer 5: SchedulingContext
CPU time should become a capability-controlled object:
#![allow(unused)]
fn main() {
struct SchedulingContext {
budget_ns: u64,
period_ns: u64,
relative_deadline_ns: u64,
priority: u16,
criticality: u8,
cpu_mask: CpuSet,
overrun_policy: OverrunPolicy,
timeout_endpoint: Option<EndpointRef>,
}
}
Kernel responsibilities:
- decrement remaining budget by actual runtime;
- replenish budget by period;
- throttle or fault a thread on depletion;
- enforce CPU mask and scheduling eligibility;
- dispatch among eligible contexts by the selected realtime policy;
- prevent untrusted SQE bytes from minting budget.
Policy-service responsibilities:
- admission control;
- budget/period/priority selection;
- CPU-isolation lease policy;
- overload response;
- telemetry and retuning.
Layer 6: Donation
Synchronous capability calls need scheduling-context donation:
client SchedulingContext -> passive server endpoint
server runs on donated budget/deadline
context returns on reply
timeout/overrun reports to caller or island policy
Without donation or inheritance, a realtime caller can be defeated by a normal-priority server that holds the capability implementation path.
Donation semantics must be fixed before implementation:
max donation call depth:
bounded per SchedulingContext or RealtimeIsland; overflow fails closed.
nested donation:
nested synchronous calls carry the current donated context until the depth
bound, unless a callee uses its own admitted context by explicit policy.
cycle handling:
a donated context may not re-enter a thread already on its donation stack;
cycles fail with a typed realtime/donation error.
partial failure:
budget already consumed stays charged to the context that ran the work.
rollback of authority or memory is separate from CPU charge rollback.
timeout propagation:
the earliest of request deadline, scheduling-context deadline, and explicit
call timeout bounds downstream execution.
server-side blocking:
a passive server running on donated context may block only on approved
realtime-safe waits or synchronous calls that continue donation.
return on exception:
application exceptions, transport errors, and cancellation return the
context to its previous owner before CQE/error delivery.
async endpoint queues:
donation does not cross ordinary async endpoint enqueue by default. Async
donation requires an explicit future token/lease design.
Hot admitted paths should avoid blocking locks. If a shared resource cannot be modeled as a passive service, it needs a reviewed priority/deadline-inheritance primitive or a bounded try-lock/fail/drop policy.
Layer 7: RealtimeIsland
RealtimeIsland admits a whole loop or graph:
#![allow(unused)]
fn main() {
struct RealtimeIslandSpec {
period_ns: u64,
deadline_ns: u64,
cpu_set: CpuSet,
nodes: Vec<NodeBudget>,
rings: Vec<RingSpec>,
memory: Vec<PreallocSpec>,
devices: Vec<DeviceReservation>,
overrun_policy: OverrunPolicy,
}
}
Admission requires:
- total budget fits period/deadline constraints;
- all hot-path buffers are preallocated;
- hot-path memory is committed and resident before start;
- guaranteed hot-path memory uses the OOM proposal’s
MemoryResidencypolicy aspinnedorsecret;normalmemory is not admitted for guaranteed hot paths. A future lock-resident operation may transition ordinary memory into a pinned reservation before admission, but the admitted island sees the result aspinned, not asnormal; - all caps and policy decisions are resolved before start;
- no expected page faults on the hot path;
- no unbounded lock acquisition;
- no blocking endpoint calls inside callback loops;
- no allocation, logging, service discovery, or provider credential work on the realtime path;
- IRQ and deferred work are bounded or moved outside the island.
Failure semantics must be typed:
CAP_ERR_DEADLINE_EXPIRED
CAP_ERR_BUDGET_EXHAUSTED
CAP_ERR_REALTIME_UNSAFE_PATH
CAP_ERR_REALTIME_ADMISSION_DENIED
CAP_ERR_OVERRUN
CAP_ERR_STALE_INPUT
CQE/status should distinguish not-started-late, completed-late, dropped by policy, throttled, and dependency-cancelled.
Telemetry Requirements
Tickless, nohz, SQPOLL, and realtime behavior must be observable through future monitoring/status capability surfaces, not only through ad hoc debug logs. The first counters should include:
scheduler_tick_count{cpu}
ticks_suppressed{cpu,mode}
nohz_enter_count{cpu,kind}
nohz_exit_count{cpu,reason}
oneshot_deadline_miss_count
sqpoll_busy_ns
sqpoll_sleep_count
deadline_expired_count
budget_exhausted_count
realtime_overrun_count
donation_depth_max
housekeeping_offload_count
These counters are correctness evidence. Missing or surprising values should fail focused nohz/realtime proofs rather than being treated as performance-only diagnostics.
Implementation Sequence
- Add timer/scheduler instrumentation around the existing periodic tick.
- Add
monotonic_ns()and switchTimer.nowto the clocksource layer while keeping periodic scheduling. - Convert timeout waiters to
deadline_ns. - Add LAPIC one-shot programming and a focused one-shot smoke.
- Replace user-mode idle with kernel/per-CPU idle while keeping periodic ticks.
- Enable tickless idle only when there is no runnable work.
- Keep networking in
ForcedPeriodicor add explicit network poll deadlines. - Add ring mode state and refuse timer-side SQ processing for SQPOLL rings.
- Land Ring v2 per-thread ring ownership and completion routing.
- Add the SQPOLL wake/sleep protocol and a host or Loom-style lost-wakeup model.
- Add kernel SQPOLL without full-nohz, under normal scheduler ticks.
- Add CPU isolation leases and housekeeping CPU placement.
- Enable SQPOLL nohz on isolated CPUs.
- Add request
deadline_nsmetadata and typed late/drop CQE outcomes. - Add
SchedulingContextand admission-controlled realtime islands.
Verification
Tickless idle gates:
make fmt-check
cargo test-lib
cargo test-config
make run-smoke
make run-spawn
Additional tickless proof:
1 second idle interval does not produce 100 scheduler ticks
Timer.sleep still completes
cap_enter timeout still completes
ParkSpace timeout still completes
preemption fairness unchanged with runnable contention
SQPOLL gates:
thread-lifecycle
timer-smoke
timer-flood
park wake/timeout
endpoint CALL/RECV/RETURN
mandatory host or Loom-style lost-wakeup model before any real SQPOLL worker:
poller: set NEED_WAKEUP -> full barrier -> recheck tail -> park
producer: write SQE -> publish tail -> full barrier -> check NEED_WAKEUP -> wake
Realtime gates:
deadline ordering tests
budget depletion tests
donation/return tests through passive endpoint
admission denial tests
QEMU proof for late/drop/overrun behavior
telemetry counters prove ticks suppressed, deadlines expired, budgets
exhausted, and donation depth bounded as expected
Decision
Adopt this staged direction:
Tickless idle:
yes, after clocksource/clockevent split and kernel idle.
Generic full-nohz:
defer. It depends on per-CPU scheduling, Ring v2, accounting, and
housekeeping.
SQPOLL nohz:
yes, but only as explicit CPU-isolation authority after Ring v2.
Realtime:
`SQE.deadline_ns` is useful metadata, but `SchedulingContext` is the
authority that provides CPU time.