Proposal: Tickless and Realtime Scheduling
This proposal captures the scheduling design from the 2026-04-29 discussion and the subsequent implementation status: tickless idle is useful, full-nohz belongs behind explicit CPU isolation authority, and realtime requires scheduling contexts rather than only per-request deadlines.
Design Grounding
The directly relevant grounding is:
- NO_HZ, SQPOLL, and Realtime Scheduling
- Out-of-kernel scheduling
- Completion rings and threaded runtimes
- Multimedia pipeline latency
- Robotics realtime control
- x2APIC and APIC virtualization
- Scheduling
- Ring v2 For Full SMP
- SMP
- Realtime Voice Agent Shell
External grounding is recorded in the research note so reviewers can audit the prior-art claims without treating this proposal as the source of truth.
Goals
- Add tickless idle: when a CPU has no runnable work, stop the periodic scheduler tick and program the local timer for the earliest known deadline.
- Split monotonic timekeeping from timer interrupt delivery.
- Convert scheduler timeout waiters to absolute monotonic deadlines.
- Stage full-nohz as an explicit CPU isolation/lease mode for SQPOLL and realtime executors, not as a generic scheduler default.
- Define
SQE.deadline_nsas request freshness metadata. - Define
SchedulingContextas CPU-time authority. - Define
RealtimeIslandas the admission object for media, robotics, provider, and other bounded realtime graphs.
Non-Goals
- No ambient Linux-style
NO_HZ_FULLfor arbitrary unbudgeted user threads. Ordinary-thread full-nohz requires an explicit budgetedSchedulingContexttarget and aCpuIsolationLease. - No SQPOLL on the current process-wide ring.
- No second SQ consumer through timer-side polling for SQPOLL rings.
- No TSC-deadline or x2APIC requirement for the first tickless-idle milestone.
- No hard realtime claim before kernel-path, IRQ, device, locking, and WCET evidence exists.
- No full realtime policy blob inside every SQE.
CPU Authority Taxonomy
These terms must not drift into overlapping authority systems:
ResourceProfile:
policy template selected by identity, session, account, or service profile;
it is not spendable authority by itself.
ResourceLedger:
coarse accounting and quota owner for a resource class. It records and
enforces limits, including non-realtime CPU share/runtime budgets where the
scheduler has not minted finer scheduling contexts.
SchedulingContext:
spendable CPU-time authority with budget, period, relative deadline,
priority/criticality, CPU mask, and overrun policy.
CpuIsolationLease:
placement, exclusivity, and nohz/noise-isolation authority for a CPU or CPU
set. It does not grant CPU-time credit and must charge consumed time through
a SchedulingContext or coarse scheduler ResourceLedger.
NoHzEligibility:
a reviewed claim or hint that a thread, ring, poller, or island may use nohz
isolation if the scheduler can prove the current CPU state allows it.
NoHzActivation:
the scheduler-proven current CPU state that actually suppresses ticks.
RealtimeIsland:
admitted bundle of SchedulingContexts, memory reservations, device
reservations, rings, endpoint/service constraints, and optional
CpuIsolationLeases.
Scheduling-context donation is not generic resource donation. It donates only execution budget/deadline along a synchronous capability path; it does not donate capability authority, invocation subject identity, disclosure scope, memory budget, network budget, storage budget, or service-management authority.
Layer 1: Tickless Idle
Tickless idle should be the first behavioral milestone. It applies only when the CPU has no runnable thread and no local work that still depends on a periodic scheduler tick.
Clocksource
Add a monotonic clock layer:
#![allow(unused)]
fn main() {
pub fn monotonic_ns() -> u64;
}
The first backend can use the current periodic tick as a compatibility source while the system is still periodic. The selected QEMU/x86_64 backend should eventually use a calibrated stable counter, with SMP consistency handled when multiple scheduler owners exist.
Required invariant:
monotonic_ns() never moves backwards on one CPU.
Clockevent
Add a small scheduler timer backend boundary:
#![allow(unused)]
fn main() {
trait ClockEvent {
fn program_periodic(period_ns: u64);
fn program_oneshot(delta_ns: u64);
fn stop();
fn min_delta_ns() -> u64;
fn max_delta_ns() -> u64;
}
}
The first backend is the current PIT-calibrated xAPIC LAPIC timer on vector 48. PIT/PIC and periodic LAPIC remain fallback paths.
Deadline Waiters
Convert timeout state from tick counts to absolute deadlines:
#![allow(unused)]
fn main() {
struct DeadlineWaiter {
deadline_ns: u64,
target: ThreadRef,
kind: WaiterKind,
user_data: u64,
}
}
Affected paths:
Timer.sleep;cap_enter(timeout_ns);- ParkSpace timeout;
- future process/thread wait timeouts;
- network poll deadline through
NetworkPollClock.
Waiter storage remains bounded. No interrupt path may allocate.
Network Poll Clock
The kernel-resident networking path is scheduler-polled. Rather than keep every
network-coupled lease in ForcedPeriodic, the in-kernel virtio-net poll is now
routed off a lease-isolated CPU (landed 2026-06-04,
scheduler-nohz-network-poll-housekeeping-routing): virtio::poll_scheduler
consults sched::current_cpu_lease_nohz_active() and skips driving the poll
from a CPU inside a lease-backed tick-suppression window, so that CPU no longer
needs the periodic tick to make network progress. The always-ticking
housekeeping CPU the lease admission already requires keeps servicing virtqueue
completions and pending network-waiter scans. The CpuIsolationLease activation
preflight reflects this with a network_polling=routed-periodic-network-polling- to-housekeeping-cpu admit label when a housekeeping CPU is available, failing
closed (rejected-network-polling-no-housekeeping-cpu-to-relocate, and the lease
is refused at create when no housekeeping CPU exists) otherwise. The longer-term
explicit poll-deadline interface below remains the target for fully removing the
dependency on a housekeeping CPU continuing to tick:
#![allow(unused)]
fn main() {
trait NetworkPollClock {
fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
}
next_poll_deadline_ns lets the scheduler include TCP/runtime timers in
earliest_global_deadline(). poll_until_budget prevents network progress
from becoming an unbounded idle-exit or interrupt path. A CPU with active
networking may enter tickless idle only when the network runtime is inactive or
has exposed a bounded deadline through this interface.
Kernel Idle
Tickless idle depends on replacing the user-mode idle process with a kernel/per-CPU idle context. Timer IRQ handling must distinguish:
IRQ from CPL3 user thread -> save/restore user context
IRQ from CPL0 idle -> wake/check scheduler without fake user context
Idle entry shape:
if no runnable work:
deadline = earliest_global_deadline()
clockevent.program_oneshot(deadline - now)
enter_kernel_idle()
The idle loop enables interrupts, halts, wakes on timer/IPI/device interrupt, then rechecks runnable work and deadline expiry.
Tickless State
Per CPU:
Periodic:
normal scheduler tick active
TicklessIdle:
no runnable thread
one-shot local timer programmed for earliest deadline
CPU in kernel idle
ForcedPeriodic:
fallback when a subsystem still needs regular polling
Enter TicklessIdle only when:
run queue empty
no direct IPC target
no deferred completion work
no timer-side ring work required
clockevent supports one-shot
kernel idle context available
network runtime inactive or deadline-driven
Keep periodic preemption whenever there is runnable contention. Even one runnable user thread remains periodic until Ring v2, CPU accounting, and timer-side polling dependencies are resolved.
Layer 2: SQPOLL NoHz
SQPOLL full-nohz is a later CPU ownership mode:
full-nohz is not a timer feature here;
it is part of the SQPOLL CPU ownership contract.
Required prerequisites:
- Ring v2 or equivalent per-thread rings;
- one SQ consumer per ring, including implemented syscall-mode leases and bounded SQPOLL mode transitions;
- per-CPU scheduler ownership;
- reschedule IPI and idle-to-runnable handoff;
- at least one housekeeping CPU;
- explicit placement of network polling away from isolated CPUs.
Current Phase F status: CpuIsolationLease and nohz telemetry exist, the
housekeeping/deferred-work placement child records selected online
housekeeping CPU masks plus deferred cleanup, timer/deadline, network polling,
IRQ-affinity, accounting-target, and cleanup-latency placement or rejection
labels, bounded SQPOLL ring mode can progress from periodic service or one
current-thread syscall/producer-wake batch, and the clockevent/deadline
substrate has split monotonic clocksource reads from LAPIC clockevent
programming. The clockevent one-shot’s firing precision is proven, not just its
programming: a runtime-reprogrammed TICK_NS/2 one-shot armed over the live
periodic timer is measured to fire at its requested sub-tick instant (~5 ms
for a 5 ms request, far under the 10 ms tick, with the current-count correctly
reset to the sub-tick value), and the kernel-mode-fire path restores a live
periodic timer so a one-shot consumed without running schedule() cannot
strand the CPU with no timer source (make run-scheduling-context).
The monotonic clocksource discipline is now sub-tick-accurate as well. The
periodic discipline step previously floored every fire to epoch + TICK_NS
(max(tsc_interpolated, epoch + TICK_NS)), which inflated a real sub-tick
interval to a full tick and hid sub-tick deadlines from the accounting clock.
discipline_clocksource_tick now trusts the TSC interpolation at sub-tick
granularity and falls back to the TICK_NS floor only when the interpolated
advance is implausibly small (below MIN_DISCIPLINED_ADVANCE_NS), preserving a
minimum forward rate against a degenerate TSC (publish_monotonic_ns enforces
only non-decreasing time, not a minimum rate). A boot proof advances a real
TICK_NS/2 interval through one discipline step and asserts monotonic_ns()
tracked the sub-tick delta rather than the full-tick floor
(make run-scheduling-context).
The first activation increment is now real: the CpuIsolationLease
activation preflight performs real per-CPU periodic-tick suppression for
the narrow single-runnable-entity window. When the preflight finds every
proof obligation satisfied – exactly one runnable caller on the target CPU,
ready housekeeping CPU, no local deferred-cleanup/timer dependency, valid
accounting target, live monotonic clocksource, non-stale one-SQ-consumer, and
bounded revocation latency – and the target CPU is the CPU running the
preflight, it masks the periodic LAPIC tick and arms a bounded one-shot
deadline at min(nearest pending timer wakeup, now + max revocation latency).
Network polling is now routed to a housekeeping CPU rather than kept read-only
fail-closed (landed 2026-06-04): the in-kernel virtio-net poll skips driving
from a lease-isolated CPU (virtio::poll_scheduler consulting
sched::current_cpu_lease_nohz_active()), so the admission network_polling
gate flips to a routed-periodic-network-polling-to-housekeeping-cpu admit when
a housekeeping CPU is available and fails closed otherwise. IRQ affinity is now
routable in a bounded form (landed 2026-06-04): when a lease opts in, the
activation path reprograms the leased CPU’s legacy IO-APIC redirection-entry
destinations onto the selected housekeeping CPU (mask-before-reprogram +
read-back, restored on rollback/revoke) before admitting tick suppression, and
keeps the conservative rejected-irq-affinity-not-routed-to-housekeeping refusal
for any ring-coupled lease whose IRQ dependency cannot be safely rerouted. The
live reroute is presently scoped to a quiescent housekeeping destination: under
the in-kernel KVM irqchip, reprogramming an IO-APIC redirection-entry destination
onto a CPU that is actively scheduling stalls forward progress on that
destination CPU, so a general “reroute onto any housekeeping CPU regardless of
occupancy” admission remains future work behind a real destination-quiescence
gate or a delivery backend without that re-evaluation cost. Every disqualifying
change (stale lease generation, a
second runnable entity, stealable sibling work, a local deferred-cleanup
dependency, a target-CPU mismatch, or a one-shot backend that can no longer
arm a deadline) rolls the CPU back to the periodic LAPIC tick first, before
ordinary work continues. Generic full-nohz for ordinary budgeted compute threads
is now admitted through explicit SchedulingContext-targeted compute leases. A
generic SQPOLL nohz state machine now admits explicitly leased caller-thread
rings when the ring is in SQPOLL running/sleeping mode with a live owner, one
SQ consumer, and bounded producer-wake/deadline rollback. Broader
userspace-poller/device-queue admission and production realtime island
admission remain future work; the periodic tick stays the fail-closed fallback
everywhere else. Timeout-based auto-revoke has since landed:
a lease created
with leaseLifetimeNs > 0 auto-revokes on first observation past its deadline
(reason=lease-expired) and a tickless CPU under it rolls back at the next
recheck (lease-lifetime-expired)
(docs/tasks/done/2026-05-30/scheduler-cpu-isolation-lease-timeout-auto-revoke.md).
SQPOLL-driven activation is now proven by
make run-scheduler-generic-sqpoll-nohz: a ring-coupled kernelSqpoll lease
whose bound ring is in SQPOLL running/sleeping mode with a live owner is
admitted for tick suppression, producer wake drives bounded non-periodic
service, and revoke/stale-owner rollback fails closed. The per-CPU
idle thread has also landed – the scheduler idle path is now a CPL0 per-CPU
kernel idle thread and the user-mode idle process is gone (docs/tasks/README.md).
The non-atomic createLease-vs-revokeGrant SMP window
(kernel/src/cap/cpu_isolation_pool_grant.rs:472-483) – a createLease that
passes the grant live-check on one CPU can register its lease just after a
concurrent revokeGrant on another CPU snapshotted the registry, so that lease
is not cascade-terminated and lingers until its own leaseLifetimeNs or an
explicit revoke – is now a modeled, bounded residual rather than a prose-only
caveat. The Alloy lease/grant authority model represents it explicitly as the
WindowLingering set and checks that no live lease reaches a revoked grant
outside it. That the lingering lease was nonetheless legitimately authorized
(no lease is ever minted through an already-revoked grant) is a temporal
mint-time-vs-revoke property the static relational model does not itself check;
it rests on the code’s create-time minted_grant_live gate
(cpu_isolation_pool_grant.rs:484), which fails closed before admission. Taken
together this is a bounded capacity-hold window, not an authority escalation. The
companion TLA+ model checks the two-lock teardown the cascade and prune share
(generation advances exactly once, no capacity double-free, no stranded
generation). Both run under make model-scheduler-lease-alloy /
make model-scheduler-lease-tla; see models/scheduler/README.md.
The nohz/tickless activation-rollback path – the lock-free NOHZ_ACTIVE_CPUS
bit read from ISR context against the locked dispatch.nohz_activation[slot]
record, with IPI-delivered cross-CPU activation/rollback – is likewise now a
checked model rather than a prose-only invariant. The TLA+ lifecycle model
(models/scheduler/nohz_activation.tla) checks that no scheduler CPU is ever
left timer-less (a fired one-shot always has the contention fallback re-arm
enabled, and is always eventually re-armed), that the lock-free bit and the
locked record always reconcile (the bit-set/record-cleared and
record-present/bit-cleared divergences the rollback and contention paths produce
are transient), and that a staled remote activation is dropped rather than
applied to a newer lease (a staled generation is never committed, and a
recorded generation staled by the cap-side maybe_expire path is always rolled
back by the stale-lease-generation disqualifier). A focused Loom test pins the
lock-free-bit ↔ locked-record reconciliation under the C11 memory model. Both
run under make model-scheduler-nohz-tla / make model-scheduler-nohz-loom;
see models/scheduler/README.md.
Ring mode:
#![allow(unused)]
fn main() {
enum RingMode {
Syscall,
SqpollStarting,
Sqpoll,
SqpollStopping,
}
}
In syscall mode, the owner thread’s cap_enter drains SQ. In SQPOLL mode, a
kernel worker owns SQ head; userspace owns SQ tail and CQ head; cap_enter
waits for completions and may wake a sleeping poller, but it does not drain
SQ.
SQPOLL state:
Disabled -> Starting -> Running -> IdleSpinning -> Sleeping -> Stopping
The wake protocol uses a NEED_WAKEUP flag. Userspace release-stores the SQ
tail, acquire-loads flags, and invokes a wake path only if the poller has gone
to sleep.
The race-free sequence is normative.
Poller before sleeping:
#![allow(unused)]
fn main() {
flags.fetch_or(NEED_WAKEUP, SeqCst);
let tail = sq_tail.load(Acquire);
if sq_head != tail {
flags.fetch_and(!NEED_WAKEUP, Release);
continue;
}
park();
}
Producer:
#![allow(unused)]
fn main() {
write_sqe();
sq_tail.store(new_tail, Release);
fence(SeqCst);
let flags = flags.load(Acquire);
if flags & NEED_WAKEUP != 0 {
wake_poller();
}
}
The poller must set NEED_WAKEUP before the final tail recheck. Otherwise a
producer can publish a new SQE after the poller checks the tail but before it
parks, losing the wake.
The NEED_WAKEUP publication must also be ordered before the final tail
recheck by a full store-to-load barrier. A SeqCst RMW is the simplest
portable rule for the ABI text; an implementation may substitute an explicitly
reviewed architecture-specific fence or park primitive that provides the same
ordering. A plain release store or release-only RMW is not sufficient for this
protocol.
The producer must likewise order the SQ tail publication before checking
NEED_WAKEUP. The normative sequence uses a full fence between
sq_tail.store(..., Release) and flags.load(Acquire); an implementation may
substitute an explicitly reviewed equivalent that prevents the producer from
missing NEED_WAKEUP while the poller misses the new tail before parking.
An SQPOLL CPU may suppress the periodic tick only if:
cpu role is SqpollIsolated
exactly one runnable entity is the poller
no ordinary user thread is runnable there
no timer-side SQ polling is enabled
no network scheduler polling is pinned there
no deferred cleanup is pinned there
stable clocksource/accounting exists
housekeeping CPU is online
If any condition fails, restore periodic tick or migrate the unrelated work.
NoHz Activation Proof Obligations
To enter SqpollNoHz or future AutoNoHz, the scheduler must prove:
exactly one runnable entity is assigned to the CPU
at least one housekeeping CPU is online
no local network polling dependency remains
no timer-side SQ polling can run for the active ring
no local deferred cleanup or unbound kernel worker is pinned there
no unmigratable IRQ targets that CPU unless explicitly allowed
clocksource and CPU accounting are boundary/counter driven, not tick driven
revocation latency is within the lease policy
The proof is dynamic. If any condition stops holding, the scheduler must restore periodic tick, migrate unrelated work, revoke the lease, or leave nohz mode before continuing.
Layer 3: AutoNoHz CPU Lease
The long-term design should split eligibility from activation.
Eligibility says a thread, process, ring, or realtime island may use nohz isolation:
#![allow(unused)]
fn main() {
enum NoHzKind {
Idle,
KernelSqpoll,
AutoCompute,
AutoUserspacePoller,
RealtimeIsland,
}
struct NoHzEligibility {
kind: NoHzKind,
max_revocation_latency_ns: u64,
preferred_cpus: CpuSet,
allow_busy_spin: bool,
accounting_target: CpuAccountingTarget,
}
enum CpuAccountingTarget {
CurrentSchedulingContext,
SchedulerResourceLedger,
}
}
Activation is a scheduler proof that a CPU currently satisfies isolation conditions. Without a lease, a latency-sensitive hint may influence placement but must not grant exclusive CPU access.
Future lease shape:
CpuIsolationLease:
owner process/session
allowed CPU set
allowed mode: poller/compute/kernel-worker
accounting target, not CPU-time credit
revocation policy
Housekeeping must be explicit:
Housekeeping CPU set:
global timers
deferred frees
cleanup
statistics
non-critical kernel workers
debug/watchdog
load balancing and migration control
Layer 4: Deadline Metadata
Deadline metadata lives in fixed ring ABI fields, not in a Cap’n Proto SQE
envelope and not in variable side metadata. The current fixed SQE layout should
not be silently reinterpreted; add these fields through a versioned
CapSqeV2/ring ABI gate when the transport is ready.
#![allow(unused)]
fn main() {
#[repr(C)]
struct CapSqeV2 {
// existing fixed CapSqe fields, unchanged in order and meaning
deadline_ns: u64, // absolute monotonic deadline, 0 = none
qos_flags: u32, // drop/allow/reorder/propagate semantics
sched_ctx_id: u32, // 0 = current/default scheduling context
}
}
deadline_ns is an absolute monotonic timestamp. It is request freshness
metadata, not a promise of nanosecond wakeup precision. The kernel may round
timer programming to clockevent granularity, coalesce timers where policy
allows, or report a miss when dispatch observes the timestamp has already
expired. The field remains u64 nanoseconds because absolute u64 ns values
are simple, tracing-friendly, and shared with existing timeout surfaces; a
u64 microsecond field saves no ABI space.
Only consider a compact profile if SQE space becomes critical:
#![allow(unused)]
fn main() {
deadline_delta_us: u32
}
That profile would be a soft-deadline compact transport shape only. It is not
the primary realtime or SchedulingContext ABI and must not replace
deadline_ns for admitted realtime work.
ABI negotiation uses both bootstrap metadata and a runtime query surface:
#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
ring_addr: u64,
ring_abi_version: u32,
sqe_size: u16,
cqe_size: u16,
}
}
- Process bootstrap passes the ring ABI version and fixed entry sizes alongside the ring address.
RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live incapos-config/src/ring.rs; the kernel andcapos-rtimport the same definition rather than carrying local copies.- A future
RuntimeInfo/SystemInfoquery returns the kernel-supported ring ABI range so language runtimes can fail before mapping incompatible rings. cap_enterrejects unsupported SQE versions or entry sizes with stable transport errors such asCAP_ERR_UNSUPPORTED_RING_ABIandCAP_ERR_UNSUPPORTED_SQE_VERSION.- Runtimes in Rust, C, Go, and other languages must generate or mirror the exact fixed layout for the negotiated version.
Suggested flags:
DROP_IF_LATE:
if now > deadline_ns before dispatch, post DEADLINE_EXPIRED
ALLOW_LATE:
dispatch anyway, but CQE/telemetry marks late
PROPAGATE_DEADLINE:
endpoint CALL/RETURN carries deadline metadata to server-side request
DEADLINE_ORDERED:
SQPOLL may reorder within a bounded window only when all reorder-safety
checks below pass
NO_BLOCKING_PATH:
reject if target method/op is not declared realtime-safe
Do not put budget, period, priority, criticality, or CPU affinity into each SQE. Deadline is per request. Budget is execution authority.
DEADLINE_ORDERED is valid only when all of the following are true:
the ring mode permits reordering
the SQE marks this request reorderable
the target capability interface and method declare reorder-safe semantics
the reordering window is bounded
the operation does not depend on earlier same-ring requests for correctness
Ordered side effects such as write A; write B; flush or lock; mutate; unlock must not be deadline-reordered unless the target method contract
explicitly defines that sequence as reorder-safe.
Layer 5: SchedulingContext
CPU time should become a capability-controlled object:
#![allow(unused)]
fn main() {
struct SchedulingContext {
budget_ns: u64,
period_ns: u64,
relative_deadline_ns: u64,
priority: u16,
criticality: u8,
cpu_mask: CpuSet,
overrun_policy: OverrunPolicy,
timeout_endpoint: Option<EndpointRef>,
}
}
Kernel responsibilities:
- decrement remaining budget by actual runtime;
- replenish budget by period;
- throttle or fault a thread on depletion;
- enforce CPU mask and scheduling eligibility;
- dispatch among eligible contexts by the selected realtime policy;
- prevent untrusted SQE bytes from minting budget.
Policy-service responsibilities:
- admission control;
- budget/period/priority selection;
- CPU-isolation lease policy;
- overload response;
- telemetry and retuning.
Layer 6: Donation
Synchronous capability calls need scheduling-context donation:
client SchedulingContext -> passive server endpoint
server runs on donated budget/deadline
context returns on reply
timeout/overrun reports to caller or island policy
Without donation or inheritance, a realtime caller can be defeated by a normal-priority server that holds the capability implementation path.
Donation semantics must be fixed before implementation:
max donation call depth:
bounded per SchedulingContext or RealtimeIsland; overflow fails closed.
nested donation:
nested synchronous calls carry the current donated context until the depth
bound, unless a callee uses its own admitted context by explicit policy.
cycle handling:
a donated context may not re-enter a thread already on its donation stack;
cycles fail with a typed realtime/donation error.
partial failure:
budget already consumed stays charged to the context that ran the work.
rollback of authority or memory is separate from CPU charge rollback.
timeout propagation:
the earliest of request deadline, scheduling-context deadline, and explicit
call timeout bounds downstream execution.
server-side blocking:
a passive server running on donated context may block only on approved
realtime-safe waits or synchronous calls that continue donation.
return on exception:
application exceptions, transport errors, and cancellation return the
context to its previous owner before CQE/error delivery.
async endpoint queues:
donation does not cross ordinary async endpoint enqueue by default. Async
donation requires an explicit future token/lease design.
Hot admitted paths should avoid blocking locks. If a shared resource cannot be modeled as a passive service, it needs a reviewed priority/deadline-inheritance primitive or a bounded try-lock/fail/drop policy.
Layer 7: RealtimeIsland
RealtimeIsland admits a whole loop or graph:
#![allow(unused)]
fn main() {
struct RealtimeIslandSpec {
period_ns: u64,
deadline_ns: u64,
cpu_set: CpuSet,
nodes: Vec<NodeBudget>,
rings: Vec<RingSpec>,
memory: Vec<PreallocSpec>,
devices: Vec<DeviceReservation>,
overrun_policy: OverrunPolicy,
}
}
Admission requires:
- total budget fits period/deadline constraints;
- all hot-path buffers are preallocated;
- hot-path memory is committed and resident before start;
- guaranteed hot-path memory uses the OOM proposal’s
MemoryResidencypolicy aspinnedorsecret;normalmemory is not admitted for guaranteed hot paths. A future lock-resident operation may transition ordinary memory into a pinned reservation before admission, but the admitted island sees the result aspinned, not asnormal; - all caps and policy decisions are resolved before start;
- no expected page faults on the hot path;
- no unbounded lock acquisition;
- no blocking endpoint calls inside callback loops;
- no allocation, logging, service discovery, or provider credential work on the realtime path;
- IRQ and deferred work are bounded or moved outside the island.
Failure semantics must be typed:
CAP_ERR_DEADLINE_EXPIRED
CAP_ERR_BUDGET_EXHAUSTED
CAP_ERR_REALTIME_UNSAFE_PATH
CAP_ERR_REALTIME_ADMISSION_DENIED
CAP_ERR_OVERRUN
CAP_ERR_STALE_INPUT
CQE/status should distinguish not-started-late, completed-late, dropped by policy, throttled, and dependency-cancelled.
Policy-Service Userstories: AutoNoHz Placement for Compute-Capable Threads
The Layer 1-7 primitives above are mechanism: NoHzEligibility is a reviewed
claim, CpuIsolationLease is the placement authority, SchedulingContext and
the coarse ResourceLedger own CPU-time budget, and NoHzActivation is the
scheduler proof that current CPU state allows tick suppression. They do not
answer who decides to issue an eligibility hint for an ordinary user thread
that was not pre-declared as a realtime island or kernel SQPOLL worker, or
what observation justifies the issuance. That decision is policy, and it
belongs in the user-space scheduler policy service described in
Stage 7 of scheduler-evolution-proposal.
This section records the userstories that motivate the responsibility and the
bounds the policy service must enforce so auto-promotion never becomes an
implicit “unlimited CPU-hold” grant.
Core property: promotion is placement, not budget
Auto-promotion adds isolation; it never mints CPU-time authority. A
policy-issued CpuIsolationLease only removes tick and scheduler noise while
its bound thread consumes time that was already authorized through its
SchedulingContext or coarse ResourceLedger. SchedulingContext budget
exhaustion is now folded into the same nearest-deadline timer as nohz
revocation/timer work, so a tick-masked CPU is re-observed at the budget
deadline rather than at a later periodic tick. When budget exhausts, or when any
existing Layer 3 activation obligation stops holding, the existing fail-closed
rollback path restores the periodic tick. Priority-aware revocation of the lease
itself when an equal-or-higher-priority runnable arrives is new Phase H surface
(see “Bounds the policy service must enforce” below); today’s Phase F rollback
only restores ticks on the leased CPU and does not terminate the lease.
This separation answers the obvious objection. A busy-spinning thread cannot escalate itself into permanent CPU exclusivity, because the spin drains its allotted budget at the same rate periodic scheduling would have drained it. If the operator has granted enough budget to saturate a core, auto-promotion removes tick interference while that budget is consumed; if not, the same authority that would have throttled the thread under periodic scheduling still throttles it under nohz.
Trigger: “thread appears capable of utilizing a full CPU core”
The trigger is not a fixed percentage threshold inside the kernel. The kernel exports per-thread observation; the policy service synthesizes a saturation-capability signal from those observations and decides what “capable of utilizing a full CPU core” means for a given account, session, or service profile. Plausible inputs the policy service may combine:
- runtime accumulated over a rolling window approaches the wall-clock window the thread had on its assigned CPU;
- voluntary-block count over the same window stays low (the thread is not IPC- or IO-bound at a rate that would lose the benefit);
- runnable-but-not-running time stays low when the thread is the only runnable entity on its CPU, or correlates with placement contention rather than IO when it is not.
Concrete window length, smoothing, and the synthesis rule are policy-service
choices, replaceable without ABI churn. As of 2026-05-30 the kernel exports
the observation inputs the heuristic consumes as ordinary (non-measure)
per-thread state: runtime_ns/virtual_runtime_ns, voluntary_blocks,
preemptions, and a cumulative runnable_accumulated_ns
(runnable-but-not-running time) are all returned by
SchedulingPolicyCap.snapshot @2. voluntary_blocks and preemptions were
promoted out of cfg(feature = "measure") and runnable_accumulated_ns was
added at the run-queue enqueue/select boundary; only migrations remains
measure-gated. This closes the Phase H “monitoring/status surface that
exports per-thread saturation observation” prerequisite. The surface exports
raw cumulative counters only: no fixed threshold and no windowing live in the
kernel – the policy service synthesizes the saturation signal.
Userstories
-
Long-running compute tenant with declared budget. A model-training, video-encoding, or HPC build job is admitted with a
SchedulingContext(or coarseResourceLedgerallocation) sized for sustained near-core utilization on a declared CPU pool. The policy service observes the thread saturating the pool’s CPU share, issues a boundedCpuIsolationLeaseagainst the pool, the scheduler proves the activation obligations from Layer 2/3, and ticks are suppressed for as long as the thread keeps consuming the granted budget. The lease ends when the budget exhausts, the job completes, the operator revokes the pool, or the saturation signal subsides. -
Userspace poller that earned isolation. A service polls a ring or device queue (a candidate
AutoUserspacePollerin theNoHzKindtaxonomy). The policy service sees consistent saturation with low voluntary blocking, recognizes theAutoUserspacePollereligibility kind, and issues a lease. The bounds are the same as for the kernel SQPOLL path; only the consumer differs. -
Account-scoped auto-claim pool. An operator pre-declares “account X may auto-claim up to N isolated CPUs from pool P, maximum auto-lease lifetime L, with revocation latency R, charging to ledger E.” The policy service monitors threads owned by X, issues leases against P when saturation capability is observed, and refuses promotion when X already holds N leases or when no CPU in P currently satisfies the activation proof. Without the operator declaration the policy service does not auto-promote.
-
Background agent that bursts to full-core compute. A general-purpose agent process does not normally saturate a core. When it briefly does (a planning phase, a build step, a local inference call), the policy service may issue a short-lifetime lease if the agent’s account has authorized auto-promotion. When the burst ends the signal subsides; the lease is not renewed.
Bounds the policy service must enforce
For every auto-issued lease the policy service records:
lifetime_ns: bounded; shorter than admin-issued leases by
default; renewal requires re-observing the
saturation signal.
max_revocation_latency_ns: bounded by NoHzEligibility.max_revocation_latency_ns;
cannot exceed the operator/account policy.
accounting_target: a live SchedulingContext or coarse ResourceLedger;
the lease does not mint CPU-time authority.
auto_claim_pool: the pre-authorized CPU set; no implicit fallback to
system-wide isolation.
fairness_preemption: another runnable entity at equal-or-higher policy
priority terminates the lease if no other CPU
authorized by both the pool and lease mask is
eligible.
Two of these bounds map to existing kernel-enforced surfaces:
max_revocation_latency_ns is already a field on NoHzEligibility and the
closed Phase F activation preflight; accounting_target is already a field
on NoHzEligibility and the live SchedulingContext/ResourceLedger
authority.
The other three bounds need new kernel-enforced surfaces before the heuristic can ship and are named as Phase H prerequisites:
lifetime_ns: LANDED 2026-05-30.CpuIsolationLeaseSpecnow carriesleaseLifetimeNs @6(0= no expiry, the default). A lease records an absolute monotonicexpires_at_nsat creation; the first observation past the deadline auto-revokes through the existing generation-advancing cleanup (reason=lease-expired), and the nohz activation record carries the lifetime deadline so a tickless CPU rolls back at the next timer/IPI recheck (lease-lifetime-expired), bounded bymaxRevocationLatencyNs. This is the bounded-lifetime guarantee the auto-issued placement lease needs, so a compromised, blocked, or malfunctioning policy service cannot leave an auto-issued lease holding the CPU indefinitely. The bounded renewal primitive LANDED on top of this:CpuIsolationLease.renew @4pushesexpires_at_nsforward tonow + leaseLifetimeNs(clamped to the same one-hour ceilingread_specenforces), keeping the same(leaseId, generation), accounting binding, and nohz activation state – distinct from re-minting a fresh lease. It is callable only before expiry (a revoked, auto-revoked, or past-deadline lease staysstaleGenerationand is not resurrected; an unboundedleaseLifetimeNs = 0lease reportsnotRenewable), and the renewed deadline is propagated to a tickless CPU’s nohz activation record so thelease-lifetime-expireddisqualifier no longer rolls it back at the old deadline;CpuIsolationLeaseInfo.expiresAtNsechoes the deadline read-only. Only the Phase H renewal heuristic – re-observing the saturation signal to decide whether to callrenewon a near-expiry lease – remains future policy-service work on top of this primitive.auto_claim_pooland per-account capacity (Nin userstory 3): the operator-declared CPU-pool descriptor LANDED 2026-05-30, making a non-defaultpoolIdmeaningful for the first time.CpuIsolationLeaseSpeccarriespoolId @7(0= the implicit default pool over every scheduler CPU), and the kernel seeds a fixed declared-pool registry (CpuIsolationPoolDescriptor: the default pool0plus exactly one declared non-default pool1over a single CPU). The create-time admission gate now looks the pool up: an undeclaredpoolIdis rejectedinvalidSpec; a declared pool whose CPU mask the lease’sallowedCpuMaskexceeds is rejectedinvalidSpec; a declared pool with a subset mask is admitted and its id/mask are echoed read-only throughCpuIsolationLeaseInfo(admittedPoolId/admittedPoolCpuMask) (proofmake run-scheduler-cpu-isolation-lease:nondefault_pool=invalidSpecfor the undeclared id,declared_pool=ok admitted_pool_id=1 admitted_pool_cpu_mask_subset=true,declared_pool_mask_violation=invalidSpec,default_pool_id=0). The declared-pool table is now operator-sourced (LANDED 2026-05-30): the kernel installs it from the boot manifestSystemConfig.cpuIsolationPools @14(aList(CpuIsolationPoolDescriptor)), with the in-kernel constant as the fail-closed default when the manifest omits the list, and validates each entry fail-closed at boot (canonical CPU mask subset of the scheduler mask, default pool0synthesized if omitted, duplicate ids rejected). The boot linecpu-isolation: declared-pools source=manifest count=3 default_pool_id=0 nondefault_pool_id=1 nondefault_pool_cpu_mask=0x2proves the source (proofmake run-scheduler-cpu-isolation-lease; the kernel-default fallback is proven bycargo test-configdecode/empty assertions). The descriptor now also carries a per-pool live-lease capacity bound (poolMaxLeases @2, LANDED 2026-05-31): a non-zero value caps the number of simultaneously live (non-revoked, current-generation) leases the kernel admits against that pool at create-time, counted from the existingLEASE_REGISTRYafterprune_dead, rejecting an over-capacity create fail-closedresourceExhausted(0= unbounded, preserving the default pool0and every existing producer). The manifest bounds pool2atpoolMaxLeases: 2; the proof admits two live leases, refuses a third non-overlapping create (cpu-isolation: pool-capacity-rejected admitted_pool_id=2 live_leases=2 pool_max_leases=2 result=resourceExhausted,pool_capacity_exceeded=resourceExhausted), then reclaims after a revoke (pool_capacity_reclaimed=ok), proving the bound is live-count not cumulative. The account identity and per-accountNthen landed on top of this counter (LANDED 2026-05-31):CpuIsolationLeaseSpeccarriesaccountId @8 :UInt64(0= unattributed, caller-asserted and inert until counted, echoed read-only throughCpuIsolationLeaseInfo.accountId @6) andCpuIsolationPoolDescriptorcarriespoolMaxLeasesPerAccount @3 :UInt32(0= unbounded per account). After the pool-wide check,registercounts the requesting account’s live entries (matching bothadmitted_pool_idandaccount_id) against the per-account bound and rejects an over-bound create fail-closedresourceExhausted(0account or0bound skips the gate). The manifest bounds pool2atpoolMaxLeasesPerAccount: 1; the proof admits one account-7 lease, refuses a second account-7 create (cpu-isolation: account-capacity-rejected admitted_pool_id=2 account_id=7 account_live_leases=1 pool_max_leases_per_account=1 result=resourceExhausted,account_capacity_exceeded=resourceExhausted), admits a different account-9 lease on that CPU (account_capacity_other_account=ok– per-account, not pool-wide), and reclaims after revoking account-7 (account_capacity_reclaimed=ok). The account id is caller-asserted on the plain lease path. The authentication half LANDED 2026-05-31:CpuIsolationPoolGrant(schema/capos.capnp; sourcecpu_isolation_pool_grant; kernelkernel/src/cap/cpu_isolation_pool_grant.rs) introduced a bootstrap-staged grant that binds one authenticated account to one declared pool. ItscreateLeasestamps the bound account/pool onto the minted lease, overriding any caller-assertedaccountId/poolId, and reuses the same lease-create admission path (cpu_isolation::create_lease_for_caller) – so the per-account bound is unforgeable by cap-possession: a holder cannot assert another account to evadepoolMaxLeasesPerAccount. The initial single-grant proof used account7bound to pool2; the currentmake run-scheduler-cpu-isolation-pool-grantproof boots manifest-declared grants. The grant binding is now operator-declared (LANDED 2026-06-01): the manifestSystemConfig.cpuIsolationPoolGrantstable seeds the bound(account, pool)pairs (mirroring thecpuIsolationPoolstable), and thecpu_isolation_pool_grant/cpu_isolation_pool_grant_secondarysources stage seeded binding index0/1, so an operator can pre-authorize multiple distinct accounts/pools, each staged as its own bootstrap grant cap. An absent/empty list falls back to one in-kernel binding at index0: account7bound to preferred pool1when active, otherwise account7bound to synthesized default pool0, preserving a usable single default grant when a manifest-sourced pool table omits pool1.make run-scheduler-cpu-isolation-pool-grantnow boots a two-entry table (account5/pool1, account8/pool2) and proves each grant stamps its OWN bound account with the per-account bound still enforced.make run-scheduler-cpu-isolation-pool-grant-defaultboots the empty-list fallback with pool1omitted and proves the synthesized(account 7, pool 0)grant is usable. Runtime grant minting landed 2026-06-02 22:24 UTC (CpuIsolationGrantMinter): one cap mints a freshCpuIsolationPoolGrantfor an operator-chosen(account, pool)at call time, bounded by the declaredSystemConfig.cpuIsolationGrantMinterAllowlist(an out-of-allowlist pair is refusedunauthorized, so the minter is never an ambient grant-any authority; the minted grant reuses the same unforgeablecreateLeaseadmission path). The samemake run-scheduler-cpu-isolation-pool-grantsmoke mints a grant for the allowed(account 6, pool 2), proves itscreateLeasestamps account6and stays bounded by the per-account gate, and proves an out-of-allowlist mint is refused. Grant-revocation lifecycle landed 2026-06-03 17:11 UTC (CpuIsolationGrantMinter.revokeGrant), closing (c): a runtime-minted grant carries a revocable(grantId, generation);revokeGrant(grantId)advances the grant generation so a stale grant handle’screateLeasefailsstaleGenerationand mints nothing, and revocation cascades to every live lease minted through that grant – reusing the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) once per tagged lease, so per-pool/per-account live-lease capacity frees immediately and a fresh grant is admitted into the reclaimed slot. Double-revoke isalreadyRevokedand an unknowngrantIdisunknownGrant, both fail-closed; seeded bootstrap grants are not minter-owned and stay un-revocable. The samemake run-scheduler-cpu-isolation-pool-grantsmoke proves the full lifecycle. No pool authority is minted from holding a lease cap; the kernel stays the fail-closed admission gate.fairness_preemption: LANDED 2026-06-02 21:17 UTC. The Phase F rollback path now compares policy priority at the existing nohz recheck site: when a second runnable entity appears on the leased CPU at equal-or-higher WFQ policy priority (latency_class,weight) than the captured leased thread, and no sibling CPU authorized by both the admitted pool and the leaseallowedCpuMaskis eligible to host the lease, the kernel terminates theCpuIsolationLeaseitself (fairness-preempted ... result=lease-terminated) rather than only restoring the periodic tick, bounded bymaxRevocationLatencyNs. The termination runs the same generation-advancing cleanupleaseLifetimeNsexpiry uses (reason=fairness-preempted) immediately after the scheduler restores the periodic tick, so a subsequentinfo/revokereportsstaleGenerationand placement/account capacity is freed without waiting for the holder’s next cap call; a strictly-lower arrival or an eligible sibling CPU inside both masks keeps the existing tick-restore-only behavior. The kernel supplies the comparison and fail-closed termination; the policy service remains the issuer and bookkeeper of the saturation signal. Re-placement of the leased thread onto an eligible sibling CPU (instead of terminating) remains generic-full-nohz work; the “no sibling eligible” condition is recorded.
The policy service is the issuer and the bookkeeper of the synthesized saturation signal; the kernel remains the authority gate, the activation prover, and the fail-closed rollback path – including for the three not-yet-existing surfaces above.
Explicit non-goals
- The kernel does not contain a saturation-detection rule of its own. It exports observation; it does not synthesize the signal.
- Auto-promotion does not grant unlimited CPU-hold. The lease is bounded by lifetime, budget, revocation, and pool capacity; absent a pre-authorized pool, no auto-promotion occurs.
- Auto-promotion does not grant realtime authority.
RealtimeIslandadmission remains a separate, stricter path with preallocation, deadline, and no-blocking proofs. - Auto-promotion does not bypass donation, fairness, or session-lifecycle invariants. Process exit, session logout, and explicit revoke still tear the lease down through the existing Layer 3 rollback.
Telemetry Requirements
Tickless, nohz, SQPOLL, and realtime behavior must be observable through future monitoring/status capability surfaces, not only through ad hoc debug logs. The first counters should include:
scheduler_tick_count{cpu}
ticks_suppressed{cpu,mode}
nohz_enter_count{cpu,kind}
nohz_exit_count{cpu,reason}
oneshot_deadline_miss_count
sqpoll_busy_ns
sqpoll_sleep_count
deadline_expired_count
budget_exhausted_count
realtime_overrun_count
donation_depth_max
housekeeping_offload_count
These counters are correctness evidence. Missing or surprising values should fail focused nohz/realtime proofs rather than being treated as performance-only diagnostics.
The ticks_suppressed{cpu,mode} / scheduler_tick_count{cpu} evidence is
realized as an asserted proof line on the lease path:
make run-scheduler-cpu-isolation-lease now counts genuine periodic LAPIC
fires per CPU (a fire is counted only when neither the lease-backed nor the
idle tick-suppression bit is set, so the one-shot replacement is never
miscounted) and, on lease nohz rollback, emits
cpu-isolation: nohz suppressed-ticks cpu=<n> window_ns=<w> expected_periodic=<e> actual_periodic=<a> suppressed=<e-a>. The harness
asserts that over a bounded masked window the leased CPU recorded actual near
zero while expected was substantial – the periodic tick demonstrably stopped,
not merely that the mask write was issued – and that a bounded post-rollback
cpu-isolation: nohz restored-rate window shows the periodic rate returning.
This is bounded proof-line evidence, not yet a durable
SchedulingPolicyCap/monitoring telemetry field; the persistent
ticks_suppressed surface and the generic-full-nohz path’s inheritance of the
same measured assertion remain future telemetry work.
Implementation Sequence
- Add timer/scheduler instrumentation around the existing periodic tick.
- Add
monotonic_ns()backed by a clocksource that is not derived from the scheduler tick, and switchTimer.nowplus scheduler accounting to that clocksource while keeping periodic scheduling. Completed for normal QEMU/x86_64 by the Phase F clockevent/deadline substrate. - Convert timeout waiters to
deadline_ns. Completed forTimer.sleep, finitecap_enter, and park timeouts by the Phase F clockevent/deadline substrate. - Add LAPIC one-shot programming, periodic restore state, and a focused one-shot smoke. Completed as a disabled-nohz substrate proof by the Phase F clockevent/deadline substrate.
- Replace user-mode idle with kernel/per-CPU idle while keeping periodic
ticks. Completed: the scheduler idle path is now a CPL0 per-CPU kernel idle
thread and the user-mode idle process is gone (
docs/tasks/README.md). - Enable tickless idle only when there is no runnable work. Completed by
docs/tasks/done/2026/scheduler-tickless-idle-step6.md: true-idle CPUs with no runnable non-idle work, no active nohz lease, no local deferred cleanup, no cap-enter polling dependency, and a one-shot LAPIC clockevent mask the periodic tick and arm a bounded one-shot at the nextTimer/ParkSpacedeadline or the 100 ms idle housekeeping floor. The scheduler restores the periodic tick before ordinary non-idle dispatch, on reschedule IPIs, and on backend/refusal rollback. Cap-enter polling waiters and ready-but-budget-throttledSchedulingContextretry windows remain periodic until the legacy terminal/network/IRQ polling and scheduling-context retry surfaces move behind explicit deadlines or housekeeping placement. - Route the in-kernel virtio-net poll off a lease-isolated CPU to the
housekeeping CPU (landed 2026-06-04); an explicit
NetworkPollClockpoll deadline remains the longer-term target. - Add ring mode state and refuse timer-side SQ processing for SQPOLL rings.
- Land Ring v2 per-thread ring ownership and completion routing.
- Add the SQPOLL wake/sleep protocol and a host or Loom-style lost-wakeup model.
- Add kernel SQPOLL without full-nohz, under normal scheduler ticks.
- Add CPU isolation leases and housekeeping CPU placement.
- Prove SQPOLL progress through a wake/deadline path that does not depend on periodic scheduler ticks. Completed for bounded current-thread syscall/producer-wake progress by the Phase F SQPOLL nohz-progress child.
- Enable SQPOLL nohz on isolated CPUs for explicitly leased caller-thread rings. Landed 2026-06-07 09:45 UTC; broader userspace-poller/device-queue policy issuance remains separate.
- Add request
deadline_nsmetadata and typed late/drop CQE outcomes. - Add
SchedulingContextand admission-controlled realtime islands. - Add generic full-nohz admission for ordinary budgeted compute threads
through explicit
SchedulingContext-targetedCpuIsolationLeasepreflight. Landed 2026-06-06 09:44 UTC; policy-service issuance remains separate. - Add the user-space policy-service AutoNoHz placement heuristic. The
kernel exports per-thread saturation observation through the
monitoring/status surface; the policy service synthesizes the “thread
appears capable of utilizing a full CPU core” decision and issues
bounded
CpuIsolationLeasegrants against pre-authorized account or session CPU pools. The auto-revoke timeout primitive (leaseLifetimeNs) landed 2026-05-30 15:22 UTC at84c1c5ba, priority-aware fairness lease termination landed 2026-06-02 21:28 UTC atcae825a4with immediate release remediation atca28ef63, runtime grant minting (CpuIsolationGrantMinter) landed 2026-06-02 22:25 UTC at5c5c63cc, and the grant-revocation lifecycle (CpuIsolationGrantMinter.revokeGrantwith cascade-to-leases) landed 2026-06-03 17:11 UTC, completing the pool-grant authority surface. The local userspace policy-service proof landed 2026-06-07: it reads the per-thread saturation counters, denies a voluntarily blocking worker, issues a finite grant-stamped full-nohz lease only after a saturated local window, renews only after re-observation, and lets stopped renewal expire fail-closed. A reusable production policy daemon with profile-driven smoothing, cross-process target discovery, and richer operator policy remains future work.
Verification
Tickless idle gates:
make fmt-check
cargo test-lib
cargo test-config
make run-smoke
make run-spawn
Additional tickless proof:
1 second idle interval does not produce 100 scheduler ticks
Timer.sleep still completes
cap_enter timeout still completes
ParkSpace timeout still completes
preemption fairness unchanged with runnable contention
SQPOLL gates:
thread-lifecycle
timer-smoke
timer-flood
park wake/timeout
endpoint CALL/RECV/RETURN
mandatory host or Loom-style lost-wakeup model before any real SQPOLL worker:
poller: set NEED_WAKEUP -> full barrier -> recheck tail -> park
producer: write SQE -> publish tail -> full barrier -> check NEED_WAKEUP -> wake
Realtime gates:
deadline ordering tests
budget depletion tests
donation/return tests through passive endpoint
admission denial tests
QEMU proof for late/drop/overrun behavior
telemetry counters prove ticks suppressed, deadlines expired, budgets
exhausted, and donation depth bounded as expected
Decision
Adopt this staged direction:
Tickless idle:
yes, after the kernel/per-CPU idle context and activation proof. The
clocksource/clockevent split is implemented.
Generic full-nohz:
implemented for explicit budgeted compute leases targeting a live
SchedulingContext. Automatic issuance and unbudgeted ordinary threads remain
out of scope.
SQPOLL nohz:
yes, for explicitly leased caller-thread rings whose SQPOLL poller is live,
single-consumer, and bounded by producer wake plus rollback deadlines.
AutoNoHz placement for ordinary threads:
yes, but only as a user-space policy-service decision that issues a
bounded CpuIsolationLease against a pre-authorized CPU pool. The lease
adds isolation; it never mints CPU-time authority. The "thread appears
capable of utilizing a full CPU core" signal is synthesized in the
policy service from observations the future monitoring/status surface
must export, not as a fixed kernel threshold.
Realtime:
`SQE.deadline_ns` is useful metadata, but `SchedulingContext` is the
authority that provides CPU time.