Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Tickless and Realtime Scheduling

This proposal captures the scheduling design from the 2026-04-29 discussion and the subsequent implementation status: tickless idle is useful, full-nohz belongs behind explicit CPU isolation authority, and realtime requires scheduling contexts rather than only per-request deadlines.

Design Grounding

The directly relevant grounding is:

External grounding is recorded in the research note so reviewers can audit the prior-art claims without treating this proposal as the source of truth.

Goals

  • Add tickless idle: when a CPU has no runnable work, stop the periodic scheduler tick and program the local timer for the earliest known deadline.
  • Split monotonic timekeeping from timer interrupt delivery.
  • Convert scheduler timeout waiters to absolute monotonic deadlines.
  • Stage full-nohz as an explicit CPU isolation/lease mode for SQPOLL and realtime executors, not as a generic scheduler default.
  • Define SQE.deadline_ns as request freshness metadata.
  • Define SchedulingContext as CPU-time authority.
  • Define RealtimeIsland as the admission object for media, robotics, provider, and other bounded realtime graphs.

Non-Goals

  • No ambient Linux-style NO_HZ_FULL for arbitrary unbudgeted user threads. Ordinary-thread full-nohz requires an explicit budgeted SchedulingContext target and a CpuIsolationLease.
  • No SQPOLL on the current process-wide ring.
  • No second SQ consumer through timer-side polling for SQPOLL rings.
  • No TSC-deadline or x2APIC requirement for the first tickless-idle milestone.
  • No hard realtime claim before kernel-path, IRQ, device, locking, and WCET evidence exists.
  • No full realtime policy blob inside every SQE.

CPU Authority Taxonomy

These terms must not drift into overlapping authority systems:

ResourceProfile:
  policy template selected by identity, session, account, or service profile;
  it is not spendable authority by itself.

ResourceLedger:
  coarse accounting and quota owner for a resource class. It records and
  enforces limits, including non-realtime CPU share/runtime budgets where the
  scheduler has not minted finer scheduling contexts.

SchedulingContext:
  spendable CPU-time authority with budget, period, relative deadline,
  priority/criticality, CPU mask, and overrun policy.

CpuIsolationLease:
  placement, exclusivity, and nohz/noise-isolation authority for a CPU or CPU
  set. It does not grant CPU-time credit and must charge consumed time through
  a SchedulingContext or coarse scheduler ResourceLedger.

NoHzEligibility:
  a reviewed claim or hint that a thread, ring, poller, or island may use nohz
  isolation if the scheduler can prove the current CPU state allows it.

NoHzActivation:
  the scheduler-proven current CPU state that actually suppresses ticks.

RealtimeIsland:
  admitted bundle of SchedulingContexts, memory reservations, device
  reservations, rings, endpoint/service constraints, and optional
  CpuIsolationLeases.

Scheduling-context donation is not generic resource donation. It donates only execution budget/deadline along a synchronous capability path; it does not donate capability authority, invocation subject identity, disclosure scope, memory budget, network budget, storage budget, or service-management authority.

Layer 1: Tickless Idle

Tickless idle should be the first behavioral milestone. It applies only when the CPU has no runnable thread and no local work that still depends on a periodic scheduler tick.

Clocksource

Add a monotonic clock layer:

#![allow(unused)]
fn main() {
pub fn monotonic_ns() -> u64;
}

The first backend can use the current periodic tick as a compatibility source while the system is still periodic. The selected QEMU/x86_64 backend should eventually use a calibrated stable counter, with SMP consistency handled when multiple scheduler owners exist.

Required invariant:

monotonic_ns() never moves backwards on one CPU.

Clockevent

Add a small scheduler timer backend boundary:

#![allow(unused)]
fn main() {
trait ClockEvent {
    fn program_periodic(period_ns: u64);
    fn program_oneshot(delta_ns: u64);
    fn stop();
    fn min_delta_ns() -> u64;
    fn max_delta_ns() -> u64;
}
}

The first backend is the current PIT-calibrated xAPIC LAPIC timer on vector 48. PIT/PIC and periodic LAPIC remain fallback paths.

Deadline Waiters

Convert timeout state from tick counts to absolute deadlines:

#![allow(unused)]
fn main() {
struct DeadlineWaiter {
    deadline_ns: u64,
    target: ThreadRef,
    kind: WaiterKind,
    user_data: u64,
}
}

Affected paths:

  • Timer.sleep;
  • cap_enter(timeout_ns);
  • ParkSpace timeout;
  • future process/thread wait timeouts;
  • network poll deadline through NetworkPollClock.

Waiter storage remains bounded. No interrupt path may allocate.

Network Poll Clock

The kernel-resident networking path is scheduler-polled. Rather than keep every network-coupled lease in ForcedPeriodic, the in-kernel virtio-net poll is now routed off a lease-isolated CPU (landed 2026-06-04, scheduler-nohz-network-poll-housekeeping-routing): virtio::poll_scheduler consults sched::current_cpu_lease_nohz_active() and skips driving the poll from a CPU inside a lease-backed tick-suppression window, so that CPU no longer needs the periodic tick to make network progress. The always-ticking housekeeping CPU the lease admission already requires keeps servicing virtqueue completions and pending network-waiter scans. The CpuIsolationLease activation preflight reflects this with a network_polling=routed-periodic-network-polling- to-housekeeping-cpu admit label when a housekeeping CPU is available, failing closed (rejected-network-polling-no-housekeeping-cpu-to-relocate, and the lease is refused at create when no housekeeping CPU exists) otherwise. The longer-term explicit poll-deadline interface below remains the target for fully removing the dependency on a housekeeping CPU continuing to tick:

#![allow(unused)]
fn main() {
trait NetworkPollClock {
    fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
    fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
}

next_poll_deadline_ns lets the scheduler include TCP/runtime timers in earliest_global_deadline(). poll_until_budget prevents network progress from becoming an unbounded idle-exit or interrupt path. A CPU with active networking may enter tickless idle only when the network runtime is inactive or has exposed a bounded deadline through this interface.

Kernel Idle

Tickless idle depends on replacing the user-mode idle process with a kernel/per-CPU idle context. Timer IRQ handling must distinguish:

IRQ from CPL3 user thread -> save/restore user context
IRQ from CPL0 idle        -> wake/check scheduler without fake user context

Idle entry shape:

if no runnable work:
    deadline = earliest_global_deadline()
    clockevent.program_oneshot(deadline - now)
    enter_kernel_idle()

The idle loop enables interrupts, halts, wakes on timer/IPI/device interrupt, then rechecks runnable work and deadline expiry.

Tickless State

Per CPU:

Periodic:
  normal scheduler tick active

TicklessIdle:
  no runnable thread
  one-shot local timer programmed for earliest deadline
  CPU in kernel idle

ForcedPeriodic:
  fallback when a subsystem still needs regular polling

Enter TicklessIdle only when:

run queue empty
no direct IPC target
no deferred completion work
no timer-side ring work required
clockevent supports one-shot
kernel idle context available
network runtime inactive or deadline-driven

Keep periodic preemption whenever there is runnable contention. Even one runnable user thread remains periodic until Ring v2, CPU accounting, and timer-side polling dependencies are resolved.

Layer 2: SQPOLL NoHz

SQPOLL full-nohz is a later CPU ownership mode:

full-nohz is not a timer feature here;
it is part of the SQPOLL CPU ownership contract.

Required prerequisites:

  • Ring v2 or equivalent per-thread rings;
  • one SQ consumer per ring, including implemented syscall-mode leases and bounded SQPOLL mode transitions;
  • per-CPU scheduler ownership;
  • reschedule IPI and idle-to-runnable handoff;
  • at least one housekeeping CPU;
  • explicit placement of network polling away from isolated CPUs.

Current Phase F status: CpuIsolationLease and nohz telemetry exist, the housekeeping/deferred-work placement child records selected online housekeeping CPU masks plus deferred cleanup, timer/deadline, network polling, IRQ-affinity, accounting-target, and cleanup-latency placement or rejection labels, bounded SQPOLL ring mode can progress from periodic service or one current-thread syscall/producer-wake batch, and the clockevent/deadline substrate has split monotonic clocksource reads from LAPIC clockevent programming. The clockevent one-shot’s firing precision is proven, not just its programming: a runtime-reprogrammed TICK_NS/2 one-shot armed over the live periodic timer is measured to fire at its requested sub-tick instant (~5 ms for a 5 ms request, far under the 10 ms tick, with the current-count correctly reset to the sub-tick value), and the kernel-mode-fire path restores a live periodic timer so a one-shot consumed without running schedule() cannot strand the CPU with no timer source (make run-scheduling-context).

The monotonic clocksource discipline is now sub-tick-accurate as well. The periodic discipline step previously floored every fire to epoch + TICK_NS (max(tsc_interpolated, epoch + TICK_NS)), which inflated a real sub-tick interval to a full tick and hid sub-tick deadlines from the accounting clock. discipline_clocksource_tick now trusts the TSC interpolation at sub-tick granularity and falls back to the TICK_NS floor only when the interpolated advance is implausibly small (below MIN_DISCIPLINED_ADVANCE_NS), preserving a minimum forward rate against a degenerate TSC (publish_monotonic_ns enforces only non-decreasing time, not a minimum rate). A boot proof advances a real TICK_NS/2 interval through one discipline step and asserts monotonic_ns() tracked the sub-tick delta rather than the full-tick floor (make run-scheduling-context).

The first activation increment is now real: the CpuIsolationLease activation preflight performs real per-CPU periodic-tick suppression for the narrow single-runnable-entity window. When the preflight finds every proof obligation satisfied – exactly one runnable caller on the target CPU, ready housekeeping CPU, no local deferred-cleanup/timer dependency, valid accounting target, live monotonic clocksource, non-stale one-SQ-consumer, and bounded revocation latency – and the target CPU is the CPU running the preflight, it masks the periodic LAPIC tick and arms a bounded one-shot deadline at min(nearest pending timer wakeup, now + max revocation latency). Network polling is now routed to a housekeeping CPU rather than kept read-only fail-closed (landed 2026-06-04): the in-kernel virtio-net poll skips driving from a lease-isolated CPU (virtio::poll_scheduler consulting sched::current_cpu_lease_nohz_active()), so the admission network_polling gate flips to a routed-periodic-network-polling-to-housekeeping-cpu admit when a housekeeping CPU is available and fails closed otherwise. IRQ affinity is now routable in a bounded form (landed 2026-06-04): when a lease opts in, the activation path reprograms the leased CPU’s legacy IO-APIC redirection-entry destinations onto the selected housekeeping CPU (mask-before-reprogram + read-back, restored on rollback/revoke) before admitting tick suppression, and keeps the conservative rejected-irq-affinity-not-routed-to-housekeeping refusal for any ring-coupled lease whose IRQ dependency cannot be safely rerouted. The live reroute is presently scoped to a quiescent housekeeping destination: under the in-kernel KVM irqchip, reprogramming an IO-APIC redirection-entry destination onto a CPU that is actively scheduling stalls forward progress on that destination CPU, so a general “reroute onto any housekeeping CPU regardless of occupancy” admission remains future work behind a real destination-quiescence gate or a delivery backend without that re-evaluation cost. Every disqualifying change (stale lease generation, a second runnable entity, stealable sibling work, a local deferred-cleanup dependency, a target-CPU mismatch, or a one-shot backend that can no longer arm a deadline) rolls the CPU back to the periodic LAPIC tick first, before ordinary work continues. Generic full-nohz for ordinary budgeted compute threads is now admitted through explicit SchedulingContext-targeted compute leases. A generic SQPOLL nohz state machine now admits explicitly leased caller-thread rings when the ring is in SQPOLL running/sleeping mode with a live owner, one SQ consumer, and bounded producer-wake/deadline rollback. Broader userspace-poller/device-queue admission and production realtime island admission remain future work; the periodic tick stays the fail-closed fallback everywhere else. Timeout-based auto-revoke has since landed: a lease created with leaseLifetimeNs > 0 auto-revokes on first observation past its deadline (reason=lease-expired) and a tickless CPU under it rolls back at the next recheck (lease-lifetime-expired) (docs/tasks/done/2026-05-30/scheduler-cpu-isolation-lease-timeout-auto-revoke.md). SQPOLL-driven activation is now proven by make run-scheduler-generic-sqpoll-nohz: a ring-coupled kernelSqpoll lease whose bound ring is in SQPOLL running/sleeping mode with a live owner is admitted for tick suppression, producer wake drives bounded non-periodic service, and revoke/stale-owner rollback fails closed. The per-CPU idle thread has also landed – the scheduler idle path is now a CPL0 per-CPU kernel idle thread and the user-mode idle process is gone (docs/tasks/README.md).

The non-atomic createLease-vs-revokeGrant SMP window (kernel/src/cap/cpu_isolation_pool_grant.rs:472-483) – a createLease that passes the grant live-check on one CPU can register its lease just after a concurrent revokeGrant on another CPU snapshotted the registry, so that lease is not cascade-terminated and lingers until its own leaseLifetimeNs or an explicit revoke – is now a modeled, bounded residual rather than a prose-only caveat. The Alloy lease/grant authority model represents it explicitly as the WindowLingering set and checks that no live lease reaches a revoked grant outside it. That the lingering lease was nonetheless legitimately authorized (no lease is ever minted through an already-revoked grant) is a temporal mint-time-vs-revoke property the static relational model does not itself check; it rests on the code’s create-time minted_grant_live gate (cpu_isolation_pool_grant.rs:484), which fails closed before admission. Taken together this is a bounded capacity-hold window, not an authority escalation. The companion TLA+ model checks the two-lock teardown the cascade and prune share (generation advances exactly once, no capacity double-free, no stranded generation). Both run under make model-scheduler-lease-alloy / make model-scheduler-lease-tla; see models/scheduler/README.md.

The nohz/tickless activation-rollback path – the lock-free NOHZ_ACTIVE_CPUS bit read from ISR context against the locked dispatch.nohz_activation[slot] record, with IPI-delivered cross-CPU activation/rollback – is likewise now a checked model rather than a prose-only invariant. The TLA+ lifecycle model (models/scheduler/nohz_activation.tla) checks that no scheduler CPU is ever left timer-less (a fired one-shot always has the contention fallback re-arm enabled, and is always eventually re-armed), that the lock-free bit and the locked record always reconcile (the bit-set/record-cleared and record-present/bit-cleared divergences the rollback and contention paths produce are transient), and that a staled remote activation is dropped rather than applied to a newer lease (a staled generation is never committed, and a recorded generation staled by the cap-side maybe_expire path is always rolled back by the stale-lease-generation disqualifier). A focused Loom test pins the lock-free-bit ↔ locked-record reconciliation under the C11 memory model. Both run under make model-scheduler-nohz-tla / make model-scheduler-nohz-loom; see models/scheduler/README.md.

Ring mode:

#![allow(unused)]
fn main() {
enum RingMode {
    Syscall,
    SqpollStarting,
    Sqpoll,
    SqpollStopping,
}
}

In syscall mode, the owner thread’s cap_enter drains SQ. In SQPOLL mode, a kernel worker owns SQ head; userspace owns SQ tail and CQ head; cap_enter waits for completions and may wake a sleeping poller, but it does not drain SQ.

SQPOLL state:

Disabled -> Starting -> Running -> IdleSpinning -> Sleeping -> Stopping

The wake protocol uses a NEED_WAKEUP flag. Userspace release-stores the SQ tail, acquire-loads flags, and invokes a wake path only if the poller has gone to sleep.

The race-free sequence is normative.

Poller before sleeping:

#![allow(unused)]
fn main() {
flags.fetch_or(NEED_WAKEUP, SeqCst);

let tail = sq_tail.load(Acquire);
if sq_head != tail {
    flags.fetch_and(!NEED_WAKEUP, Release);
    continue;
}

park();
}

Producer:

#![allow(unused)]
fn main() {
write_sqe();
sq_tail.store(new_tail, Release);
fence(SeqCst);

let flags = flags.load(Acquire);
if flags & NEED_WAKEUP != 0 {
    wake_poller();
}
}

The poller must set NEED_WAKEUP before the final tail recheck. Otherwise a producer can publish a new SQE after the poller checks the tail but before it parks, losing the wake.

The NEED_WAKEUP publication must also be ordered before the final tail recheck by a full store-to-load barrier. A SeqCst RMW is the simplest portable rule for the ABI text; an implementation may substitute an explicitly reviewed architecture-specific fence or park primitive that provides the same ordering. A plain release store or release-only RMW is not sufficient for this protocol.

The producer must likewise order the SQ tail publication before checking NEED_WAKEUP. The normative sequence uses a full fence between sq_tail.store(..., Release) and flags.load(Acquire); an implementation may substitute an explicitly reviewed equivalent that prevents the producer from missing NEED_WAKEUP while the poller misses the new tail before parking.

An SQPOLL CPU may suppress the periodic tick only if:

cpu role is SqpollIsolated
exactly one runnable entity is the poller
no ordinary user thread is runnable there
no timer-side SQ polling is enabled
no network scheduler polling is pinned there
no deferred cleanup is pinned there
stable clocksource/accounting exists
housekeeping CPU is online

If any condition fails, restore periodic tick or migrate the unrelated work.

NoHz Activation Proof Obligations

To enter SqpollNoHz or future AutoNoHz, the scheduler must prove:

exactly one runnable entity is assigned to the CPU
at least one housekeeping CPU is online
no local network polling dependency remains
no timer-side SQ polling can run for the active ring
no local deferred cleanup or unbound kernel worker is pinned there
no unmigratable IRQ targets that CPU unless explicitly allowed
clocksource and CPU accounting are boundary/counter driven, not tick driven
revocation latency is within the lease policy

The proof is dynamic. If any condition stops holding, the scheduler must restore periodic tick, migrate unrelated work, revoke the lease, or leave nohz mode before continuing.

Layer 3: AutoNoHz CPU Lease

The long-term design should split eligibility from activation.

Eligibility says a thread, process, ring, or realtime island may use nohz isolation:

#![allow(unused)]
fn main() {
enum NoHzKind {
    Idle,
    KernelSqpoll,
    AutoCompute,
    AutoUserspacePoller,
    RealtimeIsland,
}

struct NoHzEligibility {
    kind: NoHzKind,
    max_revocation_latency_ns: u64,
    preferred_cpus: CpuSet,
    allow_busy_spin: bool,
    accounting_target: CpuAccountingTarget,
}

enum CpuAccountingTarget {
    CurrentSchedulingContext,
    SchedulerResourceLedger,
}
}

Activation is a scheduler proof that a CPU currently satisfies isolation conditions. Without a lease, a latency-sensitive hint may influence placement but must not grant exclusive CPU access.

Future lease shape:

CpuIsolationLease:
  owner process/session
  allowed CPU set
  allowed mode: poller/compute/kernel-worker
  accounting target, not CPU-time credit
  revocation policy

Housekeeping must be explicit:

Housekeeping CPU set:
  global timers
  deferred frees
  cleanup
  statistics
  non-critical kernel workers
  debug/watchdog
  load balancing and migration control

Layer 4: Deadline Metadata

Deadline metadata lives in fixed ring ABI fields, not in a Cap’n Proto SQE envelope and not in variable side metadata. The current fixed SQE layout should not be silently reinterpreted; add these fields through a versioned CapSqeV2/ring ABI gate when the transport is ready.

#![allow(unused)]
fn main() {
#[repr(C)]
struct CapSqeV2 {
    // existing fixed CapSqe fields, unchanged in order and meaning

    deadline_ns: u64,  // absolute monotonic deadline, 0 = none
    qos_flags: u32,   // drop/allow/reorder/propagate semantics
    sched_ctx_id: u32, // 0 = current/default scheduling context
}
}

deadline_ns is an absolute monotonic timestamp. It is request freshness metadata, not a promise of nanosecond wakeup precision. The kernel may round timer programming to clockevent granularity, coalesce timers where policy allows, or report a miss when dispatch observes the timestamp has already expired. The field remains u64 nanoseconds because absolute u64 ns values are simple, tracing-friendly, and shared with existing timeout surfaces; a u64 microsecond field saves no ABI space.

Only consider a compact profile if SQE space becomes critical:

#![allow(unused)]
fn main() {
deadline_delta_us: u32
}

That profile would be a soft-deadline compact transport shape only. It is not the primary realtime or SchedulingContext ABI and must not replace deadline_ns for admitted realtime work.

ABI negotiation uses both bootstrap metadata and a runtime query surface:

#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
    ring_addr: u64,
    ring_abi_version: u32,
    sqe_size: u16,
    cqe_size: u16,
}
}
  • Process bootstrap passes the ring ABI version and fixed entry sizes alongside the ring address.
  • RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live in capos-config/src/ring.rs; the kernel and capos-rt import the same definition rather than carrying local copies.
  • A future RuntimeInfo/SystemInfo query returns the kernel-supported ring ABI range so language runtimes can fail before mapping incompatible rings.
  • cap_enter rejects unsupported SQE versions or entry sizes with stable transport errors such as CAP_ERR_UNSUPPORTED_RING_ABI and CAP_ERR_UNSUPPORTED_SQE_VERSION.
  • Runtimes in Rust, C, Go, and other languages must generate or mirror the exact fixed layout for the negotiated version.

Suggested flags:

DROP_IF_LATE:
  if now > deadline_ns before dispatch, post DEADLINE_EXPIRED

ALLOW_LATE:
  dispatch anyway, but CQE/telemetry marks late

PROPAGATE_DEADLINE:
  endpoint CALL/RETURN carries deadline metadata to server-side request

DEADLINE_ORDERED:
  SQPOLL may reorder within a bounded window only when all reorder-safety
  checks below pass

NO_BLOCKING_PATH:
  reject if target method/op is not declared realtime-safe

Do not put budget, period, priority, criticality, or CPU affinity into each SQE. Deadline is per request. Budget is execution authority.

DEADLINE_ORDERED is valid only when all of the following are true:

the ring mode permits reordering
the SQE marks this request reorderable
the target capability interface and method declare reorder-safe semantics
the reordering window is bounded
the operation does not depend on earlier same-ring requests for correctness

Ordered side effects such as write A; write B; flush or lock; mutate; unlock must not be deadline-reordered unless the target method contract explicitly defines that sequence as reorder-safe.

Layer 5: SchedulingContext

CPU time should become a capability-controlled object:

#![allow(unused)]
fn main() {
struct SchedulingContext {
    budget_ns: u64,
    period_ns: u64,
    relative_deadline_ns: u64,
    priority: u16,
    criticality: u8,
    cpu_mask: CpuSet,
    overrun_policy: OverrunPolicy,
    timeout_endpoint: Option<EndpointRef>,
}
}

Kernel responsibilities:

  • decrement remaining budget by actual runtime;
  • replenish budget by period;
  • throttle or fault a thread on depletion;
  • enforce CPU mask and scheduling eligibility;
  • dispatch among eligible contexts by the selected realtime policy;
  • prevent untrusted SQE bytes from minting budget.

Policy-service responsibilities:

  • admission control;
  • budget/period/priority selection;
  • CPU-isolation lease policy;
  • overload response;
  • telemetry and retuning.

Layer 6: Donation

Synchronous capability calls need scheduling-context donation:

client SchedulingContext -> passive server endpoint
server runs on donated budget/deadline
context returns on reply
timeout/overrun reports to caller or island policy

Without donation or inheritance, a realtime caller can be defeated by a normal-priority server that holds the capability implementation path.

Donation semantics must be fixed before implementation:

max donation call depth:
  bounded per SchedulingContext or RealtimeIsland; overflow fails closed.

nested donation:
  nested synchronous calls carry the current donated context until the depth
  bound, unless a callee uses its own admitted context by explicit policy.

cycle handling:
  a donated context may not re-enter a thread already on its donation stack;
  cycles fail with a typed realtime/donation error.

partial failure:
  budget already consumed stays charged to the context that ran the work.
  rollback of authority or memory is separate from CPU charge rollback.

timeout propagation:
  the earliest of request deadline, scheduling-context deadline, and explicit
  call timeout bounds downstream execution.

server-side blocking:
  a passive server running on donated context may block only on approved
  realtime-safe waits or synchronous calls that continue donation.

return on exception:
  application exceptions, transport errors, and cancellation return the
  context to its previous owner before CQE/error delivery.

async endpoint queues:
  donation does not cross ordinary async endpoint enqueue by default. Async
  donation requires an explicit future token/lease design.

Hot admitted paths should avoid blocking locks. If a shared resource cannot be modeled as a passive service, it needs a reviewed priority/deadline-inheritance primitive or a bounded try-lock/fail/drop policy.

Layer 7: RealtimeIsland

RealtimeIsland admits a whole loop or graph:

#![allow(unused)]
fn main() {
struct RealtimeIslandSpec {
    period_ns: u64,
    deadline_ns: u64,
    cpu_set: CpuSet,
    nodes: Vec<NodeBudget>,
    rings: Vec<RingSpec>,
    memory: Vec<PreallocSpec>,
    devices: Vec<DeviceReservation>,
    overrun_policy: OverrunPolicy,
}
}

Admission requires:

  • total budget fits period/deadline constraints;
  • all hot-path buffers are preallocated;
  • hot-path memory is committed and resident before start;
  • guaranteed hot-path memory uses the OOM proposal’s MemoryResidency policy as pinned or secret; normal memory is not admitted for guaranteed hot paths. A future lock-resident operation may transition ordinary memory into a pinned reservation before admission, but the admitted island sees the result as pinned, not as normal;
  • all caps and policy decisions are resolved before start;
  • no expected page faults on the hot path;
  • no unbounded lock acquisition;
  • no blocking endpoint calls inside callback loops;
  • no allocation, logging, service discovery, or provider credential work on the realtime path;
  • IRQ and deferred work are bounded or moved outside the island.

Failure semantics must be typed:

CAP_ERR_DEADLINE_EXPIRED
CAP_ERR_BUDGET_EXHAUSTED
CAP_ERR_REALTIME_UNSAFE_PATH
CAP_ERR_REALTIME_ADMISSION_DENIED
CAP_ERR_OVERRUN
CAP_ERR_STALE_INPUT

CQE/status should distinguish not-started-late, completed-late, dropped by policy, throttled, and dependency-cancelled.

Policy-Service Userstories: AutoNoHz Placement for Compute-Capable Threads

The Layer 1-7 primitives above are mechanism: NoHzEligibility is a reviewed claim, CpuIsolationLease is the placement authority, SchedulingContext and the coarse ResourceLedger own CPU-time budget, and NoHzActivation is the scheduler proof that current CPU state allows tick suppression. They do not answer who decides to issue an eligibility hint for an ordinary user thread that was not pre-declared as a realtime island or kernel SQPOLL worker, or what observation justifies the issuance. That decision is policy, and it belongs in the user-space scheduler policy service described in Stage 7 of scheduler-evolution-proposal. This section records the userstories that motivate the responsibility and the bounds the policy service must enforce so auto-promotion never becomes an implicit “unlimited CPU-hold” grant.

Core property: promotion is placement, not budget

Auto-promotion adds isolation; it never mints CPU-time authority. A policy-issued CpuIsolationLease only removes tick and scheduler noise while its bound thread consumes time that was already authorized through its SchedulingContext or coarse ResourceLedger. SchedulingContext budget exhaustion is now folded into the same nearest-deadline timer as nohz revocation/timer work, so a tick-masked CPU is re-observed at the budget deadline rather than at a later periodic tick. When budget exhausts, or when any existing Layer 3 activation obligation stops holding, the existing fail-closed rollback path restores the periodic tick. Priority-aware revocation of the lease itself when an equal-or-higher-priority runnable arrives is new Phase H surface (see “Bounds the policy service must enforce” below); today’s Phase F rollback only restores ticks on the leased CPU and does not terminate the lease.

This separation answers the obvious objection. A busy-spinning thread cannot escalate itself into permanent CPU exclusivity, because the spin drains its allotted budget at the same rate periodic scheduling would have drained it. If the operator has granted enough budget to saturate a core, auto-promotion removes tick interference while that budget is consumed; if not, the same authority that would have throttled the thread under periodic scheduling still throttles it under nohz.

Trigger: “thread appears capable of utilizing a full CPU core”

The trigger is not a fixed percentage threshold inside the kernel. The kernel exports per-thread observation; the policy service synthesizes a saturation-capability signal from those observations and decides what “capable of utilizing a full CPU core” means for a given account, session, or service profile. Plausible inputs the policy service may combine:

  • runtime accumulated over a rolling window approaches the wall-clock window the thread had on its assigned CPU;
  • voluntary-block count over the same window stays low (the thread is not IPC- or IO-bound at a rate that would lose the benefit);
  • runnable-but-not-running time stays low when the thread is the only runnable entity on its CPU, or correlates with placement contention rather than IO when it is not.

Concrete window length, smoothing, and the synthesis rule are policy-service choices, replaceable without ABI churn. As of 2026-05-30 the kernel exports the observation inputs the heuristic consumes as ordinary (non-measure) per-thread state: runtime_ns/virtual_runtime_ns, voluntary_blocks, preemptions, and a cumulative runnable_accumulated_ns (runnable-but-not-running time) are all returned by SchedulingPolicyCap.snapshot @2. voluntary_blocks and preemptions were promoted out of cfg(feature = "measure") and runnable_accumulated_ns was added at the run-queue enqueue/select boundary; only migrations remains measure-gated. This closes the Phase H “monitoring/status surface that exports per-thread saturation observation” prerequisite. The surface exports raw cumulative counters only: no fixed threshold and no windowing live in the kernel – the policy service synthesizes the saturation signal.

Userstories

  1. Long-running compute tenant with declared budget. A model-training, video-encoding, or HPC build job is admitted with a SchedulingContext (or coarse ResourceLedger allocation) sized for sustained near-core utilization on a declared CPU pool. The policy service observes the thread saturating the pool’s CPU share, issues a bounded CpuIsolationLease against the pool, the scheduler proves the activation obligations from Layer 2/3, and ticks are suppressed for as long as the thread keeps consuming the granted budget. The lease ends when the budget exhausts, the job completes, the operator revokes the pool, or the saturation signal subsides.

  2. Userspace poller that earned isolation. A service polls a ring or device queue (a candidate AutoUserspacePoller in the NoHzKind taxonomy). The policy service sees consistent saturation with low voluntary blocking, recognizes the AutoUserspacePoller eligibility kind, and issues a lease. The bounds are the same as for the kernel SQPOLL path; only the consumer differs.

  3. Account-scoped auto-claim pool. An operator pre-declares “account X may auto-claim up to N isolated CPUs from pool P, maximum auto-lease lifetime L, with revocation latency R, charging to ledger E.” The policy service monitors threads owned by X, issues leases against P when saturation capability is observed, and refuses promotion when X already holds N leases or when no CPU in P currently satisfies the activation proof. Without the operator declaration the policy service does not auto-promote.

  4. Background agent that bursts to full-core compute. A general-purpose agent process does not normally saturate a core. When it briefly does (a planning phase, a build step, a local inference call), the policy service may issue a short-lifetime lease if the agent’s account has authorized auto-promotion. When the burst ends the signal subsides; the lease is not renewed.

Bounds the policy service must enforce

For every auto-issued lease the policy service records:

lifetime_ns:               bounded; shorter than admin-issued leases by
                           default; renewal requires re-observing the
                           saturation signal.
max_revocation_latency_ns: bounded by NoHzEligibility.max_revocation_latency_ns;
                           cannot exceed the operator/account policy.
accounting_target:         a live SchedulingContext or coarse ResourceLedger;
                           the lease does not mint CPU-time authority.
auto_claim_pool:           the pre-authorized CPU set; no implicit fallback to
                           system-wide isolation.
fairness_preemption:       another runnable entity at equal-or-higher policy
                           priority terminates the lease if no other CPU
                           authorized by both the pool and lease mask is
                           eligible.

Two of these bounds map to existing kernel-enforced surfaces: max_revocation_latency_ns is already a field on NoHzEligibility and the closed Phase F activation preflight; accounting_target is already a field on NoHzEligibility and the live SchedulingContext/ResourceLedger authority.

The other three bounds need new kernel-enforced surfaces before the heuristic can ship and are named as Phase H prerequisites:

  • lifetime_ns: LANDED 2026-05-30. CpuIsolationLeaseSpec now carries leaseLifetimeNs @6 (0 = no expiry, the default). A lease records an absolute monotonic expires_at_ns at creation; the first observation past the deadline auto-revokes through the existing generation-advancing cleanup (reason=lease-expired), and the nohz activation record carries the lifetime deadline so a tickless CPU rolls back at the next timer/IPI recheck (lease-lifetime-expired), bounded by maxRevocationLatencyNs. This is the bounded-lifetime guarantee the auto-issued placement lease needs, so a compromised, blocked, or malfunctioning policy service cannot leave an auto-issued lease holding the CPU indefinitely. The bounded renewal primitive LANDED on top of this: CpuIsolationLease.renew @4 pushes expires_at_ns forward to now + leaseLifetimeNs (clamped to the same one-hour ceiling read_spec enforces), keeping the same (leaseId, generation), accounting binding, and nohz activation state – distinct from re-minting a fresh lease. It is callable only before expiry (a revoked, auto-revoked, or past-deadline lease stays staleGeneration and is not resurrected; an unbounded leaseLifetimeNs = 0 lease reports notRenewable), and the renewed deadline is propagated to a tickless CPU’s nohz activation record so the lease-lifetime-expired disqualifier no longer rolls it back at the old deadline; CpuIsolationLeaseInfo.expiresAtNs echoes the deadline read-only. Only the Phase H renewal heuristic – re-observing the saturation signal to decide whether to call renew on a near-expiry lease – remains future policy-service work on top of this primitive.
  • auto_claim_pool and per-account capacity (N in userstory 3): the operator-declared CPU-pool descriptor LANDED 2026-05-30, making a non-default poolId meaningful for the first time. CpuIsolationLeaseSpec carries poolId @7 (0 = the implicit default pool over every scheduler CPU), and the kernel seeds a fixed declared-pool registry (CpuIsolationPoolDescriptor: the default pool 0 plus exactly one declared non-default pool 1 over a single CPU). The create-time admission gate now looks the pool up: an undeclared poolId is rejected invalidSpec; a declared pool whose CPU mask the lease’s allowedCpuMask exceeds is rejected invalidSpec; a declared pool with a subset mask is admitted and its id/mask are echoed read-only through CpuIsolationLeaseInfo (admittedPoolId/admittedPoolCpuMask) (proof make run-scheduler-cpu-isolation-lease: nondefault_pool=invalidSpec for the undeclared id, declared_pool=ok admitted_pool_id=1 admitted_pool_cpu_mask_subset=true, declared_pool_mask_violation=invalidSpec, default_pool_id=0). The declared-pool table is now operator-sourced (LANDED 2026-05-30): the kernel installs it from the boot manifest SystemConfig.cpuIsolationPools @14 (a List(CpuIsolationPoolDescriptor)), with the in-kernel constant as the fail-closed default when the manifest omits the list, and validates each entry fail-closed at boot (canonical CPU mask subset of the scheduler mask, default pool 0 synthesized if omitted, duplicate ids rejected). The boot line cpu-isolation: declared-pools source=manifest count=3 default_pool_id=0 nondefault_pool_id=1 nondefault_pool_cpu_mask=0x2 proves the source (proof make run-scheduler-cpu-isolation-lease; the kernel-default fallback is proven by cargo test-config decode/empty assertions). The descriptor now also carries a per-pool live-lease capacity bound (poolMaxLeases @2, LANDED 2026-05-31): a non-zero value caps the number of simultaneously live (non-revoked, current-generation) leases the kernel admits against that pool at create-time, counted from the existing LEASE_REGISTRY after prune_dead, rejecting an over-capacity create fail-closed resourceExhausted (0 = unbounded, preserving the default pool 0 and every existing producer). The manifest bounds pool 2 at poolMaxLeases: 2; the proof admits two live leases, refuses a third non-overlapping create (cpu-isolation: pool-capacity-rejected admitted_pool_id=2 live_leases=2 pool_max_leases=2 result=resourceExhausted, pool_capacity_exceeded=resourceExhausted), then reclaims after a revoke (pool_capacity_reclaimed=ok), proving the bound is live-count not cumulative. The account identity and per-account N then landed on top of this counter (LANDED 2026-05-31): CpuIsolationLeaseSpec carries accountId @8 :UInt64 (0 = unattributed, caller-asserted and inert until counted, echoed read-only through CpuIsolationLeaseInfo.accountId @6) and CpuIsolationPoolDescriptor carries poolMaxLeasesPerAccount @3 :UInt32 (0 = unbounded per account). After the pool-wide check, register counts the requesting account’s live entries (matching both admitted_pool_id and account_id) against the per-account bound and rejects an over-bound create fail-closed resourceExhausted (0 account or 0 bound skips the gate). The manifest bounds pool 2 at poolMaxLeasesPerAccount: 1; the proof admits one account-7 lease, refuses a second account-7 create (cpu-isolation: account-capacity-rejected admitted_pool_id=2 account_id=7 account_live_leases=1 pool_max_leases_per_account=1 result=resourceExhausted, account_capacity_exceeded=resourceExhausted), admits a different account-9 lease on that CPU (account_capacity_other_account=ok – per-account, not pool-wide), and reclaims after revoking account-7 (account_capacity_reclaimed=ok). The account id is caller-asserted on the plain lease path. The authentication half LANDED 2026-05-31: CpuIsolationPoolGrant (schema/capos.capnp; source cpu_isolation_pool_grant; kernel kernel/src/cap/cpu_isolation_pool_grant.rs) introduced a bootstrap-staged grant that binds one authenticated account to one declared pool. Its createLease stamps the bound account/pool onto the minted lease, overriding any caller-asserted accountId/poolId, and reuses the same lease-create admission path (cpu_isolation::create_lease_for_caller) – so the per-account bound is unforgeable by cap-possession: a holder cannot assert another account to evade poolMaxLeasesPerAccount. The initial single-grant proof used account 7 bound to pool 2; the current make run-scheduler-cpu-isolation-pool-grant proof boots manifest-declared grants. The grant binding is now operator-declared (LANDED 2026-06-01): the manifest SystemConfig.cpuIsolationPoolGrants table seeds the bound (account, pool) pairs (mirroring the cpuIsolationPools table), and the cpu_isolation_pool_grant / cpu_isolation_pool_grant_secondary sources stage seeded binding index 0 / 1, so an operator can pre-authorize multiple distinct accounts/pools, each staged as its own bootstrap grant cap. An absent/empty list falls back to one in-kernel binding at index 0: account 7 bound to preferred pool 1 when active, otherwise account 7 bound to synthesized default pool 0, preserving a usable single default grant when a manifest-sourced pool table omits pool 1. make run-scheduler-cpu-isolation-pool-grant now boots a two-entry table (account 5/pool 1, account 8/pool 2) and proves each grant stamps its OWN bound account with the per-account bound still enforced. make run-scheduler-cpu-isolation-pool-grant-default boots the empty-list fallback with pool 1 omitted and proves the synthesized (account 7, pool 0) grant is usable. Runtime grant minting landed 2026-06-02 22:24 UTC (CpuIsolationGrantMinter): one cap mints a fresh CpuIsolationPoolGrant for an operator-chosen (account, pool) at call time, bounded by the declared SystemConfig.cpuIsolationGrantMinterAllowlist (an out-of-allowlist pair is refused unauthorized, so the minter is never an ambient grant-any authority; the minted grant reuses the same unforgeable createLease admission path). The same make run-scheduler-cpu-isolation-pool-grant smoke mints a grant for the allowed (account 6, pool 2), proves its createLease stamps account 6 and stays bounded by the per-account gate, and proves an out-of-allowlist mint is refused. Grant-revocation lifecycle landed 2026-06-03 17:11 UTC (CpuIsolationGrantMinter.revokeGrant), closing (c): a runtime-minted grant carries a revocable (grantId, generation); revokeGrant(grantId) advances the grant generation so a stale grant handle’s createLease fails staleGeneration and mints nothing, and revocation cascades to every live lease minted through that grant – reusing the landed fairness-termination cleanup (reason=grant-revoked, periodic-tick rollback, registry unregister) once per tagged lease, so per-pool/per-account live-lease capacity frees immediately and a fresh grant is admitted into the reclaimed slot. Double-revoke is alreadyRevoked and an unknown grantId is unknownGrant, both fail-closed; seeded bootstrap grants are not minter-owned and stay un-revocable. The same make run-scheduler-cpu-isolation-pool-grant smoke proves the full lifecycle. No pool authority is minted from holding a lease cap; the kernel stays the fail-closed admission gate.
  • fairness_preemption: LANDED 2026-06-02 21:17 UTC. The Phase F rollback path now compares policy priority at the existing nohz recheck site: when a second runnable entity appears on the leased CPU at equal-or-higher WFQ policy priority (latency_class, weight) than the captured leased thread, and no sibling CPU authorized by both the admitted pool and the lease allowedCpuMask is eligible to host the lease, the kernel terminates the CpuIsolationLease itself (fairness-preempted ... result=lease-terminated) rather than only restoring the periodic tick, bounded by maxRevocationLatencyNs. The termination runs the same generation-advancing cleanup leaseLifetimeNs expiry uses (reason=fairness-preempted) immediately after the scheduler restores the periodic tick, so a subsequent info/revoke reports staleGeneration and placement/account capacity is freed without waiting for the holder’s next cap call; a strictly-lower arrival or an eligible sibling CPU inside both masks keeps the existing tick-restore-only behavior. The kernel supplies the comparison and fail-closed termination; the policy service remains the issuer and bookkeeper of the saturation signal. Re-placement of the leased thread onto an eligible sibling CPU (instead of terminating) remains generic-full-nohz work; the “no sibling eligible” condition is recorded.

The policy service is the issuer and the bookkeeper of the synthesized saturation signal; the kernel remains the authority gate, the activation prover, and the fail-closed rollback path – including for the three not-yet-existing surfaces above.

Explicit non-goals

  • The kernel does not contain a saturation-detection rule of its own. It exports observation; it does not synthesize the signal.
  • Auto-promotion does not grant unlimited CPU-hold. The lease is bounded by lifetime, budget, revocation, and pool capacity; absent a pre-authorized pool, no auto-promotion occurs.
  • Auto-promotion does not grant realtime authority. RealtimeIsland admission remains a separate, stricter path with preallocation, deadline, and no-blocking proofs.
  • Auto-promotion does not bypass donation, fairness, or session-lifecycle invariants. Process exit, session logout, and explicit revoke still tear the lease down through the existing Layer 3 rollback.

Telemetry Requirements

Tickless, nohz, SQPOLL, and realtime behavior must be observable through future monitoring/status capability surfaces, not only through ad hoc debug logs. The first counters should include:

scheduler_tick_count{cpu}
ticks_suppressed{cpu,mode}
nohz_enter_count{cpu,kind}
nohz_exit_count{cpu,reason}
oneshot_deadline_miss_count
sqpoll_busy_ns
sqpoll_sleep_count
deadline_expired_count
budget_exhausted_count
realtime_overrun_count
donation_depth_max
housekeeping_offload_count

These counters are correctness evidence. Missing or surprising values should fail focused nohz/realtime proofs rather than being treated as performance-only diagnostics.

The ticks_suppressed{cpu,mode} / scheduler_tick_count{cpu} evidence is realized as an asserted proof line on the lease path: make run-scheduler-cpu-isolation-lease now counts genuine periodic LAPIC fires per CPU (a fire is counted only when neither the lease-backed nor the idle tick-suppression bit is set, so the one-shot replacement is never miscounted) and, on lease nohz rollback, emits cpu-isolation: nohz suppressed-ticks cpu=<n> window_ns=<w> expected_periodic=<e> actual_periodic=<a> suppressed=<e-a>. The harness asserts that over a bounded masked window the leased CPU recorded actual near zero while expected was substantial – the periodic tick demonstrably stopped, not merely that the mask write was issued – and that a bounded post-rollback cpu-isolation: nohz restored-rate window shows the periodic rate returning. This is bounded proof-line evidence, not yet a durable SchedulingPolicyCap/monitoring telemetry field; the persistent ticks_suppressed surface and the generic-full-nohz path’s inheritance of the same measured assertion remain future telemetry work.

Implementation Sequence

  1. Add timer/scheduler instrumentation around the existing periodic tick.
  2. Add monotonic_ns() backed by a clocksource that is not derived from the scheduler tick, and switch Timer.now plus scheduler accounting to that clocksource while keeping periodic scheduling. Completed for normal QEMU/x86_64 by the Phase F clockevent/deadline substrate.
  3. Convert timeout waiters to deadline_ns. Completed for Timer.sleep, finite cap_enter, and park timeouts by the Phase F clockevent/deadline substrate.
  4. Add LAPIC one-shot programming, periodic restore state, and a focused one-shot smoke. Completed as a disabled-nohz substrate proof by the Phase F clockevent/deadline substrate.
  5. Replace user-mode idle with kernel/per-CPU idle while keeping periodic ticks. Completed: the scheduler idle path is now a CPL0 per-CPU kernel idle thread and the user-mode idle process is gone (docs/tasks/README.md).
  6. Enable tickless idle only when there is no runnable work. Completed by docs/tasks/done/2026/scheduler-tickless-idle-step6.md: true-idle CPUs with no runnable non-idle work, no active nohz lease, no local deferred cleanup, no cap-enter polling dependency, and a one-shot LAPIC clockevent mask the periodic tick and arm a bounded one-shot at the next Timer/ParkSpace deadline or the 100 ms idle housekeeping floor. The scheduler restores the periodic tick before ordinary non-idle dispatch, on reschedule IPIs, and on backend/refusal rollback. Cap-enter polling waiters and ready-but-budget-throttled SchedulingContext retry windows remain periodic until the legacy terminal/network/IRQ polling and scheduling-context retry surfaces move behind explicit deadlines or housekeeping placement.
  7. Route the in-kernel virtio-net poll off a lease-isolated CPU to the housekeeping CPU (landed 2026-06-04); an explicit NetworkPollClock poll deadline remains the longer-term target.
  8. Add ring mode state and refuse timer-side SQ processing for SQPOLL rings.
  9. Land Ring v2 per-thread ring ownership and completion routing.
  10. Add the SQPOLL wake/sleep protocol and a host or Loom-style lost-wakeup model.
  11. Add kernel SQPOLL without full-nohz, under normal scheduler ticks.
  12. Add CPU isolation leases and housekeeping CPU placement.
  13. Prove SQPOLL progress through a wake/deadline path that does not depend on periodic scheduler ticks. Completed for bounded current-thread syscall/producer-wake progress by the Phase F SQPOLL nohz-progress child.
  14. Enable SQPOLL nohz on isolated CPUs for explicitly leased caller-thread rings. Landed 2026-06-07 09:45 UTC; broader userspace-poller/device-queue policy issuance remains separate.
  15. Add request deadline_ns metadata and typed late/drop CQE outcomes.
  16. Add SchedulingContext and admission-controlled realtime islands.
  17. Add generic full-nohz admission for ordinary budgeted compute threads through explicit SchedulingContext-targeted CpuIsolationLease preflight. Landed 2026-06-06 09:44 UTC; policy-service issuance remains separate.
  18. Add the user-space policy-service AutoNoHz placement heuristic. The kernel exports per-thread saturation observation through the monitoring/status surface; the policy service synthesizes the “thread appears capable of utilizing a full CPU core” decision and issues bounded CpuIsolationLease grants against pre-authorized account or session CPU pools. The auto-revoke timeout primitive (leaseLifetimeNs) landed 2026-05-30 15:22 UTC at 84c1c5ba, priority-aware fairness lease termination landed 2026-06-02 21:28 UTC at cae825a4 with immediate release remediation at ca28ef63, runtime grant minting (CpuIsolationGrantMinter) landed 2026-06-02 22:25 UTC at 5c5c63cc, and the grant-revocation lifecycle (CpuIsolationGrantMinter.revokeGrant with cascade-to-leases) landed 2026-06-03 17:11 UTC, completing the pool-grant authority surface. The local userspace policy-service proof landed 2026-06-07: it reads the per-thread saturation counters, denies a voluntarily blocking worker, issues a finite grant-stamped full-nohz lease only after a saturated local window, renews only after re-observation, and lets stopped renewal expire fail-closed. A reusable production policy daemon with profile-driven smoothing, cross-process target discovery, and richer operator policy remains future work.

Verification

Tickless idle gates:

make fmt-check
cargo test-lib
cargo test-config
make run-smoke
make run-spawn

Additional tickless proof:

1 second idle interval does not produce 100 scheduler ticks
Timer.sleep still completes
cap_enter timeout still completes
ParkSpace timeout still completes
preemption fairness unchanged with runnable contention

SQPOLL gates:

thread-lifecycle
timer-smoke
timer-flood
park wake/timeout
endpoint CALL/RECV/RETURN
mandatory host or Loom-style lost-wakeup model before any real SQPOLL worker:
  poller: set NEED_WAKEUP -> full barrier -> recheck tail -> park
  producer: write SQE -> publish tail -> full barrier -> check NEED_WAKEUP -> wake

Realtime gates:

deadline ordering tests
budget depletion tests
donation/return tests through passive endpoint
admission denial tests
QEMU proof for late/drop/overrun behavior
telemetry counters prove ticks suppressed, deadlines expired, budgets
exhausted, and donation depth bounded as expected

Decision

Adopt this staged direction:

Tickless idle:
  yes, after the kernel/per-CPU idle context and activation proof. The
  clocksource/clockevent split is implemented.

Generic full-nohz:
  implemented for explicit budgeted compute leases targeting a live
  SchedulingContext. Automatic issuance and unbudgeted ordinary threads remain
  out of scope.

SQPOLL nohz:
  yes, for explicitly leased caller-thread rings whose SQPOLL poller is live,
  single-consumer, and bounded by producer wake plus rollback deadlines.

AutoNoHz placement for ordinary threads:
  yes, but only as a user-space policy-service decision that issues a
  bounded CpuIsolationLease against a pre-authorized CPU pool. The lease
  adds isolation; it never mints CPU-time authority. The "thread appears
  capable of utilizing a full CPU core" signal is synthesized in the
  policy service from observations the future monitoring/status surface
  must export, not as a fixed kernel threshold.

Realtime:
  `SQE.deadline_ns` is useful metadata, but `SchedulingContext` is the
  authority that provides CPU time.