Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Tickless and Realtime Scheduling

This proposal captures the scheduling design from the 2026-04-29 discussion: tickless idle is useful soon, generic full-nohz is premature, SQPOLL-oriented full-nohz belongs behind Ring v2 and CPU isolation, and realtime requires scheduling contexts rather than only per-request deadlines.

Design Grounding

The local docs/research/ contents were checked before adding this proposal. The directly relevant grounding is:

External grounding is recorded in the research note so reviewers can audit the prior-art claims without treating this proposal as the source of truth.

Goals

  • Add tickless idle: when a CPU has no runnable work, stop the periodic scheduler tick and program the local timer for the earliest known deadline.
  • Split monotonic timekeeping from timer interrupt delivery.
  • Convert scheduler timeout waiters to absolute monotonic deadlines.
  • Stage full-nohz as an explicit CPU isolation/lease mode for SQPOLL and realtime executors, not as a generic scheduler default.
  • Define SQE.deadline_ns as request freshness metadata.
  • Define SchedulingContext as CPU-time authority.
  • Define RealtimeIsland as the admission object for media, robotics, provider, and other bounded realtime graphs.

Non-Goals

  • No generic NO_HZ_FULL for arbitrary user threads in the near term.
  • No SQPOLL on the current process-wide ring.
  • No second SQ consumer through timer-side polling for SQPOLL rings.
  • No TSC-deadline or x2APIC requirement for the first tickless-idle milestone.
  • No hard realtime claim before kernel-path, IRQ, device, locking, and WCET evidence exists.
  • No full realtime policy blob inside every SQE.

CPU Authority Taxonomy

These terms must not drift into overlapping authority systems:

ResourceProfile:
  policy template selected by identity, session, account, or service profile;
  it is not spendable authority by itself.

ResourceLedger:
  coarse accounting and quota owner for a resource class. It records and
  enforces limits, including non-realtime CPU share/runtime budgets where the
  scheduler has not minted finer scheduling contexts.

SchedulingContext:
  spendable CPU-time authority with budget, period, relative deadline,
  priority/criticality, CPU mask, and overrun policy.

CpuIsolationLease:
  placement, exclusivity, and nohz/noise-isolation authority for a CPU or CPU
  set. It does not grant CPU-time credit and must charge consumed time through
  a SchedulingContext or coarse scheduler ResourceLedger.

NoHzEligibility:
  a reviewed claim or hint that a thread, ring, poller, or island may use nohz
  isolation if the scheduler can prove the current CPU state allows it.

NoHzActivation:
  the scheduler-proven current CPU state that actually suppresses ticks.

RealtimeIsland:
  admitted bundle of SchedulingContexts, memory reservations, device
  reservations, rings, endpoint/service constraints, and optional
  CpuIsolationLeases.

Scheduling-context donation is not generic resource donation. It donates only execution budget/deadline along a synchronous capability path; it does not donate capability authority, invocation subject identity, disclosure scope, memory budget, network budget, storage budget, or service-management authority.

Layer 1: Tickless Idle

Tickless idle should be the first behavioral milestone. It applies only when the CPU has no runnable thread and no local work that still depends on a periodic scheduler tick.

Clocksource

Add a monotonic clock layer:

#![allow(unused)]
fn main() {
pub fn monotonic_ns() -> u64;
}

The first backend can use the current periodic tick as a compatibility source while the system is still periodic. The selected QEMU/x86_64 backend should eventually use a calibrated stable counter, with SMP consistency handled when multiple scheduler owners exist.

Required invariant:

monotonic_ns() never moves backwards on one CPU.

Clockevent

Add a small scheduler timer backend boundary:

#![allow(unused)]
fn main() {
trait ClockEvent {
    fn program_periodic(period_ns: u64);
    fn program_oneshot(delta_ns: u64);
    fn stop();
    fn min_delta_ns() -> u64;
    fn max_delta_ns() -> u64;
}
}

The first backend is the current PIT-calibrated xAPIC LAPIC timer on vector 48. PIT/PIC and periodic LAPIC remain fallback paths.

Deadline Waiters

Convert timeout state from tick counts to absolute deadlines:

#![allow(unused)]
fn main() {
struct DeadlineWaiter {
    deadline_ns: u64,
    target: ThreadRef,
    kind: WaiterKind,
    user_data: u64,
}
}

Affected paths:

  • Timer.sleep;
  • cap_enter(timeout_ns);
  • ParkSpace timeout;
  • future process/thread wait timeouts;
  • network poll deadline through NetworkPollClock.

Waiter storage remains bounded. No interrupt path may allocate.

Network Poll Clock

The current kernel-resident networking path is scheduler-polled, so it keeps a CPU in ForcedPeriodic unless networking exposes an explicit poll clock. The intermediate interface should be:

#![allow(unused)]
fn main() {
trait NetworkPollClock {
    fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
    fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
}

next_poll_deadline_ns lets the scheduler include TCP/runtime timers in earliest_global_deadline(). poll_until_budget prevents network progress from becoming an unbounded idle-exit or interrupt path. A CPU with active networking may enter tickless idle only when the network runtime is inactive or has exposed a bounded deadline through this interface.

Kernel Idle

Tickless idle depends on replacing the user-mode idle process with a kernel/per-CPU idle context. Timer IRQ handling must distinguish:

IRQ from CPL3 user thread -> save/restore user context
IRQ from CPL0 idle        -> wake/check scheduler without fake user context

Idle entry shape:

if no runnable work:
    deadline = earliest_global_deadline()
    clockevent.program_oneshot(deadline - now)
    enter_kernel_idle()

The idle loop enables interrupts, halts, wakes on timer/IPI/device interrupt, then rechecks runnable work and deadline expiry.

Tickless State

Per CPU:

Periodic:
  normal scheduler tick active

TicklessIdle:
  no runnable thread
  one-shot local timer programmed for earliest deadline
  CPU in kernel idle

ForcedPeriodic:
  fallback when a subsystem still needs regular polling

Enter TicklessIdle only when:

run queue empty
no direct IPC target
no deferred completion work
no timer-side ring work required
clockevent supports one-shot
kernel idle context available
network runtime inactive or deadline-driven

Keep periodic preemption whenever there is runnable contention. Even one runnable user thread remains periodic until Ring v2, CPU accounting, and timer-side polling dependencies are resolved.

Layer 2: SQPOLL NoHz

SQPOLL full-nohz is a later CPU ownership mode:

full-nohz is not a timer feature here;
it is part of the SQPOLL CPU ownership contract.

Required prerequisites:

  • Ring v2 or equivalent per-thread rings;
  • one SQ consumer per ring;
  • per-CPU scheduler ownership;
  • reschedule IPI and idle-to-runnable handoff;
  • at least one housekeeping CPU;
  • explicit placement of network polling away from isolated CPUs.

Ring mode:

#![allow(unused)]
fn main() {
enum RingMode {
    Syscall,
    SqpollStarting,
    Sqpoll,
    SqpollStopping,
}
}

In syscall mode, the owner thread’s cap_enter drains SQ. In SQPOLL mode, a kernel worker owns SQ head; userspace owns SQ tail and CQ head; cap_enter waits for completions and may wake a sleeping poller, but it does not drain SQ.

SQPOLL state:

Disabled -> Starting -> Running -> IdleSpinning -> Sleeping -> Stopping

The wake protocol uses a NEED_WAKEUP flag. Userspace release-stores the SQ tail, acquire-loads flags, and invokes a wake path only if the poller has gone to sleep.

The race-free sequence is normative.

Poller before sleeping:

#![allow(unused)]
fn main() {
flags.fetch_or(NEED_WAKEUP, SeqCst);

let tail = sq_tail.load(Acquire);
if sq_head != tail {
    flags.fetch_and(!NEED_WAKEUP, Release);
    continue;
}

park();
}

Producer:

#![allow(unused)]
fn main() {
write_sqe();
sq_tail.store(new_tail, Release);
fence(SeqCst);

let flags = flags.load(Acquire);
if flags & NEED_WAKEUP != 0 {
    wake_poller();
}
}

The poller must set NEED_WAKEUP before the final tail recheck. Otherwise a producer can publish a new SQE after the poller checks the tail but before it parks, losing the wake.

The NEED_WAKEUP publication must also be ordered before the final tail recheck by a full store-to-load barrier. A SeqCst RMW is the simplest portable rule for the ABI text; an implementation may substitute an explicitly reviewed architecture-specific fence or park primitive that provides the same ordering. A plain release store or release-only RMW is not sufficient for this protocol.

The producer must likewise order the SQ tail publication before checking NEED_WAKEUP. The normative sequence uses a full fence between sq_tail.store(..., Release) and flags.load(Acquire); an implementation may substitute an explicitly reviewed equivalent that prevents the producer from missing NEED_WAKEUP while the poller misses the new tail before parking.

An SQPOLL CPU may suppress the periodic tick only if:

cpu role is SqpollIsolated
exactly one runnable entity is the poller
no ordinary user thread is runnable there
no timer-side SQ polling is enabled
no network scheduler polling is pinned there
no deferred cleanup is pinned there
stable clocksource/accounting exists
housekeeping CPU is online

If any condition fails, restore periodic tick or migrate the unrelated work.

NoHz Activation Proof Obligations

To enter SqpollNoHz or future AutoNoHz, the scheduler must prove:

exactly one runnable entity is assigned to the CPU
at least one housekeeping CPU is online
no local network polling dependency remains
no timer-side SQ polling can run for the active ring
no local deferred cleanup or unbound kernel worker is pinned there
no unmigratable IRQ targets that CPU unless explicitly allowed
clocksource and CPU accounting are boundary/counter driven, not tick driven
revocation latency is within the lease policy

The proof is dynamic. If any condition stops holding, the scheduler must restore periodic tick, migrate unrelated work, revoke the lease, or leave nohz mode before continuing.

Layer 3: AutoNoHz CPU Lease

The long-term design should split eligibility from activation.

Eligibility says a thread, process, ring, or realtime island may use nohz isolation:

#![allow(unused)]
fn main() {
enum NoHzKind {
    Idle,
    KernelSqpoll,
    AutoCompute,
    AutoUserspacePoller,
    RealtimeIsland,
}

struct NoHzEligibility {
    kind: NoHzKind,
    max_revocation_latency_ns: u64,
    preferred_cpus: CpuSet,
    allow_busy_spin: bool,
    accounting_target: CpuAccountingTarget,
}

enum CpuAccountingTarget {
    CurrentSchedulingContext,
    SchedulerResourceLedger,
}
}

Activation is a scheduler proof that a CPU currently satisfies isolation conditions. Without a lease, a latency-sensitive hint may influence placement but must not grant exclusive CPU access.

Future lease shape:

CpuIsolationLease:
  owner process/session
  allowed CPU set
  allowed mode: poller/compute/kernel-worker
  accounting target, not CPU-time credit
  revocation policy

Housekeeping must be explicit:

Housekeeping CPU set:
  global timers
  deferred frees
  cleanup
  statistics
  non-critical kernel workers
  debug/watchdog
  load balancing and migration control

Layer 4: Deadline Metadata

Deadline metadata lives in fixed ring ABI fields, not in a Cap’n Proto SQE envelope and not in variable side metadata. The current fixed SQE layout should not be silently reinterpreted; add these fields through a versioned CapSqeV2/ring ABI gate when the transport is ready.

#![allow(unused)]
fn main() {
#[repr(C)]
struct CapSqeV2 {
    // existing fixed CapSqe fields, unchanged in order and meaning

    deadline_ns: u64,  // absolute monotonic deadline, 0 = none
    qos_flags: u32,   // drop/allow/reorder/propagate semantics
    sched_ctx_id: u32, // 0 = current/default scheduling context
}
}

deadline_ns is an absolute monotonic timestamp. It is request freshness metadata, not a promise of nanosecond wakeup precision. The kernel may round timer programming to clockevent granularity, coalesce timers where policy allows, or report a miss when dispatch observes the timestamp has already expired. The field remains u64 nanoseconds because absolute u64 ns values are simple, tracing-friendly, and shared with existing timeout surfaces; a u64 microsecond field saves no ABI space.

Only consider a compact profile if SQE space becomes critical:

#![allow(unused)]
fn main() {
deadline_delta_us: u32
}

That profile would be a soft-deadline compact transport shape only. It is not the primary realtime or SchedulingContext ABI and must not replace deadline_ns for admitted realtime work.

ABI negotiation uses both bootstrap metadata and a runtime query surface:

#![allow(unused)]
fn main() {
struct RuntimeBootInfo {
    ring_addr: u64,
    ring_abi_version: u32,
    sqe_size: u16,
    cqe_size: u16,
}
}
  • Process bootstrap passes the ring ABI version and fixed entry sizes alongside the ring address.
  • RuntimeBootInfo, ring ABI version constants, and fixed SQE/CQE layouts live in capos-config/src/ring.rs; the kernel and capos-rt import the same definition rather than carrying local copies.
  • A future RuntimeInfo/SystemInfo query returns the kernel-supported ring ABI range so language runtimes can fail before mapping incompatible rings.
  • cap_enter rejects unsupported SQE versions or entry sizes with stable transport errors such as CAP_ERR_UNSUPPORTED_RING_ABI and CAP_ERR_UNSUPPORTED_SQE_VERSION.
  • Runtimes in Rust, C, Go, and other languages must generate or mirror the exact fixed layout for the negotiated version.

Suggested flags:

DROP_IF_LATE:
  if now > deadline_ns before dispatch, post DEADLINE_EXPIRED

ALLOW_LATE:
  dispatch anyway, but CQE/telemetry marks late

PROPAGATE_DEADLINE:
  endpoint CALL/RETURN carries deadline metadata to server-side request

DEADLINE_ORDERED:
  SQPOLL may reorder within a bounded window only when all reorder-safety
  checks below pass

NO_BLOCKING_PATH:
  reject if target method/op is not declared realtime-safe

Do not put budget, period, priority, criticality, or CPU affinity into each SQE. Deadline is per request. Budget is execution authority.

DEADLINE_ORDERED is valid only when all of the following are true:

the ring mode permits reordering
the SQE marks this request reorderable
the target capability interface and method declare reorder-safe semantics
the reordering window is bounded
the operation does not depend on earlier same-ring requests for correctness

Ordered side effects such as write A; write B; flush or lock; mutate; unlock must not be deadline-reordered unless the target method contract explicitly defines that sequence as reorder-safe.

Layer 5: SchedulingContext

CPU time should become a capability-controlled object:

#![allow(unused)]
fn main() {
struct SchedulingContext {
    budget_ns: u64,
    period_ns: u64,
    relative_deadline_ns: u64,
    priority: u16,
    criticality: u8,
    cpu_mask: CpuSet,
    overrun_policy: OverrunPolicy,
    timeout_endpoint: Option<EndpointRef>,
}
}

Kernel responsibilities:

  • decrement remaining budget by actual runtime;
  • replenish budget by period;
  • throttle or fault a thread on depletion;
  • enforce CPU mask and scheduling eligibility;
  • dispatch among eligible contexts by the selected realtime policy;
  • prevent untrusted SQE bytes from minting budget.

Policy-service responsibilities:

  • admission control;
  • budget/period/priority selection;
  • CPU-isolation lease policy;
  • overload response;
  • telemetry and retuning.

Layer 6: Donation

Synchronous capability calls need scheduling-context donation:

client SchedulingContext -> passive server endpoint
server runs on donated budget/deadline
context returns on reply
timeout/overrun reports to caller or island policy

Without donation or inheritance, a realtime caller can be defeated by a normal-priority server that holds the capability implementation path.

Donation semantics must be fixed before implementation:

max donation call depth:
  bounded per SchedulingContext or RealtimeIsland; overflow fails closed.

nested donation:
  nested synchronous calls carry the current donated context until the depth
  bound, unless a callee uses its own admitted context by explicit policy.

cycle handling:
  a donated context may not re-enter a thread already on its donation stack;
  cycles fail with a typed realtime/donation error.

partial failure:
  budget already consumed stays charged to the context that ran the work.
  rollback of authority or memory is separate from CPU charge rollback.

timeout propagation:
  the earliest of request deadline, scheduling-context deadline, and explicit
  call timeout bounds downstream execution.

server-side blocking:
  a passive server running on donated context may block only on approved
  realtime-safe waits or synchronous calls that continue donation.

return on exception:
  application exceptions, transport errors, and cancellation return the
  context to its previous owner before CQE/error delivery.

async endpoint queues:
  donation does not cross ordinary async endpoint enqueue by default. Async
  donation requires an explicit future token/lease design.

Hot admitted paths should avoid blocking locks. If a shared resource cannot be modeled as a passive service, it needs a reviewed priority/deadline-inheritance primitive or a bounded try-lock/fail/drop policy.

Layer 7: RealtimeIsland

RealtimeIsland admits a whole loop or graph:

#![allow(unused)]
fn main() {
struct RealtimeIslandSpec {
    period_ns: u64,
    deadline_ns: u64,
    cpu_set: CpuSet,
    nodes: Vec<NodeBudget>,
    rings: Vec<RingSpec>,
    memory: Vec<PreallocSpec>,
    devices: Vec<DeviceReservation>,
    overrun_policy: OverrunPolicy,
}
}

Admission requires:

  • total budget fits period/deadline constraints;
  • all hot-path buffers are preallocated;
  • hot-path memory is committed and resident before start;
  • guaranteed hot-path memory uses the OOM proposal’s MemoryResidency policy as pinned or secret; normal memory is not admitted for guaranteed hot paths. A future lock-resident operation may transition ordinary memory into a pinned reservation before admission, but the admitted island sees the result as pinned, not as normal;
  • all caps and policy decisions are resolved before start;
  • no expected page faults on the hot path;
  • no unbounded lock acquisition;
  • no blocking endpoint calls inside callback loops;
  • no allocation, logging, service discovery, or provider credential work on the realtime path;
  • IRQ and deferred work are bounded or moved outside the island.

Failure semantics must be typed:

CAP_ERR_DEADLINE_EXPIRED
CAP_ERR_BUDGET_EXHAUSTED
CAP_ERR_REALTIME_UNSAFE_PATH
CAP_ERR_REALTIME_ADMISSION_DENIED
CAP_ERR_OVERRUN
CAP_ERR_STALE_INPUT

CQE/status should distinguish not-started-late, completed-late, dropped by policy, throttled, and dependency-cancelled.

Telemetry Requirements

Tickless, nohz, SQPOLL, and realtime behavior must be observable through future monitoring/status capability surfaces, not only through ad hoc debug logs. The first counters should include:

scheduler_tick_count{cpu}
ticks_suppressed{cpu,mode}
nohz_enter_count{cpu,kind}
nohz_exit_count{cpu,reason}
oneshot_deadline_miss_count
sqpoll_busy_ns
sqpoll_sleep_count
deadline_expired_count
budget_exhausted_count
realtime_overrun_count
donation_depth_max
housekeeping_offload_count

These counters are correctness evidence. Missing or surprising values should fail focused nohz/realtime proofs rather than being treated as performance-only diagnostics.

Implementation Sequence

  1. Add timer/scheduler instrumentation around the existing periodic tick.
  2. Add monotonic_ns() and switch Timer.now to the clocksource layer while keeping periodic scheduling.
  3. Convert timeout waiters to deadline_ns.
  4. Add LAPIC one-shot programming and a focused one-shot smoke.
  5. Replace user-mode idle with kernel/per-CPU idle while keeping periodic ticks.
  6. Enable tickless idle only when there is no runnable work.
  7. Keep networking in ForcedPeriodic or add explicit network poll deadlines.
  8. Add ring mode state and refuse timer-side SQ processing for SQPOLL rings.
  9. Land Ring v2 per-thread ring ownership and completion routing.
  10. Add the SQPOLL wake/sleep protocol and a host or Loom-style lost-wakeup model.
  11. Add kernel SQPOLL without full-nohz, under normal scheduler ticks.
  12. Add CPU isolation leases and housekeeping CPU placement.
  13. Enable SQPOLL nohz on isolated CPUs.
  14. Add request deadline_ns metadata and typed late/drop CQE outcomes.
  15. Add SchedulingContext and admission-controlled realtime islands.

Verification

Tickless idle gates:

make fmt-check
cargo test-lib
cargo test-config
make run-smoke
make run-spawn

Additional tickless proof:

1 second idle interval does not produce 100 scheduler ticks
Timer.sleep still completes
cap_enter timeout still completes
ParkSpace timeout still completes
preemption fairness unchanged with runnable contention

SQPOLL gates:

thread-lifecycle
timer-smoke
timer-flood
park wake/timeout
endpoint CALL/RECV/RETURN
mandatory host or Loom-style lost-wakeup model before any real SQPOLL worker:
  poller: set NEED_WAKEUP -> full barrier -> recheck tail -> park
  producer: write SQE -> publish tail -> full barrier -> check NEED_WAKEUP -> wake

Realtime gates:

deadline ordering tests
budget depletion tests
donation/return tests through passive endpoint
admission denial tests
QEMU proof for late/drop/overrun behavior
telemetry counters prove ticks suppressed, deadlines expired, budgets
exhausted, and donation depth bounded as expected

Decision

Adopt this staged direction:

Tickless idle:
  yes, after clocksource/clockevent split and kernel idle.

Generic full-nohz:
  defer. It depends on per-CPU scheduling, Ring v2, accounting, and
  housekeeping.

SQPOLL nohz:
  yes, but only as explicit CPU-isolation authority after Ring v2.

Realtime:
  `SQE.deadline_ns` is useful metadata, but `SchedulingContext` is the
  authority that provides CPU time.