# Proposal: Tickless and Realtime Scheduling

This proposal captures the scheduling design from the 2026-04-29 discussion:
tickless idle is useful soon, generic full-nohz is premature, SQPOLL-oriented
full-nohz belongs behind Ring v2 and CPU isolation, and realtime requires
scheduling contexts rather than only per-request deadlines.

## Design Grounding

The local `docs/research/` contents were checked before adding this proposal.
The directly relevant grounding is:

- [NO_HZ, SQPOLL, and Realtime Scheduling](../research/nohz-sqpoll-realtime.md)
- [Out-of-kernel scheduling](../research/out-of-kernel-scheduling.md)
- [Completion rings and threaded runtimes](../research/completion-ring-threading.md)
- [Multimedia pipeline latency](../research/multimedia-pipeline-latency.md)
- [Robotics realtime control](../research/robotics-realtime-control.md)
- [x2APIC and APIC virtualization](../research/x2apic-and-virtualization.md)
- [Scheduling](../architecture/scheduling.md)
- [Ring v2 For Full SMP](ring-v2-smp-proposal.md)
- [SMP](smp-proposal.md)
- [Realtime Voice Agent Shell](realtime-voice-agent-shell-proposal.md)

External grounding is recorded in the research note so reviewers can audit the
prior-art claims without treating this proposal as the source of truth.

## Goals

- Add tickless idle: when a CPU has no runnable work, stop the periodic
  scheduler tick and program the local timer for the earliest known deadline.
- Split monotonic timekeeping from timer interrupt delivery.
- Convert scheduler timeout waiters to absolute monotonic deadlines.
- Stage full-nohz as an explicit CPU isolation/lease mode for SQPOLL and
  realtime executors, not as a generic scheduler default.
- Define `SQE.deadline_ns` as request freshness metadata.
- Define `SchedulingContext` as CPU-time authority.
- Define `RealtimeIsland` as the admission object for media, robotics,
  provider, and other bounded realtime graphs.

## Non-Goals

- No generic `NO_HZ_FULL` for arbitrary user threads in the near term.
- No SQPOLL on the current process-wide ring.
- No second SQ consumer through timer-side polling for SQPOLL rings.
- No TSC-deadline or x2APIC requirement for the first tickless-idle milestone.
- No hard realtime claim before kernel-path, IRQ, device, locking, and WCET
  evidence exists.
- No full realtime policy blob inside every SQE.

## CPU Authority Taxonomy

These terms must not drift into overlapping authority systems:

```text
ResourceProfile:
  policy template selected by identity, session, account, or service profile;
  it is not spendable authority by itself.

ResourceLedger:
  coarse accounting and quota owner for a resource class. It records and
  enforces limits, including non-realtime CPU share/runtime budgets where the
  scheduler has not minted finer scheduling contexts.

SchedulingContext:
  spendable CPU-time authority with budget, period, relative deadline,
  priority/criticality, CPU mask, and overrun policy.

CpuIsolationLease:
  placement, exclusivity, and nohz/noise-isolation authority for a CPU or CPU
  set. It does not grant CPU-time credit and must charge consumed time through
  a SchedulingContext or coarse scheduler ResourceLedger.

NoHzEligibility:
  a reviewed claim or hint that a thread, ring, poller, or island may use nohz
  isolation if the scheduler can prove the current CPU state allows it.

NoHzActivation:
  the scheduler-proven current CPU state that actually suppresses ticks.

RealtimeIsland:
  admitted bundle of SchedulingContexts, memory reservations, device
  reservations, rings, endpoint/service constraints, and optional
  CpuIsolationLeases.
```

Scheduling-context donation is not generic resource donation. It donates only
execution budget/deadline along a synchronous capability path; it does not
donate capability authority, invocation subject identity, disclosure scope,
memory budget, network budget, storage budget, or service-management authority.

## Layer 1: Tickless Idle

Tickless idle should be the first behavioral milestone. It applies only when
the CPU has no runnable thread and no local work that still depends on a
periodic scheduler tick.

### Clocksource

Add a monotonic clock layer:

```rust
pub fn monotonic_ns() -> u64;
```

The first backend can use the current periodic tick as a compatibility source
while the system is still periodic. The selected QEMU/x86_64 backend should
eventually use a calibrated stable counter, with SMP consistency handled when
multiple scheduler owners exist.

Required invariant:

```text
monotonic_ns() never moves backwards on one CPU.
```

### Clockevent

Add a small scheduler timer backend boundary:

```rust
trait ClockEvent {
    fn program_periodic(period_ns: u64);
    fn program_oneshot(delta_ns: u64);
    fn stop();
    fn min_delta_ns() -> u64;
    fn max_delta_ns() -> u64;
}
```

The first backend is the current PIT-calibrated xAPIC LAPIC timer on vector
48. PIT/PIC and periodic LAPIC remain fallback paths.

### Deadline Waiters

Convert timeout state from tick counts to absolute deadlines:

```rust
struct DeadlineWaiter {
    deadline_ns: u64,
    target: ThreadRef,
    kind: WaiterKind,
    user_data: u64,
}
```

Affected paths:

- `Timer.sleep`;
- `cap_enter(timeout_ns)`;
- ParkSpace timeout;
- future process/thread wait timeouts;
- network poll deadline through `NetworkPollClock`.

Waiter storage remains bounded. No interrupt path may allocate.

### Network Poll Clock

The current kernel-resident networking path is scheduler-polled, so it keeps a
CPU in `ForcedPeriodic` unless networking exposes an explicit poll clock. The
intermediate interface should be:

```rust
trait NetworkPollClock {
    fn next_poll_deadline_ns(now_ns: u64) -> Option<u64>;
    fn poll_until_budget(now_ns: u64, budget_ns: u64) -> PollResult;
}
```

`next_poll_deadline_ns` lets the scheduler include TCP/runtime timers in
`earliest_global_deadline()`. `poll_until_budget` prevents network progress
from becoming an unbounded idle-exit or interrupt path. A CPU with active
networking may enter tickless idle only when the network runtime is inactive or
has exposed a bounded deadline through this interface.

### Kernel Idle

Tickless idle depends on replacing the user-mode idle process with a
kernel/per-CPU idle context. Timer IRQ handling must distinguish:

```text
IRQ from CPL3 user thread -> save/restore user context
IRQ from CPL0 idle        -> wake/check scheduler without fake user context
```

Idle entry shape:

```text
if no runnable work:
    deadline = earliest_global_deadline()
    clockevent.program_oneshot(deadline - now)
    enter_kernel_idle()
```

The idle loop enables interrupts, halts, wakes on timer/IPI/device interrupt,
then rechecks runnable work and deadline expiry.

### Tickless State

Per CPU:

```text
Periodic:
  normal scheduler tick active

TicklessIdle:
  no runnable thread
  one-shot local timer programmed for earliest deadline
  CPU in kernel idle

ForcedPeriodic:
  fallback when a subsystem still needs regular polling
```

Enter `TicklessIdle` only when:

```text
run queue empty
no direct IPC target
no deferred completion work
no timer-side ring work required
clockevent supports one-shot
kernel idle context available
network runtime inactive or deadline-driven
```

Keep periodic preemption whenever there is runnable contention. Even one
runnable user thread remains periodic until Ring v2, CPU accounting, and
timer-side polling dependencies are resolved.

## Layer 2: SQPOLL NoHz

SQPOLL full-nohz is a later CPU ownership mode:

```text
full-nohz is not a timer feature here;
it is part of the SQPOLL CPU ownership contract.
```

Required prerequisites:

- Ring v2 or equivalent per-thread rings;
- one SQ consumer per ring;
- per-CPU scheduler ownership;
- reschedule IPI and idle-to-runnable handoff;
- at least one housekeeping CPU;
- explicit placement of network polling away from isolated CPUs.

Ring mode:

```rust
enum RingMode {
    Syscall,
    SqpollStarting,
    Sqpoll,
    SqpollStopping,
}
```

In syscall mode, the owner thread's `cap_enter` drains SQ. In SQPOLL mode, a
kernel worker owns SQ head; userspace owns SQ tail and CQ head; `cap_enter`
waits for completions and may wake a sleeping poller, but it does not drain
SQ.

SQPOLL state:

```text
Disabled -> Starting -> Running -> IdleSpinning -> Sleeping -> Stopping
```

The wake protocol uses a `NEED_WAKEUP` flag. Userspace release-stores the SQ
tail, acquire-loads flags, and invokes a wake path only if the poller has gone
to sleep.

The race-free sequence is normative.

Poller before sleeping:

```rust
flags.fetch_or(NEED_WAKEUP, SeqCst);

let tail = sq_tail.load(Acquire);
if sq_head != tail {
    flags.fetch_and(!NEED_WAKEUP, Release);
    continue;
}

park();
```

Producer:

```rust
write_sqe();
sq_tail.store(new_tail, Release);
fence(SeqCst);

let flags = flags.load(Acquire);
if flags & NEED_WAKEUP != 0 {
    wake_poller();
}
```

The poller must set `NEED_WAKEUP` before the final tail recheck. Otherwise a
producer can publish a new SQE after the poller checks the tail but before it
parks, losing the wake.

The `NEED_WAKEUP` publication must also be ordered before the final tail
recheck by a full store-to-load barrier. A `SeqCst` RMW is the simplest
portable rule for the ABI text; an implementation may substitute an explicitly
reviewed architecture-specific fence or park primitive that provides the same
ordering. A plain release store or release-only RMW is not sufficient for this
protocol.

The producer must likewise order the SQ tail publication before checking
`NEED_WAKEUP`. The normative sequence uses a full fence between
`sq_tail.store(..., Release)` and `flags.load(Acquire)`; an implementation may
substitute an explicitly reviewed equivalent that prevents the producer from
missing `NEED_WAKEUP` while the poller misses the new tail before parking.

An SQPOLL CPU may suppress the periodic tick only if:

```text
cpu role is SqpollIsolated
exactly one runnable entity is the poller
no ordinary user thread is runnable there
no timer-side SQ polling is enabled
no network scheduler polling is pinned there
no deferred cleanup is pinned there
stable clocksource/accounting exists
housekeeping CPU is online
```

If any condition fails, restore periodic tick or migrate the unrelated work.

### NoHz Activation Proof Obligations

To enter `SqpollNoHz` or future `AutoNoHz`, the scheduler must prove:

```text
exactly one runnable entity is assigned to the CPU
at least one housekeeping CPU is online
no local network polling dependency remains
no timer-side SQ polling can run for the active ring
no local deferred cleanup or unbound kernel worker is pinned there
no unmigratable IRQ targets that CPU unless explicitly allowed
clocksource and CPU accounting are boundary/counter driven, not tick driven
revocation latency is within the lease policy
```

The proof is dynamic. If any condition stops holding, the scheduler must
restore periodic tick, migrate unrelated work, revoke the lease, or leave nohz
mode before continuing.

## Layer 3: AutoNoHz CPU Lease

The long-term design should split eligibility from activation.

Eligibility says a thread, process, ring, or realtime island may use nohz
isolation:

```rust
enum NoHzKind {
    Idle,
    KernelSqpoll,
    AutoCompute,
    AutoUserspacePoller,
    RealtimeIsland,
}

struct NoHzEligibility {
    kind: NoHzKind,
    max_revocation_latency_ns: u64,
    preferred_cpus: CpuSet,
    allow_busy_spin: bool,
    accounting_target: CpuAccountingTarget,
}

enum CpuAccountingTarget {
    CurrentSchedulingContext,
    SchedulerResourceLedger,
}
```

Activation is a scheduler proof that a CPU currently satisfies isolation
conditions. Without a lease, a latency-sensitive hint may influence placement
but must not grant exclusive CPU access.

Future lease shape:

```text
CpuIsolationLease:
  owner process/session
  allowed CPU set
  allowed mode: poller/compute/kernel-worker
  accounting target, not CPU-time credit
  revocation policy
```

Housekeeping must be explicit:

```text
Housekeeping CPU set:
  global timers
  deferred frees
  cleanup
  statistics
  non-critical kernel workers
  debug/watchdog
  load balancing and migration control
```

## Layer 4: Deadline Metadata

Deadline metadata lives in fixed ring ABI fields, not in a Cap'n Proto SQE
envelope and not in variable side metadata. The current fixed SQE layout should
not be silently reinterpreted; add these fields through a versioned
`CapSqeV2`/ring ABI gate when the transport is ready.

```rust
#[repr(C)]
struct CapSqeV2 {
    // existing fixed CapSqe fields, unchanged in order and meaning

    deadline_ns: u64,  // absolute monotonic deadline, 0 = none
    qos_flags: u32,   // drop/allow/reorder/propagate semantics
    sched_ctx_id: u32, // 0 = current/default scheduling context
}
```

`deadline_ns` is an absolute monotonic timestamp. It is request freshness
metadata, not a promise of nanosecond wakeup precision. The kernel may round
timer programming to clockevent granularity, coalesce timers where policy
allows, or report a miss when dispatch observes the timestamp has already
expired. The field remains `u64` nanoseconds because absolute `u64` ns values
are simple, tracing-friendly, and shared with existing timeout surfaces; a
`u64` microsecond field saves no ABI space.

Only consider a compact profile if SQE space becomes critical:

```rust
deadline_delta_us: u32
```

That profile would be a soft-deadline compact transport shape only. It is not
the primary realtime or `SchedulingContext` ABI and must not replace
`deadline_ns` for admitted realtime work.

ABI negotiation uses both bootstrap metadata and a runtime query surface:

```rust
struct RuntimeBootInfo {
    ring_addr: u64,
    ring_abi_version: u32,
    sqe_size: u16,
    cqe_size: u16,
}
```

- Process bootstrap passes the ring ABI version and fixed entry sizes alongside
  the ring address.
- `RuntimeBootInfo`, ring ABI version constants, and fixed SQE/CQE layouts live
  in `capos-config/src/ring.rs`; the kernel and `capos-rt` import the same
  definition rather than carrying local copies.
- A future `RuntimeInfo`/`SystemInfo` query returns the kernel-supported ring
  ABI range so language runtimes can fail before mapping incompatible rings.
- `cap_enter` rejects unsupported SQE versions or entry sizes with stable
  transport errors such as `CAP_ERR_UNSUPPORTED_RING_ABI` and
  `CAP_ERR_UNSUPPORTED_SQE_VERSION`.
- Runtimes in Rust, C, Go, and other languages must generate or mirror the
  exact fixed layout for the negotiated version.

Suggested flags:

```text
DROP_IF_LATE:
  if now > deadline_ns before dispatch, post DEADLINE_EXPIRED

ALLOW_LATE:
  dispatch anyway, but CQE/telemetry marks late

PROPAGATE_DEADLINE:
  endpoint CALL/RETURN carries deadline metadata to server-side request

DEADLINE_ORDERED:
  SQPOLL may reorder within a bounded window only when all reorder-safety
  checks below pass

NO_BLOCKING_PATH:
  reject if target method/op is not declared realtime-safe
```

Do not put budget, period, priority, criticality, or CPU affinity into each
SQE. Deadline is per request. Budget is execution authority.

`DEADLINE_ORDERED` is valid only when all of the following are true:

```text
the ring mode permits reordering
the SQE marks this request reorderable
the target capability interface and method declare reorder-safe semantics
the reordering window is bounded
the operation does not depend on earlier same-ring requests for correctness
```

Ordered side effects such as `write A; write B; flush` or `lock; mutate;
unlock` must not be deadline-reordered unless the target method contract
explicitly defines that sequence as reorder-safe.

## Layer 5: SchedulingContext

CPU time should become a capability-controlled object:

```rust
struct SchedulingContext {
    budget_ns: u64,
    period_ns: u64,
    relative_deadline_ns: u64,
    priority: u16,
    criticality: u8,
    cpu_mask: CpuSet,
    overrun_policy: OverrunPolicy,
    timeout_endpoint: Option<EndpointRef>,
}
```

Kernel responsibilities:

- decrement remaining budget by actual runtime;
- replenish budget by period;
- throttle or fault a thread on depletion;
- enforce CPU mask and scheduling eligibility;
- dispatch among eligible contexts by the selected realtime policy;
- prevent untrusted SQE bytes from minting budget.

Policy-service responsibilities:

- admission control;
- budget/period/priority selection;
- CPU-isolation lease policy;
- overload response;
- telemetry and retuning.

## Layer 6: Donation

Synchronous capability calls need scheduling-context donation:

```text
client SchedulingContext -> passive server endpoint
server runs on donated budget/deadline
context returns on reply
timeout/overrun reports to caller or island policy
```

Without donation or inheritance, a realtime caller can be defeated by a
normal-priority server that holds the capability implementation path.

Donation semantics must be fixed before implementation:

```text
max donation call depth:
  bounded per SchedulingContext or RealtimeIsland; overflow fails closed.

nested donation:
  nested synchronous calls carry the current donated context until the depth
  bound, unless a callee uses its own admitted context by explicit policy.

cycle handling:
  a donated context may not re-enter a thread already on its donation stack;
  cycles fail with a typed realtime/donation error.

partial failure:
  budget already consumed stays charged to the context that ran the work.
  rollback of authority or memory is separate from CPU charge rollback.

timeout propagation:
  the earliest of request deadline, scheduling-context deadline, and explicit
  call timeout bounds downstream execution.

server-side blocking:
  a passive server running on donated context may block only on approved
  realtime-safe waits or synchronous calls that continue donation.

return on exception:
  application exceptions, transport errors, and cancellation return the
  context to its previous owner before CQE/error delivery.

async endpoint queues:
  donation does not cross ordinary async endpoint enqueue by default. Async
  donation requires an explicit future token/lease design.
```

Hot admitted paths should avoid blocking locks. If a shared resource cannot be
modeled as a passive service, it needs a reviewed priority/deadline-inheritance
primitive or a bounded try-lock/fail/drop policy.

## Layer 7: RealtimeIsland

`RealtimeIsland` admits a whole loop or graph:

```rust
struct RealtimeIslandSpec {
    period_ns: u64,
    deadline_ns: u64,
    cpu_set: CpuSet,
    nodes: Vec<NodeBudget>,
    rings: Vec<RingSpec>,
    memory: Vec<PreallocSpec>,
    devices: Vec<DeviceReservation>,
    overrun_policy: OverrunPolicy,
}
```

Admission requires:

- total budget fits period/deadline constraints;
- all hot-path buffers are preallocated;
- hot-path memory is committed and resident before start;
- guaranteed hot-path memory uses the OOM proposal's `MemoryResidency` policy
  as `pinned` or `secret`; `normal` memory is not admitted for guaranteed hot
  paths. A future lock-resident operation may transition ordinary memory into a
  pinned reservation before admission, but the admitted island sees the result
  as `pinned`, not as `normal`;
- all caps and policy decisions are resolved before start;
- no expected page faults on the hot path;
- no unbounded lock acquisition;
- no blocking endpoint calls inside callback loops;
- no allocation, logging, service discovery, or provider credential work on
  the realtime path;
- IRQ and deferred work are bounded or moved outside the island.

Failure semantics must be typed:

```text
CAP_ERR_DEADLINE_EXPIRED
CAP_ERR_BUDGET_EXHAUSTED
CAP_ERR_REALTIME_UNSAFE_PATH
CAP_ERR_REALTIME_ADMISSION_DENIED
CAP_ERR_OVERRUN
CAP_ERR_STALE_INPUT
```

CQE/status should distinguish not-started-late, completed-late, dropped by
policy, throttled, and dependency-cancelled.

## Telemetry Requirements

Tickless, nohz, SQPOLL, and realtime behavior must be observable through
future monitoring/status capability surfaces, not only through ad hoc debug
logs. The first counters should include:

```text
scheduler_tick_count{cpu}
ticks_suppressed{cpu,mode}
nohz_enter_count{cpu,kind}
nohz_exit_count{cpu,reason}
oneshot_deadline_miss_count
sqpoll_busy_ns
sqpoll_sleep_count
deadline_expired_count
budget_exhausted_count
realtime_overrun_count
donation_depth_max
housekeeping_offload_count
```

These counters are correctness evidence. Missing or surprising values should
fail focused nohz/realtime proofs rather than being treated as performance-only
diagnostics.

## Implementation Sequence

1. Add timer/scheduler instrumentation around the existing periodic tick.
2. Add `monotonic_ns()` and switch `Timer.now` to the clocksource layer while
   keeping periodic scheduling.
3. Convert timeout waiters to `deadline_ns`.
4. Add LAPIC one-shot programming and a focused one-shot smoke.
5. Replace user-mode idle with kernel/per-CPU idle while keeping periodic
   ticks.
6. Enable tickless idle only when there is no runnable work.
7. Keep networking in `ForcedPeriodic` or add explicit network poll deadlines.
8. Add ring mode state and refuse timer-side SQ processing for SQPOLL rings.
9. Land Ring v2 per-thread ring ownership and completion routing.
10. Add the SQPOLL wake/sleep protocol and a host or Loom-style lost-wakeup
    model.
11. Add kernel SQPOLL without full-nohz, under normal scheduler ticks.
12. Add CPU isolation leases and housekeeping CPU placement.
13. Enable SQPOLL nohz on isolated CPUs.
14. Add request `deadline_ns` metadata and typed late/drop CQE outcomes.
15. Add `SchedulingContext` and admission-controlled realtime islands.

## Verification

Tickless idle gates:

```text
make fmt-check
cargo test-lib
cargo test-config
make run-smoke
make run-spawn
```

Additional tickless proof:

```text
1 second idle interval does not produce 100 scheduler ticks
Timer.sleep still completes
cap_enter timeout still completes
ParkSpace timeout still completes
preemption fairness unchanged with runnable contention
```

SQPOLL gates:

```text
thread-lifecycle
timer-smoke
timer-flood
park wake/timeout
endpoint CALL/RECV/RETURN
mandatory host or Loom-style lost-wakeup model before any real SQPOLL worker:
  poller: set NEED_WAKEUP -> full barrier -> recheck tail -> park
  producer: write SQE -> publish tail -> full barrier -> check NEED_WAKEUP -> wake
```

Realtime gates:

```text
deadline ordering tests
budget depletion tests
donation/return tests through passive endpoint
admission denial tests
QEMU proof for late/drop/overrun behavior
telemetry counters prove ticks suppressed, deadlines expired, budgets
exhausted, and donation depth bounded as expected
```

## Decision

Adopt this staged direction:

```text
Tickless idle:
  yes, after clocksource/clockevent split and kernel idle.

Generic full-nohz:
  defer. It depends on per-CPU scheduling, Ring v2, accounting, and
  housekeeping.

SQPOLL nohz:
  yes, but only as explicit CPU-isolation authority after Ring v2.

Realtime:
  `SQE.deadline_ns` is useful metadata, but `SchedulingContext` is the
  authority that provides CPU time.
```