# Proposal: Realtime Voice Agent Shell

How capOS should support web-shell and native-shell voice interaction when
modern multimodal models can consume realtime audio and emit both audio streams
and structured tool calls.


## Problem

The existing language-model proposal defines a text-oriented agent runner:
messages, streamed text, structured tool calls, and per-tool permission policy.
That model still works, but it is incomplete for modern voice agents. Current
provider APIs can run stateful realtime sessions where the model directly
listens to audio, speaks audio, performs VAD/barge-in handling, and emits
function calls in the same interaction.

If capOS models voice as only "ASR into text shell, then TTS the answer," it
will miss the better latency and interaction model of native realtime audio.
If capOS lets provider-native sessions execute tools directly, it breaks the
capability model. The design needs a middle path.

## Goals

- Support native realtime audio model sessions alongside chained ASR/text/TTS
  pipelines.
- Preserve the existing agent-shell security rule: the model never holds
  session caps or tool caps.
- Let `WebShellGateway` host terminal and voice transport without becoming an
  authority sink.
- Keep microphone/speaker media out of `TerminalSession` text APIs.
- Minimize and guarantee media stack latency for admitted capOS-controlled
  realtime islands, preferring enforceable bounds over optimistic nominal
  latency.
- Support provider adapters for OpenAI Realtime, Gemini Live API, Vertex AI
  Live API, local ASR/TTS, and future local realtime multimodal models.
- Carry timestamps, deadlines, transcripts, interruptions, and tool-call ids as
  first-class session data.
- Make direct browser-to-provider media an optional optimization guarded by
  broker-minted ephemeral credentials.
- Allow a browser agent to be the web-shell UI and orchestrate the realtime
  provider loop, while keeping capOS tool execution gateway-enforced.

## Non-Goals

- Implementing provider SDKs in the kernel.
- Giving a browser any capOS capability handle.
- Treating voice recognition, wake words, or VAD as authorization.
- Making a realtime model's free-form speech or text executable.
- Guaranteeing full-path realtime behavior for browser, network, or remote
  provider segments. Native local media can enter guaranteed realtime islands
  only after scheduling contexts and device isolation mature.

## Architecture

```mermaid
flowchart LR
    Browser[Browser UI] -->|terminal frames| Gateway[WebShellGateway]
    Browser -->|mic/playback frames| Gateway

    Gateway --> Terminal[TerminalSession]
    Gateway --> Voice[VoiceSession]

    Shell[capos-shell agent mode] --> Terminal
    Shell --> Voice
    Shell --> Runner[Agent Runner]

    Runner --> RT[RealtimeModelSession]
    Runner --> Broker[AuthorityBroker]
    Runner --> Audit[AuditLog]
    Runner --> Tools[Session tool caps]

    RT --> Provider[Realtime provider adapter]
    Provider --> Remote[OpenAI / Gemini / Vertex]
    Provider --> Local[Local model backend]
```

Principal split:

- `WebShellGateway` authenticates browser sessions, owns browser transport,
  creates terminal and voice session objects, and tears down resources.
- `capos-shell` in agent mode owns the session bundle and acts as the trusted
  runner for capOS-side agent sessions.
- A browser agent UI may own the web conversation and provider session loop,
  but only as an untrusted client of `WebShellGateway`'s tool proxy.
- `RealtimeModelSession` is a model I/O object. It carries audio, text,
  transcripts, tool calls, and tool results. It has no authority over capOS
  tools.
- Provider adapters hold narrow provider credentials or model-runtime caps.
- The browser holds no capOS session caps, no tool caps, no provider long-lived
  API keys, and no bearer tokens other than short-lived provider-scoped tokens
  when a direct-media optimization is explicitly enabled.

## Interfaces

The exact schema belongs to the implementation milestone. The shape should be:

```capnp
interface RealtimeModel {
  info @0 () -> (info :RealtimeModelInfo);
  open @1 (config :RealtimeSessionConfig)
      -> (session :RealtimeModelSession);
}

interface RealtimeModelSession {
  send @0 (event :RealtimeInputEvent) -> ();
  next @1 () -> (event :RealtimeOutputEvent, done :Bool);
  sendToolResult @2 (result :RealtimeToolResult) -> ();
  cancel @3 (reason :CancelReason) -> ();
  close @4 () -> ();
}
```

`RealtimeInputEvent` should cover:

- audio frame reference;
- text input;
- image/video frame reference;
- push-to-talk start/end;
- playback-position feedback;
- tool result;
- cancel, truncate, close.

`RealtimeOutputEvent` should cover:

- audio frame reference;
- text delta;
- partial and final transcript;
- tool call delta and complete tool call;
- interruption/barge-in;
- session warning/error;
- provider usage/cost metadata;
- close/go-away/reconnect notice.

Audio frames should not be copied through Cap'n Proto payloads in the hot path.
Use `MemoryObject`-backed media rings or provider-owned stream handles. Cap'n
Proto remains the control plane.

## Tool Calls

Realtime tool calls use the same policy as text agent calls.

```mermaid
sequenceDiagram
    participant Model as RealtimeModelSession
    participant Runner as Agent Runner
    participant Broker as AuthorityBroker
    participant Tool as Typed Tool Cap
    participant Audit as AuditLog

    Model->>Runner: tool_call(name, args, provider_call_id)
    Runner->>Runner: validate ToolDescriptor
    Runner->>Broker: authorize tool call
    Broker-->>Runner: auto / consent / stepUp / forbidden
    Runner->>Tool: invoke if allowed
    Tool-->>Runner: typed result
    Runner->>Audit: record decision and outcome
    Runner->>Model: tool result
```

The runner owns the mapping from provider call ids to capOS audit/tool-call
ids in capOS-side mode. In browser-agent UI mode, `WebShellGateway`'s tool
proxy owns that mapping. Provider ids are useful correlation metadata, but
they are not authority.

Tool execution must be time-boxed. If a tool blocks too long, the runner or
gateway tool proxy sends a typed timeout result back to the realtime model and
continues or ends the turn according to policy.

## Voice Session

`VoiceSession` is the shell-facing media session object created by
`WebShellGateway` or a native terminal host.

```capnp
interface VoiceSession {
  describe @0 () -> (info :VoiceSessionInfo);
  openCapture @1 (format :AudioFormat) -> (stream :AudioInputStream);
  openPlayback @2 (format :AudioFormat) -> (stream :AudioOutputStream);
  event @3 () -> (event :VoiceSessionEvent);
  close @4 () -> ();
}
```

For web shell, `VoiceSession` is backed by browser media APIs. For native capOS
it can be backed by an audio device service. Either way, it is separate from
`TerminalSession`:

- terminal input/output remains text and presentation;
- voice capture/playback is timestamped binary media;
- transcripts can be rendered into the terminal, but they are not terminal
  input until the runner accepts them as a user turn.

## Media Graph

The local media graph is a userspace service/library layer, not a kernel
feature. Its latency goal is the lowest guaranteed-stable operating point for
the selected device, graph, and policy: a fixed quantum with admitted CPU,
memory, device, and wakeup budgets, not the smallest buffer value that can be
configured.

```mermaid
flowchart LR
    Capture[Capture source] --> Convert[format converter / resampler]
    Convert --> Gate[VAD or push-to-talk gate]
    Gate --> Input[realtime provider adapter or local ASR]
    Input --> Runner[agent runner]
    Runner --> Output[realtime provider adapter or local TTS]
    Output --> Playback[playback sink]
```

For browser voice, the graph may partly live in browser JavaScript and partly
in capOS services. For native hardware, the graph eventually uses audio driver
services that hold `DeviceMmio`, `DMAPool`, and `Interrupt` capabilities.

Graph control operations are ordinary endpoint calls:

- create node;
- connect port;
- set format;
- allocate buffer pool;
- start/stop stream;
- set deadline and latency policy.

Graph data uses `MemoryObject` pools and notification/futex wakeups. Audio
frames carry:

```text
sequence
capture_time_ns
playback_time_ns
deadline_ns
format
offset
length
flags
```

The realtime data path should not perform allocation, blocking IPC, logging,
permission checks, provider credential work, or graph mutation. Those remain
control-plane operations. Any bridge that crosses process, clock, network,
provider, or browser boundaries must declare its extra latency so the graph can
report the full stack rather than burying delay in queues. A non-guaranteed
bridge must not backpressure a guaranteed island; it must drop, silence,
bypass, stop, or renegotiate.

## WebShellGateway Modes

### Gateway-Mediated Provider Session

```mermaid
flowchart LR
    Browser[Browser] <--> Gateway[WebShellGateway]
    Gateway <--> Adapter[ProviderAdapter]
    Adapter <--> Provider[Provider API]
```

Properties:

- provider long-lived credentials remain server-side;
- tool-call events remain server-side unless explicitly proxied to a browser
  agent UI under broker policy;
- gateway can record/drop/rate-limit media;
- easier audit and teardown;
- higher latency because audio crosses the gateway.

This is the baseline mode.

### Direct Browser Provider Media

```mermaid
flowchart LR
    Browser[Browser] <--> Provider[Provider API]
    Browser <--> Gateway[WebShellGateway control/audit path]
```

Properties:

- lower media latency;
- browser receives provider-specific ephemeral credential;
- gateway may not see every media frame or provider control event;
- allowed only when broker policy says direct media is acceptable;
- provider tool declarations are disabled unless either a trusted server-side
  control channel handles tool calls and results, or the session is explicitly
  in browser-agent UI mode and every tool call is routed through
  `WebShellGateway`'s server-side tool proxy.

Direct mode requires:

- provider token scoped to model/config/session;
- short expiration;
- no capOS capability material in the token;
- provider tools disabled, provider-supported server-side receipt of tool
  calls plus server-side submission of tool results, or browser-agent UI mode
  where JavaScript receives provider tool calls but can only send structured
  `ToolRequest` values to `WebShellGateway`;
- trusted revocation or session close path; if the provider exposes only a
  browser-held connection, the kill switch is best-effort and must not be
  described as authoritative;
- audit that records direct-media mode, token issuance metadata, disabled tool
  status, and any uninspected media/control scope;
- fallback to gateway-mediated mode.

### Browser Agent UI Direct Provider Session

This mode is distinct from merely moving media off the gateway. The browser
agent is the UI: it owns the visible conversation, calls the realtime provider
with an ephemeral credential, receives provider tool-call events, and feeds
tool results back to the provider. It still does not receive capOS caps.

```mermaid
flowchart LR
    BrowserAgent[Browser Agent UI] <--> Provider[Provider API]
    BrowserAgent -->|ToolRequest| Gateway[WebShellGateway ToolProxy]
    Gateway --> Broker[AuthorityBroker]
    Gateway --> Tools[Session tool caps]
    Gateway --> Audit[AuditLog]
    Gateway -. "ToolResult" .-> BrowserAgent
```

Rules:

- the browser credential is scoped to provider, model/config, session,
  conversation, media mode, and short expiration;
- the gateway publishes a signed or MACed tool descriptor snapshot for the
  current turn;
- browser tool requests must carry the descriptor snapshot id, provider call
  id, conversation id, turn id, and typed arguments;
- gateway rejects stale snapshots, replay, unknown tools, schema mismatches,
  missing consent, missing step-up, and requests after session teardown;
- gateway performs all real capOS capability invocations server-side and
  records that the request was browser-agent-proposed;
- broker policy may deny browser-agent UI mode when prompt, transcript, media,
  or tool-result confidentiality requires capOS-side provider mediation.

This is lower latency and can use provider-native browser APIs, but it gives
up gateway inspection of some media/control frames. Audit must record that
fact instead of implying full gateway mediation.

## Realtime Provider Adapter

A provider adapter is a normal service process. It should expose
`RealtimeModel`, not provider-specific credentials.

OpenAI adapter:

- uses WebRTC for browser direct mode or WebSocket for server-side mode;
- maps provider function-call events either to server-side capOS
  `RealtimeToolCall` values or to browser-agent `ToolRequest` forwarding;
- maps `function_call_output` to `RealtimeToolResult`;
- handles response cancellation and output-audio truncation.

Gemini developer adapter:

- uses Live API WebSocket;
- supports ephemeral-token direct mode when broker policy allows;
- maps `FunctionResponse` to `RealtimeToolResult`;
- models synchronous and non-blocking function-call behavior explicitly.

Vertex adapter:

- uses cloud auth and Vertex AI Live API;
- exposes deployment metadata such as project/location/model id;
- respects enterprise logging, quota, and provisioned-throughput policy;
- should not leak Google credentials to browser or shell.

Local adapter:

- may start as ASR plus text model plus TTS;
- can later become native realtime audio if a local model supports it;
- keeps all media on-device and is the correct anonymous/guest fallback.

## Scheduling And Deadlines

Web shell and remote-provider voice need bounded soft realtime. Native local
voice can use guaranteed realtime islands once scheduling contexts exist:

- Capture frames older than their deadline should be dropped.
- Playback frames that miss the output deadline should be skipped or replaced
  with silence.
- Barge-in should cancel model output promptly.
- Tool calls should not block capture/playback loops.
- The terminal path must remain responsive under model or provider stalls.

Future scheduling contexts should represent:

```text
voice-capture budget/period
provider-adapter budget/period
agent-runner interactive priority
playback budget/period
```

SQE-level deadlines are useful metadata for stale request handling, but they
do not create CPU budget. A provider adapter may reject or drop stale media
frames using deadlines before the scheduler grows true budget enforcement.
Native media graph scheduling should eventually map graph quantum to scheduling
period and per-node CPU budget. Web shell and remote providers cannot provide a
capOS guarantee across the full path, so their jitter must be measured and
surfaced separately from the local guaranteed island latency.

The general realtime scheduling model is tracked in
[Tickless and Realtime Scheduling](tickless-realtime-scheduling-proposal.md):
`SQE.deadline_ns` is request freshness metadata for stale frame/tool handling,
while `SchedulingContext` carries CPU-time authority and `RealtimeIsland`
admits the local media graph. Voice paths must not treat deadline metadata as a
budget reservation.

## Consent And Voice Confirmation

Voice can participate in consent UX, but it is not sufficient for strong
authorization.

Rules:

- Read-only tools may run automatically if broker policy allows.
- Mutating tools need explicit consent; spoken "yes" can satisfy only low-risk
  consent when the user is already authenticated and the prompt context is
  active.
- Destructive tools require `stepUp`; WebAuthn/passkey is the likely web-shell
  path.
- Wake words, speaker identity estimates, VAD, and ASR confidence are never
  authentication factors.
- The spoken confirmation transcript and confidence are audit data.

## Security Invariants

- Browser never receives capOS caps.
- Model services never receive session caps.
- Provider adapters never receive broad process-spawn or terminal authority.
- Free-form model text and speech are never parsed as commands.
- Tool calls are structured values and must match advertised descriptors.
- Provider credentials are caps or service-private secrets, never transcript
  text or terminal output.
- Browser-held provider credentials are short-lived, provider-scoped, and
  contain no capOS capability material.
- Voice transcripts are untrusted user input until the runner or gateway
  accepts them.
- Prompt-injection rules from the text agent apply unchanged to transcripts,
  web results, tool results, and model-generated speech.
- On logout, tab close, timeout, shell exit, or failed auth, the gateway closes
  terminal, voice, pending tool consent, and server-side model streams. For
  browser-held provider sessions, gateway teardown authoritatively ends capOS
  tool execution and rejects future tool requests; provider session revocation
  is authoritative only when the provider exposes a server-side close API,
  otherwise it is best-effort and must be audited as such.

## Interaction Examples

### Low-Risk Read

```text
user speaks: "what services are running?"
model emits tool_call(systemStatus.list, {})
runner policy: auto
runner executes status cap
runner sends tool result
model speaks summary and emits text transcript
```

### Mutating Action

```text
user speaks: "restart the network stack"
model emits tool_call(service.restart, {"name":"net-stack"})
runner policy: consent
gateway renders and speaks confirmation prompt
user says: "yes"
runner executes restart
runner audits transcript, consent, tool args, result
model speaks outcome
```

### Barge-In

```text
model speaking long answer
user starts speaking
VoiceSession emits bargeIn
runner cancels provider output
provider adapter truncates unplayed audio if supported
new user audio starts a new turn
```

## Implementation Sequence

1. Document and freeze `RealtimeModelSession` and `VoiceSession` schemas.
2. Add a fake local provider adapter using text-only model responses and
   synthetic audio events so the shell/gateway state machine can be tested
   without provider credentials.
3. Extend `WebShellGateway` protocol with a voice side channel and lifecycle
   events, still with no direct provider media.
4. Implement chained local ASR/text/TTS adapter or browser-ASR demo shim for
   the first visible voice shell proof.
5. Add provider adapter for one remote realtime API behind broker-issued model
   caps and server-side credentials.
6. Add direct browser provider media only after ephemeral-token minting,
   teardown, and audit are proven in gateway-mediated mode.
7. Add browser-agent UI mode after the WebShellGateway tool proxy can bind
   descriptor snapshots, enforce consent/step-up server-side, reject replay,
   and audit browser-agent-proposed tool requests.
8. Add media-ring deadlines and underrun/drop telemetry.
9. Later, bind media and provider loops to scheduling contexts once scheduler
   policy exists.

## Open Questions

- Does `VoiceSession` belong to the terminal host family or the media graph
  service family?
- Should provider adapters expose raw provider events for diagnostics behind a
  privileged debug cap?
- Should a model be allowed to continue speaking while a non-blocking tool is
  pending, or should capOS pause speech at every tool-call boundary by default?
- How should cross-provider tool-call deltas be normalized when providers emit
  partial arguments differently?
- Which mode is acceptable for operator web shell by default:
  gateway-mediated, direct provider media, browser-agent UI, or broker policy
  dependent?
- Should model-output audio be stored in audit, summarized, or only referenced
  by transcript and provider event ids?
- How should media graph buffer quotas interact with session quotas and future
  resource donation?

## Relationship To Existing Proposals

- [Language Models and Agent Runtime](llm-and-agent-proposal.md): this proposal
  adds a realtime multimodal session sibling to the text `LanguageModel` and
  follows its browser-agent UI versus gateway-enforced tool execution split.
- [Multimedia Pipeline Latency](../research/multimedia-pipeline-latency.md):
  gives the local media graph its guaranteed-stable latency goal,
  realtime-island admission model, PipeWire/JACK grounding, and telemetry
  requirements.
- [Boot to Shell](boot-to-shell-proposal.md): WebShellGateway remains the web
  entry point and session authority boundary.
- [Interactive Command Surfaces](interactive-command-surface-proposal.md):
  voice transcripts can invoke command sessions only through typed command
  descriptors, not free-form shell text.
- [Browser/WASM](browser-wasm-proposal.md): direct browser media and
  browser-agent UI resemble the existing host-backed capability pattern, but
  real capOS tool execution must remain gateway-mediated.
- [GPU Capability](gpu-capability-proposal.md): local realtime models may later
  need GPU/NPU sessions, but the interface should not expose accelerator
  details to agent-shell.
- [Formal MAC/MIC](formal-mac-mic-proposal.md): remote realtime provider use
  must be denied when session confidentiality labels forbid off-device media.

## References

- [Realtime multimodal agent APIs research](../research/realtime-multimodal-agent-apis.md)
- [Multimedia pipeline latency research](../research/multimedia-pipeline-latency.md)
- OpenAI, [Voice agents](https://developers.openai.com/api/docs/guides/voice-agents)
- OpenAI, [Realtime conversations](https://developers.openai.com/api/docs/guides/realtime-conversations)
- Google AI for Developers, [Gemini Live API overview](https://ai.google.dev/gemini-api/docs/live-api)
- Google AI for Developers, [Tool use with Live API](https://ai.google.dev/gemini-api/docs/live-api/tools)
- Google Cloud Vertex AI, [Gemini Live API overview](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/live-api)
