Proposal: Realtime Voice Agent Shell
How capOS should support web-shell and native-shell voice interaction when modern multimodal models can consume realtime audio and emit both audio streams and structured tool calls.
Problem
The existing language-model proposal defines a text-oriented agent runner: messages, streamed text, structured tool calls, and per-tool permission policy. That model still works, but it is incomplete for modern voice agents. Current provider APIs can run stateful realtime sessions where the model directly listens to audio, speaks audio, performs VAD/barge-in handling, and emits function calls in the same interaction.
If capOS models voice as only “ASR into text shell, then TTS the answer,” it will miss the better latency and interaction model of native realtime audio. If capOS lets provider-native sessions execute tools directly, it breaks the capability model. The design needs a middle path.
Goals
- Support native realtime audio model sessions alongside chained ASR/text/TTS pipelines.
- Preserve the existing agent-shell security rule: the model never holds session caps or tool caps.
- Let
WebShellGatewayhost terminal and voice transport without becoming an authority sink. - Keep microphone/speaker media out of
TerminalSessiontext APIs. - Minimize and guarantee media stack latency for admitted capOS-controlled realtime islands, preferring enforceable bounds over optimistic nominal latency.
- Support provider adapters for OpenAI Realtime, Gemini Live API, Vertex AI Live API, local ASR/TTS, and future local realtime multimodal models.
- Carry timestamps, deadlines, transcripts, interruptions, and tool-call ids as first-class session data.
- Make direct browser-to-provider media an optional optimization guarded by broker-minted ephemeral credentials.
- Allow a browser agent to be the web-shell UI and orchestrate the realtime provider loop, while keeping capOS tool execution gateway-enforced.
Non-Goals
- Implementing provider SDKs in the kernel.
- Giving a browser any capOS capability handle.
- Treating voice recognition, wake words, or VAD as authorization.
- Making a realtime model’s free-form speech or text executable.
- Guaranteeing full-path realtime behavior for browser, network, or remote provider segments. Native local media can enter guaranteed realtime islands only after scheduling contexts and device isolation mature.
Architecture
flowchart LR
Browser[Browser UI] -->|terminal frames| Gateway[WebShellGateway]
Browser -->|mic/playback frames| Gateway
Gateway --> Terminal[TerminalSession]
Gateway --> Voice[VoiceSession]
Shell[capos-shell agent mode] --> Terminal
Shell --> Voice
Shell --> Runner[Agent Runner]
Runner --> RT[RealtimeModelSession]
Runner --> Broker[AuthorityBroker]
Runner --> Audit[AuditLog]
Runner --> Tools[Session tool caps]
RT --> Provider[Realtime provider adapter]
Provider --> Remote[OpenAI / Gemini / Vertex]
Provider --> Local[Local model backend]
Principal split:
WebShellGatewayauthenticates browser sessions, owns browser transport, creates terminal and voice session objects, and tears down resources.capos-shellin agent mode owns the session bundle and acts as the trusted runner for capOS-side agent sessions.- A browser agent UI may own the web conversation and provider session loop,
but only as an untrusted client of
WebShellGateway’s tool proxy. RealtimeModelSessionis a model I/O object. It carries audio, text, transcripts, tool calls, and tool results. It has no authority over capOS tools.- Provider adapters hold narrow provider credentials or model-runtime caps.
- The browser holds no capOS session caps, no tool caps, no provider long-lived API keys, and no bearer tokens other than short-lived provider-scoped tokens when a direct-media optimization is explicitly enabled.
Interfaces
The exact schema belongs to the implementation milestone. The shape should be:
interface RealtimeModel {
info @0 () -> (info :RealtimeModelInfo);
open @1 (config :RealtimeSessionConfig)
-> (session :RealtimeModelSession);
}
interface RealtimeModelSession {
send @0 (event :RealtimeInputEvent) -> ();
next @1 () -> (event :RealtimeOutputEvent, done :Bool);
sendToolResult @2 (result :RealtimeToolResult) -> ();
cancel @3 (reason :CancelReason) -> ();
close @4 () -> ();
}
RealtimeInputEvent should cover:
- audio frame reference;
- text input;
- image/video frame reference;
- push-to-talk start/end;
- playback-position feedback;
- tool result;
- cancel, truncate, close.
RealtimeOutputEvent should cover:
- audio frame reference;
- text delta;
- partial and final transcript;
- tool call delta and complete tool call;
- interruption/barge-in;
- session warning/error;
- provider usage/cost metadata;
- close/go-away/reconnect notice.
Audio frames should not be copied through Cap’n Proto payloads in the hot path.
Use MemoryObject-backed media rings or provider-owned stream handles. Cap’n
Proto remains the control plane.
Tool Calls
Realtime tool calls use the same policy as text agent calls.
sequenceDiagram
participant Model as RealtimeModelSession
participant Runner as Agent Runner
participant Broker as AuthorityBroker
participant Tool as Typed Tool Cap
participant Audit as AuditLog
Model->>Runner: tool_call(name, args, provider_call_id)
Runner->>Runner: validate ToolDescriptor
Runner->>Broker: authorize tool call
Broker-->>Runner: auto / consent / stepUp / forbidden
Runner->>Tool: invoke if allowed
Tool-->>Runner: typed result
Runner->>Audit: record decision and outcome
Runner->>Model: tool result
The runner owns the mapping from provider call ids to capOS audit/tool-call
ids in capOS-side mode. In browser-agent UI mode, WebShellGateway’s tool
proxy owns that mapping. Provider ids are useful correlation metadata, but
they are not authority.
Tool execution must be time-boxed. If a tool blocks too long, the runner or gateway tool proxy sends a typed timeout result back to the realtime model and continues or ends the turn according to policy.
Voice Session
VoiceSession is the shell-facing media session object created by
WebShellGateway or a native terminal host.
interface VoiceSession {
describe @0 () -> (info :VoiceSessionInfo);
openCapture @1 (format :AudioFormat) -> (stream :AudioInputStream);
openPlayback @2 (format :AudioFormat) -> (stream :AudioOutputStream);
event @3 () -> (event :VoiceSessionEvent);
close @4 () -> ();
}
For web shell, VoiceSession is backed by browser media APIs. For native capOS
it can be backed by an audio device service. Either way, it is separate from
TerminalSession:
- terminal input/output remains text and presentation;
- voice capture/playback is timestamped binary media;
- transcripts can be rendered into the terminal, but they are not terminal input until the runner accepts them as a user turn.
Media Graph
The local media graph is a userspace service/library layer, not a kernel feature. Its latency goal is the lowest guaranteed-stable operating point for the selected device, graph, and policy: a fixed quantum with admitted CPU, memory, device, and wakeup budgets, not the smallest buffer value that can be configured.
flowchart LR
Capture[Capture source] --> Convert[format converter / resampler]
Convert --> Gate[VAD or push-to-talk gate]
Gate --> Input[realtime provider adapter or local ASR]
Input --> Runner[agent runner]
Runner --> Output[realtime provider adapter or local TTS]
Output --> Playback[playback sink]
For browser voice, the graph may partly live in browser JavaScript and partly
in capOS services. For native hardware, the graph eventually uses audio driver
services that hold DeviceMmio, DMAPool, and Interrupt capabilities.
Graph control operations are ordinary endpoint calls:
- create node;
- connect port;
- set format;
- allocate buffer pool;
- start/stop stream;
- set deadline and latency policy.
Graph data uses MemoryObject pools and notification/futex wakeups. Audio
frames carry:
sequence
capture_time_ns
playback_time_ns
deadline_ns
format
offset
length
flags
The realtime data path should not perform allocation, blocking IPC, logging, permission checks, provider credential work, or graph mutation. Those remain control-plane operations. Any bridge that crosses process, clock, network, provider, or browser boundaries must declare its extra latency so the graph can report the full stack rather than burying delay in queues. A non-guaranteed bridge must not backpressure a guaranteed island; it must drop, silence, bypass, stop, or renegotiate.
WebShellGateway Modes
Gateway-Mediated Provider Session
flowchart LR
Browser[Browser] <--> Gateway[WebShellGateway]
Gateway <--> Adapter[ProviderAdapter]
Adapter <--> Provider[Provider API]
Properties:
- provider long-lived credentials remain server-side;
- tool-call events remain server-side unless explicitly proxied to a browser agent UI under broker policy;
- gateway can record/drop/rate-limit media;
- easier audit and teardown;
- higher latency because audio crosses the gateway.
This is the baseline mode.
Direct Browser Provider Media
flowchart LR
Browser[Browser] <--> Provider[Provider API]
Browser <--> Gateway[WebShellGateway control/audit path]
Properties:
- lower media latency;
- browser receives provider-specific ephemeral credential;
- gateway may not see every media frame or provider control event;
- allowed only when broker policy says direct media is acceptable;
- provider tool declarations are disabled unless either a trusted server-side
control channel handles tool calls and results, or the session is explicitly
in browser-agent UI mode and every tool call is routed through
WebShellGateway’s server-side tool proxy.
Direct mode requires:
- provider token scoped to model/config/session;
- short expiration;
- no capOS capability material in the token;
- provider tools disabled, provider-supported server-side receipt of tool
calls plus server-side submission of tool results, or browser-agent UI mode
where JavaScript receives provider tool calls but can only send structured
ToolRequestvalues toWebShellGateway; - trusted revocation or session close path; if the provider exposes only a browser-held connection, the kill switch is best-effort and must not be described as authoritative;
- audit that records direct-media mode, token issuance metadata, disabled tool status, and any uninspected media/control scope;
- fallback to gateway-mediated mode.
Browser Agent UI Direct Provider Session
This mode is distinct from merely moving media off the gateway. The browser agent is the UI: it owns the visible conversation, calls the realtime provider with an ephemeral credential, receives provider tool-call events, and feeds tool results back to the provider. It still does not receive capOS caps.
flowchart LR
BrowserAgent[Browser Agent UI] <--> Provider[Provider API]
BrowserAgent -->|ToolRequest| Gateway[WebShellGateway ToolProxy]
Gateway --> Broker[AuthorityBroker]
Gateway --> Tools[Session tool caps]
Gateway --> Audit[AuditLog]
Gateway -. "ToolResult" .-> BrowserAgent
Rules:
- the browser credential is scoped to provider, model/config, session, conversation, media mode, and short expiration;
- the gateway publishes a signed or MACed tool descriptor snapshot for the current turn;
- browser tool requests must carry the descriptor snapshot id, provider call id, conversation id, turn id, and typed arguments;
- gateway rejects stale snapshots, replay, unknown tools, schema mismatches, missing consent, missing step-up, and requests after session teardown;
- gateway performs all real capOS capability invocations server-side and records that the request was browser-agent-proposed;
- broker policy may deny browser-agent UI mode when prompt, transcript, media, or tool-result confidentiality requires capOS-side provider mediation.
This is lower latency and can use provider-native browser APIs, but it gives up gateway inspection of some media/control frames. Audit must record that fact instead of implying full gateway mediation.
Realtime Provider Adapter
A provider adapter is a normal service process. It should expose
RealtimeModel, not provider-specific credentials.
OpenAI adapter:
- uses WebRTC for browser direct mode or WebSocket for server-side mode;
- maps provider function-call events either to server-side capOS
RealtimeToolCallvalues or to browser-agentToolRequestforwarding; - maps
function_call_outputtoRealtimeToolResult; - handles response cancellation and output-audio truncation.
Gemini developer adapter:
- uses Live API WebSocket;
- supports ephemeral-token direct mode when broker policy allows;
- maps
FunctionResponsetoRealtimeToolResult; - models synchronous and non-blocking function-call behavior explicitly.
Vertex adapter:
- uses cloud auth and Vertex AI Live API;
- exposes deployment metadata such as project/location/model id;
- respects enterprise logging, quota, and provisioned-throughput policy;
- should not leak Google credentials to browser or shell.
Local adapter:
- may start as ASR plus text model plus TTS;
- can later become native realtime audio if a local model supports it;
- keeps all media on-device and is the correct anonymous/guest fallback.
Scheduling And Deadlines
Web shell and remote-provider voice need bounded soft realtime. Native local voice can use guaranteed realtime islands once scheduling contexts exist:
- Capture frames older than their deadline should be dropped.
- Playback frames that miss the output deadline should be skipped or replaced with silence.
- Barge-in should cancel model output promptly.
- Tool calls should not block capture/playback loops.
- The terminal path must remain responsive under model or provider stalls.
Future scheduling contexts should represent:
voice-capture budget/period
provider-adapter budget/period
agent-runner interactive priority
playback budget/period
SQE-level deadlines are useful metadata for stale request handling, but they do not create CPU budget. A provider adapter may reject or drop stale media frames using deadlines before the scheduler grows true budget enforcement. Native media graph scheduling should eventually map graph quantum to scheduling period and per-node CPU budget. Web shell and remote providers cannot provide a capOS guarantee across the full path, so their jitter must be measured and surfaced separately from the local guaranteed island latency.
The general realtime scheduling model is tracked in
Tickless and Realtime Scheduling:
SQE.deadline_ns is request freshness metadata for stale frame/tool handling,
while SchedulingContext carries CPU-time authority and RealtimeIsland
admits the local media graph. Voice paths must not treat deadline metadata as a
budget reservation.
Consent And Voice Confirmation
Voice can participate in consent UX, but it is not sufficient for strong authorization.
Rules:
- Read-only tools may run automatically if broker policy allows.
- Mutating tools need explicit consent; spoken “yes” can satisfy only low-risk consent when the user is already authenticated and the prompt context is active.
- Destructive tools require
stepUp; WebAuthn/passkey is the likely web-shell path. - Wake words, speaker identity estimates, VAD, and ASR confidence are never authentication factors.
- The spoken confirmation transcript and confidence are audit data.
Security Invariants
- Browser never receives capOS caps.
- Model services never receive session caps.
- Provider adapters never receive broad process-spawn or terminal authority.
- Free-form model text and speech are never parsed as commands.
- Tool calls are structured values and must match advertised descriptors.
- Provider credentials are caps or service-private secrets, never transcript text or terminal output.
- Browser-held provider credentials are short-lived, provider-scoped, and contain no capOS capability material.
- Voice transcripts are untrusted user input until the runner or gateway accepts them.
- Prompt-injection rules from the text agent apply unchanged to transcripts, web results, tool results, and model-generated speech.
- On logout, tab close, timeout, shell exit, or failed auth, the gateway closes terminal, voice, pending tool consent, and server-side model streams. For browser-held provider sessions, gateway teardown authoritatively ends capOS tool execution and rejects future tool requests; provider session revocation is authoritative only when the provider exposes a server-side close API, otherwise it is best-effort and must be audited as such.
Interaction Examples
Low-Risk Read
user speaks: "what services are running?"
model emits tool_call(systemStatus.list, {})
runner policy: auto
runner executes status cap
runner sends tool result
model speaks summary and emits text transcript
Mutating Action
user speaks: "restart the network stack"
model emits tool_call(service.restart, {"name":"net-stack"})
runner policy: consent
gateway renders and speaks confirmation prompt
user says: "yes"
runner executes restart
runner audits transcript, consent, tool args, result
model speaks outcome
Barge-In
model speaking long answer
user starts speaking
VoiceSession emits bargeIn
runner cancels provider output
provider adapter truncates unplayed audio if supported
new user audio starts a new turn
Implementation Sequence
- Document and freeze
RealtimeModelSessionandVoiceSessionschemas. - Add a fake local provider adapter using text-only model responses and synthetic audio events so the shell/gateway state machine can be tested without provider credentials.
- Extend
WebShellGatewayprotocol with a voice side channel and lifecycle events, still with no direct provider media. - Implement chained local ASR/text/TTS adapter or browser-ASR demo shim for the first visible voice shell proof.
- Add provider adapter for one remote realtime API behind broker-issued model caps and server-side credentials.
- Add direct browser provider media only after ephemeral-token minting, teardown, and audit are proven in gateway-mediated mode.
- Add browser-agent UI mode after the WebShellGateway tool proxy can bind descriptor snapshots, enforce consent/step-up server-side, reject replay, and audit browser-agent-proposed tool requests.
- Add media-ring deadlines and underrun/drop telemetry.
- Later, bind media and provider loops to scheduling contexts once scheduler policy exists.
Open Questions
- Does
VoiceSessionbelong to the terminal host family or the media graph service family? - Should provider adapters expose raw provider events for diagnostics behind a privileged debug cap?
- Should a model be allowed to continue speaking while a non-blocking tool is pending, or should capOS pause speech at every tool-call boundary by default?
- How should cross-provider tool-call deltas be normalized when providers emit partial arguments differently?
- Which mode is acceptable for operator web shell by default: gateway-mediated, direct provider media, browser-agent UI, or broker policy dependent?
- Should model-output audio be stored in audit, summarized, or only referenced by transcript and provider event ids?
- How should media graph buffer quotas interact with session quotas and future resource donation?
Relationship To Existing Proposals
- Language Models and Agent Runtime: this proposal
adds a realtime multimodal session sibling to the text
LanguageModeland follows its browser-agent UI versus gateway-enforced tool execution split. - Multimedia Pipeline Latency: gives the local media graph its guaranteed-stable latency goal, realtime-island admission model, PipeWire/JACK grounding, and telemetry requirements.
- Boot to Shell: WebShellGateway remains the web entry point and session authority boundary.
- Interactive Command Surfaces: voice transcripts can invoke command sessions only through typed command descriptors, not free-form shell text.
- Browser/WASM: direct browser media and browser-agent UI resemble the existing host-backed capability pattern, but real capOS tool execution must remain gateway-mediated.
- GPU Capability: local realtime models may later need GPU/NPU sessions, but the interface should not expose accelerator details to agent-shell.
- Formal MAC/MIC: remote realtime provider use must be denied when session confidentiality labels forbid off-device media.
References
- Realtime multimodal agent APIs research
- Multimedia pipeline latency research
- OpenAI, Voice agents
- OpenAI, Realtime conversations
- Google AI for Developers, Gemini Live API overview
- Google AI for Developers, Tool use with Live API
- Google Cloud Vertex AI, Gemini Live API overview