# Research: Realtime Multimodal Agent APIs

Survey of provider APIs for realtime native-audio, multimodal, tool-using
agents, and the consequences for capOS voice agent-shell, web shell, media
graph, scheduling, and capability boundaries.


## Scope

This report focuses on APIs where a model can consume realtime audio and emit
both audio output and structured tool calls in one session. That is distinct
from a chained pipeline where the application separately runs ASR, a text
model, and TTS.

The immediate capOS question is whether the earlier agent-shell design should
remain text-first with optional ASR/TTS wrappers, or whether it needs a
first-class realtime multimodal model session.

## Source Snapshot

All source observations below were checked against official provider
documentation on 2026-04-25.

- The companion [multimedia pipeline latency](multimedia-pipeline-latency.md)
  note covers PipeWire and JACK lessons for low-latency graph scheduling,
  latency reporting, realtime callbacks, and stable quantum selection.
- OpenAI Realtime API docs describe speech-to-speech sessions, WebRTC and
  WebSocket transports, realtime function calling, interruption/truncation, and
  the `gpt-realtime` model family.
- OpenAI Voice Agents docs explicitly frame the architecture choice as direct
  live audio sessions versus chained speech-to-text, text-agent, and
  text-to-speech pipelines.
- Google AI Gemini Live API docs describe realtime audio/image/text input,
  audio output, WebSocket transport, VAD, barge-in, tool use, and ephemeral
  tokens for client-to-server browser use.
- Vertex AI Gemini Live API docs describe the enterprise/cloud variant with
  realtime voice/video, native audio, transcriptions, function calling,
  Google Search grounding, and provisioned-throughput-oriented deployment
  considerations.

## Provider Findings

### OpenAI Realtime API

OpenAI's Realtime API is a stateful session API for low-latency interactions
with realtime models. The docs describe calling models such as
`gpt-realtime` for speech-to-speech conversations over WebRTC or WebSocket,
with the session carrying model, voice, conversation items, and generated
responses.

Important details for capOS:

- Browser clients are steered toward WebRTC for more consistent media
  performance; server-to-server integrations are steered toward WebSocket.
- WebRTC media and control are split: audio is handled by the peer connection,
  while other events travel over a data channel.
- WebSocket integrations carry JSON events and require the application to
  manage input and output audio buffers directly.
- Realtime function calling is session/response configured. The model emits a
  `function_call` item with a name, JSON arguments, and a generated call id.
  The application executes the function and sends back a
  `function_call_output` conversation item keyed by that call id.
- Realtime interruption is a first-class path. With VAD, user speech can cancel
  an ongoing model response. WebRTC/SIP paths have server-side knowledge of
  played audio; WebSocket paths require the client to stop playback and send
  truncation metadata for unplayed audio.
- `gpt-realtime-1.5` is documented as a realtime audio-in/audio-out model with
  text, audio, and image input; text and audio output; and function calling.
  The current model page marks video as unsupported.

OpenAI's Voice Agents docs expose the architectural tradeoff directly: live
speech-to-speech sessions are the natural low-latency path, while chained ASR
plus text-agent plus TTS gives stronger intermediate control and is often more
appropriate for approval-heavy workflows.

### Google AI Gemini Live API

Google AI's Gemini Live API is a realtime stateful WebSocket API. The developer
docs describe audio, image, and text input; audio output; VAD; barge-in;
transcriptions; proactive audio; affective dialog; and tool use.

Important details for capOS:

- The Google AI developer API lists input audio as raw 16-bit PCM at 16 kHz
  little-endian, image input as JPEG at up to 1 FPS, and output audio as raw
  16-bit PCM at 24 kHz little-endian.
- The public developer API supports server-to-server and client-to-server
  approaches. Client-to-server avoids backend media proxy latency but requires
  ephemeral tokens rather than long-lived API keys in client code.
- Ephemeral tokens are Live-API-only, short-lived credentials. Google documents
  default timing behavior of roughly one minute to start a new session and
  thirty minutes for sending messages over a connection, with the ability to
  restrict tokens to Live API model/config constraints.
- Tool use supports function calling and Google Search. Function declarations
  are installed in session configuration, and the client must manually send
  tool responses. Google AI docs distinguish synchronous function calls from
  non-blocking function declarations on models that support them, with response
  scheduling options such as interrupting current model output, waiting until
  idle, or staying silent.
- Tool support differs by model family and revision. The Google AI docs list
  Gemini 3.1 Flash Live Preview and Gemini 2.5 Flash Live Preview with
  function calling, but not all asynchronous behavior is supported by every
  model.

### Vertex AI Gemini Live API

Vertex AI's Live API docs describe the Google Cloud deployment path. The docs
currently present `gemini-live-2.5-flash-native-audio` as generally available
and recommended for low-latency voice agents, with native audio,
transcriptions, VAD, affective dialog, proactive audio, and tool use. They also
document a preview native-audio model and state a deprecation date for the
older preview native-audio release.

The Vertex AI page is relevant to capOS for enterprise deployment:

- It documents raw PCM input/output rates and a stateful WSS protocol.
- It describes realtime voice/video agents, tool use through function calling
  and Google Search, audio transcriptions, barge-in, and proactive audio.
- It points at partner WebRTC integrations, while the core Vertex API remains
  WebSocket-oriented in the referenced docs.
- It exposes cloud operational concerns not present in the simple developer API
  view: access management, request logging, provisioned throughput, PayGo
  variants, quotas, and regional/cloud deployment policy.

## Comparison

| Axis | OpenAI Realtime | Gemini Live API | Vertex AI Live API |
| --- | --- | --- | --- |
| Primary low-latency model shape | Realtime model session | Live model session | Cloud Live model session |
| Browser media path | WebRTC recommended | WebSocket with ephemeral token; partner WebRTC integrations exist | Partner WebRTC integrations; core docs emphasize WSS |
| Server path | WebSocket | WebSocket via Gen AI SDK/raw protocol | WebSocket via Gen AI SDK/raw protocol |
| Input | Text/audio/image on current realtime models | Audio/image/text | Audio/video/text |
| Output | Text/audio | Audio in Google AI overview | Audio/text in Vertex overview |
| Tool calls | Function calling, client executes and returns output | Function calling, client sends `FunctionResponse` | Function calling and Google Search grounding |
| Interruption | VAD, cancellation, output truncation | VAD/barge-in | VAD/barge-in |
| Client credential pattern | OpenAI ephemeral client secrets for browser realtime | Live-API ephemeral tokens | Cloud auth/service identity; client direct path depends on deployment |

The practical conclusion is that a capOS abstraction should not bake in a
single provider transport. OpenAI's best browser path is WebRTC; Gemini's core
developer path is WebSocket with ephemeral tokens; Vertex AI adds enterprise
auth and throughput controls. The common semantic layer is not "WebRTC" or
"WebSocket." It is a realtime model session carrying media frames, transcripts,
model audio output, structured tool calls, tool results, cancellation, and
session policy.

## Consequences For capOS

### A First-Class `RealtimeModelSession`

The existing language-model proposal is text-centric:

- `LanguageModel.complete`
- `LanguageModel.stream`
- tool calls emitted in assistant messages
- runner executes tools

That remains useful. It should not be stretched to pretend realtime audio is
just a token stream. Native realtime voice models need a sibling interface:

```capnp
interface RealtimeModel {
    info @0 () -> (info :RealtimeModelInfo);
    open @1 (config :RealtimeSessionConfig) -> (session :RealtimeModelSession);
}

interface RealtimeModelSession {
    sendInput @0 (event :RealtimeInputEvent) -> ();
    next @1 () -> (event :RealtimeOutputEvent, done :Bool);
    sendToolResult @2 (result :RealtimeToolResult) -> ();
    cancel @3 (reason :CancelReason) -> ();
    close @4 () -> ();
}
```

This interface lets a provider adapter hide whether it is OpenAI WebRTC,
OpenAI WebSocket, Gemini WebSocket, Vertex AI, a local model, or a future GPU
pipeline. It also keeps the existing capOS rule: the model never receives
session authority. It emits structured tool calls, and the trusted runner
executes or refuses them.

### Direct Native Audio Versus Chained Pipeline

capOS should support both.

Use a direct native-audio session when:

- the user expects conversational voice with low latency;
- barge-in and expressive speech matter;
- the provider model can safely handle tool-call turns in the same session;
- provider telemetry, cost, and policy permit streaming user audio off-box.

Use a chained pipeline when:

- the workflow is approval-heavy or destructive;
- deterministic transcript capture is mandatory before reasoning;
- ASR and TTS need to be local for privacy;
- the agent runner needs to inspect, redact, or transform text before model
  inference;
- the session is anonymous or guest and broker policy forbids remote live
  audio.

For web-shell voice, direct native audio is a better interactive experience,
but the chained path is the safer fallback and the better first local proof.

### Tool Calls Remain Proposals

Realtime providers can emit tool calls while producing or pausing audio. capOS
must still treat those calls exactly like text-agent tool calls:

1. The model emits a structured call name and arguments.
2. The agent runner validates the call against advertised tool descriptors.
3. Broker policy decides `auto`, `consent`, `stepUp`, or `forbidden`.
4. The runner invokes the underlying typed capability if allowed.
5. The runner sends a tool result back into the realtime session.
6. Audit records bind model id, session id, tool descriptor revision, typed
   arguments, permission decision, outcome, and any spoken/user confirmation.

The model must not hold the tool caps. The provider session must not receive
raw `TerminalSession`, `Launcher`, `ProcessSpawner`, tokens, credentials, or
session bundle authority.

### Audio Is Not Terminal Text

Voice input should not be encoded as `TerminalSession.readLine`, and output
audio should not be `TerminalSession.writeLine`. The terminal stream remains a
presentation channel. Voice is a sibling media channel bound to the same
authenticated session id.

This separation matters because realtime audio has properties terminal text
does not:

- frame timestamps;
- playback positions;
- output truncation;
- VAD and barge-in events;
- partial transcripts;
- deadline and stale-frame handling;
- binary frame formats;
- provider-specific session ids and event ids.

### Media Graph Substrate

Provider-native realtime sessions do not eliminate the need for a local media
graph. The graph becomes the local routing and policy layer, with the explicit
goal of minimizing and guaranteeing the portion of stack latency capOS controls
inside admitted realtime islands:

```mermaid
flowchart LR
    Mic[BrowserMic / DeviceMic] --> Capture[capture buffer]
    Capture --> Gate[VAD or push-to-talk gate]
    Gate --> Adapter[provider adapter or local ASR]
    Adapter --> Session[RealtimeModelSession]
    Session --> Runner[tool-call gate in agent runner]
    Runner --> Output[model audio output / local TTS]
    Output --> Playback[playback buffer]
    Playback --> Speaker[BrowserSpeaker / DeviceSpeaker]
```

On native capOS, device-facing audio eventually needs `DeviceMmio`, `DMAPool`,
and `Interrupt` authority. On WebShellGateway, browser WebAudio/WebRTC handles
physical microphone/speaker I/O, while capOS still owns the session authority
and tool execution boundary. The graph should follow the multimedia latency
research rule: use admitted realtime islands, preallocated media rings,
declared async-link latency, fail-closed overrun policy, and xrun/deadline
telemetry rather than hidden buffering.

### Scheduling And Deadlines

Realtime voice is soft realtime for web-shell use:

- capture frames should be forwarded before they become stale;
- model output audio should be played or discarded, not accumulated without
  bound;
- barge-in must beat model momentum;
- tool execution must not block media handling forever.

Per-SQE or per-media-frame deadlines are useful metadata, but not authority.
CPU guarantees still belong to future scheduling contexts. The media graph and
realtime provider adapter should attach absolute monotonic deadlines to frames,
tool calls, and playback events so stale work can be dropped deterministically.

### Browser/WebShellGateway Implications

Provider docs support two deployment shapes:

- Browser connects directly to provider using provider-issued ephemeral
  credentials. This minimizes media latency but exposes provider session
  traffic directly to browser JavaScript.
- Browser streams media to `WebShellGateway`, which connects to the provider
  server-side. This keeps provider credentials off the browser and lets capOS
  inspect/redact/rate-limit audio, but adds gateway latency.

For capOS, direct browser-to-provider media should be treated as an optimized
media path, not the baseline authority model. The baseline should keep
`WebShellGateway` and the agent runner in control of session lifecycle,
tool-call gating, audit, and teardown. If direct provider media is later used,
it should initially be media-only unless the provider offers a trusted
server-side control channel that lets the capOS adapter receive tool calls,
send tool results, and revoke the provider session without relying on browser
JavaScript.

The later browser-agent UI model is a separate policy choice: browser
JavaScript may receive provider tool-call events and orchestrate the provider
loop, but it still receives no capOS session caps or tool authority. Every
provider tool call must be forwarded as a structured `ToolRequest` to
`WebShellGateway`, and the gateway must validate descriptor freshness, session
state, consent/step-up, quotas, replay protection, and audit before invoking
real capOS capabilities. If those gateway controls are unavailable, provider
tool declarations must be disabled in the direct browser session and all
tool-capable turns must use gateway-mediated provider sessions. The browser
receives only short-lived, provider-scoped, model/config-locked tokens minted
by a broker-controlled service.

## Recommended capOS Direction

1. Keep `LanguageModel` for text and chained workflows.
2. Add `RealtimeModel` / `RealtimeModelSession` for native realtime multimodal
   sessions.
3. Model provider adapters should be ordinary services:
   - `OpenAIRealtimeProvider`
   - `GeminiLiveProvider`
   - `VertexLiveProvider`
   - `LocalRealtimeProvider`
4. A capOS-side agent runner or `WebShellGateway`'s server-side tool proxy
   remains the only holder of session caps and the only executor of real
   capOS tools.
5. WebShellGateway owns browser transport, media channels, and browser-agent
   tool proxy enforcement, but browser JavaScript owns no tool authority.
6. Media graph primitives should use `MemoryObject`, notifications, futexes,
   and scheduling contexts as they land.
7. Direct browser-to-provider connections require broker-minted ephemeral
   credentials and explicit audit of what bypasses gateway media inspection.

## Open Design Questions

- Should `RealtimeModelSession` expose provider event ids verbatim, or should
  it normalize them to capOS-generated ids and retain provider ids only in
  audit metadata?
- Should direct provider WebRTC be allowed for operator sessions, or should
  all production web-shell voice flow through WebShellGateway?
- How much partial transcript text is trusted enough to render before the
  provider marks it final?
- Can a provider-generated audio response be spoken before pending `consent`
  or `stepUp` decisions are resolved, or must speech pause at tool-call gates?
- How should local wake-word/VAD models be sandboxed so they can improve UX
  without becoming an authorization factor?
- Should media-frame deadlines be added to the existing SQE reserved field, or
  kept in media-ring metadata until the scheduler has scheduling contexts?

## References

- OpenAI, [Realtime conversations](https://developers.openai.com/api/docs/guides/realtime-conversations)
- OpenAI, [Realtime API with WebRTC](https://developers.openai.com/api/docs/guides/realtime-webrtc)
- OpenAI, [Realtime API with WebSocket](https://developers.openai.com/api/docs/guides/realtime-websocket)
- OpenAI, [Voice agents](https://developers.openai.com/api/docs/guides/voice-agents)
- OpenAI, [gpt-realtime-1.5 model page](https://developers.openai.com/api/docs/models/gpt-realtime-1.5)
- Google AI for Developers, [Gemini Live API overview](https://ai.google.dev/gemini-api/docs/live-api)
- Google AI for Developers, [Tool use with Live API](https://ai.google.dev/gemini-api/docs/live-api/tools)
- Google AI for Developers, [Ephemeral tokens](https://ai.google.dev/gemini-api/docs/live-api/ephemeral-tokens)
- Google Cloud Vertex AI, [Gemini Live API overview](https://docs.cloud.google.com/vertex-ai/generative-ai/docs/live-api)
