Research: Realtime Multimodal Agent APIs
Survey of provider APIs for realtime native-audio, multimodal, tool-using agents, and the consequences for capOS voice agent-shell, web shell, media graph, scheduling, and capability boundaries.
Scope
This report focuses on APIs where a model can consume realtime audio and emit both audio output and structured tool calls in one session. That is distinct from a chained pipeline where the application separately runs ASR, a text model, and TTS.
The immediate capOS question is whether the earlier agent-shell design should remain text-first with optional ASR/TTS wrappers, or whether it needs a first-class realtime multimodal model session.
Source Snapshot
All source observations below were checked against official provider documentation on 2026-04-25.
- The companion multimedia pipeline latency note covers PipeWire and JACK lessons for low-latency graph scheduling, latency reporting, realtime callbacks, and stable quantum selection.
- OpenAI Realtime API docs describe speech-to-speech sessions, WebRTC and
WebSocket transports, realtime function calling, interruption/truncation, and
the
gpt-realtimemodel family. - OpenAI Voice Agents docs explicitly frame the architecture choice as direct live audio sessions versus chained speech-to-text, text-agent, and text-to-speech pipelines.
- Google AI Gemini Live API docs describe realtime audio/image/text input, audio output, WebSocket transport, VAD, barge-in, tool use, and ephemeral tokens for client-to-server browser use.
- Vertex AI Gemini Live API docs describe the enterprise/cloud variant with realtime voice/video, native audio, transcriptions, function calling, Google Search grounding, and provisioned-throughput-oriented deployment considerations.
Provider Findings
OpenAI Realtime API
OpenAI’s Realtime API is a stateful session API for low-latency interactions
with realtime models. The docs describe calling models such as
gpt-realtime for speech-to-speech conversations over WebRTC or WebSocket,
with the session carrying model, voice, conversation items, and generated
responses.
Important details for capOS:
- Browser clients are steered toward WebRTC for more consistent media performance; server-to-server integrations are steered toward WebSocket.
- WebRTC media and control are split: audio is handled by the peer connection, while other events travel over a data channel.
- WebSocket integrations carry JSON events and require the application to manage input and output audio buffers directly.
- Realtime function calling is session/response configured. The model emits a
function_callitem with a name, JSON arguments, and a generated call id. The application executes the function and sends back afunction_call_outputconversation item keyed by that call id. - Realtime interruption is a first-class path. With VAD, user speech can cancel an ongoing model response. WebRTC/SIP paths have server-side knowledge of played audio; WebSocket paths require the client to stop playback and send truncation metadata for unplayed audio.
gpt-realtime-1.5is documented as a realtime audio-in/audio-out model with text, audio, and image input; text and audio output; and function calling. The current model page marks video as unsupported.
OpenAI’s Voice Agents docs expose the architectural tradeoff directly: live speech-to-speech sessions are the natural low-latency path, while chained ASR plus text-agent plus TTS gives stronger intermediate control and is often more appropriate for approval-heavy workflows.
Google AI Gemini Live API
Google AI’s Gemini Live API is a realtime stateful WebSocket API. The developer docs describe audio, image, and text input; audio output; VAD; barge-in; transcriptions; proactive audio; affective dialog; and tool use.
Important details for capOS:
- The Google AI developer API lists input audio as raw 16-bit PCM at 16 kHz little-endian, image input as JPEG at up to 1 FPS, and output audio as raw 16-bit PCM at 24 kHz little-endian.
- The public developer API supports server-to-server and client-to-server approaches. Client-to-server avoids backend media proxy latency but requires ephemeral tokens rather than long-lived API keys in client code.
- Ephemeral tokens are Live-API-only, short-lived credentials. Google documents default timing behavior of roughly one minute to start a new session and thirty minutes for sending messages over a connection, with the ability to restrict tokens to Live API model/config constraints.
- Tool use supports function calling and Google Search. Function declarations are installed in session configuration, and the client must manually send tool responses. Google AI docs distinguish synchronous function calls from non-blocking function declarations on models that support them, with response scheduling options such as interrupting current model output, waiting until idle, or staying silent.
- Tool support differs by model family and revision. The Google AI docs list Gemini 3.1 Flash Live Preview and Gemini 2.5 Flash Live Preview with function calling, but not all asynchronous behavior is supported by every model.
Vertex AI Gemini Live API
Vertex AI’s Live API docs describe the Google Cloud deployment path. The docs
currently present gemini-live-2.5-flash-native-audio as generally available
and recommended for low-latency voice agents, with native audio,
transcriptions, VAD, affective dialog, proactive audio, and tool use. They also
document a preview native-audio model and state a deprecation date for the
older preview native-audio release.
The Vertex AI page is relevant to capOS for enterprise deployment:
- It documents raw PCM input/output rates and a stateful WSS protocol.
- It describes realtime voice/video agents, tool use through function calling and Google Search, audio transcriptions, barge-in, and proactive audio.
- It points at partner WebRTC integrations, while the core Vertex API remains WebSocket-oriented in the referenced docs.
- It exposes cloud operational concerns not present in the simple developer API view: access management, request logging, provisioned throughput, PayGo variants, quotas, and regional/cloud deployment policy.
Comparison
| Axis | OpenAI Realtime | Gemini Live API | Vertex AI Live API |
|---|---|---|---|
| Primary low-latency model shape | Realtime model session | Live model session | Cloud Live model session |
| Browser media path | WebRTC recommended | WebSocket with ephemeral token; partner WebRTC integrations exist | Partner WebRTC integrations; core docs emphasize WSS |
| Server path | WebSocket | WebSocket via Gen AI SDK/raw protocol | WebSocket via Gen AI SDK/raw protocol |
| Input | Text/audio/image on current realtime models | Audio/image/text | Audio/video/text |
| Output | Text/audio | Audio in Google AI overview | Audio/text in Vertex overview |
| Tool calls | Function calling, client executes and returns output | Function calling, client sends FunctionResponse | Function calling and Google Search grounding |
| Interruption | VAD, cancellation, output truncation | VAD/barge-in | VAD/barge-in |
| Client credential pattern | OpenAI ephemeral client secrets for browser realtime | Live-API ephemeral tokens | Cloud auth/service identity; client direct path depends on deployment |
The practical conclusion is that a capOS abstraction should not bake in a single provider transport. OpenAI’s best browser path is WebRTC; Gemini’s core developer path is WebSocket with ephemeral tokens; Vertex AI adds enterprise auth and throughput controls. The common semantic layer is not “WebRTC” or “WebSocket.” It is a realtime model session carrying media frames, transcripts, model audio output, structured tool calls, tool results, cancellation, and session policy.
Consequences For capOS
A First-Class RealtimeModelSession
The existing language-model proposal is text-centric:
LanguageModel.completeLanguageModel.stream- tool calls emitted in assistant messages
- runner executes tools
That remains useful. It should not be stretched to pretend realtime audio is just a token stream. Native realtime voice models need a sibling interface:
interface RealtimeModel {
info @0 () -> (info :RealtimeModelInfo);
open @1 (config :RealtimeSessionConfig) -> (session :RealtimeModelSession);
}
interface RealtimeModelSession {
sendInput @0 (event :RealtimeInputEvent) -> ();
next @1 () -> (event :RealtimeOutputEvent, done :Bool);
sendToolResult @2 (result :RealtimeToolResult) -> ();
cancel @3 (reason :CancelReason) -> ();
close @4 () -> ();
}
This interface lets a provider adapter hide whether it is OpenAI WebRTC, OpenAI WebSocket, Gemini WebSocket, Vertex AI, a local model, or a future GPU pipeline. It also keeps the existing capOS rule: the model never receives session authority. It emits structured tool calls, and the trusted runner executes or refuses them.
Direct Native Audio Versus Chained Pipeline
capOS should support both.
Use a direct native-audio session when:
- the user expects conversational voice with low latency;
- barge-in and expressive speech matter;
- the provider model can safely handle tool-call turns in the same session;
- provider telemetry, cost, and policy permit streaming user audio off-box.
Use a chained pipeline when:
- the workflow is approval-heavy or destructive;
- deterministic transcript capture is mandatory before reasoning;
- ASR and TTS need to be local for privacy;
- the agent runner needs to inspect, redact, or transform text before model inference;
- the session is anonymous or guest and broker policy forbids remote live audio.
For web-shell voice, direct native audio is a better interactive experience, but the chained path is the safer fallback and the better first local proof.
Tool Calls Remain Proposals
Realtime providers can emit tool calls while producing or pausing audio. capOS must still treat those calls exactly like text-agent tool calls:
- The model emits a structured call name and arguments.
- The agent runner validates the call against advertised tool descriptors.
- Broker policy decides
auto,consent,stepUp, orforbidden. - The runner invokes the underlying typed capability if allowed.
- The runner sends a tool result back into the realtime session.
- Audit records bind model id, session id, tool descriptor revision, typed arguments, permission decision, outcome, and any spoken/user confirmation.
The model must not hold the tool caps. The provider session must not receive
raw TerminalSession, Launcher, ProcessSpawner, tokens, credentials, or
session bundle authority.
Audio Is Not Terminal Text
Voice input should not be encoded as TerminalSession.readLine, and output
audio should not be TerminalSession.writeLine. The terminal stream remains a
presentation channel. Voice is a sibling media channel bound to the same
authenticated session id.
This separation matters because realtime audio has properties terminal text does not:
- frame timestamps;
- playback positions;
- output truncation;
- VAD and barge-in events;
- partial transcripts;
- deadline and stale-frame handling;
- binary frame formats;
- provider-specific session ids and event ids.
Media Graph Substrate
Provider-native realtime sessions do not eliminate the need for a local media graph. The graph becomes the local routing and policy layer, with the explicit goal of minimizing and guaranteeing the portion of stack latency capOS controls inside admitted realtime islands:
flowchart LR
Mic[BrowserMic / DeviceMic] --> Capture[capture buffer]
Capture --> Gate[VAD or push-to-talk gate]
Gate --> Adapter[provider adapter or local ASR]
Adapter --> Session[RealtimeModelSession]
Session --> Runner[tool-call gate in agent runner]
Runner --> Output[model audio output / local TTS]
Output --> Playback[playback buffer]
Playback --> Speaker[BrowserSpeaker / DeviceSpeaker]
On native capOS, device-facing audio eventually needs DeviceMmio, DMAPool,
and Interrupt authority. On WebShellGateway, browser WebAudio/WebRTC handles
physical microphone/speaker I/O, while capOS still owns the session authority
and tool execution boundary. The graph should follow the multimedia latency
research rule: use admitted realtime islands, preallocated media rings,
declared async-link latency, fail-closed overrun policy, and xrun/deadline
telemetry rather than hidden buffering.
Scheduling And Deadlines
Realtime voice is soft realtime for web-shell use:
- capture frames should be forwarded before they become stale;
- model output audio should be played or discarded, not accumulated without bound;
- barge-in must beat model momentum;
- tool execution must not block media handling forever.
Per-SQE or per-media-frame deadlines are useful metadata, but not authority. CPU guarantees still belong to future scheduling contexts. The media graph and realtime provider adapter should attach absolute monotonic deadlines to frames, tool calls, and playback events so stale work can be dropped deterministically.
Browser/WebShellGateway Implications
Provider docs support two deployment shapes:
- Browser connects directly to provider using provider-issued ephemeral credentials. This minimizes media latency but exposes provider session traffic directly to browser JavaScript.
- Browser streams media to
WebShellGateway, which connects to the provider server-side. This keeps provider credentials off the browser and lets capOS inspect/redact/rate-limit audio, but adds gateway latency.
For capOS, direct browser-to-provider media should be treated as an optimized
media path, not the baseline authority model. The baseline should keep
WebShellGateway and the agent runner in control of session lifecycle,
tool-call gating, audit, and teardown. If direct provider media is later used,
it should initially be media-only unless the provider offers a trusted
server-side control channel that lets the capOS adapter receive tool calls,
send tool results, and revoke the provider session without relying on browser
JavaScript.
The later browser-agent UI model is a separate policy choice: browser
JavaScript may receive provider tool-call events and orchestrate the provider
loop, but it still receives no capOS session caps or tool authority. Every
provider tool call must be forwarded as a structured ToolRequest to
WebShellGateway, and the gateway must validate descriptor freshness, session
state, consent/step-up, quotas, replay protection, and audit before invoking
real capOS capabilities. If those gateway controls are unavailable, provider
tool declarations must be disabled in the direct browser session and all
tool-capable turns must use gateway-mediated provider sessions. The browser
receives only short-lived, provider-scoped, model/config-locked tokens minted
by a broker-controlled service.
Recommended capOS Direction
- Keep
LanguageModelfor text and chained workflows. - Add
RealtimeModel/RealtimeModelSessionfor native realtime multimodal sessions. - Model provider adapters should be ordinary services:
OpenAIRealtimeProviderGeminiLiveProviderVertexLiveProviderLocalRealtimeProvider
- A capOS-side agent runner or
WebShellGateway’s server-side tool proxy remains the only holder of session caps and the only executor of real capOS tools. - WebShellGateway owns browser transport, media channels, and browser-agent tool proxy enforcement, but browser JavaScript owns no tool authority.
- Media graph primitives should use
MemoryObject, notifications, futexes, and scheduling contexts as they land. - Direct browser-to-provider connections require broker-minted ephemeral credentials and explicit audit of what bypasses gateway media inspection.
Open Design Questions
- Should
RealtimeModelSessionexpose provider event ids verbatim, or should it normalize them to capOS-generated ids and retain provider ids only in audit metadata? - Should direct provider WebRTC be allowed for operator sessions, or should all production web-shell voice flow through WebShellGateway?
- How much partial transcript text is trusted enough to render before the provider marks it final?
- Can a provider-generated audio response be spoken before pending
consentorstepUpdecisions are resolved, or must speech pause at tool-call gates? - How should local wake-word/VAD models be sandboxed so they can improve UX without becoming an authorization factor?
- Should media-frame deadlines be added to the existing SQE reserved field, or kept in media-ring metadata until the scheduler has scheduling contexts?
References
- OpenAI, Realtime conversations
- OpenAI, Realtime API with WebRTC
- OpenAI, Realtime API with WebSocket
- OpenAI, Voice agents
- OpenAI, gpt-realtime-1.5 model page
- Google AI for Developers, Gemini Live API overview
- Google AI for Developers, Tool use with Live API
- Google AI for Developers, Ephemeral tokens
- Google Cloud Vertex AI, Gemini Live API overview