Research: Realtime Multimodal Agent APIs

Survey of provider APIs for realtime native-audio, multimodal, tool-using agents, and the consequences for capOS voice agent-shell, web shell, media graph, scheduling, and capability boundaries.

Scope

This report focuses on APIs where a model can consume realtime audio and emit both audio output and structured tool calls in one session. That is distinct from a chained pipeline where the application separately runs ASR, a text model, and TTS.

The immediate capOS question is whether the earlier agent-shell design should remain text-first with optional ASR/TTS wrappers, or whether it needs a first-class realtime multimodal model session.

Source Snapshot

All source observations below were checked against official provider documentation on 2026-04-25.

The companion multimedia pipeline latency note covers PipeWire and JACK lessons for low-latency graph scheduling, latency reporting, realtime callbacks, and stable quantum selection.
OpenAI Realtime API docs describe speech-to-speech sessions, WebRTC and WebSocket transports, realtime function calling, interruption/truncation, and the gpt-realtime model family.
OpenAI Voice Agents docs explicitly frame the architecture choice as direct live audio sessions versus chained speech-to-text, text-agent, and text-to-speech pipelines.
Google AI Gemini Live API docs describe realtime audio/image/text input, audio output, WebSocket transport, VAD, barge-in, tool use, and ephemeral tokens for client-to-server browser use.
Vertex AI Gemini Live API docs describe the enterprise/cloud variant with realtime voice/video, native audio, transcriptions, function calling, Google Search grounding, and provisioned-throughput-oriented deployment considerations.

Provider Findings

OpenAI Realtime API

OpenAI’s Realtime API is a stateful session API for low-latency interactions with realtime models. The docs describe calling models such as gpt-realtime for speech-to-speech conversations over WebRTC or WebSocket, with the session carrying model, voice, conversation items, and generated responses.

Important details for capOS:

Browser clients are steered toward WebRTC for more consistent media performance; server-to-server integrations are steered toward WebSocket.
WebRTC media and control are split: audio is handled by the peer connection, while other events travel over a data channel.
WebSocket integrations carry JSON events and require the application to manage input and output audio buffers directly.
Realtime function calling is session/response configured. The model emits a function_call item with a name, JSON arguments, and a generated call id. The application executes the function and sends back a function_call_output conversation item keyed by that call id.
Realtime interruption is a first-class path. With VAD, user speech can cancel an ongoing model response. WebRTC/SIP paths have server-side knowledge of played audio; WebSocket paths require the client to stop playback and send truncation metadata for unplayed audio.
gpt-realtime-1.5 is documented as a realtime audio-in/audio-out model with text, audio, and image input; text and audio output; and function calling. The current model page marks video as unsupported.

OpenAI’s Voice Agents docs expose the architectural tradeoff directly: live speech-to-speech sessions are the natural low-latency path, while chained ASR plus text-agent plus TTS gives stronger intermediate control and is often more appropriate for approval-heavy workflows.

Google AI Gemini Live API

Google AI’s Gemini Live API is a realtime stateful WebSocket API. The developer docs describe audio, image, and text input; audio output; VAD; barge-in; transcriptions; proactive audio; affective dialog; and tool use.

Important details for capOS:

The Google AI developer API lists input audio as raw 16-bit PCM at 16 kHz little-endian, image input as JPEG at up to 1 FPS, and output audio as raw 16-bit PCM at 24 kHz little-endian.
The public developer API supports server-to-server and client-to-server approaches. Client-to-server avoids backend media proxy latency but requires ephemeral tokens rather than long-lived API keys in client code.
Ephemeral tokens are Live-API-only, short-lived credentials. Google documents default timing behavior of roughly one minute to start a new session and thirty minutes for sending messages over a connection, with the ability to restrict tokens to Live API model/config constraints.
Tool use supports function calling and Google Search. Function declarations are installed in session configuration, and the client must manually send tool responses. Google AI docs distinguish synchronous function calls from non-blocking function declarations on models that support them, with response scheduling options such as interrupting current model output, waiting until idle, or staying silent.
Tool support differs by model family and revision. The Google AI docs list Gemini 3.1 Flash Live Preview and Gemini 2.5 Flash Live Preview with function calling, but not all asynchronous behavior is supported by every model.

Vertex AI Gemini Live API

Vertex AI’s Live API docs describe the Google Cloud deployment path. The docs currently present gemini-live-2.5-flash-native-audio as generally available and recommended for low-latency voice agents, with native audio, transcriptions, VAD, affective dialog, proactive audio, and tool use. They also document a preview native-audio model and state a deprecation date for the older preview native-audio release.

The Vertex AI page is relevant to capOS for enterprise deployment:

It documents raw PCM input/output rates and a stateful WSS protocol.
It describes realtime voice/video agents, tool use through function calling and Google Search, audio transcriptions, barge-in, and proactive audio.
It points at partner WebRTC integrations, while the core Vertex API remains WebSocket-oriented in the referenced docs.
It exposes cloud operational concerns not present in the simple developer API view: access management, request logging, provisioned throughput, PayGo variants, quotas, and regional/cloud deployment policy.

Comparison

Axis	OpenAI Realtime	Gemini Live API	Vertex AI Live API
Primary low-latency model shape	Realtime model session	Live model session	Cloud Live model session
Browser media path	WebRTC recommended	WebSocket with ephemeral token; partner WebRTC integrations exist	Partner WebRTC integrations; core docs emphasize WSS
Server path	WebSocket	WebSocket via Gen AI SDK/raw protocol	WebSocket via Gen AI SDK/raw protocol
Input	Text/audio/image on current realtime models	Audio/image/text	Audio/video/text
Output	Text/audio	Audio in Google AI overview	Audio/text in Vertex overview
Tool calls	Function calling, client executes and returns output	Function calling, client sends `FunctionResponse`	Function calling and Google Search grounding
Interruption	VAD, cancellation, output truncation	VAD/barge-in	VAD/barge-in
Client credential pattern	OpenAI ephemeral client secrets for browser realtime	Live-API ephemeral tokens	Cloud auth/service identity; client direct path depends on deployment

The practical conclusion is that a capOS abstraction should not bake in a single provider transport. OpenAI’s best browser path is WebRTC; Gemini’s core developer path is WebSocket with ephemeral tokens; Vertex AI adds enterprise auth and throughput controls. The common semantic layer is not “WebRTC” or “WebSocket.” It is a realtime model session carrying media frames, transcripts, model audio output, structured tool calls, tool results, cancellation, and session policy.

Consequences For capOS

A First-Class `RealtimeModelSession`

The existing language-model proposal is text-centric:

LanguageModel.complete
LanguageModel.stream
tool calls emitted in assistant messages
runner executes tools

That remains useful. It should not be stretched to pretend realtime audio is just a token stream. Native realtime voice models need a sibling interface:

interface RealtimeModel {
    info @0 () -> (info :RealtimeModelInfo);
    open @1 (config :RealtimeSessionConfig) -> (session :RealtimeModelSession);
}

interface RealtimeModelSession {
    sendInput @0 (event :RealtimeInputEvent) -> ();
    next @1 () -> (event :RealtimeOutputEvent, done :Bool);
    sendToolResult @2 (result :RealtimeToolResult) -> ();
    cancel @3 (reason :CancelReason) -> ();
    close @4 () -> ();
}

This interface lets a provider adapter hide whether it is OpenAI WebRTC, OpenAI WebSocket, Gemini WebSocket, Vertex AI, a local model, or a future GPU pipeline. It also keeps the existing capOS rule: the model never receives session authority. It emits structured tool calls, and the trusted runner executes or refuses them.

Direct Native Audio Versus Chained Pipeline

capOS should support both.

Use a direct native-audio session when:

the user expects conversational voice with low latency;
barge-in and expressive speech matter;
the provider model can safely handle tool-call turns in the same session;
provider telemetry, cost, and policy permit streaming user audio off-box.

Use a chained pipeline when:

the workflow is approval-heavy or destructive;
deterministic transcript capture is mandatory before reasoning;
ASR and TTS need to be local for privacy;
the agent runner needs to inspect, redact, or transform text before model inference;
the session is anonymous or guest and broker policy forbids remote live audio.

For web-shell voice, direct native audio is a better interactive experience, but the chained path is the safer fallback and the better first local proof.

Tool Calls Remain Proposals

Realtime providers can emit tool calls while producing or pausing audio. capOS must still treat those calls exactly like text-agent tool calls:

The model emits a structured call name and arguments.
The agent runner validates the call against advertised tool descriptors.
Broker policy decides auto, consent, stepUp, or forbidden.
The runner invokes the underlying typed capability if allowed.
The runner sends a tool result back into the realtime session.
Audit records bind model id, session id, tool descriptor revision, typed arguments, permission decision, outcome, and any spoken/user confirmation.

The model must not hold the tool caps. The provider session must not receive raw TerminalSession, Launcher, ProcessSpawner, tokens, credentials, or session bundle authority.

Audio Is Not Terminal Text

Voice input should not be encoded as TerminalSession.readLine, and output audio should not be TerminalSession.writeLine. The terminal stream remains a presentation channel. Voice is a sibling media channel bound to the same authenticated session id.

This separation matters because realtime audio has properties terminal text does not:

frame timestamps;
playback positions;
output truncation;
VAD and barge-in events;
partial transcripts;
deadline and stale-frame handling;
binary frame formats;
provider-specific session ids and event ids.

Media Graph Substrate

Provider-native realtime sessions do not eliminate the need for a local media graph. The graph becomes the local routing and policy layer, with the explicit goal of minimizing and guaranteeing the portion of stack latency capOS controls inside admitted realtime islands:

flowchart LR
    Mic[BrowserMic / DeviceMic] --> Capture[capture buffer]
    Capture --> Gate[VAD or push-to-talk gate]
    Gate --> Adapter[provider adapter or local ASR]
    Adapter --> Session[RealtimeModelSession]
    Session --> Runner[tool-call gate in agent runner]
    Runner --> Output[model audio output / local TTS]
    Output --> Playback[playback buffer]
    Playback --> Speaker[BrowserSpeaker / DeviceSpeaker]

On native capOS, device-facing audio eventually needs DeviceMmio, DMAPool, and Interrupt authority. On WebShellGateway, browser WebAudio/WebRTC handles physical microphone/speaker I/O, while capOS still owns the session authority and tool execution boundary. The graph should follow the multimedia latency research rule: use admitted realtime islands, preallocated media rings, declared async-link latency, fail-closed overrun policy, and xrun/deadline telemetry rather than hidden buffering.

Scheduling And Deadlines

Realtime voice is soft realtime for web-shell use:

capture frames should be forwarded before they become stale;
model output audio should be played or discarded, not accumulated without bound;
barge-in must beat model momentum;
tool execution must not block media handling forever.

Per-SQE or per-media-frame deadlines are useful metadata, but not authority. CPU guarantees still belong to future scheduling contexts. The media graph and realtime provider adapter should attach absolute monotonic deadlines to frames, tool calls, and playback events so stale work can be dropped deterministically.

Browser/WebShellGateway Implications

Provider docs support two deployment shapes:

Browser connects directly to provider using provider-issued ephemeral credentials. This minimizes media latency but exposes provider session traffic directly to browser JavaScript.
Browser streams media to WebShellGateway, which connects to the provider server-side. This keeps provider credentials off the browser and lets capOS inspect/redact/rate-limit audio, but adds gateway latency.

For capOS, direct browser-to-provider media should be treated as an optimized media path, not the baseline authority model. The baseline should keep WebShellGateway and the agent runner in control of session lifecycle, tool-call gating, audit, and teardown. If direct provider media is later used, it should initially be media-only unless the provider offers a trusted server-side control channel that lets the capOS adapter receive tool calls, send tool results, and revoke the provider session without relying on browser JavaScript.

The later browser-agent UI model is a separate policy choice: browser JavaScript may receive provider tool-call events and orchestrate the provider loop, but it still receives no capOS session caps or tool authority. Every provider tool call must be forwarded as a structured ToolRequest to WebShellGateway, and the gateway must validate descriptor freshness, session state, consent/step-up, quotas, replay protection, and audit before invoking real capOS capabilities. If those gateway controls are unavailable, provider tool declarations must be disabled in the direct browser session and all tool-capable turns must use gateway-mediated provider sessions. The browser receives only short-lived, provider-scoped, model/config-locked tokens minted by a broker-controlled service.

Recommended capOS Direction

Keep LanguageModel for text and chained workflows.
Add RealtimeModel / RealtimeModelSession for native realtime multimodal sessions.
Model provider adapters should be ordinary services:
- OpenAIRealtimeProvider
- GeminiLiveProvider
- VertexLiveProvider
- LocalRealtimeProvider
A capOS-side agent runner or WebShellGateway’s server-side tool proxy remains the only holder of session caps and the only executor of real capOS tools.
WebShellGateway owns browser transport, media channels, and browser-agent tool proxy enforcement, but browser JavaScript owns no tool authority.
Media graph primitives should use MemoryObject, notifications, futexes, and scheduling contexts as they land.
Direct browser-to-provider connections require broker-minted ephemeral credentials and explicit audit of what bypasses gateway media inspection.

Open Design Questions

Should RealtimeModelSession expose provider event ids verbatim, or should it normalize them to capOS-generated ids and retain provider ids only in audit metadata?
Should direct provider WebRTC be allowed for operator sessions, or should all production web-shell voice flow through WebShellGateway?
How much partial transcript text is trusted enough to render before the provider marks it final?
Can a provider-generated audio response be spoken before pending consent or stepUp decisions are resolved, or must speech pause at tool-call gates?
How should local wake-word/VAD models be sandboxed so they can improve UX without becoming an authorization factor?
Should media-frame deadlines be added to the existing SQE reserved field, or kept in media-ring metadata until the scheduler has scheduling contexts?

References

OpenAI, Realtime conversations
OpenAI, Realtime API with WebRTC
OpenAI, Realtime API with WebSocket
OpenAI, Voice agents
OpenAI, gpt-realtime-1.5 model page
Google AI for Developers, Gemini Live API overview
Google AI for Developers, Tool use with Live API
Google AI for Developers, Ephemeral tokens
Google Cloud Vertex AI, Gemini Live API overview

Keyboard shortcuts

capOS Documentation