Proposal: Browser Capability and Agent Web Sessions

How capOS should expose the web without turning a browser into an ambiently privileged desktop escape hatch.

This proposal is intentionally split into three tracks:

After GUI: a full visual browser for humans, with windows, input, rendering, profiles, downloads, extensions, and ordinary web compatibility.
Agent/shell usage: a standard BrowserSession capability that lets shells and AI agents navigate, inspect, screenshot, fill forms, download, and collect evidence through a brokered browser service before capOS has a native GUI browser.
Cap-native document engine: an intermediate path that runs JS, DOM/CSS, layout, and rendering over caller-provided document/resource data, with fetch, storage, permissions, clipboard, downloads, and host I/O wired to native capOS capabilities instead of a browser-owned ambient platform.

The existing Browser/WASM proposal runs capOS in a browser tab. This proposal is the inverse: capOS exposes browser capabilities to users, services, and agents.

Grounding research: Browser Engines, Document Engines, and Agent Browsers.

Problem

The web is both a user interface substrate and a huge authority boundary. A browser can read credentials, perform network requests, upload local files, download untrusted bytes, run JavaScript from hostile origins, track users through profiles, and expose debug protocols powerful enough to rewrite page state.

On a conventional OS that power is hidden behind process permissions, profile directories, and implicit user intent. capOS needs a browser model that fits the capability system:

Profiles and sessions are explicit authority.
Network routes, downloads, uploads, credentials, and automation are scoped.
Browser JavaScript does not get shell or storage authority by accident.
Agents can use the web as a tool without receiving raw CDP, filesystem, or network capabilities.

Non-Goals

Writing a new browser engine for the first capOS browser milestone.
Porting Chromium, WebKit, Gecko, Servo, or Ladybird before the GUI, userspace networking/storage, fonts, and driver-safety prerequisites exist.
Treating anti-detection, fingerprint evasion, scraping at scale, or bot bypass as a capOS product goal.
Exposing raw Chrome DevTools Protocol, WebDriver BiDi, or Playwright handles as ordinary user/session capabilities.
Letting browser-hosted JavaScript hold raw capOS shell, launch, file, or network capabilities.

Design Principles

Browser state is authority. A profile’s cookies, local storage, permissions, saved credentials, cache, proxy route, and downloads are not implementation details. They are held through BrowserProfile and BrowserContext capabilities.
The interface is the permission. A caller that can navigate does not automatically get DOM inspection, screenshot, input, download, upload, network interception, profile mutation, or automation-debug authority.
Agents receive tools, not admin ports. CDP and WebDriver BiDi are backend protocols for the trusted browser service. The agent-facing ABI is a typed narrowed capability surface.
Origins become visible policy inputs. Browser decisions should record origin, top-level site, profile, user session, persona, network route, and initiator. URL strings alone are not enough.
Downloads and uploads cross explicit caps. A download returns a BrowserArtifact or writes through a granted DownloadSink. Uploading a file requires a granted read cap for that object and a per-action policy decision.
Automation is auditable. Browser actions initiated by an agent are logged with the page/session, operation, typed arguments, permission mode, result, and artifacts captured for later review.
Visual browsing waits for GUI. A human browser is a real app, not a terminal command. It should land only after compositor/input/font/storage and userspace networking foundations are credible.
A browser can be headless before it is native. The early agent/shell-facing capability may be served by a host-side browser, a development-machine sidecar, a Linux companion process, or a remote browser service. The capOS ABI should not expose which backend serves it.

Track 1: Agent/Shell Browser Capability

This is the near-term conceptual track. It gives capOS agents and shells a standard web tool without waiting for a compositor or native browser port.

Conceptual interfaces:

interface BrowserBroker {
  createProfile @0 (request :BrowserProfileRequest) -> (profile :BrowserProfile);
  openContext @1 (profile :BrowserProfile, policy :BrowserContextPolicy)
      -> (context :BrowserContext);
}

interface BrowserContext {
  openSession @0 (persona :BrowserPersona) -> (session :BrowserSession);
  snapshot @1 () -> (profileSnapshot :BrowserProfileSnapshot);
  destroy @2 () -> ();
}

interface BrowserSession {
  close @0 () -> ();
}

interface BrowserNavigate {
  navigate @0 (url :Text, wait :NavigationWait) -> (result :NavigationResult);
}

interface BrowserReadPage {
  readPage @0 (budget :PageReadBudget) -> (snapshot :PageSnapshot);
}

interface BrowserScreenshot {
  screenshot @0 (options :ScreenshotOptions) -> (image :BrowserArtifact);
}

interface BrowserInput {
  input @0 (action :InputAction) -> (result :InputResult);
}

interface BrowserDownload {
  download @0 (selector :DownloadSelector, sink :DownloadSink)
      -> (artifact :BrowserArtifact);
}

The exact schema belongs in a later implementation slice. The important rule is that BrowserSession is only a lifetime handle for one browsing context. It does not imply navigation, inspection, screenshot, input, download, upload, network-observer, or debug authority. The broker mints only the operation facets allowed by the caller’s session policy, and the shell/agent runner advertises only tools backed by facets it actually holds.

Capability	Authority
`BrowserBroker`	Mint profiles and contexts according to session policy.
`BrowserProfile`	Own persistent browser state and profile lifecycle.
`BrowserContext`	Own one isolated browsing context under a profile.
`BrowserSession`	Hold and close one session lifetime; no operation authority by itself.
`BrowserNavigate`	Navigate within one session.
`BrowserReadPage`	Inspect page state under output budgets.
`BrowserScreenshot`	Capture screenshot artifacts under policy.
`BrowserInput`	Click, type, select, upload only with explicit grants.
`BrowserDownload`	Initiate browser downloads into a granted sink.
`DownloadSink`	Receive bytes/artifacts from browser downloads.
`BrowserNetworkObserver`	Read network metadata or bodies under redaction policy.
`BrowserAdmin`	Backend-only: raw CDP/BiDi, crash dumps, trace, profile mutation.

Agent Tool Shape

The native shell or agent runner advertises browser operations as ordinary tools:

browser.open(url)
browser.snapshot()
browser.screenshot()
browser.click(ref)
browser.type(ref, text)
browser.select(ref, value)
browser.download(ref)
browser.close()

The tool result is structured:

page title, URL, origin, load state
accessibility/DOM references under stable short IDs
visible text and form fields under a token/byte budget
screenshot artifact cap, when requested
network/download artifacts only when separately allowed

The model never receives the BrowserSession cap. It proposes tool calls; the runner executes them after policy and consent checks, then feeds bounded results back to the model. This matches Language Models and the Agent Runtime.

Backend Strategy

The first implementation should be a userspace service or host-side harness that owns a real browser and exposes the typed capOS surface:

Browser service launches or attaches to Chromium/Firefox/WebKit through Playwright, WebDriver BiDi, or CDP.
The service stores profile state in a host directory or capOS Store backend, but callers see only BrowserProfile caps.
The service enforces per-session operation grants and output budgets before returning DOM text, screenshots, network metadata, or downloads.
An MCP adapter can present the same tools to external agents, but MCP is an adapter, not the authority model.

This makes browser usage testable while capOS still lacks native GUI pieces. It also creates a practical compatibility path for agents that need the modern web during capOS development.

Track 1.5: Cap-Native Document Engine

The most capOS-shaped browser work may not be “port a full browser” first. There is a meaningful middle target: run the parts of the web stack that turn provided data into an interactive document – JavaScript, DOM, CSS, layout, rendering, and perhaps WebAssembly – while replacing browser-owned host APIs with capability-backed services.

In this model, the engine does not own raw networking, files, profile directories, clipboard, permissions, downloads, credentials, or extension installation. It receives a document/resource graph and a bundle of explicit host caps. Each document bundle also needs a broker- or ResourceLoader-minted web principal: an explicit origin, package origin, or opaque origin plus base URL policy used for relative URLs, storage partitioning, fetch checks, audit records, and user-facing permission prompts. Opaque origins are the default for caller-provided bundles; a real web or package origin requires authority or attestation from the loader that supplied the bytes. Web APIs become host bindings:

Web-facing operation	capOS-backed authority
`fetch()` / subresource load	`HttpEndpoint`, `Fetch`, or content-addressed `ResourceLoader` cap.
cookies / local storage / IndexedDB	`BrowserProfileStore` or narrower origin-scoped `KvStore` cap.
file picker / upload	user-approved `FileRead` or artifact cap.
downloads	`DownloadSink` / `StoreWriter` cap.
clipboard	explicit `ClipboardRead` / `ClipboardWrite` caps.
geolocation, camera, microphone	future sensor/media caps, never implicit.
workers / timers	scheduler and resource-budget caps.
WebAssembly imports	explicit host import caps, not ambient syscalls.

Document-engine Wasm hosting is the same shape as the WASI Host Adapter: a userspace process holds the wasm runtime and binds each import to an explicit capOS capability passed in through its bootstrap CapSet, rather than letting module code reach for ambient syscalls. Phase W.3/W.4 of that proposal already grants per-instance bounded text (argv, environment) and typed EntropySource-backed random_get through narrowed broker grants; the cap-native document engine should reuse the same bootstrap CapSet convention and per-instance grant shape when it eventually hosts JS/Wasm runtimes inside the browser stack so that fetch/storage/clipboard/random_get bindings stay authority-by-grant.

This track is useful for three reasons:

It gives capOS a native HTML/CSS/JS application substrate without waiting for all of ordinary web browsing. Documentation, setup flows, dashboards, adventure/Paperclips UIs, and local admin apps could be rendered from trusted or packaged resources before arbitrary internet browsing is safe.
It lets the project design web API host bindings around capabilities from the start. A later full browser can reuse the same profile, fetch, storage, permission, and artifact services instead of hiding them inside an engine.
It is a smaller research target for engine embedding. Servo, Ladybird, and WebKit/WPE can be evaluated as document/rendering substrates, while SpiderMonkey, JavaScriptCore, Boa, or QuickJS can be evaluated as JS/Wasm runtime components or host-binding proof substrates without committing to an entire general-purpose browser port.

The accepted first shape should be conservative:

Load documents from a DocumentBundle or ResourceLoader cap, not from a URL bar.
Require every bundle principal to be minted or validated by the broker or ResourceLoader, and partition fetch, storage, cache, and audit state by profile/context/session plus that principal.
Disable arbitrary internet subresource fetch until a caller grants a narrowed Fetch/HttpEndpoint.
Produce a rendered surface or screenshot artifact plus a bounded accessibility/DOM snapshot.
Treat every Web API host binding as a separate facet and require explicit broker grants.
Avoid extension APIs, service workers, persistent background sync, notifications, WebRTC, and device APIs until their capOS authority model is clear.

The self-served remote-session web UI is an application-hosting instance of this middle track, not a general browser milestone. The UI bundle is an immutable boot-package resource served by a capOS service through scoped listener authority; browser JavaScript is still ordinary untrusted page code. The capOS service, not the page, holds the remote session CapSet and service proxies, then exposes browser-safe view models and user-event commands over same-origin HTTP routes. This keeps the first proof aligned with the browser capability rule that JavaScript never receives raw capOS caps, shell or spawn authority, endpoint owner handles, storage roots, or host identity hints. The Remote Session UI Security proposal owns the concrete web-security posture for that bridge – per-browser-session isolation, CSRF/CSP/cookie posture, transcript redaction, and the Tauri desktop wrapper’s reduced webview surface – and is the load-bearing precedent for how a cap-native document engine should treat its same-origin DTO channel: the Rust/backend authority boundary, not page JavaScript, holds upstream capOS handles.

This is still not a toy scripting widget. Running hostile JavaScript against a DOM/layout engine remains a large TCB, and rendering bugs can be security bugs. The point is to narrow the host-platform surface: provided data in, rendered surface/snapshot/artifacts out, and every side effect through typed caps.

Track 2: Visual Browser After GUI

A human-facing browser should be a normal capOS GUI application once these prerequisites exist:

compositor and input service
font discovery/rasterization
userspace networking and TLS
Store/Namespace-backed profile persistence
download/upload mediation
shared-memory graphics buffers or GPU session caps
process crash/restart handling
brokered user-session profile policy

Candidate engine paths:

Engine path	Role	capOS assessment
Chromium Ozone / CEF	Maximum compatibility and automation ecosystem	Best external/backend choice; native port is very large.
WPE WebKit	Embedded visual browser candidate	Plausible post-GUI engine because WPE is designed for embedded backends.
Gecko / GeckoView	Browser diversity and principal-model precedent	Good external backend; GeckoView itself is Android-specific.
Servo	Rust/modular research-aligned engine	Track closely; not first broad-compatibility choice.
Ladybird / LibWeb	Independent-engine precedent	Track for architecture; not a near-term dependency.

The visual browser should reuse the agent/shell profile/session model instead of inventing a second profile stack. A GUI tab is a BrowserSession with a visual BrowserView surface attached. Closing the window should not silently destroy profile state unless the profile cap is ephemeral.

Donut Browser Ideas To Adapt

Donut Browser is useful because it treats browser profiles as first-class, scriptable objects and exposes local REST/MCP automation. capOS should adapt the capability-shaped parts:

Unlimited local profiles map to broker-minted BrowserProfile caps.
Profile groups map to policy bundles and user-session grants.
Per-profile cookies/storage/extensions map to Store-backed state owned by the profile cap.
Per-profile proxy/VPN selection maps to explicit network-route caps.
Local REST/MCP maps to a typed capOS service plus optional external adapter.
Persistent automation sessions map to BrowserContext lifetimes and snapshots.
Default-browser link routing maps to a broker decision: which profile/context should open a URL for this user/session?

capOS should not adopt Donut’s anti-detect promise. If capOS supports persona controls such as viewport, locale, timezone, user agent, geolocation, WebRTC policy, or fingerprint reduction, those controls should be explicit BrowserPersona policy with audit and user-facing disclosure.

Security Boundary

Browser work adds these trust boundaries:

Web content to browser engine. Untrusted JavaScript, media, fonts, and documents hit a large engine TCB. Native browser work should keep renderer, network, image decode, and profile services separated where the backend permits it.
Browser engine to capOS. The engine must not receive broad shell caps. Its only capOS authorities should be its granted network route, profile store, artifact sink, and visual/input surfaces.
Agent to browser service. The agent sees tool descriptors and bounded snapshots, not backend debug ports.
Browser downloads to storage. Downloaded bytes are untrusted artifacts until a user or policy process imports them into a namespace.
Browser uploads to web origin. Upload requires explicit file/artifact authority and must record the destination origin.
Profile to profile. Cookies, storage, cache, extension state, and persona policy must not bleed across profiles unless a broker grants an explicit clone/import/export operation.

Raw CDP or BiDi access is BrowserAdmin authority. It should be held only by the browser service supervisor and developer harnesses, not by ordinary shell sessions.

Phased Plan

Phase A: Host-Backed Agent Browser

Add a host-side or userspace browser service proof that exposes a narrowed BrowserSession over an existing browser backend.
Use fake-model or scripted-agent QEMU/host proof first: navigate to a local page, read a bounded snapshot, click/type, capture a screenshot artifact, and close the session.
Record audit output for each action and show that the caller never receives raw CDP/BiDi.

Phase B: Standard Shell Tool

Add native shell and agent-runner integration so browser.open, browser.snapshot, and browser.screenshot are standard tools when the broker grants a browser bundle.
Add MCP adapter support for external agents using the same typed operation set.
Add download/upload gates once Store/Namespace and artifact caps exist.

Phase C: Cap-Native Document Engine Proof

Add a restricted DocumentBundle proof that renders packaged HTML/CSS/JS to a screenshot or simple surface and emits a bounded accessibility/DOM snapshot.
Wire at least one host API, such as fetch from a preloaded resource bundle or a profile-scoped key/value store, through a typed capability.
Prove that absent caps fail closed: no network, no profile storage, no clipboard, and no downloads by default.

Phase D: In-capOS Headless Browser Backend

Port or package a browser backend process once userspace networking, storage, fonts, and threads are mature enough.
Prefer a backend that can run without a full visible GUI surface but still supports screenshots and accessibility/DOM snapshots.
Preserve the same BrowserSession ABI so agents do not notice the backend change.

Phase E: Visual Browser

Add BrowserView/window integration after compositor/input support exists.
Reuse BrowserProfile and BrowserSession for tabs/windows.
Add user-facing profile picker, permissions UI, downloads UI, and audit view.

Relationship To Existing Proposals

Browser/WASM is about capOS as a browser-hosted runtime. This proposal is about capOS exposing browser capability services.
Language Models and the Agent Runtime owns the model/tool-call loop. Browser sessions are one tool family.
Shell and Interactive Command Surfaces own command exposure. Browser operations should appear there as typed tools, not string commands tunneled to an automation port.
Networking, Storage and Naming, and GPU Capability provide prerequisites for a native visual browser. The networking proposal owns the userspace TCP/IP and TLS authority the broker eventually narrows into Fetch, HttpEndpoint, and per-profile proxy/route caps; a browser engine never sees raw socket authority.
Remote Session UI Security defines the web-security posture for the trusted local remote-session-ui bridge and its Tauri desktop wrapper. It is the concrete precedent for the cap-native document engine’s “Rust/backend authority boundary, not page JavaScript, holds capOS handles” rule.
WASI Host Adapter ships the typed capability boundary for sandboxed WebAssembly imports. The cap-native document engine’s Wasm bindings should reuse the same bootstrap CapSet convention and per-instance grant shape (argv, env, entropy, and – once their authority surfaces exist – filesystem and sockets) rather than inventing a parallel browser-only Wasm host.

Open Questions

Should the first implementation wrap Playwright for breadth, raw CDP for smaller dependencies, or WebDriver BiDi for standards alignment?
What is the minimal page snapshot that remains useful to an LLM while limiting token use and accidental data disclosure?
Should BrowserPersona support fingerprint reduction only, or also compatibility personas for testing?
How should extensions be represented: profile-owned package state, separately granted extension caps, or both?
How should a visual browser present capOS capability prompts without training users to approve every web-origin request blindly?

Keyboard shortcuts

capOS Documentation