# Proposal: Browser Capability and Agent Web Sessions

How capOS should expose the web without turning a browser into an ambiently
privileged desktop escape hatch.

This proposal is intentionally split into three tracks:

- **After GUI:** a full visual browser for humans, with windows, input,
  rendering, profiles, downloads, extensions, and ordinary web compatibility.
- **Agent/shell usage:** a standard `BrowserSession` capability that lets
  shells and AI agents navigate, inspect, screenshot, fill forms, download,
  and collect evidence through a brokered browser service before capOS has a
  native GUI browser.
- **Cap-native document engine:** an intermediate path that runs JS, DOM/CSS,
  layout, and rendering over caller-provided document/resource data, with
  fetch, storage, permissions, clipboard, downloads, and host I/O wired to
  native capOS capabilities instead of a browser-owned ambient platform.

The existing [Browser/WASM](browser-wasm-proposal.md) proposal runs capOS in a
browser tab. This proposal is the inverse: capOS exposes browser capabilities
to users, services, and agents.

Grounding research:
[Browser Engines, Document Engines, and Agent Browsers](../research/browser-engines-and-agent-browsers.md).

## Problem

The web is both a user interface substrate and a huge authority boundary. A
browser can read credentials, perform network requests, upload local files,
download untrusted bytes, run JavaScript from hostile origins, track users
through profiles, and expose debug protocols powerful enough to rewrite page
state.

On a conventional OS that power is hidden behind process permissions, profile
directories, and implicit user intent. capOS needs a browser model that fits
the capability system:

- Profiles and sessions are explicit authority.
- Network routes, downloads, uploads, credentials, and automation are scoped.
- Browser JavaScript does not get shell or storage authority by accident.
- Agents can use the web as a tool without receiving raw CDP, filesystem, or
  network capabilities.

## Non-Goals

- Writing a new browser engine for the first capOS browser milestone.
- Porting Chromium, WebKit, Gecko, Servo, or Ladybird before the GUI,
  userspace networking/storage, fonts, and driver-safety prerequisites exist.
- Treating anti-detection, fingerprint evasion, scraping at scale, or bot
  bypass as a capOS product goal.
- Exposing raw Chrome DevTools Protocol, WebDriver BiDi, or Playwright handles
  as ordinary user/session capabilities.
- Letting browser-hosted JavaScript hold raw capOS shell, launch, file, or
  network capabilities.

## Design Principles

1. **Browser state is authority.** A profile's cookies, local storage,
   permissions, saved credentials, cache, proxy route, and downloads are not
   implementation details. They are held through `BrowserProfile` and
   `BrowserContext` capabilities.

2. **The interface is the permission.** A caller that can navigate does not
   automatically get DOM inspection, screenshot, input, download, upload,
   network interception, profile mutation, or automation-debug authority.

3. **Agents receive tools, not admin ports.** CDP and WebDriver BiDi are
   backend protocols for the trusted browser service. The agent-facing ABI is
   a typed narrowed capability surface.

4. **Origins become visible policy inputs.** Browser decisions should record
   origin, top-level site, profile, user session, persona, network route, and
   initiator. URL strings alone are not enough.

5. **Downloads and uploads cross explicit caps.** A download returns a
   `BrowserArtifact` or writes through a granted `DownloadSink`. Uploading a
   file requires a granted read cap for that object and a per-action policy
   decision.

6. **Automation is auditable.** Browser actions initiated by an agent are
   logged with the page/session, operation, typed arguments, permission mode,
   result, and artifacts captured for later review.

7. **Visual browsing waits for GUI.** A human browser is a real app, not a
   terminal command. It should land only after compositor/input/font/storage
   and userspace networking foundations are credible.

8. **A browser can be headless before it is native.** The early
   agent/shell-facing capability may be served by a host-side browser,
   a development-machine sidecar, a Linux companion process, or a remote
   browser service. The capOS ABI should not expose which backend serves it.

## Track 1: Agent/Shell Browser Capability

This is the near-term conceptual track. It gives capOS agents and shells a
standard web tool without waiting for a compositor or native browser port.

Conceptual interfaces:

```capnp
interface BrowserBroker {
  createProfile @0 (request :BrowserProfileRequest) -> (profile :BrowserProfile);
  openContext @1 (profile :BrowserProfile, policy :BrowserContextPolicy)
      -> (context :BrowserContext);
}

interface BrowserContext {
  openSession @0 (persona :BrowserPersona) -> (session :BrowserSession);
  snapshot @1 () -> (profileSnapshot :BrowserProfileSnapshot);
  destroy @2 () -> ();
}

interface BrowserSession {
  close @0 () -> ();
}

interface BrowserNavigate {
  navigate @0 (url :Text, wait :NavigationWait) -> (result :NavigationResult);
}

interface BrowserReadPage {
  readPage @0 (budget :PageReadBudget) -> (snapshot :PageSnapshot);
}

interface BrowserScreenshot {
  screenshot @0 (options :ScreenshotOptions) -> (image :BrowserArtifact);
}

interface BrowserInput {
  input @0 (action :InputAction) -> (result :InputResult);
}

interface BrowserDownload {
  download @0 (selector :DownloadSelector, sink :DownloadSink)
      -> (artifact :BrowserArtifact);
}
```

The exact schema belongs in a later implementation slice. The important rule is
that `BrowserSession` is only a lifetime handle for one browsing context. It
does not imply navigation, inspection, screenshot, input, download, upload,
network-observer, or debug authority. The broker mints only the operation
facets allowed by the caller's session policy, and the shell/agent runner
advertises only tools backed by facets it actually holds.

| Capability | Authority |
| --- | --- |
| `BrowserBroker` | Mint profiles and contexts according to session policy. |
| `BrowserProfile` | Own persistent browser state and profile lifecycle. |
| `BrowserContext` | Own one isolated browsing context under a profile. |
| `BrowserSession` | Hold and close one session lifetime; no operation authority by itself. |
| `BrowserNavigate` | Navigate within one session. |
| `BrowserReadPage` | Inspect page state under output budgets. |
| `BrowserScreenshot` | Capture screenshot artifacts under policy. |
| `BrowserInput` | Click, type, select, upload only with explicit grants. |
| `BrowserDownload` | Initiate browser downloads into a granted sink. |
| `DownloadSink` | Receive bytes/artifacts from browser downloads. |
| `BrowserNetworkObserver` | Read network metadata or bodies under redaction policy. |
| `BrowserAdmin` | Backend-only: raw CDP/BiDi, crash dumps, trace, profile mutation. |

### Agent Tool Shape

The native shell or agent runner advertises browser operations as ordinary
tools:

- `browser.open(url)`
- `browser.snapshot()`
- `browser.screenshot()`
- `browser.click(ref)`
- `browser.type(ref, text)`
- `browser.select(ref, value)`
- `browser.download(ref)`
- `browser.close()`

The tool result is structured:

- page title, URL, origin, load state
- accessibility/DOM references under stable short IDs
- visible text and form fields under a token/byte budget
- screenshot artifact cap, when requested
- network/download artifacts only when separately allowed

The model never receives the `BrowserSession` cap. It proposes tool calls;
the runner executes them after policy and consent checks, then feeds bounded
results back to the model. This matches
[Language Models and the Agent Runtime](llm-and-agent-proposal.md).

### Backend Strategy

The first implementation should be a userspace service or host-side harness
that owns a real browser and exposes the typed capOS surface:

1. Browser service launches or attaches to Chromium/Firefox/WebKit through
   Playwright, WebDriver BiDi, or CDP.
2. The service stores profile state in a host directory or capOS Store backend,
   but callers see only `BrowserProfile` caps.
3. The service enforces per-session operation grants and output budgets before
   returning DOM text, screenshots, network metadata, or downloads.
4. An MCP adapter can present the same tools to external agents, but MCP is an
   adapter, not the authority model.

This makes browser usage testable while capOS still lacks native GUI pieces.
It also creates a practical compatibility path for agents that need the modern
web during capOS development.

## Track 1.5: Cap-Native Document Engine

The most capOS-shaped browser work may not be "port a full browser" first.
There is a meaningful middle target: run the parts of the web stack that turn
provided data into an interactive document -- JavaScript, DOM, CSS, layout,
rendering, and perhaps WebAssembly -- while replacing browser-owned host APIs
with capability-backed services.

In this model, the engine does **not** own raw networking, files, profile
directories, clipboard, permissions, downloads, credentials, or extension
installation. It receives a document/resource graph and a bundle of explicit
host caps. Each document bundle also needs a broker- or `ResourceLoader`-minted
web principal: an explicit origin, package origin, or opaque origin plus base
URL policy used for relative URLs, storage partitioning, fetch checks, audit
records, and user-facing permission prompts. Opaque origins are the default for
caller-provided bundles; a real web or package origin requires authority or
attestation from the loader that supplied the bytes. Web APIs become host
bindings:

| Web-facing operation | capOS-backed authority |
| --- | --- |
| `fetch()` / subresource load | `HttpEndpoint`, `Fetch`, or content-addressed `ResourceLoader` cap. |
| cookies / local storage / IndexedDB | `BrowserProfileStore` or narrower origin-scoped `KvStore` cap. |
| file picker / upload | user-approved `FileRead` or artifact cap. |
| downloads | `DownloadSink` / `StoreWriter` cap. |
| clipboard | explicit `ClipboardRead` / `ClipboardWrite` caps. |
| geolocation, camera, microphone | future sensor/media caps, never implicit. |
| workers / timers | scheduler and resource-budget caps. |
| WebAssembly imports | explicit host import caps, not ambient syscalls. |

This track is useful for three reasons:

1. It gives capOS a native HTML/CSS/JS application substrate without waiting
   for all of ordinary web browsing. Documentation, setup flows, dashboards,
   adventure/Paperclips UIs, and local admin apps could be rendered from
   trusted or packaged resources before arbitrary internet browsing is safe.
2. It lets the project design web API host bindings around capabilities from
   the start. A later full browser can reuse the same profile, fetch, storage,
   permission, and artifact services instead of hiding them inside an engine.
3. It is a smaller research target for engine embedding. Servo, Ladybird, and
   WebKit/WPE can be evaluated as document/rendering substrates, while
   SpiderMonkey, JavaScriptCore, Boa, or QuickJS can be evaluated as JS/Wasm
   runtime components or host-binding proof substrates without committing to an
   entire general-purpose browser port.

The accepted first shape should be conservative:

- Load documents from a `DocumentBundle` or `ResourceLoader` cap, not from a
  URL bar.
- Require every bundle principal to be minted or validated by the broker or
  `ResourceLoader`, and partition fetch, storage, cache, and audit state by
  profile/context/session plus that principal.
- Disable arbitrary internet subresource fetch until a caller grants a
  narrowed `Fetch`/`HttpEndpoint`.
- Produce a rendered surface or screenshot artifact plus a bounded
  accessibility/DOM snapshot.
- Treat every Web API host binding as a separate facet and require explicit
  broker grants.
- Avoid extension APIs, service workers, persistent background sync,
  notifications, WebRTC, and device APIs until their capOS authority model is
  clear.

This is still not a toy scripting widget. Running hostile JavaScript against a
DOM/layout engine remains a large TCB, and rendering bugs can be security bugs.
The point is to narrow the host-platform surface: provided data in, rendered
surface/snapshot/artifacts out, and every side effect through typed caps.

## Track 2: Visual Browser After GUI

A human-facing browser should be a normal capOS GUI application once these
prerequisites exist:

- compositor and input service
- font discovery/rasterization
- userspace networking and TLS
- Store/Namespace-backed profile persistence
- download/upload mediation
- shared-memory graphics buffers or GPU session caps
- process crash/restart handling
- brokered user-session profile policy

Candidate engine paths:

| Engine path | Role | capOS assessment |
| --- | --- | --- |
| Chromium Ozone / CEF | Maximum compatibility and automation ecosystem | Best external/backend choice; native port is very large. |
| WPE WebKit | Embedded visual browser candidate | Plausible post-GUI engine because WPE is designed for embedded backends. |
| Gecko / GeckoView | Browser diversity and principal-model precedent | Good external backend; GeckoView itself is Android-specific. |
| Servo | Rust/modular research-aligned engine | Track closely; not first broad-compatibility choice. |
| Ladybird / LibWeb | Independent-engine precedent | Track for architecture; not a near-term dependency. |

The visual browser should reuse the agent/shell profile/session model instead
of inventing a second profile stack. A GUI tab is a `BrowserSession` with a
visual `BrowserView` surface attached. Closing the window should not silently
destroy profile state unless the profile cap is ephemeral.

## Donut Browser Ideas To Adapt

Donut Browser is useful because it treats browser profiles as first-class,
scriptable objects and exposes local REST/MCP automation. capOS should adapt
the capability-shaped parts:

- Unlimited local profiles map to broker-minted `BrowserProfile` caps.
- Profile groups map to policy bundles and user-session grants.
- Per-profile cookies/storage/extensions map to Store-backed state owned by
  the profile cap.
- Per-profile proxy/VPN selection maps to explicit network-route caps.
- Local REST/MCP maps to a typed capOS service plus optional external adapter.
- Persistent automation sessions map to `BrowserContext` lifetimes and
  snapshots.
- Default-browser link routing maps to a broker decision: which profile/context
  should open a URL for this user/session?

capOS should not adopt Donut's anti-detect promise. If capOS supports persona
controls such as viewport, locale, timezone, user agent, geolocation, WebRTC
policy, or fingerprint reduction, those controls should be explicit
`BrowserPersona` policy with audit and user-facing disclosure.

## Security Boundary

Browser work adds these trust boundaries:

- **Web content to browser engine.** Untrusted JavaScript, media, fonts, and
  documents hit a large engine TCB. Native browser work should keep renderer,
  network, image decode, and profile services separated where the backend
  permits it.
- **Browser engine to capOS.** The engine must not receive broad shell caps.
  Its only capOS authorities should be its granted network route, profile
  store, artifact sink, and visual/input surfaces.
- **Agent to browser service.** The agent sees tool descriptors and bounded
  snapshots, not backend debug ports.
- **Browser downloads to storage.** Downloaded bytes are untrusted artifacts
  until a user or policy process imports them into a namespace.
- **Browser uploads to web origin.** Upload requires explicit file/artifact
  authority and must record the destination origin.
- **Profile to profile.** Cookies, storage, cache, extension state, and
  persona policy must not bleed across profiles unless a broker grants an
  explicit clone/import/export operation.

Raw CDP or BiDi access is `BrowserAdmin` authority. It should be held only by
the browser service supervisor and developer harnesses, not by ordinary shell
sessions.

## Phased Plan

### Phase A: Host-Backed Agent Browser

- Add a host-side or userspace browser service proof that exposes a narrowed
  `BrowserSession` over an existing browser backend.
- Use fake-model or scripted-agent QEMU/host proof first: navigate to a local
  page, read a bounded snapshot, click/type, capture a screenshot artifact,
  and close the session.
- Record audit output for each action and show that the caller never receives
  raw CDP/BiDi.

### Phase B: Standard Shell Tool

- Add native shell and agent-runner integration so `browser.open`,
  `browser.snapshot`, and `browser.screenshot` are standard tools when the
  broker grants a browser bundle.
- Add MCP adapter support for external agents using the same typed operation
  set.
- Add download/upload gates once `Store`/`Namespace` and artifact caps exist.

### Phase C: Cap-Native Document Engine Proof

- Add a restricted `DocumentBundle` proof that renders packaged HTML/CSS/JS
  to a screenshot or simple surface and emits a bounded accessibility/DOM
  snapshot.
- Wire at least one host API, such as fetch from a preloaded resource bundle
  or a profile-scoped key/value store, through a typed capability.
- Prove that absent caps fail closed: no network, no profile storage, no
  clipboard, and no downloads by default.

### Phase D: In-capOS Headless Browser Backend

- Port or package a browser backend process once userspace networking,
  storage, fonts, and threads are mature enough.
- Prefer a backend that can run without a full visible GUI surface but still
  supports screenshots and accessibility/DOM snapshots.
- Preserve the same `BrowserSession` ABI so agents do not notice the backend
  change.

### Phase E: Visual Browser

- Add `BrowserView`/window integration after compositor/input support exists.
- Reuse `BrowserProfile` and `BrowserSession` for tabs/windows.
- Add user-facing profile picker, permissions UI, downloads UI, and audit view.

## Relationship To Existing Proposals

- [Browser/WASM](browser-wasm-proposal.md) is about capOS as a browser-hosted
  runtime. This proposal is about capOS exposing browser capability services.
- [Language Models and the Agent Runtime](llm-and-agent-proposal.md) owns the
  model/tool-call loop. Browser sessions are one tool family.
- [Shell](shell-proposal.md) and
  [Interactive Command Surfaces](interactive-command-surface-proposal.md) own
  command exposure. Browser operations should appear there as typed tools, not
  string commands tunneled to an automation port.
- [Networking](networking-proposal.md), [Storage and Naming](storage-and-naming-proposal.md),
  and [GPU Capability](gpu-capability-proposal.md) provide prerequisites for a
  native visual browser.

## Open Questions

- Should the first implementation wrap Playwright for breadth, raw CDP for
  smaller dependencies, or WebDriver BiDi for standards alignment?
- What is the minimal page snapshot that remains useful to an LLM while
  limiting token use and accidental data disclosure?
- Should `BrowserPersona` support fingerprint reduction only, or also
  compatibility personas for testing?
- How should extensions be represented: profile-owned package state,
  separately granted extension caps, or both?
- How should a visual browser present capOS capability prompts without
  training users to approve every web-origin request blindly?