Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Browser Capability and Agent Web Sessions

How capOS should expose the web without turning a browser into an ambiently privileged desktop escape hatch.

This proposal is intentionally split into three tracks:

  • After GUI: a full visual browser for humans, with windows, input, rendering, profiles, downloads, extensions, and ordinary web compatibility.
  • Agent/shell usage: a standard BrowserSession capability that lets shells and AI agents navigate, inspect, screenshot, fill forms, download, and collect evidence through a brokered browser service before capOS has a native GUI browser.
  • Cap-native document engine: an intermediate path that runs JS, DOM/CSS, layout, and rendering over caller-provided document/resource data, with fetch, storage, permissions, clipboard, downloads, and host I/O wired to native capOS capabilities instead of a browser-owned ambient platform.

The existing Browser/WASM proposal runs capOS in a browser tab. This proposal is the inverse: capOS exposes browser capabilities to users, services, and agents.

Grounding research: Browser Engines, Document Engines, and Agent Browsers.

Problem

The web is both a user interface substrate and a huge authority boundary. A browser can read credentials, perform network requests, upload local files, download untrusted bytes, run JavaScript from hostile origins, track users through profiles, and expose debug protocols powerful enough to rewrite page state.

On a conventional OS that power is hidden behind process permissions, profile directories, and implicit user intent. capOS needs a browser model that fits the capability system:

  • Profiles and sessions are explicit authority.
  • Network routes, downloads, uploads, credentials, and automation are scoped.
  • Browser JavaScript does not get shell or storage authority by accident.
  • Agents can use the web as a tool without receiving raw CDP, filesystem, or network capabilities.

Non-Goals

  • Writing a new browser engine for the first capOS browser milestone.
  • Porting Chromium, WebKit, Gecko, Servo, or Ladybird before the GUI, userspace networking/storage, fonts, and driver-safety prerequisites exist.
  • Treating anti-detection, fingerprint evasion, scraping at scale, or bot bypass as a capOS product goal.
  • Exposing raw Chrome DevTools Protocol, WebDriver BiDi, or Playwright handles as ordinary user/session capabilities.
  • Letting browser-hosted JavaScript hold raw capOS shell, launch, file, or network capabilities.

Design Principles

  1. Browser state is authority. A profile’s cookies, local storage, permissions, saved credentials, cache, proxy route, and downloads are not implementation details. They are held through BrowserProfile and BrowserContext capabilities.

  2. The interface is the permission. A caller that can navigate does not automatically get DOM inspection, screenshot, input, download, upload, network interception, profile mutation, or automation-debug authority.

  3. Agents receive tools, not admin ports. CDP and WebDriver BiDi are backend protocols for the trusted browser service. The agent-facing ABI is a typed narrowed capability surface.

  4. Origins become visible policy inputs. Browser decisions should record origin, top-level site, profile, user session, persona, network route, and initiator. URL strings alone are not enough.

  5. Downloads and uploads cross explicit caps. A download returns a BrowserArtifact or writes through a granted DownloadSink. Uploading a file requires a granted read cap for that object and a per-action policy decision.

  6. Automation is auditable. Browser actions initiated by an agent are logged with the page/session, operation, typed arguments, permission mode, result, and artifacts captured for later review.

  7. Visual browsing waits for GUI. A human browser is a real app, not a terminal command. It should land only after compositor/input/font/storage and userspace networking foundations are credible.

  8. A browser can be headless before it is native. The early agent/shell-facing capability may be served by a host-side browser, a development-machine sidecar, a Linux companion process, or a remote browser service. The capOS ABI should not expose which backend serves it.

Track 1: Agent/Shell Browser Capability

This is the near-term conceptual track. It gives capOS agents and shells a standard web tool without waiting for a compositor or native browser port.

Conceptual interfaces:

interface BrowserBroker {
  createProfile @0 (request :BrowserProfileRequest) -> (profile :BrowserProfile);
  openContext @1 (profile :BrowserProfile, policy :BrowserContextPolicy)
      -> (context :BrowserContext);
}

interface BrowserContext {
  openSession @0 (persona :BrowserPersona) -> (session :BrowserSession);
  snapshot @1 () -> (profileSnapshot :BrowserProfileSnapshot);
  destroy @2 () -> ();
}

interface BrowserSession {
  close @0 () -> ();
}

interface BrowserNavigate {
  navigate @0 (url :Text, wait :NavigationWait) -> (result :NavigationResult);
}

interface BrowserReadPage {
  readPage @0 (budget :PageReadBudget) -> (snapshot :PageSnapshot);
}

interface BrowserScreenshot {
  screenshot @0 (options :ScreenshotOptions) -> (image :BrowserArtifact);
}

interface BrowserInput {
  input @0 (action :InputAction) -> (result :InputResult);
}

interface BrowserDownload {
  download @0 (selector :DownloadSelector, sink :DownloadSink)
      -> (artifact :BrowserArtifact);
}

The exact schema belongs in a later implementation slice. The important rule is that BrowserSession is only a lifetime handle for one browsing context. It does not imply navigation, inspection, screenshot, input, download, upload, network-observer, or debug authority. The broker mints only the operation facets allowed by the caller’s session policy, and the shell/agent runner advertises only tools backed by facets it actually holds.

CapabilityAuthority
BrowserBrokerMint profiles and contexts according to session policy.
BrowserProfileOwn persistent browser state and profile lifecycle.
BrowserContextOwn one isolated browsing context under a profile.
BrowserSessionHold and close one session lifetime; no operation authority by itself.
BrowserNavigateNavigate within one session.
BrowserReadPageInspect page state under output budgets.
BrowserScreenshotCapture screenshot artifacts under policy.
BrowserInputClick, type, select, upload only with explicit grants.
BrowserDownloadInitiate browser downloads into a granted sink.
DownloadSinkReceive bytes/artifacts from browser downloads.
BrowserNetworkObserverRead network metadata or bodies under redaction policy.
BrowserAdminBackend-only: raw CDP/BiDi, crash dumps, trace, profile mutation.

Agent Tool Shape

The native shell or agent runner advertises browser operations as ordinary tools:

  • browser.open(url)
  • browser.snapshot()
  • browser.screenshot()
  • browser.click(ref)
  • browser.type(ref, text)
  • browser.select(ref, value)
  • browser.download(ref)
  • browser.close()

The tool result is structured:

  • page title, URL, origin, load state
  • accessibility/DOM references under stable short IDs
  • visible text and form fields under a token/byte budget
  • screenshot artifact cap, when requested
  • network/download artifacts only when separately allowed

The model never receives the BrowserSession cap. It proposes tool calls; the runner executes them after policy and consent checks, then feeds bounded results back to the model. This matches Language Models and the Agent Runtime.

Backend Strategy

The first implementation should be a userspace service or host-side harness that owns a real browser and exposes the typed capOS surface:

  1. Browser service launches or attaches to Chromium/Firefox/WebKit through Playwright, WebDriver BiDi, or CDP.
  2. The service stores profile state in a host directory or capOS Store backend, but callers see only BrowserProfile caps.
  3. The service enforces per-session operation grants and output budgets before returning DOM text, screenshots, network metadata, or downloads.
  4. An MCP adapter can present the same tools to external agents, but MCP is an adapter, not the authority model.

This makes browser usage testable while capOS still lacks native GUI pieces. It also creates a practical compatibility path for agents that need the modern web during capOS development.

Track 1.5: Cap-Native Document Engine

The most capOS-shaped browser work may not be “port a full browser” first. There is a meaningful middle target: run the parts of the web stack that turn provided data into an interactive document – JavaScript, DOM, CSS, layout, rendering, and perhaps WebAssembly – while replacing browser-owned host APIs with capability-backed services.

In this model, the engine does not own raw networking, files, profile directories, clipboard, permissions, downloads, credentials, or extension installation. It receives a document/resource graph and a bundle of explicit host caps. Each document bundle also needs a broker- or ResourceLoader-minted web principal: an explicit origin, package origin, or opaque origin plus base URL policy used for relative URLs, storage partitioning, fetch checks, audit records, and user-facing permission prompts. Opaque origins are the default for caller-provided bundles; a real web or package origin requires authority or attestation from the loader that supplied the bytes. Web APIs become host bindings:

Web-facing operationcapOS-backed authority
fetch() / subresource loadHttpEndpoint, Fetch, or content-addressed ResourceLoader cap.
cookies / local storage / IndexedDBBrowserProfileStore or narrower origin-scoped KvStore cap.
file picker / uploaduser-approved FileRead or artifact cap.
downloadsDownloadSink / StoreWriter cap.
clipboardexplicit ClipboardRead / ClipboardWrite caps.
geolocation, camera, microphonefuture sensor/media caps, never implicit.
workers / timersscheduler and resource-budget caps.
WebAssembly importsexplicit host import caps, not ambient syscalls.

This track is useful for three reasons:

  1. It gives capOS a native HTML/CSS/JS application substrate without waiting for all of ordinary web browsing. Documentation, setup flows, dashboards, adventure/Paperclips UIs, and local admin apps could be rendered from trusted or packaged resources before arbitrary internet browsing is safe.
  2. It lets the project design web API host bindings around capabilities from the start. A later full browser can reuse the same profile, fetch, storage, permission, and artifact services instead of hiding them inside an engine.
  3. It is a smaller research target for engine embedding. Servo, Ladybird, and WebKit/WPE can be evaluated as document/rendering substrates, while SpiderMonkey, JavaScriptCore, Boa, or QuickJS can be evaluated as JS/Wasm runtime components or host-binding proof substrates without committing to an entire general-purpose browser port.

The accepted first shape should be conservative:

  • Load documents from a DocumentBundle or ResourceLoader cap, not from a URL bar.
  • Require every bundle principal to be minted or validated by the broker or ResourceLoader, and partition fetch, storage, cache, and audit state by profile/context/session plus that principal.
  • Disable arbitrary internet subresource fetch until a caller grants a narrowed Fetch/HttpEndpoint.
  • Produce a rendered surface or screenshot artifact plus a bounded accessibility/DOM snapshot.
  • Treat every Web API host binding as a separate facet and require explicit broker grants.
  • Avoid extension APIs, service workers, persistent background sync, notifications, WebRTC, and device APIs until their capOS authority model is clear.

This is still not a toy scripting widget. Running hostile JavaScript against a DOM/layout engine remains a large TCB, and rendering bugs can be security bugs. The point is to narrow the host-platform surface: provided data in, rendered surface/snapshot/artifacts out, and every side effect through typed caps.

Track 2: Visual Browser After GUI

A human-facing browser should be a normal capOS GUI application once these prerequisites exist:

  • compositor and input service
  • font discovery/rasterization
  • userspace networking and TLS
  • Store/Namespace-backed profile persistence
  • download/upload mediation
  • shared-memory graphics buffers or GPU session caps
  • process crash/restart handling
  • brokered user-session profile policy

Candidate engine paths:

Engine pathRolecapOS assessment
Chromium Ozone / CEFMaximum compatibility and automation ecosystemBest external/backend choice; native port is very large.
WPE WebKitEmbedded visual browser candidatePlausible post-GUI engine because WPE is designed for embedded backends.
Gecko / GeckoViewBrowser diversity and principal-model precedentGood external backend; GeckoView itself is Android-specific.
ServoRust/modular research-aligned engineTrack closely; not first broad-compatibility choice.
Ladybird / LibWebIndependent-engine precedentTrack for architecture; not a near-term dependency.

The visual browser should reuse the agent/shell profile/session model instead of inventing a second profile stack. A GUI tab is a BrowserSession with a visual BrowserView surface attached. Closing the window should not silently destroy profile state unless the profile cap is ephemeral.

Donut Browser Ideas To Adapt

Donut Browser is useful because it treats browser profiles as first-class, scriptable objects and exposes local REST/MCP automation. capOS should adapt the capability-shaped parts:

  • Unlimited local profiles map to broker-minted BrowserProfile caps.
  • Profile groups map to policy bundles and user-session grants.
  • Per-profile cookies/storage/extensions map to Store-backed state owned by the profile cap.
  • Per-profile proxy/VPN selection maps to explicit network-route caps.
  • Local REST/MCP maps to a typed capOS service plus optional external adapter.
  • Persistent automation sessions map to BrowserContext lifetimes and snapshots.
  • Default-browser link routing maps to a broker decision: which profile/context should open a URL for this user/session?

capOS should not adopt Donut’s anti-detect promise. If capOS supports persona controls such as viewport, locale, timezone, user agent, geolocation, WebRTC policy, or fingerprint reduction, those controls should be explicit BrowserPersona policy with audit and user-facing disclosure.

Security Boundary

Browser work adds these trust boundaries:

  • Web content to browser engine. Untrusted JavaScript, media, fonts, and documents hit a large engine TCB. Native browser work should keep renderer, network, image decode, and profile services separated where the backend permits it.
  • Browser engine to capOS. The engine must not receive broad shell caps. Its only capOS authorities should be its granted network route, profile store, artifact sink, and visual/input surfaces.
  • Agent to browser service. The agent sees tool descriptors and bounded snapshots, not backend debug ports.
  • Browser downloads to storage. Downloaded bytes are untrusted artifacts until a user or policy process imports them into a namespace.
  • Browser uploads to web origin. Upload requires explicit file/artifact authority and must record the destination origin.
  • Profile to profile. Cookies, storage, cache, extension state, and persona policy must not bleed across profiles unless a broker grants an explicit clone/import/export operation.

Raw CDP or BiDi access is BrowserAdmin authority. It should be held only by the browser service supervisor and developer harnesses, not by ordinary shell sessions.

Phased Plan

Phase A: Host-Backed Agent Browser

  • Add a host-side or userspace browser service proof that exposes a narrowed BrowserSession over an existing browser backend.
  • Use fake-model or scripted-agent QEMU/host proof first: navigate to a local page, read a bounded snapshot, click/type, capture a screenshot artifact, and close the session.
  • Record audit output for each action and show that the caller never receives raw CDP/BiDi.

Phase B: Standard Shell Tool

  • Add native shell and agent-runner integration so browser.open, browser.snapshot, and browser.screenshot are standard tools when the broker grants a browser bundle.
  • Add MCP adapter support for external agents using the same typed operation set.
  • Add download/upload gates once Store/Namespace and artifact caps exist.

Phase C: Cap-Native Document Engine Proof

  • Add a restricted DocumentBundle proof that renders packaged HTML/CSS/JS to a screenshot or simple surface and emits a bounded accessibility/DOM snapshot.
  • Wire at least one host API, such as fetch from a preloaded resource bundle or a profile-scoped key/value store, through a typed capability.
  • Prove that absent caps fail closed: no network, no profile storage, no clipboard, and no downloads by default.

Phase D: In-capOS Headless Browser Backend

  • Port or package a browser backend process once userspace networking, storage, fonts, and threads are mature enough.
  • Prefer a backend that can run without a full visible GUI surface but still supports screenshots and accessibility/DOM snapshots.
  • Preserve the same BrowserSession ABI so agents do not notice the backend change.

Phase E: Visual Browser

  • Add BrowserView/window integration after compositor/input support exists.
  • Reuse BrowserProfile and BrowserSession for tabs/windows.
  • Add user-facing profile picker, permissions UI, downloads UI, and audit view.

Relationship To Existing Proposals

Open Questions

  • Should the first implementation wrap Playwright for breadth, raw CDP for smaller dependencies, or WebDriver BiDi for standards alignment?
  • What is the minimal page snapshot that remains useful to an LLM while limiting token use and accidental data disclosure?
  • Should BrowserPersona support fingerprint reduction only, or also compatibility personas for testing?
  • How should extensions be represented: profile-owned package state, separately granted extension caps, or both?
  • How should a visual browser present capOS capability prompts without training users to approve every web-origin request blindly?