Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: capOS Repository Harness Engineering

This proposal applies OpenAI-style harness engineering to the capOS repository itself. The goal is not to add agent features to the operating system. The goal is to make this repository a better, safer work environment for long-running agents and human reviewers.

The related capOS-Hosted Agent Swarms proposal describes capOS as a future host for OpenClaw-like agent services. This proposal describes the repository infrastructure needed so agents can work on capOS without repeatedly rediscovering project state, extending superseded designs, choosing the wrong QEMU proof, or silently drifting documentation.

Why This Proposal Exists

The capOS repo is already heavily agent-shaped:

  • AGENTS.md and CLAUDE.md define workflow rules.
  • docs/tasks/state.toml selects the current milestone, and task records under docs/tasks/ define immediate gates.
  • docs/tasks/** records open remediation and review-finding work.
  • docs/proposals/, docs/backlog/, and docs/research/ hold design context.
  • docs/topics.md, docs/SUMMARY.md, and proposal indexes make docs navigable.
  • Make targets and QEMU harnesses prove behavior.
  • CUE manifests define focused system configurations.

That is enough for a careful agent to work, but it is not yet a complete harness. Too much project state still requires fragile human-style inference: which document is authoritative, which proposal is stale, which run target proves which behavior, which open finding blocks a task, and which design pivot explains why old text should not be extended.

OpenAI’s harness engineering lesson is direct: what an agent cannot inspect in its working context effectively does not exist. capOS should therefore compile its project state into repo-local, versioned, mechanically checked artifacts.

Two existing tracker documents already shape the harness contract this proposal builds on, and the artifacts below must stay consistent with them rather than re-derive their state:

  • Trusted Build Inputs inventories the toolchain, generated bindings, dependency policy, Limine pin, QEMU/OVMF observation, and host-tool surface the repo currently trusts. Any run-target, proof, or generated-code claim the harness exposes to agents must point back to that inventory rather than restate pinning or drift status independently.
  • Design Risks and Open Questions Register is the consolidated index of long-horizon design risks (including the supply-chain risk R13, the harness-coverage gaps, and the open-question pointers for proposal/backlog/design ownership). Harness artifacts that claim a risk is “tracked” should cite the register row, and new risks surfaced by harness checks should be filed there rather than buried in this proposal.

Scope

In scope:

  • agent-facing repository map;
  • task-selection and milestone state;
  • proposal/research/status consistency checks;
  • run-target and QEMU proof inventory;
  • machine-readable design relationships;
  • agent-maintained but reviewed knowledge compilation;
  • deterministic evals for future coding agents;
  • active-work and shared-resource visibility;
  • review and security handoff artifacts.

Out of scope:

  • capOS-hosted agent runtime implementation;
  • model provider selection;
  • browser, MCP, or A2A runtime integration;
  • replacing human review;
  • changing the current mandatory worktree workflow.

Design Principles

  1. Repository-local context wins. Important design and workflow state should live in tracked files, not in chat history or operator memory.

  2. Indexes are harness inputs. docs/topics.md, docs/SUMMARY.md, proposal indexes, backlog pointers, and run-target tables are not cosmetic; they are how agents find the right context.

  3. Status must be checkable. Proposal status, supersession, implementation status, selected milestone, and review findings should fail checks when they drift.

  4. Proofs need names and ownership. A QEMU harness target should say what it proves, which manifest it uses, which proposal/backlog owns it, and what transcript shape is expected.

  5. Compiled knowledge is non-authoritative until reviewed. Agent-generated wiki pages can help navigation, but proposals, architecture docs, schemas, code, and review findings remain authoritative.

  6. Prefer generation over duplicate hand-maintained state. When possible, sidecars and indexes should be generated from front matter, Makefile metadata, manifests, or explicit source files.

  7. Expose replacement paths. If a proposal is superseded, an agent should see the replacement before acting on stale text.

  8. Make unsafe shortcuts hard. The harness should steer agents away from main-worktree edits, stale branches, missing review, unverified QEMU claims, and undocumented design pivots.

  9. Agents must know when they are not alone. Shared resources such as git branches, worktrees, docs indexes, task lists, generated files, and review queues need visible ownership, lease, and version state before agents mutate them.

Proposed Artifacts

docs/agent-harness.md

A concise entry point for future agents. It should answer:

  • where current project state lives;
  • how to choose a task;
  • how to create a compliant worktree;
  • how to find relevant proposals, backlog, research, and review findings;
  • how to choose checks;
  • how to handle docs/status updates;
  • how to hand off verification and review.

This file should link to authoritative docs rather than duplicate them. It is a map, not a new policy source.

docs/run-targets.md

Generated or maintained inventory of run/check targets:

TargetManifestPurposeExpected proofOwner
make run-session-contextsystem-session-context.cueone immutable session context proofhostile second-session attempts fail closedsession-bound invocation context
make run-chatsystem-chat.cueresident chat service proofsession-scoped chat transcriptchat/shared-service proposal

The table should cover make run-*, make qemu-*, docs checks, generated-code checks, and security checks. Agents should not infer target meaning from target names alone.

Active Work Registry

Add a small generated or reviewed active-work registry for concurrent agents. It should be derived from git worktrees where possible and supplemented by task metadata:

TaskBranchWorktreeClaimed resourcesModeExpiresStatus
example-session-modelfeat/session-model-proof<worktree-root>/session-model-proofsrc/capos/service.rs, docs/proposals/session-context.mdexclusive source, shared docs2026-05-01checking

The registry is not a replacement for git or human review. It is a harness surface for “another agent is already touching this shared resource.” The row above is synthetic sample data, not live project state.

The same registry should also feed the daily development-performance report defined in capOS Agentic Development Experiment. Git can explain what merged, but the registry explains live ownership, intended role, claimed resource surface, and whether a task was implementation, review, verification, recovery, or metrics processing.

Minimum fields:

  • task or issue id;
  • owner identity or runner id;
  • actor class when known: claude, codex, human/manual, mixed, or unknown;
  • role: implementation, review, planning/design, verification, recovery/integration, or recap/metrics processing;
  • attribution confidence: direct, corroborated, inferred, or unknown;
  • branch and worktree path;
  • claimed paths, subsystems, generated outputs, todo items, or review queues;
  • exclusive/shared mode;
  • observed base revision;
  • lease expiry and renewal time;
  • status: planning, editing, checking, review, merge, blocked, abandoned.

Rows should keep attribution confidence explicit. A direct session id, commit trailer, or operator-created row is higher-confidence than timestamp overlap. Low-confidence rows should stay unknown or mixed rather than assigning work to a specific tool.

For the current repo workflow, this would make the existing worktree policy queryable. For a future capOS-hosted swarm, the same shape becomes a SharedResource/ResourceLease service: git repos, shared todo items, wiki pages, generated docs, and merge queues all get visible claims and versioned writes.

Proposal Relationship Metadata

Add or standardize front matter fields:

status: "Future design. No implementation."
last_reviewed: "2026-04-28 00:00 UTC"
supersedes:
  - old-proposal.md
superseded_by: new-proposal.md
implemented_by:
  - commit-or-target
owned_backlog: docs/backlog/example.md
proof_targets:
  - make run-example

The exact schema can be narrower at first. The important requirement is that replacement and proof relationships become queryable.

Design Pivot Records

Add short ADR-style files under docs/decisions/ for high-impact pivots:

  • endpoint badges as service identity rejected;
  • service-object capabilities superseded by session-bound invocation context;
  • SSH work paused behind session-bound invocation context;
  • hosted agents split from shell agent mode.

Each record should state context, decision, consequences, superseded docs, and current replacement docs.

docs/agent-wiki/

A generated or agent-maintained compiled knowledge tree:

  • index.md: current topic map;
  • capability-model.md: current “interface is permission” model;
  • session-model.md: implemented session-bound invocation context summary;
  • shell-and-remote-access.md: shell, Telnet, SSH, WebShellGateway status;
  • qemu-proofs.md: proof target summaries;
  • open-findings.md: current review findings summarized with links.

This tree must be clearly labeled as compiled navigation, not authority. It can be hidden from public docs until reviewed.

Agent Evals

Add deterministic repository-workflow evals:

  • identify selected milestone from docs/tasks/state.toml;
  • find the relevant backlog and proposal;
  • reject editing the main worktree;
  • detect another active worker claiming the same exclusive path or generated output;
  • choose a non-overlapping task or wait when a shared resource is already leased;
  • identify required checks for a doc-only proposal change;
  • detect a superseded proposal and follow replacement;
  • update proposal index and summary when adding a proposal;
  • avoid claiming full tests passed when only docs built;
  • surface open review-finding task records before unrelated feature work.

These evals can start as scripted fixtures. They do not need live model calls.

Mechanical Checks

Extend existing documentation tooling to check:

  • every proposal in docs/proposals/ is present in docs/proposals/index.md or an explicit archive section;
  • every proposal linked in docs/SUMMARY.md exists;
  • every proposal with topics appears in docs/topics.md after generation;
  • superseded_by points to an existing file;
  • superseded proposals display a replacement link near the top;
  • selected milestone in docs/tasks/state.toml has matching docs/tasks/README.md / backlog orientation;
  • run-target inventory entries point to existing Make targets and manifests;
  • research-backed proposals link at least one docs/research/*.md note;
  • external source snapshots in research notes include a review date;
  • QEMU proof claims name a target;
  • active-work registry entries point to existing branches/worktrees when local;
  • no two active registry entries claim the same exclusive resource unless one is marked blocked, abandoned, or waiting for merge;
  • daily metrics rows that cite an active-work entry use a known actor class, role, and confidence label.

These checks should start warning-only if needed, then become required once the metadata is in place.

The harness checks above stop at proposal/index/run-target/active-work hygiene. They are deliberately not a substitute for the security review process. Trust-boundary review, threat-model refresh, per-boundary CWE/CAPEC tagging, tiered tooling (CI hygiene, miri/proptest/fuzzing, Loom, Kani), and the Security Verification Track registry live in Security Review and Formal Verification. When a harness check (for example “proof claim names a target” or “active-work registry attributes a generated output”) touches trust-boundary or supply-chain authority, it must route the finding to the matching security verification track or design-risks register row rather than absorb the authority claim into agent-facing harness metadata.

Workflow Impact

For agents:

  • start at docs/agent-harness.md;
  • read selected milestone state through stable headings or generated sidecar;
  • inspect active-work/resource claims before choosing or mutating shared files;
  • follow proposal relationship metadata to avoid stale design;
  • choose checks from run-target inventory;
  • update docs/status through mechanically checked indexes;
  • hand off with proof target names and transcript artifacts.

For humans:

  • less repeated explanation of repo rules;
  • easier review of whether an agent chose the right context;
  • clearer detection of stale docs;
  • explicit locations for “why did we change direction?” records.

Implementation Phases

Phase 1 - Map and Inventory

  • Add docs/agent-harness.md.
  • Add initial docs/run-targets.md by hand for major run targets.
  • Link both from docs/SUMMARY.md, docs/topics.md, and README.md.
  • Add a short section in docs/tasks/README.md pointing future agents to the harness map.

Phase 2 - Metadata and Checks

  • Standardize front matter for proposals and research notes.
  • Extend mdBook metadata tooling to validate proposal index, topic membership, summary links, status fields, and supersession links.
  • Add run-target inventory validation against Makefile and manifest paths.

Phase 3 - Decision Records

  • Add docs/decisions/ and initial pivot records for the session-bound invocation context change and hosted-agent split.
  • Link decisions from affected proposals and backlog files.

Phase 4 - Compiled Agent Wiki

  • Create a reviewed docs/agent-wiki/ seed for the current selected milestone.
  • Add lint for stale links, missing citations, and “compiled, not authority” labels.
  • Decide whether generated wiki pages are published in mdBook or kept as repo-internal harness files.

Phase 5 - Agent Workflow Evals

  • Add fixtures and scripts for repository-workflow evals.
  • Run them in a docs/check target.
  • Use failures to improve docs/agent-harness.md, metadata, and run-target inventory.

Open Questions

  • Should proposal relationship metadata live only in front matter, or should there be a generated JSON sidecar for fast agent/tool consumption?
  • Should docs/agent-wiki/ be generated on demand or checked in after review?
  • How much QEMU transcript output should be retained as proof artifacts without bloating the repository?
  • Should run-target metadata live in Makefile comments, a CUE file, or docs/run-targets.md front matter blocks?
  • How strict should the first status linter be, given existing historical docs?
  • Should agent evals be part of make docs, a separate make agent-harness-check, or a broader make check?

Relationship to Existing Documents

  • Hosted agent harnesses research records the external harness research and the initial checklist.
  • capOS-Hosted Agent Swarms uses this repo harness as precedent for future capOS-hosted agents.
  • mdBook Documentation Site owns public docs structure and status vocabulary; this proposal adds agent-legibility and mechanical checks on top.
  • Trusted Build Inputs is the source of truth for toolchain pinning, generated-code drift, dependency policy, Limine binary pinning, observed-only QEMU/OVMF surface, and host-tool inventory. The harness run-target inventory, proof-target metadata, and generated-output active-work claims in this proposal must cite the relevant row there rather than re-derive trust status.
  • Security Review and Formal Verification owns the trust-boundary model, per-boundary CWE/CAPEC checklist, tiered tooling (CI hygiene, miri/proptest/fuzzing, Loom, Kani), and the Security Verification Track registry. Harness mechanical checks must hand security- bearing findings to that proposal’s tracks rather than redefine review authority.
  • Design Risks and Open Questions Register is the consolidated index of long-horizon design risks and open architectural questions. New harness-surfaced risks should be filed against existing rows there (for example R13 for supply-chain pinning gaps) or added as new rows, not buried in harness artifacts.
  • CLAUDE.md, AGENTS.md, docs/tasks/README.md, and the task ledger remain authoritative workflow inputs. docs/agent-harness.md should route to them, not replace them.