Proposal: capOS Repository Harness Engineering

This proposal applies OpenAI-style harness engineering to the capOS repository itself. The goal is not to add agent features to the operating system. The goal is to make this repository a better, safer work environment for long-running agents and human reviewers.

The related capOS-Hosted Agent Swarms proposal describes capOS as a future host for OpenClaw-like agent services. This proposal describes the repository infrastructure needed so agents can work on capOS without repeatedly rediscovering project state, extending superseded designs, choosing the wrong QEMU proof, or silently drifting documentation.

Why This Proposal Exists

The capOS repo is already heavily agent-shaped:

AGENTS.md and CLAUDE.md define workflow rules.
The loopyard project setting selected_milestone selects the current milestone, and tasks tracked in loopyard define immediate gates.
loopyard (the open remediation and review-finding backlog) records open remediation and review-finding work.
docs/proposals/, docs/backlog/, and docs/research/ hold design context.
docs/topics.md, docs/SUMMARY.md, and proposal indexes make docs navigable.
Make targets and QEMU harnesses prove behavior.
CUE manifests define focused system configurations.

That is enough for a careful agent to work, but it is not yet a complete harness. Too much project state still requires fragile human-style inference: which document is authoritative, which proposal is stale, which run target proves which behavior, which open finding blocks a task, and which design pivot explains why old text should not be extended.

OpenAI’s harness engineering lesson is direct: what an agent cannot inspect in its working context effectively does not exist. capOS should therefore compile its project state into repo-local, versioned, mechanically checked artifacts.

The implementation now distinguishes three classes of state instead of copying all of them into repository prose:

durable project knowledge lives in code, schemas, design authority, architecture/proposal/backlog documents, and generated documentation indexes;
live work state lives in loopyard tasks, project settings, dependencies, conflict domains, and dispatch locks;
ephemeral execution state lives in Git branches/worktrees and local proof artifacts.

Repository Agent Harness is the compact routing map across those authorities. It deliberately contains no current milestone, active-task table, or copied gate inventory.

Two existing tracker documents already shape the harness contract this proposal builds on, and the artifacts below must stay consistent with them rather than re-derive their state:

Trusted Build Inputs inventories the toolchain, generated bindings, dependency policy, Limine pin, QEMU/OVMF observation, and host-tool surface the repo currently trusts. Any run-target, proof, or generated-code claim the harness exposes to agents must point back to that inventory rather than restate pinning or drift status independently.
Design Risks and Open Questions Register is the consolidated index of long-horizon design risks (including the supply-chain risk R13, the harness-coverage gaps, and the open-question pointers for proposal/backlog/design ownership). Harness artifacts that claim a risk is “tracked” should cite the register row, and new risks surfaced by harness checks should be filed there rather than buried in this proposal.

Scope

In scope:

agent-facing repository map;
task-selection and milestone state;
proposal/research/status consistency checks;
named gate and QEMU proof discovery;
machine-readable design relationships;
progressively disclosed, reviewed knowledge navigation;
deterministic evals for future coding agents;
active-work and shared-resource visibility;
review and security handoff artifacts.

Out of scope:

capOS-hosted agent runtime implementation;
model provider selection;
browser, MCP, or A2A runtime integration;
replacing human review;
changing the current mandatory worktree workflow.

Design Principles

Repository-local context wins. Important design and workflow state should live in tracked files, not in chat history or operator memory.
Indexes are harness inputs. docs/topics.md, docs/SUMMARY.md, proposal indexes, backlog pointers, and workflow-gate metadata are not cosmetic; they are how agents find the right context.
Status must be checkable. Proposal status, supersession, implementation status, selected milestone, and review findings should fail checks when they drift.
Proofs need names and ownership. A QEMU harness target should say what it proves, which manifest it uses, which proposal/backlog owns it, and what transcript shape is expected.
Navigation must not become another authority. Generated views can help discovery, but code, schemas, architecture/proposal/backlog documents, loopyard, and review records remain authoritative.
Prefer generation over duplicate hand-maintained state. When possible, sidecars and indexes should be generated from front matter, Makefile metadata, manifests, or explicit source files.
Expose replacement paths. If a proposal is superseded, an agent should see the replacement before acting on stale text.
Make unsafe shortcuts hard. The harness should steer agents away from main-worktree edits, stale branches, missing review, unverified QEMU claims, and undocumented design pivots.
Agents must know when they are not alone. Shared resources such as git branches, worktrees, docs indexes, task lists, generated files, and review queues need visible ownership, lease, and version state before agents mutate them.

Implemented Harness Surfaces

Compact routing map

Repository Agent Harness answers where authoritative project state lives, how to select context progressively, how to inspect live ownership, and how to choose the first validation gates. make workflow-check caps both the mandatory AGENTS.md + CLAUDE.md entrypoint and the routing map itself so reference detail moves to the owning document.

Gate inventory and path view

docs/workflow/gates.toml is the machine-readable authority for named gates: command, proof meaning, enforcement point, and optional slice/hazard/path applicability. The Makefile remains authoritative for concrete targets and .cargo/config.toml for host-test aliases. This replaces the earlier proposal for a hand-maintained docs/run-targets.md table.

tools/workflow_gates.py for-paths emits the registered named gates made mandatory by repository-relative path globs. It is intentionally a lower bound: task acceptance, trust-boundary hazards, generated outputs, and focused QEMU behavior proofs still require explicit selection.

Live work and resource view

Loopyard task records, dependencies, conflict domains, dispatch locks, and Git worktrees already contain the live ownership data this proposal originally placed in a checked-in active-work registry. The harness derives current state from those sources instead of committing a second, immediately stale table.

Run provenance and worklogs remain derived from task records, lock/run data, commit refs, and commit trailers. Metrics consumers should query those sources rather than introduce another attribution ledger in this proposal.

Planned Harness Surfaces

Proposal Relationship Metadata

Add or standardize front matter fields:

status: "Future design. No implementation."
last_reviewed: "2026-04-28 00:00 UTC"
supersedes:
  - old-proposal.md
superseded_by: new-proposal.md
implemented_by:
  - commit-or-target
owned_backlog: docs/backlog/example.md
proof_targets:
  - make test-example

The exact schema can be narrower at first. The important requirement is that replacement and proof relationships become queryable.

Design Pivot Records

Add short ADR-style files under docs/decisions/ for high-impact pivots:

endpoint badges as service identity rejected;
service-object capabilities superseded by session-bound invocation context;
SSH work paused behind session-bound invocation context;
hosted agents split from shell agent mode.

Each record should state context, decision, consequences, superseded docs, and current replacement docs.

Compiled knowledge views

The current harness does not add docs/agent-wiki/. docs/SUMMARY.md, generated topics, the repository map, and the design-authority map already provide progressively disclosed navigation without paraphrasing architecture or live findings. A new compiled view should be added only after a workflow eval demonstrates a repeated retrieval gap; it should be generated from named authorities and remain disposable.

Agent Evals

Add deterministic repository-workflow evals:

identify selected milestone from the loopyard project setting selected_milestone;
find the relevant backlog and proposal;
reject editing the main worktree;
detect another active worker claiming the same exclusive path or generated output;
choose a non-overlapping task or wait when a shared resource is already leased;
identify required checks for a doc-only proposal change;
detect a superseded proposal and follow replacement;
update proposal index and summary when adding a proposal;
avoid claiming full tests passed when only docs built;
surface open review-finding task records before unrelated feature work.

These evals can start as scripted fixtures. They do not need live model calls.

Mechanical Checks

Current harness checks establish:

workflow-gate names and registered commands resolve consistently;
repository-relative paths globs select gates with segment-aware, deterministic behavior;
the mandatory instruction entrypoint and agent-harness routing map stay within explicit context budgets;
commit Evidence: gate references resolve through the registry;
documentation links, front matter, topics, and agent-facing assets can be regenerated and checked by the documentation workflow.

Remaining metadata checks should cover:

every proposal in docs/proposals/ is present in docs/proposals/index.md or an explicit archive section;
every proposal linked in docs/SUMMARY.md exists;
every proposal with topics appears in docs/topics.md after generation;
superseded_by points to an existing file;
superseded proposals display a replacement link near the top;
selected milestone in the loopyard project setting selected_milestone has matching loopyard board / backlog orientation;
proposal proof-target metadata resolves through the gate registry or an existing concrete target;
research-backed proposals link at least one docs/research/*.md note;
external source snapshots in research notes include a review date;
QEMU proof claims name a target;
selected-milestone and task-source retrieval is available through supported loopyard client surfaces rather than direct database inspection;
metrics derived from task/run provenance retain known actor class, role, and confidence labels where those fields are available.

These checks should start warning-only if needed, then become required once the metadata is in place.

The harness checks above stop at repository context, metadata, gate, and live- work hygiene. They are deliberately not a substitute for the security review process. Trust-boundary review, threat-model refresh, per-boundary CWE/CAPEC tagging, tiered tooling (CI hygiene, miri/proptest/fuzzing, Loom, Kani), and the Security Verification Track registry live in Security Review and Formal Verification. When a harness check touches trust-boundary or supply-chain authority, it must route the finding to the matching security verification track or design-risks register row rather than absorb the authority claim into agent-facing harness metadata.

Workflow Impact

For agents:

start at docs/agent-harness.md;
read selected milestone and task state from loopyard;
inspect task conflict domains, dispatch locks, branches, and worktrees before mutating shared files;
follow the design-authority map, proposal status, and explicit supersession links to avoid stale design;
derive path-mandatory checks from the workflow-gates registry, then add task/hazard/behavior-specific gates;
update docs/status through mechanically checked indexes;
hand off with proof target names and transcript artifacts.

For humans:

less repeated explanation of repo rules;
easier review of whether an agent chose the right context;
clearer detection of stale docs;
explicit locations for “why did we change direction?” records.

Implementation Phases

Phase 1 - Routing and Existing Authorities (implemented)

Route agents through docs/agent-harness.md, AGENTS.md, CLAUDE.md, and REVIEW.md without copying live state.
Use loopyard tasks/settings/locks and Git worktrees for live work ownership.
Use docs/workflow/gates.toml, the Makefile, and Cargo aliases for gate authority; expose the path-mandatory view with workflow_gates.py for-paths.
Keep documentation navigation generated from front matter, SUMMARY.md, and existing indexes.
Enforce context budgets for the mandatory entrypoint and routing map.

Phase 2 - Metadata and Checks

Standardize front matter for proposals and research notes.
Extend mdBook metadata tooling to validate proposal index, topic membership, summary links, status fields, and supersession links.
Validate proof-target metadata through the existing gate/target authorities.

Phase 3 - Decision Records

Add docs/decisions/ and initial pivot records for the session-bound invocation context change and hosted-agent split.
Link decisions from affected proposals and backlog files.

Phase 4 - Agent Workflow Evals

Add fixtures and scripts for repository-workflow evals.
Run them in a docs/check target.
Use failures to improve docs/agent-harness.md, metadata, and derived gate views.

Open Questions

Should proposal relationship metadata live only in front matter, or should there be a generated JSON sidecar for fast agent/tool consumption?
How much QEMU transcript output should be retained as proof artifacts without bloating the repository?
How strict should the first status linter be, given existing historical docs?
Which supported loopyard CLI/API command should expose project settings such as selected_milestone without requiring a database-specific query?
Should agent evals be part of make docs, a separate make agent-harness-check, or a broader make check?

Relationship to Existing Documents

Hosted agent harnesses research records the external harness research and the initial checklist.
capOS-Hosted Agent Swarms uses this repo harness as precedent for future capOS-hosted agents.
mdBook Documentation Site owns public docs structure and status vocabulary; this proposal adds agent-legibility and mechanical checks on top.
Trusted Build Inputs is the source of truth for toolchain pinning, generated-code drift, dependency policy, Limine binary pinning, observed-only QEMU/OVMF surface, and host-tool inventory. The harness proof-target metadata and generated-output claims must cite the relevant row there rather than re-derive trust status.
Security Review and Formal Verification owns the trust-boundary model, per-boundary CWE/CAPEC checklist, tiered tooling (CI hygiene, miri/proptest/fuzzing, Loom, Kani), and the Security Verification Track registry. Harness mechanical checks must hand security- bearing findings to that proposal’s tracks rather than redefine review authority.
Design Risks and Open Questions Register is the consolidated index of long-horizon design risks and open architectural questions. New harness-surfaced risks should be filed against existing rows there (for example R13 for supply-chain pinning gaps) or added as new rows, not buried in harness artifacts.
CLAUDE.md, AGENTS.md, and loopyard (the task ledger) remain authoritative workflow inputs. docs/agent-harness.md should route to them, not replace them.

Keyboard shortcuts

capOS Documentation