Proposal: capOS Agentic Development Experiment
This proposal treats capOS development as a longitudinal field experiment in agentic software engineering. The experiment studies whether persistent coding agents, subagents, review agents, recovery routines, and session-recap tooling can make sustained progress on a nontrivial operating-system project while preserving engineering quality, reviewability, and coordination safety.
The core question is not whether an AI can produce isolated code changes. The stronger question is whether an agentic workflow can maintain a coherent project over many sessions, interruptions, branches, reviews, and handoffs, and which process controls keep that workflow reliable.
This proposal studies the development-time workflow that produced capOS, not the in-system agent runtime that capOS itself targets. The capability-served language-model, embedder, and agent-runner surface lives in Language Models and Agent Runtime; that proposal is the authority on tool-use loops, per-tool permission modes, and how a future agentic capOS user surface holds model authority. The experiment described here uses external Claude and Codex sessions running against the repo, and records observations about their behaviour for later analysis.
Motivation
capOS is a useful setting because it is systems software with real correctness constraints: kernel behavior, capability discipline, QEMU evidence, generated schemas, docs, reviews, and integration rules all matter. It is a stronger testbed than toy programming tasks because the work has long dependency chains and observable integration gates.
The immediate practical need is session memory. Raw ~/.codex and ~/.claude
logs contain the evidence, but they are too large and operationally noisy for
routine recovery or research analysis. The recap tooling creates a derived
evidence layer: structured metadata, compact evidence packets, plain-text
summaries, parent/child session graphs, and freshness tracking.
Research Questions
- Can agentic development produce sustained, reviewable progress on capOS across many sessions and subagents?
- Which controls reduce coordination failures such as stale ownership, duplicate work, unsafe branch cleanup, live-process confusion, and review drift?
- How should parent sessions and subagent sessions be summarized so project history remains useful without recursively flooding the recap system?
- How reliable are LLM-generated factual recaps when grounded in compact evidence packets rather than full transcripts?
- What failure modes remain visible after adding stronger evidence fields, prompt examples, routing rules, and summary comparison snapshots?
Hypotheses
- Dedicated worktrees, explicit ownership rules, and mandatory review gates reduce destructive interference between concurrent agents.
- Root-session summaries plus compact child-session evidence are more useful than treating every subagent as an independent top-level recap by default.
- Small summarizer models can handle simple review sessions when given exact paths, strict output scope, and good/bad examples, but routine/recovery and child-heavy parent sessions need stronger models.
- Derived recaps can support research and operations if treated as coded observations, while raw transcripts remain the authority for audits.
- Iterative prompt and evidence changes can measurably reduce recap defects such as bootstrap-boilerplate summaries, queue-processing self-references, and “limited evidence” outputs.
Experimental Setting
The setting spans more than one development machine. Session identity therefore needs an explicit source-machine dimension: a session captured through one machine but originating on another must remain attributed to the originating machine in raw manifests and derived data.
Observed source classes:
- Claude transcripts under
~/.claude/projects/.../*.jsonl. - Claude live metadata under
~/.claude/sessions/*.json. - Codex thread metadata in
~/.codex/state_5.sqlite. - Codex parent/child relationships in
thread_spawn_edges. - Codex rollout transcripts under
~/.codex/sessions/YYYY/MM/DD/. - Git branch, worktree, commit, review, and check evidence from this repo.
Raw collection keeps source_host separate from capture_host. A central
machine may perform the capture, but the manifest records where each source
file originated.
An initial private pilot inventory found a large child-session skew: most Codex sessions were spawned subagents rather than root sessions. This motivates the default policy of indexing every session while queuing only primary/root sessions for standalone summaries.
Tooling
The repo-tracked tools live under tools/agent-session-recaps/:
maintain_recap_store.pyinventories local Claude/Codex sessions, writes script-owned metadata and evidence JSON, maintains summary queues, maps live PIDs conservatively, and ingests LLM-ownedsummary.txtfreshness metadata.archive_raw_sessions.pysnapshots raw session sources with host provenance, checksums, compression, optional project filtering, and optional upload to private object storage.
The default derived recap store remains outside the repo:
~/ai-session-recaps/index.json~/ai-session-recaps/by-session/{tool}/{session_id}/meta.json~/ai-session-recaps/by-session/{tool}/{session_id}/evidence.json~/ai-session-recaps/by-session/{tool}/{session_id}/summary.txt~/ai-session-recaps/by-session/{tool}/{session_id}/summary.meta.json~/ai-session-recaps/queue/*.json
Important design choices:
- Summary prose lives only in
summary.txt. - JSON files remain script-owned metadata/evidence/freshness files.
- The index tracks source
updated_attimestamps for staleness. - Parent/root sessions are queued by default.
- Spawned child sessions remain indexed and linked, but are not queued by default.
- Parent evidence includes compact child-session evidence so root summaries can include meaningful subagent outcomes.
- Codex
task_complete.last_agent_messageis extracted to improve final review and implementation verdicts. - Live Claude/Codex PIDs are mapped conservatively using
/proc, ClaudeprocStart, Codex wrapper/native process relationships, and explicit Codex resume evidence when available.
Data Products
The experiment distinguishes four layers:
- Raw logs: private source of truth.
- Evidence packets: compact redacted excerpts, metadata, child-session packets, and command/check summaries.
- LLM summaries: qualitative coded observations, not ground truth.
- Analysis snapshots: immutable comparison runs that evaluate prompt and evidence changes.
Daily development-performance reports are analysis snapshots. They combine git, worktree, check, review, and session evidence for a bounded reporting window. They are not raw logs and should not contain private prompts, unredacted transcripts, local credentials, or unrelated operator context.
Raw transcripts should not be committed to the public source history. Evidence packets and summaries may be committed only after redaction policy and privacy review. Tooling, schemas, prompts, synthetic examples, and methodology docs can be tracked first.
Raw Evidence Archival
The recap store is derived data; it is not enough for auditability. Raw session sources should be archived separately, with checksums and a manifest that lets a later analysis reproduce which transcript version produced each evidence packet and summary.
Preferred raw archive design:
- Use private object storage, such as a locked-down GCS bucket, as the default archive for raw session logs.
- Store compressed snapshots by capture time and source host, for example:
gs://<private-bucket>/capos-agentic-dev/raw-sessions/YYYY/MM/DD/<snapshot-id>/
manifest.json
sha256sums.txt
hosts/primary-dev/.codex/sessions/....jsonl.zst
hosts/primary-dev/.claude/projects/....jsonl.zst
hosts/portable-dev/.codex/sessions/....jsonl.zst
hosts/portable-dev/.claude/projects/....jsonl.zst
- Enable uniform bucket-level access, least-privilege IAM, lifecycle rules, and object versioning or retention if the bucket policy allows it.
- Consider customer-managed encryption if the archive will contain sensitive prompts, private operational instructions, or source excerpts.
- Store a manifest with source host, capture host, source path, archive object path, byte size, SHA-256, source mtime, capture timestamp, tool, session id when known, compression, and redaction status.
- Keep the manifest path or archive snapshot id in derived recap metadata so summaries can be audited against the exact archived source.
- Do not merge logs from one machine into another machine’s live
~/.codexor~/.claudetrees. Gather them into host-partitioned archives first, then import from that archive if the recap store is extended to multi-host analysis.
Project filtering matters on machines that contain unrelated Claude/Codex
projects. archive_raw_sessions.py --project-root <path> selects only Codex
rollouts whose threads.cwd is inside a selected project/worktree root, writes
a filtered Codex state JSON extract, and selects matching Claude project JSONL
and session metadata. Full Codex SQLite state, global history, Codex logs DB,
Claude tasks, and Claude file history are opt-in.
Git branch or Git LFS storage is useful only under tighter constraints:
- A private Git LFS dataset branch can be convenient for small, curated, redacted, or synthetic fixtures.
- Raw local session logs should not go into normal git history because they may contain private prompts, operational instructions, credentials accidentally pasted into chat, or unrelated user content.
- Even private Git LFS is awkward for raw logs if later deletion or redaction is needed, because clones and LFS object stores can retain historical content.
- If Git LFS is used, prefer a separate private data repository or an orphan data branch, never the normal capOS source branch.
Recommended split:
- Git-tracked source repo: tooling, schemas, prompts, proposal, methodology, redaction scripts, synthetic examples.
- Private object storage: exact source session JSON/JSONL and SQLite snapshots.
- Optional private Git LFS dataset: curated redacted snapshots used by paper reviewers.
- Public artifact, if any: synthetic fixtures plus aggregate metrics and selected redacted examples.
Pilot Results
An initial private pilot processed a small queue of current summaries and then reran a target set after prompt/evidence changes.
Baseline result:
- Current summaries: 53.
- Bad queue/meta/evidence self-reference markers: 0.
- “Limited evidence” summaries: 7.
- Child-heavy current summaries: 3.
First intervention:
- Added prompt good/bad examples.
- Added compact child-session evidence for parent Codex sessions.
- Dereferenced child recap-worker output files so parent summaries see summary text, not only completion paths.
Second intervention:
- Added Codex
task_complete.last_agent_messageextraction. - Reran the remaining limited-evidence summaries.
Combined candidate result:
- Candidate summaries: 53.
- Bad self-reference markers: 0.
- Limited-evidence summaries: 0.
- Average baseline summary length: 1221.6 characters.
- Average candidate summary length: 1060.6 characters.
These results support the claim that prompt examples help, but evidence shape matters more. The weak summaries were not only a prompt problem; they lacked the right final-result evidence.
Methodology
Collection
Run the recap maintainer periodically or after major work bursts. Each run should:
- refresh metadata for all sessions;
- update evidence for recent primary/root sessions;
- preserve child-session graph information;
- update live-process mappings;
- queue only stale or missing summaries;
- record immutable analysis snapshots before major prompt/evidence changes.
Summarization
Use model routing:
gpt-5.3-codex-sparkfor simple, concrete, non-routine sessions.- A stronger model for routine/recovery sessions, live sessions, and child-heavy parent sessions.
Keep small-model tasks concrete:
- one queue item or a very small batch;
- exact paths;
- no JSON output;
- no broad filesystem exploration;
- good/bad examples in the prompt;
- strict instruction to summarize target-session outcomes, not the queue-processing task.
Metrics
Hard metrics:
- session count by tool, primary/child/root role, model, and live state;
- queue size and stale/current summary count;
- child-session count per root;
- number of review findings, no-finding reviews, failed checks, and passed checks;
- branch/worktree lifecycle events: created, committed, reviewed, merged, pushed, parked, abandoned;
- recovery-session frequency and duration;
- recap quality markers: self-reference markers, limited-evidence phrases, missing final verdicts, excessive bootstrap boilerplate.
Qualitative coding:
- coordination failures;
- evidence gaps;
- useful controls;
- subagent summarization failures;
- review-loop behavior;
- human intervention points.
Daily Development Metrics
The daily report answers a narrower operational question than the recap store: what project progress happened during the reporting window, how strongly was it validated, and which agent/human channels contributed to it. It should keep project-performance metrics separate from attribution. Raw commits or lines of code are activity signals, not performance by themselves.
Use a fixed window and record it in the report. UTC calendar days are the
default for cross-machine comparison; a local workday boundary may be used only
when the report records the chosen day_start hour. The collector should
derive base and tip commits from the window and report both raw and
normalized git stats.
Normalize diff metrics by separating generated and vendored churn:
- raw commit count and non-merge commit count;
- first-parent merged task branches;
- raw file and line stats;
- authored file and line stats excluding
vendor/**andtools/generated/**; - optional secondary exclusions for lockfiles and generated demo content;
- top-level directory and subsystem breakdown;
- schema changes and generated-code regeneration as distinct rows.
Project-progress metrics:
- reviewed task slices merged;
- selected-milestone gates closed;
- task records closed under
docs/tasks/done/; - review-finding task records opened, closed, or carried forward;
- blockers retired and blockers still open;
- new capability, schema, runtime, demo, manifest, or QEMU proof surfaces;
- checks and QEMU targets recorded as passed, failed, skipped, or flaky;
- review iterations and review finding severity;
- rework after review or after merge.
Validation metrics should be evidence-first. A report may say a check was recorded only when it can point to a session evidence packet, saved log, commit message, or local check database entry. It should not convert a conversational claim into “passed” without a corroborating artifact. Flakes should be recorded separately from deterministic failures.
Attribution metrics are secondary accounting. Attribute by task slice and role,
not by raw line count. The report should allow at least these actor classes:
claude, codex, human/manual, mixed, and unknown. A commit trailer,
session evidence, or active-work registry row can support attribution, but
timestamp overlap alone is low-confidence and should remain unknown unless
corroborated.
Split roles explicitly:
- implementation;
- review;
- planning/design;
- verification/check running;
- recovery/integration;
- recap/metrics processing.
The Claude/Codex split should be reported as a matrix of actor class by role, with counts of task slices, sessions, review findings, checks, and merged commits where known. It should not rank agents by total commits or authored lines because generated code, vendored dependencies, docs refreshes, and review work distort that comparison.
Recommended daily report sections:
- Executive summary: visible progress, evidence gates closed, blockers retired, and blockers still open.
- Git metrics: raw commits, non-merge commits, merged task branches, normalized diff stats, generated/vendor churn.
- Area breakdown: kernel, schema, runtime, demos, tools, docs, and plans.
- Evidence and validation: checks, QEMU proof targets, flakes, skipped gates, and missing gates.
- Review and rework: review iterations, findings opened/closed, severity, and post-review or post-merge rework.
- Claude/Codex/human split: role-based attribution with confidence labels.
- Planning state: selected milestone, active high-priority tasks, closed plan items, stale blockers, and next credible gates.
The active-work registry proposed by capOS Repository Harness Engineering is the preferred source for live task ownership, claimed resources, and role labels. Git remains the authority for merged history; raw session archives remain the authority for auditing derived summaries.
Validation
Treat summaries as coded observations. Validate claims against raw logs, git history, and checks before using them as paper evidence. The capOS review and verification regime described in Security and Verification is the authority on what counts as a closed review gate, what counts as a deterministic check versus a flake, and how trust boundaries are documented. The recap store and daily report cite those gates rather than redefining them: a summary may record that a check passed only when the evidence packet, saved log, or commit trailer matches one of the named gates in that proposal.
Use audits:
- sample raw transcript lines for selected summaries;
- verify cited commits and branches;
- verify check outcomes in logs;
- compare parent summaries against child-session final results;
- rerun summaries after prompt/evidence changes and compare snapshots;
- compare daily report attribution against commit trailers, session evidence, and active-work registry rows;
- sample normalized diff calculations to ensure generated and vendored files are not counted as authored development volume.
Threats To Validity
- Single-project bias: capOS is one project with one workflow.
- Model/version drift: model behavior and Codex/Claude log schemas may change.
- Observer effect: improving prompts and processes changes the system being studied.
- LLM-coded summaries can omit or distort details.
- Raw logs may contain private operational data, limiting public reproducibility.
- Agent behavior is affected by local instructions, model routing, and tool availability.
Paper Outline
- Introduction: why long-running agentic development is different from single-prompt code generation.
- Background: capOS, worktrees, review gates, Codex/Claude sessions.
- System design: recap instrumentation, evidence packets, child-session graph, model routing, live-process mapping.
- Methodology: longitudinal observation, metrics, prompt/evidence interventions, audit strategy.
- Pilot findings: session scale, child-session dominance, failure modes, recap improvement loop.
- Case studies:
- recovery session after interruption;
- child-heavy device-driver-foundation work session;
- repeated review loop;
- recap prompt/evidence refinement.
- Discussion: what worked, what remained brittle, implications for agentic software engineering.
- Limitations and future work.
Immediate Next Steps
- Add schema documentation and a privacy/redaction README.
- Add repeatable analysis scripts for baseline/rerun comparison.
- Add a daily metrics collector that joins git, recap evidence, active-work rows, check artifacts, and review findings into the report sections above.
- Add a small synthetic fixture set that exercises:
- root session with children;
- recap-worker child returning only a path;
- review session with
task_complete; - recovery session with bootstrap boilerplate.
- Decide whether generated summaries should be tracked privately, exported as redacted snapshots, or kept only as local research data.