Proposal: capOS Agentic Development Experiment

This proposal treats capOS development as a longitudinal field experiment in agentic software engineering. The experiment studies whether persistent coding agents, subagents, review agents, recovery routines, and session-recap tooling can make sustained progress on a nontrivial operating-system project while preserving engineering quality, reviewability, and coordination safety.

The core question is not whether an AI can produce isolated code changes. The stronger question is whether an agentic workflow can maintain a coherent project over many sessions, interruptions, branches, reviews, and handoffs, and which process controls keep that workflow reliable.

Motivation

capOS is a useful setting because it is systems software with real correctness constraints: kernel behavior, capability discipline, QEMU evidence, generated schemas, docs, reviews, and integration rules all matter. It is a stronger testbed than toy programming tasks because the work has long dependency chains and observable integration gates.

The immediate practical need is session memory. Raw ~/.codex and ~/.claude logs contain the evidence, but they are too large and operationally noisy for routine recovery or research analysis. The recap tooling creates a derived evidence layer: structured metadata, compact evidence packets, plain-text summaries, parent/child session graphs, and freshness tracking.

Research Questions

Can agentic development produce sustained, reviewable progress on capOS across many sessions and subagents?
Which controls reduce coordination failures such as stale ownership, duplicate work, unsafe branch cleanup, live-process confusion, and review drift?
How should parent sessions and subagent sessions be summarized so project history remains useful without recursively flooding the recap system?
How reliable are LLM-generated factual recaps when grounded in compact evidence packets rather than full transcripts?
What failure modes remain visible after adding stronger evidence fields, prompt examples, routing rules, and summary comparison snapshots?

Hypotheses

Dedicated worktrees, explicit ownership rules, and mandatory review gates reduce destructive interference between concurrent agents.
Root-session summaries plus compact child-session evidence are more useful than treating every subagent as an independent top-level recap by default.
Small summarizer models can handle simple review sessions when given exact paths, strict output scope, and good/bad examples, but routine/recovery and child-heavy parent sessions need stronger models.
Derived recaps can support research and operations if treated as coded observations, while raw transcripts remain the authority for audits.
Iterative prompt and evidence changes can measurably reduce recap defects such as bootstrap-boilerplate summaries, queue-processing self-references, and “limited evidence” outputs.

Experimental Setting

The setting spans more than one development machine. Session identity therefore needs an explicit source-machine dimension: a session captured through one machine but originating on another must remain attributed to the originating machine in raw manifests and derived data.

Observed source classes:

Claude transcripts under ~/.claude/projects/.../*.jsonl.
Claude live metadata under ~/.claude/sessions/*.json.
Codex thread metadata in ~/.codex/state_5.sqlite.
Codex parent/child relationships in thread_spawn_edges.
Codex rollout transcripts under ~/.codex/sessions/YYYY/MM/DD/.
Git branch, worktree, commit, review, and check evidence from this repo.

Raw collection keeps source_host separate from capture_host. A central machine may perform the capture, but the manifest records where each source file originated.

An initial private pilot inventory found a large child-session skew: most Codex sessions were spawned subagents rather than root sessions. This motivates the default policy of indexing every session while queuing only primary/root sessions for standalone summaries.

Tooling

The repo-tracked tools live under tools/agent-session-recaps/:

maintain_recap_store.py inventories local Claude/Codex sessions, writes script-owned metadata and evidence JSON, maintains summary queues, maps live PIDs conservatively, and ingests LLM-owned summary.txt freshness metadata.
archive_raw_sessions.py snapshots raw session sources with host provenance, checksums, compression, optional project filtering, and optional upload to private object storage.

The default derived recap store remains outside the repo:

~/ai-session-recaps/index.json
~/ai-session-recaps/by-session/{tool}/{session_id}/meta.json
~/ai-session-recaps/by-session/{tool}/{session_id}/evidence.json
~/ai-session-recaps/by-session/{tool}/{session_id}/summary.txt
~/ai-session-recaps/by-session/{tool}/{session_id}/summary.meta.json
~/ai-session-recaps/queue/*.json

Important design choices:

Summary prose lives only in summary.txt.
JSON files remain script-owned metadata/evidence/freshness files.
The index tracks source updated_at timestamps for staleness.
Parent/root sessions are queued by default.
Spawned child sessions remain indexed and linked, but are not queued by default.
Parent evidence includes compact child-session evidence so root summaries can include meaningful subagent outcomes.
Codex task_complete.last_agent_message is extracted to improve final review and implementation verdicts.
Live Claude/Codex PIDs are mapped conservatively using /proc, Claude procStart, Codex wrapper/native process relationships, and explicit Codex resume evidence when available.

Data Products

The experiment distinguishes four layers:

Raw logs: private source of truth.
Evidence packets: compact redacted excerpts, metadata, child-session packets, and command/check summaries.
LLM summaries: qualitative coded observations, not ground truth.
Analysis snapshots: immutable comparison runs that evaluate prompt and evidence changes.

Raw transcripts should not be committed to the public source history. Evidence packets and summaries may be committed only after redaction policy and privacy review. Tooling, schemas, prompts, synthetic examples, and methodology docs can be tracked first.

Raw Evidence Archival

The recap store is derived data; it is not enough for auditability. Raw session sources should be archived separately, with checksums and a manifest that lets a later analysis reproduce which transcript version produced each evidence packet and summary.

Preferred raw archive design:

Use private object storage, such as a locked-down GCS bucket, as the default archive for raw session logs.
Store compressed snapshots by capture time and source host, for example:

gs://<private-bucket>/capos-agentic-dev/raw-sessions/YYYY/MM/DD/<snapshot-id>/
  manifest.json
  sha256sums.txt
  hosts/primary-dev/.codex/sessions/....jsonl.zst
  hosts/primary-dev/.claude/projects/....jsonl.zst
  hosts/portable-dev/.codex/sessions/....jsonl.zst
  hosts/portable-dev/.claude/projects/....jsonl.zst

Enable uniform bucket-level access, least-privilege IAM, lifecycle rules, and object versioning or retention if the bucket policy allows it.
Consider customer-managed encryption if the archive will contain sensitive prompts, private operational instructions, or source excerpts.
Store a manifest with source host, capture host, source path, archive object path, byte size, SHA-256, source mtime, capture timestamp, tool, session id when known, compression, and redaction status.
Keep the manifest path or archive snapshot id in derived recap metadata so summaries can be audited against the exact archived source.
Do not merge logs from one machine into another machine’s live ~/.codex or ~/.claude trees. Gather them into host-partitioned archives first, then import from that archive if the recap store is extended to multi-host analysis.

Project filtering matters on machines that contain unrelated Claude/Codex projects. archive_raw_sessions.py --project-root <path> selects only Codex rollouts whose threads.cwd is inside a selected project/worktree root, writes a filtered Codex state JSON extract, and selects matching Claude project JSONL and session metadata. Full Codex SQLite state, global history, Codex logs DB, Claude tasks, and Claude file history are opt-in.

Git branch or Git LFS storage is useful only under tighter constraints:

A private Git LFS dataset branch can be convenient for small, curated, redacted, or synthetic fixtures.
Raw local session logs should not go into normal git history because they may contain private prompts, operational instructions, credentials accidentally pasted into chat, or unrelated user content.
Even private Git LFS is awkward for raw logs if later deletion or redaction is needed, because clones and LFS object stores can retain historical content.
If Git LFS is used, prefer a separate private data repository or an orphan data branch, never the normal capOS source branch.

Recommended split:

Git-tracked source repo: tooling, schemas, prompts, proposal, methodology, redaction scripts, synthetic examples.
Private object storage: exact source session JSON/JSONL and SQLite snapshots.
Optional private Git LFS dataset: curated redacted snapshots used by paper reviewers.
Public artifact, if any: synthetic fixtures plus aggregate metrics and selected redacted examples.

Pilot Results

An initial private pilot processed a small queue of current summaries and then reran a target set after prompt/evidence changes.

Baseline result:

Current summaries: 53.
Bad queue/meta/evidence self-reference markers: 0.
“Limited evidence” summaries: 7.
Child-heavy current summaries: 3.

First intervention:

Added prompt good/bad examples.
Added compact child-session evidence for parent Codex sessions.
Dereferenced child recap-worker output files so parent summaries see summary text, not only completion paths.

Second intervention:

Added Codex task_complete.last_agent_message extraction.
Reran the remaining limited-evidence summaries.

Combined candidate result:

Candidate summaries: 53.
Bad self-reference markers: 0.
Limited-evidence summaries: 0.
Average baseline summary length: 1221.6 characters.
Average candidate summary length: 1060.6 characters.

These results support the claim that prompt examples help, but evidence shape matters more. The weak summaries were not only a prompt problem; they lacked the right final-result evidence.

Methodology

Collection

Run the recap maintainer periodically or after major work bursts. Each run should:

refresh metadata for all sessions;
update evidence for recent primary/root sessions;
preserve child-session graph information;
update live-process mappings;
queue only stale or missing summaries;
record immutable analysis snapshots before major prompt/evidence changes.

Summarization

Use model routing:

gpt-5.3-codex-spark for simple, concrete, non-routine sessions.
A stronger model for routine/recovery sessions, live sessions, and child-heavy parent sessions.

Keep small-model tasks concrete:

one queue item or a very small batch;
exact paths;
no JSON output;
no broad filesystem exploration;
good/bad examples in the prompt;
strict instruction to summarize target-session outcomes, not the queue-processing task.

Metrics

Hard metrics:

session count by tool, primary/child/root role, model, and live state;
queue size and stale/current summary count;
child-session count per root;
number of review findings, no-finding reviews, failed checks, and passed checks;
branch/worktree lifecycle events: created, committed, reviewed, merged, pushed, parked, abandoned;
recovery-session frequency and duration;
recap quality markers: self-reference markers, limited-evidence phrases, missing final verdicts, excessive bootstrap boilerplate.

Qualitative coding:

coordination failures;
evidence gaps;
useful controls;
subagent summarization failures;
review-loop behavior;
human intervention points.

Validation

Treat summaries as coded observations. Validate claims against raw logs, git history, and checks before using them as paper evidence.

Use audits:

sample raw transcript lines for selected summaries;
verify cited commits and branches;
verify check outcomes in logs;
compare parent summaries against child-session final results;
rerun summaries after prompt/evidence changes and compare snapshots.

Threats To Validity

Single-project bias: capOS is one project with one workflow.
Model/version drift: model behavior and Codex/Claude log schemas may change.
Observer effect: improving prompts and processes changes the system being studied.
LLM-coded summaries can omit or distort details.
Raw logs may contain private operational data, limiting public reproducibility.
Agent behavior is affected by local instructions, model routing, and tool availability.

Paper Outline

Introduction: why long-running agentic development is different from single-prompt code generation.
Background: capOS, worktrees, review gates, Codex/Claude sessions.
System design: recap instrumentation, evidence packets, child-session graph, model routing, live-process mapping.
Methodology: longitudinal observation, metrics, prompt/evidence interventions, audit strategy.
Pilot findings: session scale, child-session dominance, failure modes, recap improvement loop.
Case studies:
- recovery session after interruption;
- child-heavy device-driver-foundation work session;
- repeated review loop;
- recap prompt/evidence refinement.
Discussion: what worked, what remained brittle, implications for agentic software engineering.
Limitations and future work.

Immediate Next Steps

Add schema documentation and a privacy/redaction README.
Add repeatable analysis scripts for baseline/rerun comparison.
Add a small synthetic fixture set that exercises:
- root session with children;
- recap-worker child returning only a path;
- review session with task_complete;
- recovery session with bootstrap boilerplate.
Decide whether generated summaries should be tracked privately, exported as redacted snapshots, or kept only as local research data.

Keyboard shortcuts

capOS Documentation