# Proposal: capOS Agentic Development Experiment

This proposal treats capOS development as a longitudinal field experiment in
agentic software engineering. The experiment studies whether persistent coding
agents, subagents, review agents, recovery routines, and session-recap tooling
can make sustained progress on a nontrivial operating-system project while
preserving engineering quality, reviewability, and coordination safety.

The core question is not whether an AI can produce isolated code changes. The
stronger question is whether an agentic workflow can maintain a coherent project
over many sessions, interruptions, branches, reviews, and handoffs, and which
process controls keep that workflow reliable.

## Motivation

capOS is a useful setting because it is systems software with real correctness
constraints: kernel behavior, capability discipline, QEMU evidence, generated
schemas, docs, reviews, and integration rules all matter. It is a stronger
testbed than toy programming tasks because the work has long dependency chains
and observable integration gates.

The immediate practical need is session memory. Raw `~/.codex` and `~/.claude`
logs contain the evidence, but they are too large and operationally noisy for
routine recovery or research analysis. The recap tooling creates a derived
evidence layer: structured metadata, compact evidence packets, plain-text
summaries, parent/child session graphs, and freshness tracking.

## Research Questions

1. Can agentic development produce sustained, reviewable progress on capOS
   across many sessions and subagents?
2. Which controls reduce coordination failures such as stale ownership,
   duplicate work, unsafe branch cleanup, live-process confusion, and review
   drift?
3. How should parent sessions and subagent sessions be summarized so project
   history remains useful without recursively flooding the recap system?
4. How reliable are LLM-generated factual recaps when grounded in compact
   evidence packets rather than full transcripts?
5. What failure modes remain visible after adding stronger evidence fields,
   prompt examples, routing rules, and summary comparison snapshots?

## Hypotheses

- Dedicated worktrees, explicit ownership rules, and mandatory review gates
  reduce destructive interference between concurrent agents.
- Root-session summaries plus compact child-session evidence are more useful
  than treating every subagent as an independent top-level recap by default.
- Small summarizer models can handle simple review sessions when given exact
  paths, strict output scope, and good/bad examples, but routine/recovery and
  child-heavy parent sessions need stronger models.
- Derived recaps can support research and operations if treated as coded
  observations, while raw transcripts remain the authority for audits.
- Iterative prompt and evidence changes can measurably reduce recap defects
  such as bootstrap-boilerplate summaries, queue-processing self-references,
  and "limited evidence" outputs.

## Experimental Setting

The setting spans more than one development machine. Session identity therefore
needs an explicit source-machine dimension: a session captured through one
machine but originating on another must remain attributed to the originating
machine in raw manifests and derived data.

Observed source classes:

- Claude transcripts under `~/.claude/projects/.../*.jsonl`.
- Claude live metadata under `~/.claude/sessions/*.json`.
- Codex thread metadata in `~/.codex/state_5.sqlite`.
- Codex parent/child relationships in `thread_spawn_edges`.
- Codex rollout transcripts under `~/.codex/sessions/YYYY/MM/DD/`.
- Git branch, worktree, commit, review, and check evidence from this repo.

Raw collection keeps `source_host` separate from `capture_host`. A central
machine may perform the capture, but the manifest records where each source
file originated.

An initial private pilot inventory found a large child-session skew: most Codex
sessions were spawned subagents rather than root sessions. This motivates the
default policy of indexing every session while queuing only primary/root
sessions for standalone summaries.

## Tooling

The repo-tracked tools live under `tools/agent-session-recaps/`:

- `maintain_recap_store.py` inventories local Claude/Codex sessions, writes
  script-owned metadata and evidence JSON, maintains summary queues, maps live
  PIDs conservatively, and ingests LLM-owned `summary.txt` freshness metadata.
- `archive_raw_sessions.py` snapshots raw session sources with host
  provenance, checksums, compression, optional project filtering, and optional
  upload to private object storage.

The default derived recap store remains outside the repo:

- `~/ai-session-recaps/index.json`
- `~/ai-session-recaps/by-session/{tool}/{session_id}/meta.json`
- `~/ai-session-recaps/by-session/{tool}/{session_id}/evidence.json`
- `~/ai-session-recaps/by-session/{tool}/{session_id}/summary.txt`
- `~/ai-session-recaps/by-session/{tool}/{session_id}/summary.meta.json`
- `~/ai-session-recaps/queue/*.json`

Important design choices:

- Summary prose lives only in `summary.txt`.
- JSON files remain script-owned metadata/evidence/freshness files.
- The index tracks source `updated_at` timestamps for staleness.
- Parent/root sessions are queued by default.
- Spawned child sessions remain indexed and linked, but are not queued by
  default.
- Parent evidence includes compact child-session evidence so root summaries can
  include meaningful subagent outcomes.
- Codex `task_complete.last_agent_message` is extracted to improve final review
  and implementation verdicts.
- Live Claude/Codex PIDs are mapped conservatively using `/proc`, Claude
  `procStart`, Codex wrapper/native process relationships, and explicit Codex
  resume evidence when available.

## Data Products

The experiment distinguishes four layers:

1. Raw logs: private source of truth.
2. Evidence packets: compact redacted excerpts, metadata, child-session packets,
   and command/check summaries.
3. LLM summaries: qualitative coded observations, not ground truth.
4. Analysis snapshots: immutable comparison runs that evaluate prompt and
   evidence changes.

Raw transcripts should not be committed to the public source history. Evidence
packets and summaries may be committed only after redaction policy and privacy
review. Tooling, schemas, prompts, synthetic examples, and methodology docs can
be tracked first.

## Raw Evidence Archival

The recap store is derived data; it is not enough for auditability. Raw session
sources should be archived separately, with checksums and a manifest that lets a
later analysis reproduce which transcript version produced each evidence packet
and summary.

Preferred raw archive design:

- Use private object storage, such as a locked-down GCS bucket, as the default
  archive for raw session logs.
- Store compressed snapshots by capture time and source host, for example:

```text
gs://<private-bucket>/capos-agentic-dev/raw-sessions/YYYY/MM/DD/<snapshot-id>/
  manifest.json
  sha256sums.txt
  hosts/primary-dev/.codex/sessions/....jsonl.zst
  hosts/primary-dev/.claude/projects/....jsonl.zst
  hosts/portable-dev/.codex/sessions/....jsonl.zst
  hosts/portable-dev/.claude/projects/....jsonl.zst
```

- Enable uniform bucket-level access, least-privilege IAM, lifecycle rules, and
  object versioning or retention if the bucket policy allows it.
- Consider customer-managed encryption if the archive will contain sensitive
  prompts, private operational instructions, or source excerpts.
- Store a manifest with source host, capture host, source path, archive object
  path, byte size, SHA-256, source mtime, capture timestamp, tool, session id
  when known, compression, and redaction status.
- Keep the manifest path or archive snapshot id in derived recap metadata so
  summaries can be audited against the exact archived source.
- Do not merge logs from one machine into another machine's live `~/.codex` or
  `~/.claude` trees. Gather them into host-partitioned archives first, then
  import from that archive if the recap store is extended to multi-host
  analysis.

Project filtering matters on machines that contain unrelated Claude/Codex
projects. `archive_raw_sessions.py --project-root <path>` selects only Codex
rollouts whose `threads.cwd` is inside a selected project/worktree root, writes
a filtered Codex state JSON extract, and selects matching Claude project JSONL
and session metadata. Full Codex SQLite state, global history, Codex logs DB,
Claude tasks, and Claude file history are opt-in.

Git branch or Git LFS storage is useful only under tighter constraints:

- A private Git LFS dataset branch can be convenient for small, curated,
  redacted, or synthetic fixtures.
- Raw local session logs should not go into normal git history because they may
  contain private prompts, operational instructions, credentials accidentally
  pasted into chat, or unrelated user content.
- Even private Git LFS is awkward for raw logs if later deletion or redaction is
  needed, because clones and LFS object stores can retain historical content.
- If Git LFS is used, prefer a separate private data repository or an orphan
  data branch, never the normal capOS source branch.

Recommended split:

- Git-tracked source repo: tooling, schemas, prompts, proposal, methodology,
  redaction scripts, synthetic examples.
- Private object storage: exact source session JSON/JSONL and SQLite snapshots.
- Optional private Git LFS dataset: curated redacted snapshots used by paper
  reviewers.
- Public artifact, if any: synthetic fixtures plus aggregate metrics and
  selected redacted examples.

## Pilot Results

An initial private pilot processed a small queue of current summaries and then
reran a target set after prompt/evidence changes.

Baseline result:

- Current summaries: 53.
- Bad queue/meta/evidence self-reference markers: 0.
- "Limited evidence" summaries: 7.
- Child-heavy current summaries: 3.

First intervention:

- Added prompt good/bad examples.
- Added compact child-session evidence for parent Codex sessions.
- Dereferenced child recap-worker output files so parent summaries see summary
  text, not only completion paths.

Second intervention:

- Added Codex `task_complete.last_agent_message` extraction.
- Reran the remaining limited-evidence summaries.

Combined candidate result:

- Candidate summaries: 53.
- Bad self-reference markers: 0.
- Limited-evidence summaries: 0.
- Average baseline summary length: 1221.6 characters.
- Average candidate summary length: 1060.6 characters.

These results support the claim that prompt examples help, but evidence shape
matters more. The weak summaries were not only a prompt problem; they lacked the
right final-result evidence.

## Methodology

### Collection

Run the recap maintainer periodically or after major work bursts. Each run
should:

- refresh metadata for all sessions;
- update evidence for recent primary/root sessions;
- preserve child-session graph information;
- update live-process mappings;
- queue only stale or missing summaries;
- record immutable analysis snapshots before major prompt/evidence changes.

### Summarization

Use model routing:

- `gpt-5.3-codex-spark` for simple, concrete, non-routine sessions.
- A stronger model for routine/recovery sessions, live sessions, and
  child-heavy parent sessions.

Keep small-model tasks concrete:

- one queue item or a very small batch;
- exact paths;
- no JSON output;
- no broad filesystem exploration;
- good/bad examples in the prompt;
- strict instruction to summarize target-session outcomes, not the
  queue-processing task.

### Metrics

Hard metrics:

- session count by tool, primary/child/root role, model, and live state;
- queue size and stale/current summary count;
- child-session count per root;
- number of review findings, no-finding reviews, failed checks, and passed
  checks;
- branch/worktree lifecycle events: created, committed, reviewed, merged,
  pushed, parked, abandoned;
- recovery-session frequency and duration;
- recap quality markers: self-reference markers, limited-evidence phrases,
  missing final verdicts, excessive bootstrap boilerplate.

Qualitative coding:

- coordination failures;
- evidence gaps;
- useful controls;
- subagent summarization failures;
- review-loop behavior;
- human intervention points.

### Validation

Treat summaries as coded observations. Validate claims against raw logs, git
history, and checks before using them as paper evidence.

Use audits:

- sample raw transcript lines for selected summaries;
- verify cited commits and branches;
- verify check outcomes in logs;
- compare parent summaries against child-session final results;
- rerun summaries after prompt/evidence changes and compare snapshots.

## Threats To Validity

- Single-project bias: capOS is one project with one workflow.
- Model/version drift: model behavior and Codex/Claude log schemas may change.
- Observer effect: improving prompts and processes changes the system being
  studied.
- LLM-coded summaries can omit or distort details.
- Raw logs may contain private operational data, limiting public
  reproducibility.
- Agent behavior is affected by local instructions, model routing, and tool
  availability.

## Paper Outline

1. Introduction: why long-running agentic development is different from
   single-prompt code generation.
2. Background: capOS, worktrees, review gates, Codex/Claude sessions.
3. System design: recap instrumentation, evidence packets, child-session graph,
   model routing, live-process mapping.
4. Methodology: longitudinal observation, metrics, prompt/evidence
   interventions, audit strategy.
5. Pilot findings: session scale, child-session dominance, failure modes, recap
   improvement loop.
6. Case studies:
   - recovery session after interruption;
   - child-heavy device-driver-foundation work session;
   - repeated review loop;
   - recap prompt/evidence refinement.
7. Discussion: what worked, what remained brittle, implications for agentic
   software engineering.
8. Limitations and future work.

## Immediate Next Steps

1. Add schema documentation and a privacy/redaction README.
2. Add repeatable analysis scripts for baseline/rerun comparison.
3. Add a small synthetic fixture set that exercises:
   - root session with children;
   - recap-worker child returning only a path;
   - review session with `task_complete`;
   - recovery session with bootstrap boilerplate.
4. Decide whether generated summaries should be tracked privately, exported as
   redacted snapshots, or kept only as local research data.