# Proposal: Session Archive and Gantt Effort Pipeline

Development tasks in capOS each carry a real start and finish time. The
autonomous development loop records these directly for tasks it executes; for
earlier work the timing is recoverable from agent session transcripts. Collected
together and attributed to branches and tasks, that timing data enables two
things: a whole-history development Gantt and a dataset for predicting how long
a future task will take.

**Status:** Proposal. The foundation is partially landed. A per-day task ledger
exists in `docs/tasks/done/`, where each done entry carries the real branch
commit SHAs and, for tasks executed by the autonomous development loop, real
`started` and `completed` timestamps sourced from the run-telemetry log. A
`prepare-commit-msg` hook stamps `Plan-Item`, `Run-Id`, and `Agent-Kind`
trailers on commits so the commit-to-task-to-run mapping is native to git
history. The session-transcript ETL, the derived dataset builder, and the
duration-prediction model are future work this proposal scopes.

## Goals

- Predict how long a future task will take from historical effort patterns,
  using features derivable from the task's commits and metadata.
- Render a whole-history development Gantt over the landed branch and task
  ledger, attributing each interval to the task that produced it.
- Feed that data back into planning: size estimates, milestone forecasting, and
  identification of subsystems or slice classes that consistently take longer
  than anticipated.

## Timing Sources

Two sources provide per-task effort data, at different points in the project
timeline:

**Run-telemetry log (loop-era tasks).** The autonomous development loop writes
a record per task run to a local telemetry log. Each record carries: a run id,
the task id, the agent kind, a session id, a `started` timestamp (when the
agent began), and a `completed` timestamp (when the agent finished and the
branch was merged or abandoned). These timestamps are exact wall-clock values,
not estimates. They are written to the local run-telemetry log (ephemeral, not
committed) and promoted to the task's `done/` file as `started:` and
`completed:` front-matter fields when the task closes. That promotion is the
boundary between local operational state and the durable public record.

**Agent session transcripts (pre-loop history).** For tasks worked before the
autonomous development loop existed, timing must be reconstructed from agent
session transcripts. Two transcript formats exist in the project history:

- A Claude session JSONL format: one JSON object per turn, with a UTC timestamp,
  a role (`user` or `assistant`), message content, and tool-call records.
- A Codex session-rollout format: a structured log of model turns with file
  edits, shell commands, and timestamps.

Both formats carry enough information to recover: when a session started, which
files were touched, which repository and branch were active, and approximately
when the session ended (last turn timestamp). Cross-tool interval merging (a
task worked in two different tools during the same calendar day) is a rare edge
case; in practice each task belongs primarily to one tool and one continuous
session window.

## Pipeline

The pipeline has four stages:

### 1. Collect

Gather transcript files from wherever they reside. The Claude JSONL transcripts
are stored under a well-known local path per session. The Codex rollouts are
scattered across machines and backup directories and must be enumerated by a
manifest or directory scan. Neither format is committed to the repository; they
are local/backup artifacts. The collect stage produces a manifest of transcript
files keyed by session id and format type.

### 2. Normalize

Parse each transcript into a common event schema:

```
{
  "session_id": "...",
  "format": "claude-jsonl" | "codex-rollout",
  "started_at": "<UTC ISO timestamp>",
  "ended_at":   "<UTC ISO timestamp>",
  "repo":       "<repo name>",
  "branch":     "<branch name or null>",
  "files_touched": ["<relative path>", ...],
  "tool_calls": <count>,
  "role_turns": <count>
}
```

The `started_at` and `ended_at` values are the first and last turn timestamps
in the session. For the duration estimate, idle time between turns (long pauses
between user and assistant turns, or overnight gaps within a session file) is
clipped: only contiguous active intervals -- where consecutive turn timestamps
are within a configurable idle threshold -- count toward the active duration.
The result is an idle-clipped active duration attributed to the session.

Per-task effort is the sum of idle-clipped active durations across all sessions
whose `branch` matches the task's task branch. For tasks with a single session
this is trivial; for tasks where a session covered multiple branches, the
attribution is prorated by file overlap or left to manual annotation.

### 3. Recap and Index

After normalization, a recap step produces a per-task effort index: task id,
branch, real started/completed timestamps (from the run-telemetry promotions for
loop-era tasks, from the session-normalized estimate for pre-loop tasks), idle-
clipped active duration, agent kind, and the commit SHAs that belong to the
task. This index is written to a structured file (JSON Lines, one record per
task) under `target/` during the build and is the input to the dataset builder
and the Gantt renderer. It is a derived artifact; the sources of truth are git
history, the `docs/tasks/done/` ledger, and the transcript files.

### 4. Store in Object Storage

The normalized transcript archive and the per-task effort index are stored in
object storage (GCS or S3) under a versioned prefix. This serves two purposes:
it makes the archive portable across machines, and it provides a stable input
for the prediction dataset builder that does not depend on the local transcript
directory layout. The object storage upload is a manual or CI-triggered step,
not part of every build.

## Commit Provenance

The `prepare-commit-msg` hook (landed at `tools/githooks/prepare-commit-msg`)
stamps three trailers on every commit:

- `Plan-Item: <task-id>` -- the task this commit belongs to.
- `Run-Id: <run-id>` -- the run-telemetry log entry for this work session.
- `Agent-Kind: <kind>` -- which implementation agent produced the commit.

These trailers make the commit-to-task and commit-to-run mappings native to git
history and queryable by `git log --grep`. A Gantt renderer can walk `git log`
and group commits by `Plan-Item`, attributing intervals to tasks without any
external database. The run-telemetry log fills in wall-clock start/end; git
provides the commit sequence and churn metrics.

## Prediction Dataset

The prediction dataset is a derived artifact built by a script from git history
and the per-task effort index. It is not stored in task front matter; the task
front matter carries only the real timestamps and commit SHAs, not derived
features.

**Features (X):** per-task git-derived metrics over the task's `commits:` list:

- Commit count.
- Churn: insertions + deletions.
- Files changed (total and unique).
- Subsystems touched: a subsystem label per changed file, derived from the
  directory prefix (e.g. `kernel/`, `capos-lib/`, `docs/`, `schema/`).
- Categorical fields: milestone/track, slice class (`behavior`,
  `read-side-proof`, `harness-hardening`, `docs-status`), hazard families
  checked.

**Label (y):** real effort in minutes -- the idle-clipped active duration from
the session archive, or the run-telemetry-derived interval for loop-era tasks.

**Granularity:** one record per branch merge (feature/work-unit granularity).
This matches the size at which future tasks are dispatched and avoids the noise
of per-commit or per-day fragments. Tasks that span multiple branches (a
prerequisite branch plus a follow-up) are modeled as separate records linked by
a dependency field; the prediction target is per-branch, and milestone
forecasting aggregates across the dependency graph.

**Model:** a regression over the feature set above, using a simple baseline
(linear regression or gradient-boosted trees) before investing in anything more
complex. The first useful output is a p50/p90 interval per slice class and
subsystem combination, not a precise point estimate.

## Gantt Rendering

The Gantt is rendered from the per-task effort index: each task becomes a bar
spanning its `started_at` to `completed_at` (or `started_at` plus active
duration for pre-loop tasks where only the duration is reliable). Tasks are
grouped by milestone and slice class, and bars are colored by subsystem. The
output is a static SVG or a simple HTML/SVG file -- not an interactive
dashboard. The rendering script reads the per-task effort index from `target/`
and writes `target/gantt.svg` or `target/gantt.html`. It is not part of the
default build.

## Sequencing and Prerequisites

The following are already landed:

- `docs/tasks/done/` ledger with real `started:` and `completed:` fields for
  loop-era tasks.
- `prepare-commit-msg` hook stamping `Plan-Item`, `Run-Id`, and `Agent-Kind`
  trailers.
- Run-telemetry log entries for loop-era tasks (local, ephemeral).

The following are future work:

1. **Transcript collector and normalizer.** Write parsers for the Claude JSONL
   and Codex rollout formats, the idle-clipping logic, and the per-task effort
   index builder. This is a standalone Python or Rust host tool; no kernel
   changes.
2. **Backfill pass.** Run the normalizer over the existing transcript archive
   to populate pre-loop effort estimates for tasks in `docs/tasks/done/`.
   Where transcripts are unavailable, leave the duration field as `null` with
   a `source: unavailable` annotation; do not invent estimates.
3. **Object storage upload.** Configure the archive upload to GCS or S3 and
   set up the versioned prefix scheme.
4. **Dataset builder.** Write the script that joins the per-task effort index
   with git metrics to produce the prediction dataset.
5. **Baseline model.** Train and evaluate the baseline duration-prediction model
   on the dataset. Publish the p50/p90 per-slice-class table as a static
   `docs/` page once it has enough data to be meaningful.
6. **Gantt renderer.** Write the script and add a `make gantt` target.

Steps 1-2 can proceed independently of 3-6 and are the highest-value items:
the backfill populates the effort ground truth that all downstream uses depend
on.

## Authority and Privacy

- The transcript archive and the run-telemetry log are **not committed** to the
  repository. They are local/private artifacts.
- The per-task effort index written to `target/` is a derived artifact and is
  gitignored; it may contain session ids and durations but no message content.
- The `docs/tasks/done/` entries carry only `started:`, `completed:`, and
  `commits:` fields sourced from the telemetry and git; they do not carry
  message content, file system paths, or host-identifying information.
- The prediction dataset contains only git-derived metrics and duration labels;
  no transcript content.

## Relationship to Existing Proposals

- **Task State and Agent Telemetry** ([task-state-and-agent-telemetry-proposal.md](task-state-and-agent-telemetry-proposal.md)):
  the task file schema and run-telemetry structure that this proposal reads from.
  The two proposals are complementary: that proposal defines the task lifecycle
  and local operational state; this proposal defines what to do with the timing
  data once it exists.
- **agentic development experiment** ([agentic-development-experiment-proposal.md](agentic-development-experiment-proposal.md)):
  the autonomous development loop whose run-telemetry log is the primary timing
  source for loop-era tasks.

## Open Questions

- **Idle threshold.** What inter-turn gap counts as idle and is excluded from
  active duration? A 30-minute threshold is a reasonable starting point; the
  right value depends on the observed gap distribution in the transcript archive.
- **Multi-branch tasks.** Some tasks span a prerequisite branch plus a follow-up
  fix branch. The current model treats each merge as a separate record; a cleaner
  approach may be a `parent-task:` field in the task front matter so the effort
  can be rolled up.
- **Backfill completeness.** Transcript files from early project history may be
  incomplete or unavailable. The normalizer must handle missing sessions
  gracefully; the dataset must mark incomplete records rather than imputing
  durations.
- **Model selection.** Whether a simple linear baseline is sufficient or whether
  a richer model (gradient-boosted trees, conformal prediction intervals) is
  warranted depends on the dataset size and variance. Defer this decision until
  the backfill pass is complete and the distribution is known.

## Design Grounding

- Task ledger schema and run-telemetry promotion: `docs/tasks/README.md` and
  [task-state-and-agent-telemetry-proposal.md](task-state-and-agent-telemetry-proposal.md).
- Commit-provenance trailers: `tools/githooks/prepare-commit-msg`.
- Slice-class vocabulary and hazard families: `CLAUDE.md` (Autonomous Slice
  Hygiene section) and `REVIEW.md`.