Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Session Archive and Gantt Effort Pipeline

Development tasks in capOS each carry a real start and finish time. The autonomous development loop records these directly for tasks it executes; for earlier work the timing is recoverable from agent session transcripts. Collected together and attributed to branches and tasks, that timing data enables two things: a whole-history development Gantt and a dataset for predicting how long a future task will take.

Status: Proposal. The foundation is partially landed. A per-day task ledger exists in docs/tasks/done/, where each done entry carries the real branch commit SHAs and, for tasks executed by the autonomous development loop, real started and completed timestamps sourced from the run-telemetry log. A prepare-commit-msg hook stamps Plan-Item, Run-Id, and Agent-Kind trailers on commits so the commit-to-task-to-run mapping is native to git history. The session-transcript ETL, the derived dataset builder, and the duration-prediction model are future work this proposal scopes.

Goals

  • Predict how long a future task will take from historical effort patterns, using features derivable from the task’s commits and metadata.
  • Render a whole-history development Gantt over the landed branch and task ledger, attributing each interval to the task that produced it.
  • Feed that data back into planning: size estimates, milestone forecasting, and identification of subsystems or slice classes that consistently take longer than anticipated.

Timing Sources

Two sources provide per-task effort data, at different points in the project timeline:

Run-telemetry log (loop-era tasks). The autonomous development loop writes a record per task run to a local telemetry log. Each record carries: a run id, the task id, the agent kind, a session id, a started timestamp (when the agent began), and a completed timestamp (when the agent finished and the branch was merged or abandoned). These timestamps are exact wall-clock values, not estimates. They are written to the local run-telemetry log (ephemeral, not committed) and promoted to the task’s done/ file as started: and completed: front-matter fields when the task closes. That promotion is the boundary between local operational state and the durable public record.

Agent session transcripts (pre-loop history). For tasks worked before the autonomous development loop existed, timing must be reconstructed from agent session transcripts. Two transcript formats exist in the project history:

  • A Claude session JSONL format: one JSON object per turn, with a UTC timestamp, a role (user or assistant), message content, and tool-call records.
  • A Codex session-rollout format: a structured log of model turns with file edits, shell commands, and timestamps.

Both formats carry enough information to recover: when a session started, which files were touched, which repository and branch were active, and approximately when the session ended (last turn timestamp). Cross-tool interval merging (a task worked in two different tools during the same calendar day) is a rare edge case; in practice each task belongs primarily to one tool and one continuous session window.

Pipeline

The pipeline has four stages:

1. Collect

Gather transcript files from wherever they reside. The Claude JSONL transcripts are stored under a well-known local path per session. The Codex rollouts are scattered across machines and backup directories and must be enumerated by a manifest or directory scan. Neither format is committed to the repository; they are local/backup artifacts. The collect stage produces a manifest of transcript files keyed by session id and format type.

2. Normalize

Parse each transcript into a common event schema:

{
  "session_id": "...",
  "format": "claude-jsonl" | "codex-rollout",
  "started_at": "<UTC ISO timestamp>",
  "ended_at":   "<UTC ISO timestamp>",
  "repo":       "<repo name>",
  "branch":     "<branch name or null>",
  "files_touched": ["<relative path>", ...],
  "tool_calls": <count>,
  "role_turns": <count>
}

The started_at and ended_at values are the first and last turn timestamps in the session. For the duration estimate, idle time between turns (long pauses between user and assistant turns, or overnight gaps within a session file) is clipped: only contiguous active intervals – where consecutive turn timestamps are within a configurable idle threshold – count toward the active duration. The result is an idle-clipped active duration attributed to the session.

Per-task effort is the sum of idle-clipped active durations across all sessions whose branch matches the task’s task branch. For tasks with a single session this is trivial; for tasks where a session covered multiple branches, the attribution is prorated by file overlap or left to manual annotation.

3. Recap and Index

After normalization, a recap step produces a per-task effort index: task id, branch, real started/completed timestamps (from the run-telemetry promotions for loop-era tasks, from the session-normalized estimate for pre-loop tasks), idle- clipped active duration, agent kind, and the commit SHAs that belong to the task. This index is written to a structured file (JSON Lines, one record per task) under target/ during the build and is the input to the dataset builder and the Gantt renderer. It is a derived artifact; the sources of truth are git history, the docs/tasks/done/ ledger, and the transcript files.

4. Store in Object Storage

The normalized transcript archive and the per-task effort index are stored in object storage (GCS or S3) under a versioned prefix. This serves two purposes: it makes the archive portable across machines, and it provides a stable input for the prediction dataset builder that does not depend on the local transcript directory layout. The object storage upload is a manual or CI-triggered step, not part of every build.

Commit Provenance

The prepare-commit-msg hook (landed at tools/githooks/prepare-commit-msg) stamps three trailers on every commit:

  • Plan-Item: <task-id> – the task this commit belongs to.
  • Run-Id: <run-id> – the run-telemetry log entry for this work session.
  • Agent-Kind: <kind> – which implementation agent produced the commit.

These trailers make the commit-to-task and commit-to-run mappings native to git history and queryable by git log --grep. A Gantt renderer can walk git log and group commits by Plan-Item, attributing intervals to tasks without any external database. The run-telemetry log fills in wall-clock start/end; git provides the commit sequence and churn metrics.

Prediction Dataset

The prediction dataset is a derived artifact built by a script from git history and the per-task effort index. It is not stored in task front matter; the task front matter carries only the real timestamps and commit SHAs, not derived features.

Features (X): per-task git-derived metrics over the task’s commits: list:

  • Commit count.
  • Churn: insertions + deletions.
  • Files changed (total and unique).
  • Subsystems touched: a subsystem label per changed file, derived from the directory prefix (e.g. kernel/, capos-lib/, docs/, schema/).
  • Categorical fields: milestone/track, slice class (behavior, read-side-proof, harness-hardening, docs-status), hazard families checked.

Label (y): real effort in minutes – the idle-clipped active duration from the session archive, or the run-telemetry-derived interval for loop-era tasks.

Granularity: one record per branch merge (feature/work-unit granularity). This matches the size at which future tasks are dispatched and avoids the noise of per-commit or per-day fragments. Tasks that span multiple branches (a prerequisite branch plus a follow-up) are modeled as separate records linked by a dependency field; the prediction target is per-branch, and milestone forecasting aggregates across the dependency graph.

Model: a regression over the feature set above, using a simple baseline (linear regression or gradient-boosted trees) before investing in anything more complex. The first useful output is a p50/p90 interval per slice class and subsystem combination, not a precise point estimate.

Gantt Rendering

The Gantt is rendered from the per-task effort index: each task becomes a bar spanning its started_at to completed_at (or started_at plus active duration for pre-loop tasks where only the duration is reliable). Tasks are grouped by milestone and slice class, and bars are colored by subsystem. The output is a static SVG or a simple HTML/SVG file – not an interactive dashboard. The rendering script reads the per-task effort index from target/ and writes target/gantt.svg or target/gantt.html. It is not part of the default build.

Sequencing and Prerequisites

The following are already landed:

  • docs/tasks/done/ ledger with real started: and completed: fields for loop-era tasks.
  • prepare-commit-msg hook stamping Plan-Item, Run-Id, and Agent-Kind trailers.
  • Run-telemetry log entries for loop-era tasks (local, ephemeral).

The following are future work:

  1. Transcript collector and normalizer. Write parsers for the Claude JSONL and Codex rollout formats, the idle-clipping logic, and the per-task effort index builder. This is a standalone Python or Rust host tool; no kernel changes.
  2. Backfill pass. Run the normalizer over the existing transcript archive to populate pre-loop effort estimates for tasks in docs/tasks/done/. Where transcripts are unavailable, leave the duration field as null with a source: unavailable annotation; do not invent estimates.
  3. Object storage upload. Configure the archive upload to GCS or S3 and set up the versioned prefix scheme.
  4. Dataset builder. Write the script that joins the per-task effort index with git metrics to produce the prediction dataset.
  5. Baseline model. Train and evaluate the baseline duration-prediction model on the dataset. Publish the p50/p90 per-slice-class table as a static docs/ page once it has enough data to be meaningful.
  6. Gantt renderer. Write the script and add a make gantt target.

Steps 1-2 can proceed independently of 3-6 and are the highest-value items: the backfill populates the effort ground truth that all downstream uses depend on.

Authority and Privacy

  • The transcript archive and the run-telemetry log are not committed to the repository. They are local/private artifacts.
  • The per-task effort index written to target/ is a derived artifact and is gitignored; it may contain session ids and durations but no message content.
  • The docs/tasks/done/ entries carry only started:, completed:, and commits: fields sourced from the telemetry and git; they do not carry message content, file system paths, or host-identifying information.
  • The prediction dataset contains only git-derived metrics and duration labels; no transcript content.

Relationship to Existing Proposals

  • Task State and Agent Telemetry (task-state-and-agent-telemetry-proposal.md): the task file schema and run-telemetry structure that this proposal reads from. The two proposals are complementary: that proposal defines the task lifecycle and local operational state; this proposal defines what to do with the timing data once it exists.
  • agentic development experiment (capOS Agentic Development Experiment): the autonomous development loop whose run-telemetry log is the primary timing source for loop-era tasks.

Open Questions

  • Idle threshold. What inter-turn gap counts as idle and is excluded from active duration? A 30-minute threshold is a reasonable starting point; the right value depends on the observed gap distribution in the transcript archive.
  • Multi-branch tasks. Some tasks span a prerequisite branch plus a follow-up fix branch. The current model treats each merge as a separate record; a cleaner approach may be a parent-task: field in the task front matter so the effort can be rolled up.
  • Backfill completeness. Transcript files from early project history may be incomplete or unavailable. The normalizer must handle missing sessions gracefully; the dataset must mark incomplete records rather than imputing durations.
  • Model selection. Whether a simple linear baseline is sufficient or whether a richer model (gradient-boosted trees, conformal prediction intervals) is warranted depends on the dataset size and variance. Defer this decision until the backfill pass is complete and the distribution is known.

Design Grounding

  • Task ledger schema and run-telemetry promotion: docs/tasks/README.md and task-state-and-agent-telemetry-proposal.md.
  • Commit-provenance trailers: tools/githooks/prepare-commit-msg.
  • Slice-class vocabulary and hazard families: CLAUDE.md (Autonomous Slice Hygiene section) and REVIEW.md.