# Proposal: Stateful Task and Job Graphs

capOS should eventually have a small durable work-graph substrate: a way to
describe, run, inspect, pause, retry, and resume stateful DAG-shaped work.
It should serve four related needs without becoming a universal service
manager:

- init-owned service startup, restart, and shutdown orchestration;
- IX-style package and build graph execution;
- operator-visible task lists with optional assignee, budget, and run state;
- notebook-like user stories where prose, commands, outputs, and rerun points
  are recorded as a narrative over real work.

The important design line is that the graph substrate is not the UI, not a
shell, not a package manager, not a notebook runtime, not a service manager,
and not a generic capability broker. It is the durable state machine beneath
those tools.

## Position

Adopt a `WorkGraph` model, but keep it narrow.

The core object is a versioned graph definition plus run instances:

- **Graph definition:** immutable, schema-validated structure: nodes, typed
  edges, resource hints, authority requirements, retry/cancellation policy,
  and expected artifacts.
- **Graph run:** one execution attempt of a graph definition, with node-run
  state, leases, logs, checkpoints, artifacts, and audit events.
- **Node run:** one executable, manual, or descriptive unit of work inside a
  run.
- **Artifact:** durable output, checkpoint, service export, log, report, or
  Store/Namespace reference produced by a node.
- **Assignment:** optional workload metadata: assignee principal, role,
  queue, priority, resource profile, deadline, and budget.

The common substrate is a schema/library/event-log pattern, not one global
coordinator. Each domain owns its coordinator, executor queue, domain-node
schema, validation, and authority:

- init owns init lifecycle state;
- `BuildCoordinator` owns IX build graph execution and job state;
- an agent runner owns agent task state and workspace leases;
- a notebook/story service owns narrative projections;
- an operator task service owns human assignment state.

They may share graph/run/event/artifact shapes, but they do not share one
authority-holding scheduler.

Everything above that is a facade:

- init sees service lifecycle and dependency state;
- IX sees package inputs, build steps, outputs, and Store commits;
- an operator sees a DAG-organized todo list with assigned work;
- a notebook sees cells, prose, rich outputs, and rerun checkpoints;
- an agent runner sees durable steps, memory/checkpoints, and review gates.

The same persisted run can have more than one projection. A failed package
build can appear as an IX build failure, an operator task, a notebook section,
and a graph node with logs. The core should not know which projection is being
used. Cross-domain views should be read-only projections or explicit links to
the owning run, not copied mutable event state.

## Why This Belongs in capOS

capOS already has several graph-shaped systems:

- `initConfig.services` is an init-owned service graph.
- `ProcessSpawner` and `ProcessHandle` provide process lifecycle edges.
- `libcapos-service` needs readiness, shutdown, drain, background work,
  resource reservations, and handoff hooks.
- IX-on-capOS needs dependency-ordered fetch, extract, build, Store commit,
  and realm publish.
- agent and shell workflows need durable state when work crosses sessions,
  reviews, restarts, or context compaction.

Without a shared state model, each subsystem will grow its own partial
orchestrator: init will have a service table, IX will have a build executor,
agents will have task memory, operators will have ad-hoc todo state, and
notebook-like demos will have their own cell/run records. That is duplication
in the wrong layer.

With too much sharing, the substrate becomes a god object. The right answer is
a shared run-state and dependency model with domain-specific executors.

## Prior Art Baseline

Sources checked for this proposal:

- capOS IX research note:
  [`docs/research/ix-on-capos-hosting.md`](../research/ix-on-capos-hosting.md)
- Upstream IX repository and executor source:
  <https://github.com/pg83/ix>,
  <https://raw.githubusercontent.com/pg83/ix/master/core/execute.py>
- Apache Airflow 3.2 DAG docs:
  <https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html>
- Dagster docs on software-defined assets, ops/graphs, jobs, schedules, and
  sensors:
  <https://docs.dagster.io/>,
  <https://docs.dagster.io/guides/build/assets>,
  <https://docs.dagster.io/guides/build/ops>,
  <https://docs.dagster.io/guides/build/jobs>,
  <https://docs.dagster.io/guides/automate/schedules>,
  <https://docs.dagster.io/guides/automate/sensors>
- Jupyter `nbformat` docs:
  <https://nbformat.readthedocs.io/en/latest/format_description.html>
- LangGraph persistence docs:
  <https://docs.langchain.com/oss/javascript/langgraph/persistence>

The useful lessons are separable.

**Airflow:** a workflow run has task instances, dependencies, scheduling,
retries, timeouts, documentation, and operational state. Airflow's DAG object
intentionally does not care what happens inside a task; it cares about order,
retry, timeout, and execution conditions. capOS should copy that separation,
but not the Python-file import model, global scheduler database, or
operator/plugin surface.

**Dagster:** asset-first thinking fits capOS better than task-first thinking
when the output is durable state. A Store object, package output, Namespace
snapshot, boot manifest, built binary, benchmark report, or service export is
closer to a Dagster asset than to an Airflow task. Dagster's ops/graphs remain
useful when work is not naturally an asset. capOS should adopt the split:
assets are durable products; ops are execution steps; jobs are selections of
work to materialize or run. Dagster itself is data-platform-shaped, so it is
inspiration, not the implementation target for init.

**Jupyter:** notebook structure is a user story, not the kernel or init
abstraction. Cells, prose, outputs, and metadata are excellent for reviewing a
run, explaining why it happened, and rerunning a chosen step. They should be a
projection over graph state. Cell order must not become the source of truth for
service lifecycle or package builds.

**LangGraph:** checkpointed graph execution, threads, super-step boundaries,
interrupts, and time travel are useful for agent-like and human-in-the-loop
work. capOS should borrow the checkpoint boundary idea for resumability, but
avoid binding the substrate to LLM message state.

**IX:** the package graph research is the strongest local precedent. IX's
current executor traverses a dependency graph by node outputs, applies pools,
creates output directories, runs shell commands, touches sentinel files, and
kills the process group on failure. That proves IX already has a real build
graph. It also shows where capOS must stop: graph scheduling must not be fused
to subprocess, Unix process groups, filesystem sentinels, hardlinks, symlinks,
fetchers, archive extraction, or Store mutation. Those belong behind typed
capOS services.

## Core Model

The minimal model is:

```capnp
struct WorkGraph {
  graphId @0 :Text;
  version @1 :UInt64;
  nodes @2 :List(CommonNodeSpec);
  edges @3 :List(EdgeSpec);
  defaults @4 :GraphPolicy;
  domainSchema @5 :UInt64;
}

struct CommonNodeSpec {
  nodeId @0 :Text;
  title @1 :Text;
  inputs @2 :List(ArtifactSelector);
  outputs @3 :List(ArtifactSpec);
  requiredCaps @4 :List(CapRequirement);
  policy @5 :NodePolicy;
  assignmentDefault @6 :Assignment;
}

struct WorkRun {
  runId @0 :Text;
  graphId @1 :Text;
  graphVersion @2 :UInt64;
  state @3 :RunState;
  nodes @4 :List(NodeRun);
  events @5 :List(EventRef);
}

struct NodeRun {
  nodeId @0 :Text;
  state @1 :NodeState;
  attempt @2 :UInt32;
  assignment @3 :Assignment;
  artifacts @4 :List(ArtifactRef);
  checkpoint @5 :CheckpointRef;
}
```

This is a shape, not final schema. The stable part is the split between
definition, run, node-run state, artifacts, and assignments.

Domain node meanings are not a shared `NodeKind` enum in the common schema.
Init may define `InitServiceNode`; IX may define `FetchNode`, `ExtractNode`,
`BuildNode`, `StoreCommitNode`, and `PublishNode`; a story projection may
define `NotebookCellNode` or `ManualNoteNode`. Those domain structs live in
domain-owned schemas or config sections and are validated by the domain
coordinator that holds the relevant authority. The common graph library may
hash, store, and index their association with `nodeId`, but it must not
interpret every domain's node kinds.

## Node State

Node state should be explicit enough for init, package builds, and operators:

- `planned`: validated but not yet eligible.
- `blocked`: waiting on upstream nodes, an unavailable capability, resource
  budget, or manual input.
- `runnable`: dependencies are satisfied and a worker may lease it.
- `leased`: a worker or assignee owns the next attempt for a bounded time.
- `running`: execution has begun.
- `waiting`: running but blocked on a child process, readiness export,
  external event, manual approval, timer, or checkpoint resume.
- `succeeded`: produced the declared outputs or accepted terminal result.
- `failed`: terminal failure under current policy.
- `retryPending`: failed attempt will be retried under policy.
- `skipped`: intentionally not run because branch/condition policy selected a
  different path.
- `canceled`: canceled by caller, shutdown, superseding run, or authority
  revocation.
- `paused`: durable operator or policy pause.
- `stale`: graph version, cap epoch, input artifact, or session binding no
  longer matches the run's assumptions.

State transitions should be append-only events. Services may compact state into
snapshots, but audit and replay need a durable event boundary.

## Edges

A plain DAG edge is not enough. capOS needs typed edge reasons:

- `dependsOnSuccess`: downstream may run after upstream succeeds.
- `dependsOnArtifact`: downstream consumes a named artifact or Store ref.
- `dependsOnReady`: downstream waits on a service readiness export.
- `dependsOnLease`: downstream may run only while a lease/session is live.
- `cancelsWith`: cancellation propagates across the edge.
- `shutdownBefore`: shutdown order edge, usually reverse of startup.
- `approvalFor`: manual approval gates a node or subgraph.
- `observes`: node only observes another node's state and does not block it.

The graph remains acyclic within one run. Loops are modeled by new runs,
periodic schedules, sensors, retries, or explicit child graphs. This is a
critical stop line: hidden cycles create service-manager behavior inside the
graph engine.

## Workload Assignment

Assignment is optional metadata, not authority:

```capnp
struct Assignment {
  principal @0 :Text;
  role @1 :Text;
  queue @2 :Text;
  priority @3 :Int32;
  budget @4 :ResourceProfileRef;
  deadline @5 :TimeRef;
  lease @6 :LeaseRef;
}
```

An assigned operator or worker may receive a lease to attempt a node. The
lease does not grant broad system authority. It only grants the ability to
claim or update that node-run through the coordinator, and any executable work
still needs domain caps supplied by init, a build coordinator, a package
worker, an agent runner, or another supervisor.

This makes the same graph usable as:

- a todo list where a human owns a manual node;
- a build queue where a worker owns a build step;
- an init run where PID 1 owns service lifecycle nodes;
- an agent plan where a worker owns a bounded workspace task.

## Init As A Consumer

The user direction is important: this may be used for workload orchestration by
init.

The current init path validates `initConfig.services`, spawns children through
`ProcessSpawner`, records exports, and waits. The first graph use should only
observe and structure that existing behavior:

1. Compile `initConfig.services` into a graph definition.
2. Create a volatile boot `WorkRun` in init memory.
3. Treat each service as a lifecycle node with the states current init can
   actually observe: planned, spawned, running/waiting, exited, or failed.
4. Use typed edges for declared cap imports and manifest-order dependencies.
5. Persist selected run events later through a Store-backed journal when
   storage is available.

Init does not need to become a general-purpose Airflow. It needs a durable
or inspectable lifecycle table with graph semantics:

- what services were planned;
- what caps and exports they depend on;
- which services are spawned, running, waiting, exited, failed, or blocked
  under the current primitives;
- later, which services are restarting, draining, terminating, or ordered for
  shutdown once those lifecycle primitives exist;
- what operator-visible work remains.

Restart, drain, termination, readiness-export waiting, and shutdown-order
control are later phases. They require primitives that are still future in the
service and broker proposals:

- process termination or kill-tree semantics narrower than raw process-table
  authority;
- an explicit readiness/export contract for services;
- service drain or lifecycle caps for graceful shutdown;
- restart policy state that is disabled or narrowed during shutdown mode;
- stale export and stale process-handle behavior for restarted services;
- audit events that distinguish crash, restart, operator stop, shutdown,
  timeout, and stale-authority denial.

The generic graph code can be an init-internal library at first. If a separate
run-state service appears later, init should delegate only narrow read or
update capabilities to it. The separate service must not receive
`ProcessSpawner`, raw process handles, or service-owner caps merely because it
stores graph state.

## IX Package Graph Consumer

IX should use the same run-state model with a different executor:

- package templates and descriptors produce graph definitions;
- fetch/extract/build/store/publish become typed nodes;
- inputs and outputs are Store or Namespace refs;
- build logs and output hashes are artifacts;
- package build workers lease executable nodes;
- `BuildCoordinator` owns scheduling, cancellation, queues, and job state;
- `Fetcher`, `Archive`, `BuildSandbox`, `Store`, and `Namespace` hold the real
  authority.

The graph substrate should not know how to fetch a URL, unpack a tarball, run
`sh`, or commit a Store object. It records that those typed steps exist, which
worker owns the attempt, what artifacts were produced, and whether the run can
resume or retry.

This preserves the IX research recommendation: use IX's package corpus and
content-addressed model without importing a CPython/POSIX executor boundary.
It does not move IX job ownership into a global graph coordinator.

## Notebook User Story

Jupyter is best treated as a user story:

- A notebook cell can map to a `note`, `manualTask`, `notebookCell`,
  `agentStep`, or `build` node.
- Cell output is an artifact: text, table, image, log excerpt, benchmark
  summary, Store ref, or Namespace snapshot.
- Markdown/prose explains why the graph exists and how to interpret its state.
- Rerun means "create a new run or retry selected node(s) under policy", not
  "mutate hidden cell global state".
- Checkpoints let a user resume from a durable boundary.

The notebook layer may be CLI text, mdBook, a future web shell, or a rich UI.
The core model should not depend on any of those.

## Dagster Fit

Dagster is closer than Airflow for durable capOS work when outputs matter.
For capOS, a software-defined asset maps naturally to:

- content-addressed package output;
- boot image or manifest;
- Namespace snapshot;
- benchmark report;
- generated code artifact;
- service export that becomes available after readiness;
- notebook output captured as a reproducible artifact.

Dagster's ops and graphs map to executable steps. Its jobs map to selections
of assets or ops to run. Its sensors and schedules map to run creation
policies.

The mismatch is domain and authority. Dagster assumes a data-platform runtime,
Python definitions, and external resources. capOS needs capability grants,
typed service exports, process handles, sessions, Store/Namespace refs,
resource ledgers, and boot-time constraints. The right move is not "run
Dagster in init"; it is "use Dagster's asset/ops/jobs distinction to keep the
capOS graph model honest."

## Where To Stop

The main risk is building a god object. The graph substrate must not absorb
every adjacent concept.

Stop at these boundaries:

- **No kernel `WorkGraph` capability.** The kernel provides primitive caps:
  process, memory, IPC, timers, devices, and storage plumbing. Graph state is
  userspace.
- **No global service discovery.** A graph may reference capabilities granted
  into its runner or produced by its own nodes. It must not look up arbitrary
  services by global name.
- **No ambient executor.** Run-state code cannot execute arbitrary strings,
  scripts, Cap'n Proto calls, or binaries. A domain executor must hold the
  exact capabilities needed.
- **No universal plugin ABI.** Domain node kinds are typed in domain schemas.
  Unsupported node kinds fail domain validation rather than becoming untyped
  byte blobs.
- **No authority laundering.** Assignment, tags, labels, notebook cells, and
  graph edges do not grant authority. Only capabilities do.
- **No UI state in the core.** Notebook cells, DAG visual positions, comments,
  and todo-list grouping are projections or metadata.
- **No package-manager logic in the core.** Fetch, archive, build, Store, and
  Namespace operations stay in IX/build services.
- **No init-specific policy in the core.** Restart policy, shutdown order, and
  process termination are init or supervisor policy. The graph can record and
  drive them only through explicit runner methods.
- **No hidden loops.** Periodic work, sensors, retries, and agent iteration
  create new attempts or runs. One run's execution graph stays acyclic.
- **No unbounded event retention by default.** Retention and compaction are
  policy fields, not accidental database growth.

If a feature requires any graph coordinator to hold broad `ProcessSpawner`,
`DeviceManager`, `NetworkManager`, `Store`, `Namespace`, `Fetcher`, shell,
or session authority for all domains, the design has crossed the line.

## Service Split

The target split is:

```mermaid
flowchart TD
    Lib[Shared graph schema and state library]
    Log[Optional Store-backed event log]

    Lib --> InitCoord[init-local lifecycle graph]
    Lib --> BuildCoord[IX BuildCoordinator graph]
    Lib --> TaskCoord[operator task graph]
    Lib --> StoryCoord[notebook/story projection]
    Lib --> AgentCoord[agent-run graph]

    InitCoord --> InitLog[volatile boot run first]
    BuildCoord --> Log
    TaskCoord --> Log
    StoryCoord --> Log
    AgentCoord --> Log

    InitCoord --> InitExec[init lifecycle executor]
    BuildCoord --> BuildExec[build workers]
    TaskCoord --> Human[operator/manual assignee]
    AgentCoord --> AgentExec[agent worker]

    InitExec --> Spawner[ProcessSpawner]
    BuildExec --> Sandbox[BuildSandbox]
    BuildExec --> Store[Store/Namespace]
    AgentExec --> Workspace[Task workspace caps]
```

Only domain coordinators and executors hold domain authority. The shared code
owns no authority beyond manipulating in-memory or Store-backed graph records
through whatever narrow capability its caller already holds.

## Persistence

Persistence should be incremental:

- Early init boot runs can be volatile.
- Build runs should persist event logs, logs, artifacts, and Store refs as soon
  as Store exists.
- Operator tasks and notebook stories should persist once user storage exists.
- Agent runs should persist checkpoints and review state, not raw hidden prompt
  state.

Store integration should use content-addressed objects for immutable outputs
and an append-only or generation-checked log for mutable run state. Namespace
snapshots can publish human-facing names for completed runs, package realms,
or notebook reports.

Boot must not depend on a separate Store-backed graph service being available.
If durable graph logging is unavailable during boot, init falls back to its
volatile lifecycle table and emits diagnostics through its existing console/log
path. Durable replay and post-boot inspection are degraded in that mode, but
service startup must not fail solely because the graph log is unavailable.

## Security Rules

- Node claims are lease-based and expire.
- Every state update is authorized by the current lease, graph owner, or a
  delegated control cap.
- Node output publication validates expected artifact type and size.
- Retrying a node must not reuse stale capabilities, stale sessions, or stale
  object epochs.
- Cancellation must release leases and ask domain executors to drain or kill
  work through typed lifecycle caps.
- Audit logs distinguish failure, cancellation, stale authority, denied
  authority, timeout, manual rejection, and superseded run.
- Resource budgets are reserved before execution and released on all terminal
  paths.

## Staged Plan

### Stage A: Init-Local Run Model

Add a pure `capos-config` or init-local graph/run-state library that can model
the existing `initConfig.services` startup order, service exports, and child
waits. Keep it volatile. Add host tests for graph validation and state
transitions.

### Stage B: Init Lifecycle Projection

Teach init to expose or print an inspectable service run summary:
planned, spawned, running or waiting, exited, and failed. Later summaries can
add readiness, restart, drain, termination, and shutdown ordering after those
primitives exist. This can remain a text proof before adding any new capability
interface.

### Stage C: Store-Backed Run Log

Once Store/Namespace is credible, persist run events and compact snapshots.
This unlocks post-boot inspection, operator task state, and notebook stories.

### Stage D: IX BuildCoordinator

Represent IX package builds as graph runs. Keep execution in
`BuildCoordinator`, `BuildSandbox`, `Fetcher`, `Archive`, `Store`, and
`Namespace` services.

### Stage E: Operator Task Surface

Expose a shell or structured command surface for graph runs:
list, inspect, assign, pause, resume, retry, cancel, approve, and show
artifacts. This is the DAG-organized todo-list layer.

### Stage F: Notebook Story Projection

Generate notebook-like reports from graph runs: prose, cells, commands, logs,
artifacts, and checkpoints. Treat notebooks as reproducible run narratives,
not as the owner of execution semantics.

### Stage G: Agent Workflows

Use graph runs for long-lived agent tasks, review gates, workspace leases,
memory checkpoints, and human approval nodes.

## Validation

Each stage should have focused checks:

- pure host tests for state transitions and invalid graph rejection;
- init QEMU proof that existing service startup still works;
- later lifecycle-control proof that shutdown dependency order is obeyed, once
  terminate/drain/shutdown primitives exist;
- stale lease and stale cap epoch tests;
- IX differential tests against host-side IX planning where applicable;
- docs build to refresh topics and catch Mermaid/front matter errors.

## Open Questions

- Should init embed the graph library permanently, or should it eventually
  delegate run-state persistence to a child service once storage is available?
- What is the smallest schema for `ArtifactRef` that covers service exports,
  Store refs, logs, notebooks, and package outputs without becoming `Any`?
- Does `domainSchema` identify only a domain schema version, or also the
  domain payload location and content hash for node-specific config?
- How should schedules and sensors be represented without creating hidden
  cyclic runs?
- Which graph events deserve permanent audit retention versus compacted
  operational state?
- Should notebook projections use Jupyter `nbformat` directly, or a smaller
  capOS-native story format that can export to notebooks later?

## Recommendation

Build a small stateful graph substrate, but make it a run-state service or
library, not a universal orchestrator.

For init, use it to make service lifecycle visible and eventually durable.
For IX, use it to track package build graphs while execution remains in build
services. For operators, project it as an assigned DAG todo list. For Jupyter,
project it as a notebook-style user story. For agents, project it as durable
task state with checkpoints and review gates.

The stop line is authority: shared graph code records state, domain
coordinators schedule work, and typed domain services execute it.