Proposal: Stateful Task and Job Graphs
capOS should eventually have a small durable work-graph substrate: a way to describe, run, inspect, pause, retry, and resume stateful DAG-shaped work. It should serve four related needs without becoming a universal service manager:
- init-owned service startup, restart, and shutdown orchestration;
- IX-style package and build graph execution;
- operator-visible task lists with optional assignee, budget, and run state;
- notebook-like user stories where prose, commands, outputs, and rerun points are recorded as a narrative over real work.
The important design line is that the graph substrate is not the UI, not a shell, not a package manager, not a notebook runtime, not a service manager, and not a generic capability broker. It is the durable state machine beneath those tools.
Position
Adopt a WorkGraph model, but keep it narrow.
The core object is a versioned graph definition plus run instances:
- Graph definition: immutable, schema-validated structure: nodes, typed edges, resource hints, authority requirements, retry/cancellation policy, and expected artifacts.
- Graph run: one execution attempt of a graph definition, with node-run state, leases, logs, checkpoints, artifacts, and audit events.
- Node run: one executable, manual, or descriptive unit of work inside a run.
- Artifact: durable output, checkpoint, service export, log, report, or Store/Namespace reference produced by a node.
- Assignment: optional workload metadata: assignee principal, role, queue, priority, resource profile, deadline, and budget.
The common substrate is a schema/library/event-log pattern, not one global coordinator. Each domain owns its coordinator, executor queue, domain-node schema, validation, and authority:
- init owns init lifecycle state;
BuildCoordinatorowns IX build graph execution and job state;- an agent runner owns agent task state and workspace leases;
- a notebook/story service owns narrative projections;
- an operator task service owns human assignment state.
They may share graph/run/event/artifact shapes, but they do not share one authority-holding scheduler.
Everything above that is a facade:
- init sees service lifecycle and dependency state;
- IX sees package inputs, build steps, outputs, and Store commits;
- an operator sees a DAG-organized todo list with assigned work;
- a notebook sees cells, prose, rich outputs, and rerun checkpoints;
- an agent runner sees durable steps, memory/checkpoints, and review gates.
The same persisted run can have more than one projection. A failed package build can appear as an IX build failure, an operator task, a notebook section, and a graph node with logs. The core should not know which projection is being used. Cross-domain views should be read-only projections or explicit links to the owning run, not copied mutable event state.
Why This Belongs in capOS
capOS already has several graph-shaped systems:
initConfig.servicesis an init-owned service graph.ProcessSpawnerandProcessHandleprovide process lifecycle edges.libcapos-serviceneeds readiness, shutdown, drain, background work, resource reservations, and handoff hooks.- IX-on-capOS needs dependency-ordered fetch, extract, build, Store commit, and realm publish.
- agent and shell workflows need durable state when work crosses sessions, reviews, restarts, or context compaction.
Without a shared state model, each subsystem will grow its own partial orchestrator: init will have a service table, IX will have a build executor, agents will have task memory, operators will have ad-hoc todo state, and notebook-like demos will have their own cell/run records. That is duplication in the wrong layer.
With too much sharing, the substrate becomes a god object. The right answer is a shared run-state and dependency model with domain-specific executors.
Prior Art Baseline
Sources checked for this proposal:
- capOS IX research note: IX-on-capOS Hosting
- Upstream IX repository and executor source: https://github.com/pg83/ix, https://raw.githubusercontent.com/pg83/ix/master/core/execute.py
- Apache Airflow 3.2 DAG docs: https://airflow.apache.org/docs/apache-airflow/stable/core-concepts/dags.html
- Dagster docs on software-defined assets, ops/graphs, jobs, schedules, and sensors: https://docs.dagster.io/, https://docs.dagster.io/guides/build/assets, https://docs.dagster.io/guides/build/ops, https://docs.dagster.io/guides/build/jobs, https://docs.dagster.io/guides/automate/schedules, https://docs.dagster.io/guides/automate/sensors
- Jupyter
nbformatdocs: https://nbformat.readthedocs.io/en/latest/format_description.html - LangGraph persistence docs: https://docs.langchain.com/oss/javascript/langgraph/persistence
The useful lessons are separable.
Airflow: a workflow run has task instances, dependencies, scheduling, retries, timeouts, documentation, and operational state. Airflow’s DAG object intentionally does not care what happens inside a task; it cares about order, retry, timeout, and execution conditions. capOS should copy that separation, but not the Python-file import model, global scheduler database, or operator/plugin surface.
Dagster: asset-first thinking fits capOS better than task-first thinking when the output is durable state. A Store object, package output, Namespace snapshot, boot manifest, built binary, benchmark report, or service export is closer to a Dagster asset than to an Airflow task. Dagster’s ops/graphs remain useful when work is not naturally an asset. capOS should adopt the split: assets are durable products; ops are execution steps; jobs are selections of work to materialize or run. Dagster itself is data-platform-shaped, so it is inspiration, not the implementation target for init.
Jupyter: notebook structure is a user story, not the kernel or init abstraction. Cells, prose, outputs, and metadata are excellent for reviewing a run, explaining why it happened, and rerunning a chosen step. They should be a projection over graph state. Cell order must not become the source of truth for service lifecycle or package builds.
LangGraph: checkpointed graph execution, threads, super-step boundaries, interrupts, and time travel are useful for agent-like and human-in-the-loop work. capOS should borrow the checkpoint boundary idea for resumability, but avoid binding the substrate to LLM message state.
IX: the package graph research is the strongest local precedent. IX’s current executor traverses a dependency graph by node outputs, applies pools, creates output directories, runs shell commands, touches sentinel files, and kills the process group on failure. That proves IX already has a real build graph. It also shows where capOS must stop: graph scheduling must not be fused to subprocess, Unix process groups, filesystem sentinels, hardlinks, symlinks, fetchers, archive extraction, or Store mutation. Those belong behind typed capOS services.
Core Model
The minimal model is:
struct WorkGraph {
graphId @0 :Text;
version @1 :UInt64;
nodes @2 :List(CommonNodeSpec);
edges @3 :List(EdgeSpec);
defaults @4 :GraphPolicy;
domainSchema @5 :UInt64;
}
struct CommonNodeSpec {
nodeId @0 :Text;
title @1 :Text;
inputs @2 :List(ArtifactSelector);
outputs @3 :List(ArtifactSpec);
requiredCaps @4 :List(CapRequirement);
policy @5 :NodePolicy;
assignmentDefault @6 :Assignment;
}
struct WorkRun {
runId @0 :Text;
graphId @1 :Text;
graphVersion @2 :UInt64;
state @3 :RunState;
nodes @4 :List(NodeRun);
events @5 :List(EventRef);
}
struct NodeRun {
nodeId @0 :Text;
state @1 :NodeState;
attempt @2 :UInt32;
assignment @3 :Assignment;
artifacts @4 :List(ArtifactRef);
checkpoint @5 :CheckpointRef;
}
This is a shape, not final schema. The stable part is the split between definition, run, node-run state, artifacts, and assignments.
Domain node meanings are not a shared NodeKind enum in the common schema.
Init may define InitServiceNode; IX may define FetchNode, ExtractNode,
BuildNode, StoreCommitNode, and PublishNode; a story projection may
define NotebookCellNode or ManualNoteNode. Those domain structs live in
domain-owned schemas or config sections and are validated by the domain
coordinator that holds the relevant authority. The common graph library may
hash, store, and index their association with nodeId, but it must not
interpret every domain’s node kinds.
Node State
Node state should be explicit enough for init, package builds, and operators:
planned: validated but not yet eligible.blocked: waiting on upstream nodes, an unavailable capability, resource budget, or manual input.runnable: dependencies are satisfied and a worker may lease it.leased: a worker or assignee owns the next attempt for a bounded time.running: execution has begun.waiting: running but blocked on a child process, readiness export, external event, manual approval, timer, or checkpoint resume.succeeded: produced the declared outputs or accepted terminal result.failed: terminal failure under current policy.retryPending: failed attempt will be retried under policy.skipped: intentionally not run because branch/condition policy selected a different path.canceled: canceled by caller, shutdown, superseding run, or authority revocation.paused: durable operator or policy pause.stale: graph version, cap epoch, input artifact, or session binding no longer matches the run’s assumptions.
State transitions should be append-only events. Services may compact state into snapshots, but audit and replay need a durable event boundary.
Edges
A plain DAG edge is not enough. capOS needs typed edge reasons:
dependsOnSuccess: downstream may run after upstream succeeds.dependsOnArtifact: downstream consumes a named artifact or Store ref.dependsOnReady: downstream waits on a service readiness export.dependsOnLease: downstream may run only while a lease/session is live.cancelsWith: cancellation propagates across the edge.shutdownBefore: shutdown order edge, usually reverse of startup.approvalFor: manual approval gates a node or subgraph.observes: node only observes another node’s state and does not block it.
The graph remains acyclic within one run. Loops are modeled by new runs, periodic schedules, sensors, retries, or explicit child graphs. This is a critical stop line: hidden cycles create service-manager behavior inside the graph engine.
Workload Assignment
Assignment is optional metadata, not authority:
struct Assignment {
principal @0 :Text;
role @1 :Text;
queue @2 :Text;
priority @3 :Int32;
budget @4 :ResourceProfileRef;
deadline @5 :TimeRef;
lease @6 :LeaseRef;
}
An assigned operator or worker may receive a lease to attempt a node. The lease does not grant broad system authority. It only grants the ability to claim or update that node-run through the coordinator, and any executable work still needs domain caps supplied by init, a build coordinator, a package worker, an agent runner, or another supervisor.
This makes the same graph usable as:
- a todo list where a human owns a manual node;
- a build queue where a worker owns a build step;
- an init run where PID 1 owns service lifecycle nodes;
- an agent plan where a worker owns a bounded workspace task.
Init As A Consumer
The user direction is important: this may be used for workload orchestration by init.
The current init path validates initConfig.services, spawns children through
ProcessSpawner, records exports, and waits. The first graph use should only
observe and structure that existing behavior:
- Compile
initConfig.servicesinto a graph definition. - Create a volatile boot
WorkRunin init memory. - Treat each service as a lifecycle node with the states current init can actually observe: planned, spawned, running/waiting, exited, or failed.
- Use typed edges for declared cap imports and manifest-order dependencies.
- Persist selected run events later through a Store-backed journal when storage is available.
Init does not need to become a general-purpose Airflow. It needs a durable or inspectable lifecycle table with graph semantics:
- what services were planned;
- what caps and exports they depend on;
- which services are spawned, running, waiting, exited, failed, or blocked under the current primitives;
- later, which services are restarting, draining, terminating, or ordered for shutdown once those lifecycle primitives exist;
- what operator-visible work remains.
Restart, drain, termination, readiness-export waiting, and shutdown-order control are later phases. They require primitives that are still future in the service and broker proposals:
- process termination or kill-tree semantics narrower than raw process-table authority;
- an explicit readiness/export contract for services;
- service drain or lifecycle caps for graceful shutdown;
- restart policy state that is disabled or narrowed during shutdown mode;
- stale export and stale process-handle behavior for restarted services;
- audit events that distinguish crash, restart, operator stop, shutdown, timeout, and stale-authority denial.
The generic graph code can be an init-internal library at first. If a separate
run-state service appears later, init should delegate only narrow read or
update capabilities to it. The separate service must not receive
ProcessSpawner, raw process handles, or service-owner caps merely because it
stores graph state.
IX Package Graph Consumer
IX should use the same run-state model with a different executor:
- package templates and descriptors produce graph definitions;
- fetch/extract/build/store/publish become typed nodes;
- inputs and outputs are Store or Namespace refs;
- build logs and output hashes are artifacts;
- package build workers lease executable nodes;
BuildCoordinatorowns scheduling, cancellation, queues, and job state;Fetcher,Archive,BuildSandbox,Store, andNamespacehold the real authority.
The graph substrate should not know how to fetch a URL, unpack a tarball, run
sh, or commit a Store object. It records that those typed steps exist, which
worker owns the attempt, what artifacts were produced, and whether the run can
resume or retry.
This preserves the IX research recommendation: use IX’s package corpus and content-addressed model without importing a CPython/POSIX executor boundary. It does not move IX job ownership into a global graph coordinator.
Notebook User Story
Jupyter is best treated as a user story:
- A notebook cell can map to a
note,manualTask,notebookCell,agentStep, orbuildnode. - Cell output is an artifact: text, table, image, log excerpt, benchmark summary, Store ref, or Namespace snapshot.
- Markdown/prose explains why the graph exists and how to interpret its state.
- Rerun means “create a new run or retry selected node(s) under policy”, not “mutate hidden cell global state”.
- Checkpoints let a user resume from a durable boundary.
The notebook layer may be CLI text, mdBook, a future web shell, or a rich UI. The core model should not depend on any of those.
Dagster Fit
Dagster is closer than Airflow for durable capOS work when outputs matter. For capOS, a software-defined asset maps naturally to:
- content-addressed package output;
- boot image or manifest;
- Namespace snapshot;
- benchmark report;
- generated code artifact;
- service export that becomes available after readiness;
- notebook output captured as a reproducible artifact.
Dagster’s ops and graphs map to executable steps. Its jobs map to selections of assets or ops to run. Its sensors and schedules map to run creation policies.
The mismatch is domain and authority. Dagster assumes a data-platform runtime, Python definitions, and external resources. capOS needs capability grants, typed service exports, process handles, sessions, Store/Namespace refs, resource ledgers, and boot-time constraints. The right move is not “run Dagster in init”; it is “use Dagster’s asset/ops/jobs distinction to keep the capOS graph model honest.”
Where To Stop
The main risk is building a god object. The graph substrate must not absorb every adjacent concept.
Stop at these boundaries:
- No kernel
WorkGraphcapability. The kernel provides primitive caps: process, memory, IPC, timers, devices, and storage plumbing. Graph state is userspace. - No global service discovery. A graph may reference capabilities granted into its runner or produced by its own nodes. It must not look up arbitrary services by global name.
- No ambient executor. Run-state code cannot execute arbitrary strings, scripts, Cap’n Proto calls, or binaries. A domain executor must hold the exact capabilities needed.
- No universal plugin ABI. Domain node kinds are typed in domain schemas. Unsupported node kinds fail domain validation rather than becoming untyped byte blobs.
- No authority laundering. Assignment, tags, labels, notebook cells, and graph edges do not grant authority. Only capabilities do.
- No UI state in the core. Notebook cells, DAG visual positions, comments, and todo-list grouping are projections or metadata.
- No package-manager logic in the core. Fetch, archive, build, Store, and Namespace operations stay in IX/build services.
- No init-specific policy in the core. Restart policy, shutdown order, and process termination are init or supervisor policy. The graph can record and drive them only through explicit runner methods.
- No hidden loops. Periodic work, sensors, retries, and agent iteration create new attempts or runs. One run’s execution graph stays acyclic.
- No unbounded event retention by default. Retention and compaction are policy fields, not accidental database growth.
If a feature requires any graph coordinator to hold broad ProcessSpawner,
DeviceManager, NetworkManager, Store, Namespace, Fetcher, shell,
or session authority for all domains, the design has crossed the line.
Service Split
The target split is:
flowchart TD
Lib[Shared graph schema and state library]
Log[Optional Store-backed event log]
Lib --> InitCoord[init-local lifecycle graph]
Lib --> BuildCoord[IX BuildCoordinator graph]
Lib --> TaskCoord[operator task graph]
Lib --> StoryCoord[notebook/story projection]
Lib --> AgentCoord[agent-run graph]
InitCoord --> InitLog[volatile boot run first]
BuildCoord --> Log
TaskCoord --> Log
StoryCoord --> Log
AgentCoord --> Log
InitCoord --> InitExec[init lifecycle executor]
BuildCoord --> BuildExec[build workers]
TaskCoord --> Human[operator/manual assignee]
AgentCoord --> AgentExec[agent worker]
InitExec --> Spawner[ProcessSpawner]
BuildExec --> Sandbox[BuildSandbox]
BuildExec --> Store[Store/Namespace]
AgentExec --> Workspace[Task workspace caps]
Only domain coordinators and executors hold domain authority. The shared code owns no authority beyond manipulating in-memory or Store-backed graph records through whatever narrow capability its caller already holds.
Persistence
Persistence should be incremental:
- Early init boot runs can be volatile.
- Build runs should persist event logs, logs, artifacts, and Store refs as soon as Store exists.
- Operator tasks and notebook stories should persist once user storage exists.
- Agent runs should persist checkpoints and review state, not raw hidden prompt state.
Store integration should use content-addressed objects for immutable outputs and an append-only or generation-checked log for mutable run state. Namespace snapshots can publish human-facing names for completed runs, package realms, or notebook reports.
Boot must not depend on a separate Store-backed graph service being available. If durable graph logging is unavailable during boot, init falls back to its volatile lifecycle table and emits diagnostics through its existing console/log path. Durable replay and post-boot inspection are degraded in that mode, but service startup must not fail solely because the graph log is unavailable.
Security Rules
- Node claims are lease-based and expire.
- Every state update is authorized by the current lease, graph owner, or a delegated control cap.
- Node output publication validates expected artifact type and size.
- Retrying a node must not reuse stale capabilities, stale sessions, or stale object epochs.
- Cancellation must release leases and ask domain executors to drain or kill work through typed lifecycle caps.
- Audit logs distinguish failure, cancellation, stale authority, denied authority, timeout, manual rejection, and superseded run.
- Resource budgets are reserved before execution and released on all terminal paths.
Staged Plan
Stage A: Init-Local Run Model
Add a pure capos-config or init-local graph/run-state library that can model
the existing initConfig.services startup order, service exports, and child
waits. Keep it volatile. Add host tests for graph validation and state
transitions.
Stage B: Init Lifecycle Projection
Teach init to expose or print an inspectable service run summary: planned, spawned, running or waiting, exited, and failed. Later summaries can add readiness, restart, drain, termination, and shutdown ordering after those primitives exist. This can remain a text proof before adding any new capability interface.
Stage C: Store-Backed Run Log
Once Store/Namespace is credible, persist run events and compact snapshots. This unlocks post-boot inspection, operator task state, and notebook stories.
Stage D: IX BuildCoordinator
Represent IX package builds as graph runs. Keep execution in
BuildCoordinator, BuildSandbox, Fetcher, Archive, Store, and
Namespace services.
Stage E: Operator Task Surface
Expose a shell or structured command surface for graph runs: list, inspect, assign, pause, resume, retry, cancel, approve, and show artifacts. This is the DAG-organized todo-list layer.
Stage F: Notebook Story Projection
Generate notebook-like reports from graph runs: prose, cells, commands, logs, artifacts, and checkpoints. Treat notebooks as reproducible run narratives, not as the owner of execution semantics.
Stage G: Agent Workflows
Use graph runs for long-lived agent tasks, review gates, workspace leases, memory checkpoints, and human approval nodes.
Validation
Each stage should have focused checks:
- pure host tests for state transitions and invalid graph rejection;
- init QEMU proof that existing service startup still works;
- later lifecycle-control proof that shutdown dependency order is obeyed, once terminate/drain/shutdown primitives exist;
- stale lease and stale cap epoch tests;
- IX differential tests against host-side IX planning where applicable;
- docs build to refresh topics and catch Mermaid/front matter errors.
Open Questions
- Should init embed the graph library permanently, or should it eventually delegate run-state persistence to a child service once storage is available?
- What is the smallest schema for
ArtifactRefthat covers service exports, Store refs, logs, notebooks, and package outputs without becomingAny? - Does
domainSchemaidentify only a domain schema version, or also the domain payload location and content hash for node-specific config? - How should schedules and sensors be represented without creating hidden cyclic runs?
- Which graph events deserve permanent audit retention versus compacted operational state?
- Should notebook projections use Jupyter
nbformatdirectly, or a smaller capOS-native story format that can export to notebooks later?
Recommendation
Build a small stateful graph substrate, but make it a run-state service or library, not a universal orchestrator.
For init, use it to make service lifecycle visible and eventually durable. For IX, use it to track package build graphs while execution remains in build services. For operators, project it as an assigned DAG todo list. For Jupyter, project it as a notebook-style user story. For agents, project it as durable task state with checkpoints and review gates.
The stop line is authority: shared graph code records state, domain coordinators schedule work, and typed domain services execute it.