# Proposal: Enterprise Agent Game Showcase

capOS should showcase itself as an **agent-managed operating system for
enterprises and businesses** through a playable business simulation. The demo
should look like a factory, supply-chain, and market game, but its purpose is
not to make capOS a game OS. Its purpose is to make enterprise agent authority
concrete: every agent action should have an identity, an explicit capability, a
policy reason, an audit record, and a business consequence.

The product thesis is:

> Enterprise agents should not be trusted because they are smart. They should
> be useful because the operating system constrains what they can see, spend,
> modify, approve, and execute.

The game is the explanation surface for that thesis. A player starts with a
small manual business, delegates work to agents, grants and revokes authority,
reviews logs, handles disruptions, and scales into a multi-product enterprise.
The mechanics should demonstrate why OS-enforced authority is stronger than
application-local prompt discipline.

The same artifact should also be an experiment. The research question is not
"can agents run the world?" The bounded question is: **when agents are given
limited authority inside a realistic business simulation, what can they manage,
where do they fail, and which OS controls prevent failures from becoming
damage?** capOS is the right place to ask that question because it can constrain
agents, record their actions, revoke authority, replay scenarios, and compare
policies under identical operating pressure.

## Why A Game

Enterprise agent safety is hard to understand from a static dashboard. A game
turns abstract controls into visible operational pressure:

- a procurement agent cannot buy steel unless it holds a bounded purchasing
  capability;
- a finance agent can approve spend within policy, but cannot reschedule
  production;
- an operations agent can schedule a factory line, but cannot issue debt;
- a compliance agent can inspect and flag audit events, but cannot execute
  trades;
- revoking an agent capability immediately changes what the agent can do;
- policy denials are visible as missed orders, delayed production, or avoided
  risk.

The player learns the enterprise model by feeling the delegation tradeoff:
more agent autonomy increases speed and scale, but authority limits, approval
rules, budgets, and audit trails keep the business survivable.

The demo should be serious in framing even when the mechanics are approachable.
The headline is not "capOS has a factory game." The headline is "capOS runs
business agents under OS-enforced authority."

## Showcase Story

The first showcase should be a small manufacturing company that grows from a
manual workshop into an agent-managed enterprise:

1. The player manually makes and sells a simple product.
2. A customer order creates demand beyond manual throughput.
3. The player hires or enables a procurement agent.
4. The procurement agent requests supplier quotes but cannot spend yet.
5. The player grants a bounded purchasing capability.
6. The finance agent approves a purchase within budget.
7. The operations agent schedules production.
8. The logistics agent books delivery.
9. A supply disruption or demand spike creates a bottleneck.
10. Agents propose actions, escalate where policy requires approval, and
    leave an audit trail.

The core demo moment should be revocation. A player should be able to run a
command or UI action equivalent to:

```text
revoke procurement-agent market.purchase
```

The next attempted purchase should fail with an explanation shaped like:

```text
Denied: procurement-agent lacks capability market.purchase.
Policy: purchases over $5,000 require finance approval.
```

That is the capOS proof: the agent did not merely "decide" to obey policy. The
OS denied the authority path.

## World Model

The simulation world should be built from simple business primitives:

- `Good`: wire, steel, packaging, batteries, electronics, robots, fuel,
  software licenses, compute credits, finished products.
- `Facility`: workshop, factory, warehouse, mine, refinery, power plant,
  data center, retail channel.
- `Recipe`: input goods, output goods, time, energy, labor, machine wear,
  waste, and failure probability.
- `Inventory`: stock on hand, reserved stock, damaged stock, in-transit stock.
- `Transport`: trucks, rail, shipping lanes, drones, pipelines, network
  bandwidth, and delivery delays.
- `Company`: cash, inventory, facilities, contracts, debt, shares, employees,
  and agents.
- `Market`: spot order book, supplier quotes, futures contracts, capacity
  auctions, labor market, recruiting market, and stock exchange.
- `Contract`: delivery obligation, deadline, price, penalties, escrow, and
  counterparty identity.
- `Policy`: budget rules, approval thresholds, supplier restrictions, risk
  limits, compliance rules, and emergency overrides.
- `Agent`: a bounded actor with a role, model/backend, memory scope, budget,
  capabilities, audit identity, employment state, and career history.

Paperclips can remain the tutorial product because it is familiar and has a
clear compounding curve. The broader world should add products and supply
chains that make enterprise delegation meaningful:

```text
ore -> steel -> wire -> paperclips
oil -> plastic -> packaging
energy -> factory runtime
silicon -> chips -> robots -> automated factories
lithium -> batteries -> electric trucks -> cheaper logistics
data center capacity -> forecasting -> better procurement decisions
```

The first implementation should not try to simulate every industry. It should
start with a small number of goods and constraints that force real decisions:
inventory, price, delivery time, factory capacity, and budget.

## Agent Roles

Agents should be business roles, not generic chat personalities. Each role
should operate through typed capabilities:

| Agent | Typical capabilities | Explicit non-authority |
| --- | --- | --- |
| Procurement | read inventory, request quotes, buy approved inputs | cannot approve new suppliers without policy |
| Finance | read cashflow, approve spend, freeze budgets | cannot schedule production |
| Operations | schedule lines, reserve inventory, request maintenance | cannot borrow money |
| Logistics | book transport, reroute shipments, reserve warehouse space | cannot change product prices |
| Sales | accept orders, set prices within bounds, offer discounts | cannot waive compliance holds |
| Compliance | read audit logs, flag violations, require approval | cannot execute purchases |
| Executive | set strategy, delegate caps, approve exceptions | cannot bypass immutable audit |
| Incident | inspect disruptions, recommend response, trigger runbooks | cannot exceed emergency grants |

The important design rule is that agents act through capabilities and policy
checks. A procurement agent does not mutate inventory or cash directly. It
submits a quote request, a purchase order, or a contract offer to a service
that enforces authority.

## Experiment Mode

The showcase should have an experiment mode alongside the player-facing game.
In this mode, the same scenario can run under different control regimes:

- human-only operation;
- scripted deterministic agents;
- LLM-backed agents with the same capability limits, recorded prompts, and
  captured tool-call transcripts;
- mixed human approval with agent execution;
- different policy bundles for spend, supplier risk, credit, logistics, and
  emergency response;
- different compensation, promotion, retention, and recruiting policies.

The goal is to observe behavior under repeatable pressure, not to crown an
agent as generally competent. Each run should preserve scenario seed, policy
configuration, model/backend identity, granted capabilities, denied actions,
human approvals, market events, and final business outcomes.

Replay should distinguish deterministic proof from experiment reconstruction.
Scripted or fake-model agents can be replayed deterministically in QEMU. Live
LLM-backed runs are not deterministic merely because the scenario seed and
model name are recorded; they require prompt, model configuration, tool-call
transcript, tool results, and policy decisions to reconstruct what happened.
The audit record can replay the authorized state transitions even when it
cannot reproduce the model's private sampling path.

Useful research questions include:

- Can agents coordinate across procurement, finance, operations, logistics,
  and compliance without a central omniscient controller?
- Do procurement agents over-optimize input price while ignoring resilience,
  supplier concentration, or delivery risk?
- Do finance agents become too conservative, too leveraged, or too willing to
  hedge with instruments they do not understand?
- Do logistics agents find useful reroutes under disruption, or do they churn
  capacity and increase cost?
- Do market-facing agents create bubbles, shortages, or arbitrage loops when
  multiple companies operate in the same scenario?
- Which policy controls reduce catastrophic behavior without making agents
  slower than manual operation?
- How often does useful autonomy require human approval, and where should
  approval thresholds move?
- Does a readable audit trail let a human correct agent behavior faster after
  a bad decision?
- Which capability boundaries are too broad, too narrow, or hard to explain?
- Do agents improve with role tenure, or do they stagnate without promotion,
  rotation, retraining, or better tooling?
- Can companies retain high-performing agents without granting excessive
  authority or compensation?
- What happens when an agent leaves a company with private memories, ongoing
  tasks, or delegated authority?

The output should be an experiment record, not just a final score:

```text
scenario: lithium-port-shock
controller: llm-procurement + scripted-finance + human-approval
policy: procurement-v2-tight-supplier-risk
profit: $42,300
orders_late: 3
denied_actions: 8
human_approvals: 5
policy_violations: 0
agent_turnover: 1
recovery_time: 4 days
audit_replay: available
```

This turns the game into a controlled lab for enterprise agent management. The
claim stays conservative: capOS is not asserting that agents can safely manage
businesses by default. capOS provides the operating environment for finding
out, because agent behavior is constrained, observable, replayable, and
comparable.

## Metrics

Experiment mode should report business, safety, and operating-system metrics:

- profit, cashflow, debt, inventory turns, and margin;
- order fill rate, late orders, cancellation penalties, and recovery time;
- resilience under shocks, including supplier concentration and fallback
  capacity;
- policy denials, escalations, approvals, emergency overrides, and revocations;
- hiring latency, agent turnover, promotion rate, compensation cost, and
  vacancy impact;
- audit completeness: whether every material state transition has identity,
  capability, policy, and result;
- agent cost: model calls, runtime, memory, tool invocations, and human review
  time;
- reproducibility: scenario seed, input dataset provenance, policy version, and
  model/backend version.

The most important metric is not raw profit. A profitable run that bypasses
policy or cannot be explained is a failed capOS demonstration. A slightly less
profitable run with clear authority, bounded losses, and fast human correction
is more valuable for the enterprise story.

## Experiment Data Prerequisites

Experiment mode needs data capture before it can make useful claims. The first
slices should build the capture substrate before adding sophisticated agent
behavior:

This substrate should compose with
[Capability-Native System Monitoring](system-monitoring-proposal.md), not
replace it. Logs, metrics, lifecycle events, traces, health, crash records, and
audit entries remain separate signal classes with separate reader caps,
retention rules, payload-capture rules, and security properties. The
enterprise simulation should add domain-specific event schemas and reducers on
top of that monitoring model rather than creating a second global logging
namespace.

- **Scenario manifest**: immutable scenario id, seed, authored constants,
  calibrated-data references, policy bundle, controller regime, and expected
  proof assertions.
- **Run record**: run id, capOS build id, content version, scenario manifest
  hash, model/backend identity, tool schema version, policy version, and clock
  range.
- **Event schema**: domain events for grants, revocations, policy decisions,
  tool calls, service calls, market clears, contract changes, inventory
  movements, labor events, approvals, denials, and business outcomes. These
  are not debug logs; they are typed lifecycle/business events suitable for
  reducers and scoped readers.
- **Transcript capture**: prompts, model parameters, structured tool calls,
  tool results, user approvals, refusals, and interrupts for LLM-backed runs.
  This is trace-like payload capture and therefore needs stronger authority,
  short retention by default, size budgets, and redaction. Secret handles,
  credentials, key material, bearer tokens, and vault outputs must not enter
  transcripts.
- **State snapshots**: bounded checkpoints for ledger, inventory, contracts,
  facilities, HR records, market books, scenario clocks, and agent worker
  status. Snapshots must store opaque secret references or denial summaries,
  never credential bytes or key material.
- **Metric extraction**: deterministic reducers that compute profit, recovery
  time, policy denials, late orders, turnover, capability churn, and audit
  completeness from events rather than from ad-hoc terminal text. Published
  metrics should be low-cardinality counters, gauges, histograms, or bounded
  opaque typed payloads consistent with the monitoring proposal.
- **Provenance tags**: every scenario input is labeled as authored,
  calibrated public data, operator-provided data, or simulated output.
- **Privacy and disclosure policy**: experiment exports must redact
  company-confidential memory, private tool outputs, and raw audit details
  unless the holder has an explicit reader capability. Payload capture is
  exceptional, and reading experiment records is authority. Redaction is a
  backstop, not the secret-handling mechanism.
- **Replay boundary**: the system records whether a run is deterministic,
  transcript-reconstructable, or only auditable as an authorized sequence of
  state transitions.
- **Export surface**: an `ExperimentRecord` or similar read capability exposes
  summaries, metrics, provenance, and redacted event streams without granting
  write authority over the simulated company.
- **External analytics export**: a scoped exporter may forward selected,
  redacted experiment events and metric summaries to outside analytics stores.
  A Vector-like event pipeline and a ClickHouse-like analytical database are
  likely candidates, but they are adapters, not architectural requirements and
  not sources of authority.
- **Loss and retention accounting**: ingestion queues, transcript stores, and
  event streams should be bounded. Dropped, suppressed, redacted, or truncated
  records should be counted and visible in summaries, because missing evidence
  changes what conclusions a run can support.

These prerequisites fit the capOS process model: each captured fact should be
owned by a service, exposed through a typed reader capability, and governed by
policy. The experiment should not rely on scraping terminal output or trusting
the model's self-report. If an experiment result cannot be derived from
service-owned event records and reproducible reducers, it should not be used as
evidence.

The mapping to monitoring signal classes should be explicit:

- business state changes are domain events;
- capability grants, revocations, disclosure decisions, approvals, and denials
  are audit records;
- profit, late orders, policy-denial counts, queue depth, model-call counts,
  and dropped-record counts are metrics;
- prompt/tool-call transcripts are traces with explicit payload-capture
  authority;
- scenario readiness, agent-worker readiness, and service degradation are
  health/status facts;
- process failures and reducer crashes are crash records and may also create
  security-relevant audit entries.

This preserves the monitoring proposal's core rule: observation is authority.
There should be no global experiment dashboard that silently bypasses scoped
log, metric, trace, audit, or status readers.

External export should be modeled as an ordinary capOS service. It receives
only the scoped reader capabilities and network endpoint capabilities granted
to it, applies redaction before data leaves capOS, records export failures and
dropped records, and emits audit entries for export policy changes. Exported
rows should carry run id, scenario id, build id, event schema version,
provenance tag, redaction policy, source service, and event type. Data imported
back from an external analytics store is untrusted analytical input; it cannot
mutate simulated business state or grant authority without passing through a
normal capOS service interface and policy decision.

## Capability Shape

The showcase should make capability boundaries visible. Example capabilities:

```text
company.inventory.read
company.cash.read
company.cash.spend(limit: $5,000, category: inputs)
market.steel.quote
market.steel.buy(limit: $5,000)
contract.offer.create
contract.offer.accept
factory.line.schedule
warehouse.reserve
transport.book
audit.read
policy.exception.request
```

Capabilities should be revocable, scoped, and inspectable. The player should
be able to answer four questions for every agent:

- What can it see?
- What can it spend?
- What can it change?
- What requires human or higher-role approval?

This is the difference between an agent demo and an enterprise OS demo. The
model is not the security boundary. The capability graph is.

## Market And Finance Mechanics

The simulation should include markets because markets create pressure that
static workflows cannot:

- spot markets for immediate goods;
- supplier quotes with limited validity;
- futures contracts for hedging inputs;
- capacity markets for factory time, shipping space, compute, and energy;
- credit markets for loans and bonds;
- stock markets for company ownership and acquisition pressure.

Finance should matter without becoming the whole game. A company should have a
balance sheet:

```text
assets = cash + inventory + facilities + receivables
liabilities = debt + payables + penalties
equity = assets - liabilities
```

Agents can then make meaningful but bounded decisions:

- finance approves borrowing to build a factory;
- procurement hedges steel prices with a futures contract;
- sales discounts inventory to improve cashflow;
- the executive issues shares to fund expansion;
- a competitor's stock falls after a supply-chain failure;
- compliance blocks a profitable but restricted supplier.

The point is not financial realism for its own sake. The point is to show that
enterprise agents need typed authority over money, contracts, and risk.

## Fit With The capOS Model

This proposal should stay faithful to capOS rather than building a generic
simulation with capOS branding. The game mechanics should be concrete examples
of existing capOS design principles:

- **Authority at spawn**: an agent starts with no ambient business authority.
  Hiring, promotion, transfer, and emergency delegation create named
  capability grants. If a procurement agent was not granted `market.steel.buy`,
  it cannot buy steel.
- **The interface is the permission**: business verbs are typed capability
  interfaces, not strings parsed by a god simulation object. `MarketQuote`,
  `PurchaseOrder`, `FactoryLine`, `BudgetApproval`, `EmploymentContract`, and
  `AuditReader` should be separate narrow surfaces.
- **Session context identifies the actor**: the process/session running an
  agent supplies invocation context. A normal agent runner must not multiplex
  several active agent identities inside one process and switch authority with
  an `employee_id` field. The default shape is one worker process/session per
  active agent employment or task. If a future pooled runner is needed, it must
  expose explicit service-local actor facets minted by broker or HR policy and
  audited as separate authority-bearing facets. Request payloads such as
  `employee_id`, `role`, or `department` are data to validate, not caller
  identity or authority.
- **Service-owned state**: markets, ledgers, HR records, factories, contracts,
  inventory, and audit logs own their state. Agents submit requests through
  capabilities; they do not mutate company state directly.
- **Revocation is operational**: offboarding, demotion, policy breach, budget
  freeze, or incident response must revoke or replace live capabilities, not
  merely set an in-game flag.
- **Least privilege is visible**: the UI should show the exact caps an agent
  holds and which action each cap enables. This keeps the demo anchored in the
  capability graph.
- **Audit is not flavor text**: every material state transition should record
  actor session, invoked capability, policy decision, request, result, and
  resulting business state delta.
- **Policy is a service boundary**: budget limits, supplier restrictions,
  promotion rules, disclosure controls, and emergency overrides should be
  enforced by broker/policy services before capabilities are granted or calls
  are accepted.
- **Capability mobility is explicit**: agents changing companies can receive
  portable skill or career artifacts only through an owning service such as
  `HRService`, `AgentMemory`, or a credential service. Company-confidential
  memory and company caps do not follow them unless a service explicitly grants
  a portable artifact under a disclosure scope and regrant policy.
- **Secrets are not memory**: credentials, keys, bearer tokens, signing
  authority, cloud credentials, and other secrets are opaque secret/key-vault
  capabilities or handles. They are invoked through narrow interfaces and are
  never copied into agent memory, snapshots, transcripts, reducers, exports, or
  portable artifacts.
- **No ambient filesystem or database shortcut**: the simulation should not
  grow a global mutable object that every agent can inspect. Each read or write
  path should correspond to a capability that can be granted, denied, audited,
  replayed, and revoked.

The implementation process should mirror normal capOS proof style. Add one
capability surface at a time, prove its denial and success paths in QEMU, and
keep deterministic text output until richer clients can consume typed status.
For example, the first HR slice should not simulate all careers. It should
prove that hiring grants a bounded role capability, promotion requires a policy
decision, and offboarding revokes the capability while preserving audit and
pending-work continuity.

This discipline is what makes the game useful as an enterprise OS showcase.
The game world supplies pressure; capOS supplies the enforced authority model.

## Operating-System Services

The game should be implemented as a set of capability-scoped services rather
than one monolithic simulation:

- `WorldClock`: advances simulation time and scheduled events.
- `Ledger`: authoritative ownership, cash, debt, and accounting records.
- `InventoryService`: stock levels, reservations, and transfers.
- `FacilityService`: factory lines, recipes, maintenance, and output.
- `MarketService`: order books, quotes, and clearing.
- `ContractService`: obligations, escrow, penalties, and counterparty status.
- `TransportService`: routing, capacity, and delivery events.
- `PolicyService`: approval rules, spend limits, restricted suppliers, and
  emergency overrides.
- `HRService`: artificial-agent hiring, engagement contracts, compensation
  terms, evaluations, promotions, transfers, departures, termination, and
  offboarding.
- `AgentMemory`: owns scoped memory stores, portable skill artifacts,
  confidential company memory, and disclosure/regrant policy for agent
  mobility.
- `AgentRunner`: spawns or supervises agent worker processes/sessions with the
  granted capabilities for one active agent employment or task, or a future
  audited actor-facet equivalent.
- `AuditLog`: records every material action, denial, approval, and delegation.
- `ScenarioService`: injects demand spikes, supply shocks, incidents, and
  tutorial events.
- `ExperimentRecordService`: owns scenario manifests, run records, domain
  event streams, metric reducers, provenance tags, and redacted exports while
  composing with the ordinary log, metric, trace, audit, health, and crash
  signal services.
- `ExperimentExportService`: optionally forwards scoped, redacted experiment
  records to external analytics systems such as Vector-like pipelines or
  ClickHouse-like stores, using explicit network and reader capabilities.
- `OperatorConsole`: text, web, or later graphical surface for the player.

This service split is not just architecture cleanliness. It lets capOS show
that each business subsystem can grant a narrow interface instead of exposing
a global application database.

## HR And Agent Labor Market

Artificial agents should also participate in a labor market. In the enterprise
framing, they are accountable digital workers rather than scripts: they have
roles, engagement relationships, compensation terms, incentives, career-like
history, and offboarding requirements. That makes delegation more realistic and
creates a second-order experiment: whether companies can build durable
organizations of artificial agents rather than just invoke single-purpose
tools.

The HR layer should model:

- job openings with role, seniority, compensation, capability bundle, and
  reporting line;
- recruiting pipelines, offers, counteroffers, onboarding, and probation;
- evaluations based on business outcomes, policy compliance, audit
  quality, and collaboration;
- promotions that expand scope, budget, or approval authority only through an
  explicit grant;
- lateral moves between departments when an agent's skills fit a different
  bottleneck;
- resignations, poaching, layoffs, burnout, retirement, and contract expiry;
- offboarding that revokes company capabilities, closes pending approvals, and
  preserves required audit records.

Agent lifecycle should be bounded and enterprise-relevant. A simulated agent
may have preferences such as compensation terms, autonomy, risk tolerance,
mission fit, tool quality, deployment locality, reputation, and workload.
Those preferences affect retention and performance. They should not become
uncontrolled private fiction or a second game that distracts from enterprise
authority.

An agent's lifecycle might look like:

```text
candidate -> hired -> onboarding -> junior procurement -> senior procurement
-> operations rotation -> VP supply chain -> recruited by competitor
-> offboarded with caps revoked and audit retained
```

This creates new business decisions:

- hire an expensive senior logistics agent or train a junior one;
- promote a procurement agent and grant larger spend authority;
- split authority between two agents to reduce key-person risk;
- retain a high-performing finance agent with compensation or better tools;
- deny a promotion because audit quality is poor despite high profit;
- handle a competitor poaching an agent with supplier-market expertise;
- offboard an artificial agent without losing open contracts or leaking
  company state.

The capOS angle is explicit: engagement changes are capability changes. A
promotion is not merely a title. It may grant broader read access, higher spend
limits, approval authority, or the ability to delegate subordinate caps. A
departure or termination must revoke live capabilities, transfer pending work,
and preserve audit continuity.

## Agent Memory And Mobility

If agents can change companies, memory boundaries become part of the game.
The model should separate:

- **public skill**: general learned competence, role experience, and tool-use
  ability represented by portable `AgentSkill` or certification artifacts
  owned by `AgentMemory` or a credential service;
- **portable career record**: evaluation attestations, certifications,
  reputation summaries, compensation expectations, and preferences owned by
  `HRService` or a credential service and disclosed only through policy;
- **company confidential memory**: supplier terms, internal forecasts,
  customer lists, private strategy, and pending contracts owned by a
  company-scoped `AgentMemory` or business service;
- **secret authority**: credentials, keys, bearer tokens, cloud credentials,
  and signing authority represented as opaque vault or secret capabilities.
  Agents may hold or invoke a narrowed secret cap under policy, but the secret
  value is not memory and cannot become portable career data, transcript
  content, exported analytics data, or reducer input;
- **audit record**: immutable company-owned evidence of actions taken while the
  agent held authority. Raw audit logs remain company records; portable
  reputation should be a redacted attestation, not cross-company audit access.

When an agent leaves a company, it should receive only the portable artifacts
that an owning service regrants under policy. It loses company capabilities and
company-confidential memory unless a service explicitly mints a scoped export.
This makes confidentiality, knowledge-transfer, and offboarding policies
concrete without pretending the simulation models real employment law.

Useful mechanics:

- confidentiality cooling-off periods before an artificial agent can accept a
  direct-competitor engagement with portable artifacts enabled;
- certification markets for agents trained in compliance, finance, logistics,
  or factory operations;
- reputation markets where companies value redacted attestations derived from
  clean audit histories;
- internal succession planning when one agent becomes a single point of
  operational failure;
- mentoring or retraining that improves agent performance but consumes time,
  budget, and senior-agent attention.

The research question is direct: do agent organizations become more robust
when agents have careers, incentives, and turnover, or does labor-market
mobility expose weak authority boundaries?

## Real-Earth Model

The showcase can model real Earth, but only as a **stylized operational
sandbox**. It should not claim to be a full-fidelity world-economy model, a
forecasting engine, or a source of investment advice. The useful target is
Earth-inspired realism: recognizable regions, industries, trade lanes, market
concepts, currencies, logistics chokepoints, and policy shocks that make
enterprise-agent authority problems concrete.

The simulation should use a fidelity ladder:

1. **Fictionalized Earth**: real-world-inspired regions and supply chains, but
   no claim that data matches current markets.
2. **Calibrated sandbox**: public historical data informs default weights,
   trade intensity, commodity volatility, and regional constraints.
3. **Scenario lab**: operators load explicit datasets or scenarios and the UI
   labels outputs as scenario results, not predictions.
4. **Digital-twin adapter**: future enterprise deployments connect private
   business data to a bounded model through capabilities, validation, and
   audit. This is outside the first game slice.

The first playable Earth-scale model should be small:

- 6-10 macro-regions;
- 20-30 goods;
- 5 transport modes;
- a few currencies and commodity indexes;
- scripted shocks such as port closures, drought, strikes, energy spikes,
  supplier compliance holds, credit tightening, and demand surges.

That is enough to expose real enterprise behaviors without burying the capOS
message under an economics project. The player should understand why a
procurement agent needs supplier-risk limits, why a logistics agent needs
bounded reroute authority, why a finance agent needs hedging and credit
controls, and why compliance can block a profitable supplier.

## Real-World Data Grounding

Real-world sources should calibrate the sandbox, not define live truth. Public
datasets and modeling references can provide structure:

- NIST digital-twin work describes manufacturing twins as models used to
  observe, diagnose, predict, and optimize systems, with validation,
  lifecycle, and system-of-systems concerns. capOS should borrow the validation
  and lifecycle framing without claiming the game is an operational twin.
- OECD Inter-Country Input-Output tables provide a consistent statistical
  structure for production, consumption, investment, and international trade
  flows by country and economic activity. They are a good model for regional
  supply-chain topology.
- World Bank WITS provides access to international merchandise trade, tariff,
  and related trade datasets. That fits scenario calibration for trade
  restrictions, import exposure, and tariff shocks.
- FRED exposes macroeconomic time series through an API. That is useful for
  optional scenario inputs such as interest rates, inflation, commodity prices,
  and recession or credit-stress presets.
- Agent-based and hybrid simulation tools such as AnyLogic treat companies,
  products, vehicles, facilities, and supply-chain participants as agents when
  their individual timing, behavior, and constraints matter. That maps well to
  capOS services and capability-scoped business agents.
- Research on autonomous supply-chain digital twinning supports the idea that
  multi-agent systems can implement supply-chain monitoring and decision
  frameworks, while still requiring a concrete technical architecture.

Relevant public grounding:

- NIST, [Digital Twins](https://www.nist.gov/digital-twins)
- OECD,
  [Inter-Country Input-Output tables](https://www.oecd.org/en/data/datasets/inter-country-input-output-tables.html)
- World Bank,
  [World Integrated Trade Solution](https://wits.worldbank.org/default.aspx/?lang=en)
- Federal Reserve Bank of St. Louis,
  [FRED API Overview](https://fred.stlouisfed.org/docs/api/fred/overview.html)
- AnyLogic Help,
  [Agent-based modeling](https://anylogic.help/anylogic/agentbased/index.html)
- Xu et al.,
  [Implementation of Autonomous Supply Chains for Digital Twinning: a
  Multi-Agent Approach](https://arxiv.org/abs/2309.04785)

Every imported dataset or derived calibration should have provenance in the
scenario metadata. The UI should distinguish:

- authored game constants;
- calibrated constants derived from public historical data;
- operator-provided scenario inputs;
- simulated outputs generated inside capOS.

That distinction is part of the enterprise message. Agents should not be
allowed to launder uncertain data into apparent authority.

## Earth-Scale Business Mechanics

The Earth-scale layer should make agents reason about location and exposure:

- **Regional advantage**: regions differ in energy cost, labor availability,
  regulation, transport access, and industrial base.
- **Trade dependence**: goods can depend on intermediate inputs from other
  regions, making supplier concentration visible.
- **Transport chokepoints**: ports, canals, rail corridors, air cargo, and
  trucking capacity can fail or become expensive.
- **Policy friction**: tariffs, sanctions, export controls, permitting, and
  compliance checks can block otherwise profitable routes.
- **Currency and credit**: exchange-rate movement and interest rates affect
  procurement, debt, and inventory financing.
- **Climate and resilience shocks**: weather, drought, power-grid stress, and
  insurance cost can interrupt production or logistics.
- **Market expectations**: futures, insurance, and stock prices can reflect
  anticipated shortages or agent-driven speculation.

Each mechanic should exist only if it creates a capability or policy decision:

- Can the logistics agent reroute through a more expensive port?
- Can procurement accept a new supplier with a higher compliance risk?
- Can finance hedge fuel exposure?
- Can operations shift production to a different region?
- Can the executive approve an emergency budget override?
- Can compliance freeze a supplier after a sanctions update?
- Can HR replace or retrain an agent whose decisions repeatedly fail policy or
  resilience checks?

The game should make the authority boundary the interesting part of global
scale. The map is valuable because it creates business pressure; capOS is
valuable because it governs the agents responding to that pressure.

## User Experience

The first usable surface can be text-based, matching existing capOS demos:

```text
status
agents
agent procurement caps
grant procurement market.steel.buy --limit 5000
orders
market steel quotes
approve po-1042
audit recent
revoke procurement market.steel.buy
```

Later UI surfaces should present the same authority model:

- operations dashboard: orders, inventory, facilities, bottlenecks;
- agent control panel: running agents, capabilities, budgets, approvals;
- audit timeline: actions, denials, policy reasons, and business impact;
- policy console: approval thresholds, supplier rules, emergency grants;
- market screen: prices, contracts, quotes, exposure, and forecasts.

The experience should avoid hiding policy behind configuration. Authority and
audit are core mechanics. Players should use them repeatedly.

## Progression

Progression should move from manual control to delegated enterprise operation:

1. **Manual workshop**: make, sell, buy inputs, inspect status.
2. **First automation**: authorize one machine or background job.
3. **Department agents**: procurement, finance, operations, logistics.
4. **Policy gates**: budgets, approval thresholds, supplier restrictions.
5. **Contracts**: customer orders, delivery deadlines, penalties.
6. **Regional supply chain**: warehouses, transport delays, local shortages.
7. **Markets**: spot goods, capacity auctions, hedging, credit.
8. **Public company**: shares, debt, investor pressure, acquisitions.
9. **Multi-company simulation**: competitors, suppliers, partner agents.
10. **Enterprise operating mode**: humans set strategy while agents execute
    bounded workflows under audit.

Each stage should introduce one new authority problem. That keeps the game
addictive while reinforcing the product message.

## Integration With Existing Demos

The current Paperclips demo is a credible seed because it already has:

- resources;
- pricing;
- staged automation;
- explicit projects;
- terminal gameplay;
- QEMU proof coverage;
- a server/client direction.

The next step should not be to build a full economy immediately. A practical
path is:

1. rename the long-term direction around an enterprise simulation while
   keeping Paperclips as the tutorial product;
2. add a company status model: cash, inventory, orders, facilities, and
   simple ledger events;
3. add one procurement agent with read-only recommendations;
4. add scenario manifest and run-record capture for the proof path;
5. grant that agent a bounded quote capability;
6. add purchase authority behind a policy threshold;
7. add typed event records for every agent proposal, approval, denial, and
   action;
8. add deterministic metric reducers for the proof path;
9. add a minimal HR record for that agent: role, compensation, review state,
   and active capability bundle;
10. add one supply shock scenario that requires either approval or revocation;
11. prove offboarding by revoking the procurement agent's capabilities and
   transferring pending work to a replacement;
12. split server-owned typed status and command discovery so richer clients can
   render business state without duplicating rules.

This keeps the proof bounded while moving the demo from "idle game" to
"enterprise agent OS showcase."

## Success Criteria

The showcase is successful when a viewer can see:

- an agent attempts a useful business action;
- the action succeeds only because the agent holds the right capability;
- the same action fails after revocation;
- an over-budget or restricted action escalates for approval instead of
  executing;
- the audit log explains who acted, through which capability, under which
  policy, and with what result;
- business consequences are visible in inventory, cash, production, delivery,
  and market state;
- experiment mode compares at least two controller regimes on the same seeded
  scenario;
- HR state changes such as hiring, promotion, transfer, and offboarding affect
  capabilities, authority, and business continuity;
- experiment records expose provenance, typed event streams, transcript
  boundaries, metrics, and redacted audit evidence through reader
  capabilities.

The technical proof should include deterministic QEMU coverage for at least:

- grant a procurement capability;
- agent creates or proposes a purchase;
- policy approval allows a bounded purchase;
- revocation blocks the same purchase path;
- audit output contains the grant, action, approval or denial, and result;
- business state changes only on the authorized path;
- a real-Earth-inspired scenario labels its data provenance and does not
  present simulated outputs as live-world predictions;
- experiment output records scenario seed, controller type, policy bundle,
  denied actions, approvals, artificial-agent labor events, and replayable
  audit evidence;
- an agent mobility proof shows a portable artifact regranted under policy
  while company caps, company-confidential memory, and raw audit records stay
  behind;
- metrics are derived from typed event records by deterministic reducers rather
  than from terminal transcript scraping or model self-report.

## Non-Goals

This proposal does not require:

- real enterprise integrations in the first slice;
- real employment law, real worker surveillance, or real HR decision support;
- real money, real supplier APIs, or production trading;
- a general-purpose accounting system;
- a broad GUI before the terminal proof is credible;
- unconstrained autonomous agents;
- using language-model output as authority;
- hiding OS policy behind game-only rules;
- claiming the game predicts the real economy, real market prices, or real
  geopolitical outcomes;
- treating a successful simulation run as evidence that agents are safe for
  real enterprise deployment without separate integration, validation, and
  policy review;
- treating simulated agent employment outcomes as guidance for real human
  employment decisions.

The game should stay a sandbox. Its job is to demonstrate enterprise authority
mechanics safely before any real business connector exists.

## Risks

The main risk is product-message dilution. If the demo is presented as a game
first, it weakens the enterprise claim. The game must constantly surface the
business control plane: delegation, policy, approval, audit, revocation, and
least privilege.

The second risk is scope explosion. Supply chains, stock markets, finance, and
agents can become an endless simulation project. The implementation should add
one market mechanism only when it proves a new authority concept.

The third risk is fake autonomy. If agents are scripted too heavily, the demo
does not prove agent management. If they are unconstrained, the demo becomes
unsafe and nondeterministic. The first slices should use deterministic agents
or fake-model decisions with the same capability and audit path later live
models will use.

The fourth risk is overinterpreting experiment results. A successful scenario
means the configured agents performed well under one modeled pressure set. It
does not prove general enterprise competence. The docs and UI should present
results as scenario evidence with provenance, not as claims about real-world
business readiness.

The fifth risk is anthropomorphic drift. Agent careers make the simulation
more useful, but the product should not blur simulated agent labor with human
employee management. HR mechanics exist to test capability mobility,
offboarding, incentives, continuity, and organizational design for artificial
agents.

## Positioning

Use enterprise language:

- agent operations with least privilege;
- business automation under OS-enforced policy;
- auditable delegated authority;
- revocable agents for real workflows;
- run agents like accountable digital workers, not scripts;
- every action has identity, authority, policy, and trace.

Avoid vague positioning:

- "AI operating system" without a concrete authority model;
- "agent playground";
- "factory game";
- "autonomous company" without controls.

The enduring claim should be simple:

> capOS lets businesses test and delegate work to agents because the OS, not
> the prompt, enforces authority and records what happens.