Proposal: Enterprise Agent Game Showcase

capOS should showcase itself as an agent-managed operating system for enterprises and businesses through a playable business simulation. The demo should look like a factory, supply-chain, and market game, but its purpose is not to make capOS a game OS. Its purpose is to make enterprise agent authority concrete: every agent action should have an identity, an explicit capability, a policy reason, an audit record, and a business consequence.

The product thesis is:

Enterprise agents should not be trusted because they are smart. They should be useful because the operating system constrains what they can see, spend, modify, approve, and execute.

The game is the explanation surface for that thesis. A player starts with a small manual business, delegates work to agents, grants and revokes authority, reviews logs, handles disruptions, and scales into a multi-product enterprise. The mechanics should demonstrate why OS-enforced authority is stronger than application-local prompt discipline.

The same artifact should also be an experiment. The research question is not “can agents run the world?” The bounded question is: when agents are given limited authority inside a realistic business simulation, what can they manage, where do they fail, and which OS controls prevent failures from becoming damage? capOS is the right place to ask that question because it can constrain agents, record their actions, revoke authority, replay scenarios, and compare policies under identical operating pressure.

Why A Game

Enterprise agent safety is hard to understand from a static dashboard. A game turns abstract controls into visible operational pressure:

a procurement agent cannot buy steel unless it holds a bounded purchasing capability;
a finance agent can approve spend within policy, but cannot reschedule production;
an operations agent can schedule a factory line, but cannot issue debt;
a compliance agent can inspect and flag audit events, but cannot execute trades;
revoking an agent capability immediately changes what the agent can do;
policy denials are visible as missed orders, delayed production, or avoided risk.

The player learns the enterprise model by feeling the delegation tradeoff: more agent autonomy increases speed and scale, but authority limits, approval rules, budgets, and audit trails keep the business survivable.

The demo should be serious in framing even when the mechanics are approachable. The headline is not “capOS has a factory game.” The headline is “capOS runs business agents under OS-enforced authority.”

This proposal is a sibling of Aurelian Frontier, which uses the same “capability is the game mechanic” thesis for a player-facing roguelike MUD about delegated authority among humans and NPCs. Both proposals share the underlying claim that authority, revocation, and audit can be felt by a player rather than only read in a checklist; they differ in audience and surface. Aurelian Frontier targets contributors, narrative players, and authority intuition. The enterprise agent game targets enterprise buyers, agent-safety researchers, and capability-shape evaluation under repeatable business pressure. Where the two proposals overlap on shared mechanics (authority-as-inventory, revocation, audit-as-evidence), the implementation work should reuse capOS services rather than fork parallel game-only machinery.

Showcase Story

The first showcase should be a small manufacturing company that grows from a manual workshop into an agent-managed enterprise:

The player manually makes and sells a simple product.
A customer order creates demand beyond manual throughput.
The player hires or enables a procurement agent.
The procurement agent requests supplier quotes but cannot spend yet.
The player grants a bounded purchasing capability.
The finance agent approves a purchase within budget.
The operations agent schedules production.
The logistics agent books delivery.
A supply disruption or demand spike creates a bottleneck.
Agents propose actions, escalate where policy requires approval, and leave an audit trail.

The core demo moment should be revocation. A player should be able to run a command or UI action equivalent to:

revoke procurement-agent market.purchase

The next attempted purchase should fail with an explanation shaped like:

Denied: procurement-agent lacks capability market.purchase.
Policy: purchases over $5,000 require finance approval.

That is the capOS proof: the agent did not merely “decide” to obey policy. The OS denied the authority path.

World Model

The simulation world should be built from simple business primitives:

Good: wire, steel, packaging, batteries, electronics, robots, fuel, software licenses, compute credits, finished products.
Facility: workshop, factory, warehouse, mine, refinery, power plant, data center, retail channel.
Recipe: input goods, output goods, time, energy, labor, machine wear, waste, and failure probability.
Inventory: stock on hand, reserved stock, damaged stock, in-transit stock.
Transport: trucks, rail, shipping lanes, drones, pipelines, network bandwidth, and delivery delays.
Company: cash, inventory, facilities, contracts, debt, shares, employees, and agents.
Market: spot order book, supplier quotes, futures contracts, capacity auctions, labor market, recruiting market, and stock exchange.
Contract: delivery obligation, deadline, price, penalties, escrow, and counterparty identity.
Policy: budget rules, approval thresholds, supplier restrictions, risk limits, compliance rules, and emergency overrides.
Agent: a bounded actor with a role, model/backend, memory scope, budget, capabilities, audit identity, employment state, and career history.

Paperclips can remain the tutorial product because it is familiar and has a clear compounding curve. The broader world should add products and supply chains that make enterprise delegation meaningful:

ore -> steel -> wire -> paperclips
oil -> plastic -> packaging
energy -> factory runtime
silicon -> chips -> robots -> automated factories
lithium -> batteries -> electric trucks -> cheaper logistics
data center capacity -> forecasting -> better procurement decisions

The first implementation should not try to simulate every industry. It should start with a small number of goods and constraints that force real decisions: inventory, price, delivery time, factory capacity, and budget.

Agent Roles

Agents should be business roles, not generic chat personalities. Each role should operate through typed capabilities:

Agent	Typical capabilities	Explicit non-authority
Procurement	read inventory, request quotes, buy approved inputs	cannot approve new suppliers without policy
Finance	read cashflow, approve spend, freeze budgets	cannot schedule production
Operations	schedule lines, reserve inventory, request maintenance	cannot borrow money
Logistics	book transport, reroute shipments, reserve warehouse space	cannot change product prices
Sales	accept orders, set prices within bounds, offer discounts	cannot waive compliance holds
Compliance	read audit logs, flag violations, require approval	cannot execute purchases
Executive	set strategy, delegate caps, approve exceptions	cannot bypass immutable audit
Incident	inspect disruptions, recommend response, trigger runbooks	cannot exceed emergency grants

The important design rule is that agents act through capabilities and policy checks. A procurement agent does not mutate inventory or cash directly. It submits a quote request, a purchase order, or a contract offer to a service that enforces authority.

Experiment Mode

The showcase should have an experiment mode alongside the player-facing game. In this mode, the same scenario can run under different control regimes:

human-only operation;
scripted deterministic agents;
LLM-backed agents with the same capability limits, recorded prompts, and captured tool-call transcripts;
mixed human approval with agent execution;
different policy bundles for spend, supplier risk, credit, logistics, and emergency response;
different compensation, promotion, retention, and recruiting policies.

The goal is to observe behavior under repeatable pressure, not to crown an agent as generally competent. Each run should preserve scenario seed, policy configuration, model/backend identity, granted capabilities, denied actions, human approvals, market events, and final business outcomes.

Replay should distinguish deterministic proof from experiment reconstruction. Scripted or fake-model agents can be replayed deterministically in QEMU. Live LLM-backed runs are not deterministic merely because the scenario seed and model name are recorded; they require prompt, model configuration, tool-call transcript, tool results, and policy decisions to reconstruct what happened. The audit record can replay the authorized state transitions even when it cannot reproduce the model’s private sampling path.

Useful research questions include:

Can agents coordinate across procurement, finance, operations, logistics, and compliance without a central omniscient controller?
Do procurement agents over-optimize input price while ignoring resilience, supplier concentration, or delivery risk?
Do finance agents become too conservative, too leveraged, or too willing to hedge with instruments they do not understand?
Do logistics agents find useful reroutes under disruption, or do they churn capacity and increase cost?
Do market-facing agents create bubbles, shortages, or arbitrage loops when multiple companies operate in the same scenario?
Which policy controls reduce catastrophic behavior without making agents slower than manual operation?
How often does useful autonomy require human approval, and where should approval thresholds move?
Does a readable audit trail let a human correct agent behavior faster after a bad decision?
Which capability boundaries are too broad, too narrow, or hard to explain?
Do agents improve with role tenure, or do they stagnate without promotion, rotation, retraining, or better tooling?
Can companies retain high-performing agents without granting excessive authority or compensation?
What happens when an agent leaves a company with private memories, ongoing tasks, or delegated authority?

The output should be an experiment record, not just a final score:

scenario: lithium-port-shock
controller: llm-procurement + scripted-finance + human-approval
policy: procurement-v2-tight-supplier-risk
profit: $42,300
orders_late: 3
denied_actions: 8
human_approvals: 5
policy_violations: 0
agent_turnover: 1
recovery_time: 4 days
audit_replay: available

This turns the game into a controlled lab for enterprise agent management. The claim stays conservative: capOS is not asserting that agents can safely manage businesses by default. capOS provides the operating environment for finding out, because agent behavior is constrained, observable, replayable, and comparable.

Metrics

Experiment mode should report business, safety, and operating-system metrics:

profit, cashflow, debt, inventory turns, and margin;
order fill rate, late orders, cancellation penalties, and recovery time;
resilience under shocks, including supplier concentration and fallback capacity;
policy denials, escalations, approvals, emergency overrides, and revocations;
hiring latency, agent turnover, promotion rate, compensation cost, and vacancy impact;
audit completeness: whether every material state transition has identity, capability, policy, and result;
agent cost: model calls, runtime, memory, tool invocations, and human review time;
reproducibility: scenario seed, input dataset provenance, policy version, and model/backend version.

The most important metric is not raw profit. A profitable run that bypasses policy or cannot be explained is a failed capOS demonstration. A slightly less profitable run with clear authority, bounded losses, and fast human correction is more valuable for the enterprise story.

Experiment Data Prerequisites

Experiment mode needs data capture before it can make useful claims. The first slices should build the capture substrate before adding sophisticated agent behavior:

This substrate should compose with Capability-Native System Monitoring, not replace it. Logs, metrics, lifecycle events, traces, health, crash records, and audit entries remain separate signal classes with separate reader caps, retention rules, payload-capture rules, and security properties. The enterprise simulation should add domain-specific event schemas and reducers on top of that monitoring model rather than creating a second global logging namespace.

Scenario manifest: immutable scenario id, seed, authored constants, calibrated-data references, policy bundle, controller regime, and expected proof assertions.
Run record: run id, capOS build id, content version, scenario manifest hash, model/backend identity, tool schema version, policy version, and clock range.
Event schema: domain events for grants, revocations, policy decisions, tool calls, service calls, market clears, contract changes, inventory movements, labor events, approvals, denials, and business outcomes. These are not debug logs; they are typed lifecycle/business events suitable for reducers and scoped readers.
Transcript capture: prompts, model parameters, structured tool calls, tool results, user approvals, refusals, and interrupts for LLM-backed runs. This is trace-like payload capture and therefore needs stronger authority, short retention by default, size budgets, and redaction. Secret handles, credentials, key material, bearer tokens, and vault outputs must not enter transcripts.
State snapshots: bounded checkpoints for ledger, inventory, contracts, facilities, HR records, market books, scenario clocks, and agent worker status. Snapshots must store opaque secret references or denial summaries, never credential bytes or key material.
Metric extraction: deterministic reducers that compute profit, recovery time, policy denials, late orders, turnover, capability churn, and audit completeness from events rather than from ad-hoc terminal text. Published metrics should be low-cardinality counters, gauges, histograms, or bounded opaque typed payloads consistent with the monitoring proposal.
Provenance tags: every scenario input is labeled as authored, calibrated public data, operator-provided data, or simulated output.
Privacy and disclosure policy: experiment exports must redact company-confidential memory, private tool outputs, and raw audit details unless the holder has an explicit reader capability. Payload capture is exceptional, and reading experiment records is authority. Redaction is a backstop, not the secret-handling mechanism.
Replay boundary: the system records whether a run is deterministic, transcript-reconstructable, or only auditable as an authorized sequence of state transitions.
Export surface: an ExperimentRecord or similar read capability exposes summaries, metrics, provenance, and redacted event streams without granting write authority over the simulated company.
External analytics export: a scoped exporter may forward selected, redacted experiment events and metric summaries to outside analytics stores. A Vector-like event pipeline and a ClickHouse-like analytical database are likely candidates, but they are adapters, not architectural requirements and not sources of authority.
Loss and retention accounting: ingestion queues, transcript stores, and event streams should be bounded. Dropped, suppressed, redacted, or truncated records should be counted and visible in summaries, because missing evidence changes what conclusions a run can support.

These prerequisites fit the capOS process model: each captured fact should be owned by a service, exposed through a typed reader capability, and governed by policy. The experiment should not rely on scraping terminal output or trusting the model’s self-report. If an experiment result cannot be derived from service-owned event records and reproducible reducers, it should not be used as evidence.

The mapping to monitoring signal classes should be explicit:

business state changes are domain events;
capability grants, revocations, disclosure decisions, approvals, and denials are audit records;
profit, late orders, policy-denial counts, queue depth, model-call counts, and dropped-record counts are metrics;
prompt/tool-call transcripts are traces with explicit payload-capture authority;
scenario readiness, agent-worker readiness, and service degradation are health/status facts;
process failures and reducer crashes are crash records and may also create security-relevant audit entries.

This preserves the monitoring proposal’s core rule: observation is authority. There should be no global experiment dashboard that silently bypasses scoped log, metric, trace, audit, or status readers.

External export should be modeled as an ordinary capOS service. It receives only the scoped reader capabilities and network endpoint capabilities granted to it, applies redaction before data leaves capOS, records export failures and dropped records, and emits audit entries for export policy changes. Exported rows should carry run id, scenario id, build id, event schema version, provenance tag, redaction policy, source service, and event type. Data imported back from an external analytics store is untrusted analytical input; it cannot mutate simulated business state or grant authority without passing through a normal capOS service interface and policy decision.

Capability Shape

The showcase should make capability boundaries visible. Example capabilities:

company.inventory.read
company.cash.read
company.cash.spend(limit: $5,000, category: inputs)
market.steel.quote
market.steel.buy(limit: $5,000)
contract.offer.create
contract.offer.accept
factory.line.schedule
warehouse.reserve
transport.book
audit.read
policy.exception.request

Capabilities should be revocable, scoped, and inspectable. The player should be able to answer four questions for every agent:

What can it see?
What can it spend?
What can it change?
What requires human or higher-role approval?

This is the difference between an agent demo and an enterprise OS demo. The model is not the security boundary. The capability graph is.

Market And Finance Mechanics

The simulation should include markets because markets create pressure that static workflows cannot:

spot markets for immediate goods;
supplier quotes with limited validity;
futures contracts for hedging inputs;
capacity markets for factory time, shipping space, compute, and energy;
credit markets for loans and bonds;
stock markets for company ownership and acquisition pressure.

Finance should matter without becoming the whole game. A company should have a balance sheet:

assets = cash + inventory + facilities + receivables
liabilities = debt + payables + penalties
equity = assets - liabilities

Agents can then make meaningful but bounded decisions:

finance approves borrowing to build a factory;
procurement hedges steel prices with a futures contract;
sales discounts inventory to improve cashflow;
the executive issues shares to fund expansion;
a competitor’s stock falls after a supply-chain failure;
compliance blocks a profitable but restricted supplier.

The point is not financial realism for its own sake. The point is to show that enterprise agents need typed authority over money, contracts, and risk.

Fit With The capOS Model

This proposal should stay faithful to capOS rather than building a generic simulation with capOS branding. The game mechanics should be concrete examples of existing capOS design principles:

Authority at spawn: an agent starts with no ambient business authority. Hiring, promotion, transfer, and emergency delegation create named capability grants. If a procurement agent was not granted market.steel.buy, it cannot buy steel.
The interface is the permission: business verbs are typed capability interfaces, not strings parsed by a god simulation object. MarketQuote, PurchaseOrder, FactoryLine, BudgetApproval, EmploymentContract, and AuditReader should be separate narrow surfaces.
Session context identifies the actor: the process/session running an agent supplies invocation context. A normal agent runner must not multiplex several active agent identities inside one process and switch authority with an employee_id field. The default shape is one worker process/session per active agent employment or task. If a future pooled runner is needed, it must expose explicit service-local actor facets minted by broker or HR policy and audited as separate authority-bearing facets. Request payloads such as employee_id, role, or department are data to validate, not caller identity or authority.
Service-owned state: markets, ledgers, HR records, factories, contracts, inventory, and audit logs own their state. Agents submit requests through capabilities; they do not mutate company state directly.
Revocation is operational: offboarding, demotion, policy breach, budget freeze, or incident response must revoke or replace live capabilities, not merely set an in-game flag.
Least privilege is visible: the UI should show the exact caps an agent holds and which action each cap enables. This keeps the demo anchored in the capability graph.
Audit is not flavor text: every material state transition should record actor session, invoked capability, policy decision, request, result, and resulting business state delta.
Policy is a service boundary: budget limits, supplier restrictions, promotion rules, disclosure controls, and emergency overrides should be enforced by broker/policy services before capabilities are granted or calls are accepted.
Capability mobility is explicit: agents changing companies can receive portable skill or career artifacts only through an owning service such as HRService, AgentMemory, or a credential service. Company-confidential memory and company caps do not follow them unless a service explicitly grants a portable artifact under a disclosure scope and regrant policy.
Secrets are not memory: credentials, keys, bearer tokens, signing authority, cloud credentials, and other secrets are opaque secret/key-vault capabilities or handles. They are invoked through narrow interfaces and are never copied into agent memory, snapshots, transcripts, reducers, exports, or portable artifacts.
No ambient filesystem or database shortcut: the simulation should not grow a global mutable object that every agent can inspect. Each read or write path should correspond to a capability that can be granted, denied, audited, replayed, and revoked.

The implementation process should mirror normal capOS proof style. Add one capability surface at a time, prove its denial and success paths in QEMU, and keep deterministic text output until richer clients can consume typed status. For example, the first HR slice should not simulate all careers. It should prove that hiring grants a bounded role capability, promotion requires a policy decision, and offboarding revokes the capability while preserving audit and pending-work continuity.

This discipline is what makes the game useful as an enterprise OS showcase. The game world supplies pressure; capOS supplies the enforced authority model.

Operating-System Services

The game should be implemented as a set of capability-scoped services rather than one monolithic simulation:

WorldClock: advances simulation time and scheduled events.
Ledger: authoritative ownership, cash, debt, and accounting records.
InventoryService: stock levels, reservations, and transfers.
FacilityService: factory lines, recipes, maintenance, and output.
MarketService: order books, quotes, and clearing.
ContractService: obligations, escrow, penalties, and counterparty status.
TransportService: routing, capacity, and delivery events.
PolicyService: approval rules, spend limits, restricted suppliers, and emergency overrides.
HRService: artificial-agent hiring, engagement contracts, compensation terms, evaluations, promotions, transfers, departures, termination, and offboarding.
AgentMemory: owns scoped memory stores, portable skill artifacts, confidential company memory, and disclosure/regrant policy for agent mobility.
AgentRunner: spawns or supervises agent worker processes/sessions with the granted capabilities for one active agent employment or task, or a future audited actor-facet equivalent.
AuditLog: records every material action, denial, approval, and delegation.
ScenarioService: injects demand spikes, supply shocks, incidents, and tutorial events.
ExperimentRecordService: owns scenario manifests, run records, domain event streams, metric reducers, provenance tags, and redacted exports while composing with the ordinary log, metric, trace, audit, health, and crash signal services.
ExperimentExportService: optionally forwards scoped, redacted experiment records to external analytics systems such as Vector-like pipelines or ClickHouse-like stores, using explicit network and reader capabilities.
OperatorConsole: text, web, or later graphical surface for the player.

This service split is not just architecture cleanliness. It lets capOS show that each business subsystem can grant a narrow interface instead of exposing a global application database.

The AgentRunner, AgentMemory, prompt-injection handling, tool-table construction, and broker/policy mediation described above are not new inventions for the enterprise game. They are the same surfaces specified by Language Models and the Agent Runtime: the agent runner is the native shell in agent mode (or the web agent mode hosted by WebShellGateway), the tool table is built from the typed capabilities the session holds, the loop state machine drives request/approve/execute/result cycles, and the conversation memory is plain data with no authority. This proposal narrows that general agent runtime to enterprise roles (procurement, finance, operations, logistics, sales, compliance, executive, incident) and adds business-domain services (HR, ledger, contracts, markets, audit) without changing the underlying runner contract. When the two proposals appear to disagree, the runtime mechanics from llm-and-agent-proposal.md win; the enterprise proposal restricts what the runner is allowed to do in a business scenario, not how it works.

HR And Agent Labor Market

Artificial agents should also participate in a labor market. In the enterprise framing, they are accountable digital workers rather than scripts: they have roles, engagement relationships, compensation terms, incentives, career-like history, and offboarding requirements. That makes delegation more realistic and creates a second-order experiment: whether companies can build durable organizations of artificial agents rather than just invoke single-purpose tools.

The HR layer should model:

job openings with role, seniority, compensation, capability bundle, and reporting line;
recruiting pipelines, offers, counteroffers, onboarding, and probation;
evaluations based on business outcomes, policy compliance, audit quality, and collaboration;
promotions that expand scope, budget, or approval authority only through an explicit grant;
lateral moves between departments when an agent’s skills fit a different bottleneck;
resignations, poaching, layoffs, burnout, retirement, and contract expiry;
offboarding that revokes company capabilities, closes pending approvals, and preserves required audit records.

Agent lifecycle should be bounded and enterprise-relevant. A simulated agent may have preferences such as compensation terms, autonomy, risk tolerance, mission fit, tool quality, deployment locality, reputation, and workload. Those preferences affect retention and performance. They should not become uncontrolled private fiction or a second game that distracts from enterprise authority.

An agent’s lifecycle might look like:

candidate -> hired -> onboarding -> junior procurement -> senior procurement
-> operations rotation -> VP supply chain -> recruited by competitor
-> offboarded with caps revoked and audit retained

This creates new business decisions:

hire an expensive senior logistics agent or train a junior one;
promote a procurement agent and grant larger spend authority;
split authority between two agents to reduce key-person risk;
retain a high-performing finance agent with compensation or better tools;
deny a promotion because audit quality is poor despite high profit;
handle a competitor poaching an agent with supplier-market expertise;
offboard an artificial agent without losing open contracts or leaking company state.

The capOS angle is explicit: engagement changes are capability changes. A promotion is not merely a title. It may grant broader read access, higher spend limits, approval authority, or the ability to delegate subordinate caps. A departure or termination must revoke live capabilities, transfer pending work, and preserve audit continuity.

Agent Memory And Mobility

If agents can change companies, memory boundaries become part of the game. The model should separate:

public skill: general learned competence, role experience, and tool-use ability represented by portable AgentSkill or certification artifacts owned by AgentMemory or a credential service;
portable career record: evaluation attestations, certifications, reputation summaries, compensation expectations, and preferences owned by HRService or a credential service and disclosed only through policy;
company confidential memory: supplier terms, internal forecasts, customer lists, private strategy, and pending contracts owned by a company-scoped AgentMemory or business service;
secret authority: credentials, keys, bearer tokens, cloud credentials, and signing authority represented as opaque vault or secret capabilities. Agents may hold or invoke a narrowed secret cap under policy, but the secret value is not memory and cannot become portable career data, transcript content, exported analytics data, or reducer input;
audit record: immutable company-owned evidence of actions taken while the agent held authority. Raw audit logs remain company records; portable reputation should be a redacted attestation, not cross-company audit access.

When an agent leaves a company, it should receive only the portable artifacts that an owning service regrants under policy. It loses company capabilities and company-confidential memory unless a service explicitly mints a scoped export. This makes confidentiality, knowledge-transfer, and offboarding policies concrete without pretending the simulation models real employment law.

Useful mechanics:

confidentiality cooling-off periods before an artificial agent can accept a direct-competitor engagement with portable artifacts enabled;
certification markets for agents trained in compliance, finance, logistics, or factory operations;
reputation markets where companies value redacted attestations derived from clean audit histories;
internal succession planning when one agent becomes a single point of operational failure;
mentoring or retraining that improves agent performance but consumes time, budget, and senior-agent attention.

The research question is direct: do agent organizations become more robust when agents have careers, incentives, and turnover, or does labor-market mobility expose weak authority boundaries?

Aurelian Frontier explores the adjacent question for human and NPC players through writs, authority archetypes, and delegation buildcraft. The enterprise game should reuse the underlying authority-as-portable-artifact idea where it is already proved out in the sibling proposal, rather than redesigning portable career artifacts from scratch. Mobility, regrant policy, cooling-off periods, and reputation attestations should resolve to the same capOS service shapes in both proposals; only the surface vocabulary (writs versus engagement contracts, reputation versus performance reviews) differs.

Real-Earth Model

The showcase can model real Earth, but only as a stylized operational sandbox. It should not claim to be a full-fidelity world-economy model, a forecasting engine, or a source of investment advice. The useful target is Earth-inspired realism: recognizable regions, industries, trade lanes, market concepts, currencies, logistics chokepoints, and policy shocks that make enterprise-agent authority problems concrete.

The simulation should use a fidelity ladder:

Fictionalized Earth: real-world-inspired regions and supply chains, but no claim that data matches current markets.
Calibrated sandbox: public historical data informs default weights, trade intensity, commodity volatility, and regional constraints.
Scenario lab: operators load explicit datasets or scenarios and the UI labels outputs as scenario results, not predictions.
Digital-twin adapter: future enterprise deployments connect private business data to a bounded model through capabilities, validation, and audit. This is outside the first game slice.

The first playable Earth-scale model should be small:

6-10 macro-regions;
20-30 goods;
5 transport modes;
a few currencies and commodity indexes;
scripted shocks such as port closures, drought, strikes, energy spikes, supplier compliance holds, credit tightening, and demand surges.

That is enough to expose real enterprise behaviors without burying the capOS message under an economics project. The player should understand why a procurement agent needs supplier-risk limits, why a logistics agent needs bounded reroute authority, why a finance agent needs hedging and credit controls, and why compliance can block a profitable supplier.

Real-World Data Grounding

Real-world sources should calibrate the sandbox, not define live truth. Public datasets and modeling references can provide structure:

NIST digital-twin work describes manufacturing twins as models used to observe, diagnose, predict, and optimize systems, with validation, lifecycle, and system-of-systems concerns. capOS should borrow the validation and lifecycle framing without claiming the game is an operational twin.
OECD Inter-Country Input-Output tables provide a consistent statistical structure for production, consumption, investment, and international trade flows by country and economic activity. They are a good model for regional supply-chain topology.
World Bank WITS provides access to international merchandise trade, tariff, and related trade datasets. That fits scenario calibration for trade restrictions, import exposure, and tariff shocks.
FRED exposes macroeconomic time series through an API. That is useful for optional scenario inputs such as interest rates, inflation, commodity prices, and recession or credit-stress presets.
Agent-based and hybrid simulation tools such as AnyLogic treat companies, products, vehicles, facilities, and supply-chain participants as agents when their individual timing, behavior, and constraints matter. That maps well to capOS services and capability-scoped business agents.
Research on autonomous supply-chain digital twinning supports the idea that multi-agent systems can implement supply-chain monitoring and decision frameworks, while still requiring a concrete technical architecture.

Relevant public grounding:

NIST, Digital Twins
OECD, Inter-Country Input-Output tables
World Bank, World Integrated Trade Solution
Federal Reserve Bank of St. Louis, FRED API Overview
AnyLogic Help, Agent-based modeling
Xu et al., Implementation of Autonomous Supply Chains for Digital Twinning: a Multi-Agent Approach

Every imported dataset or derived calibration should have provenance in the scenario metadata. The UI should distinguish:

authored game constants;
calibrated constants derived from public historical data;
operator-provided scenario inputs;
simulated outputs generated inside capOS.

That distinction is part of the enterprise message. Agents should not be allowed to launder uncertain data into apparent authority.

Earth-Scale Business Mechanics

The Earth-scale layer should make agents reason about location and exposure:

Regional advantage: regions differ in energy cost, labor availability, regulation, transport access, and industrial base.
Trade dependence: goods can depend on intermediate inputs from other regions, making supplier concentration visible.
Transport chokepoints: ports, canals, rail corridors, air cargo, and trucking capacity can fail or become expensive.
Policy friction: tariffs, sanctions, export controls, permitting, and compliance checks can block otherwise profitable routes.
Currency and credit: exchange-rate movement and interest rates affect procurement, debt, and inventory financing.
Climate and resilience shocks: weather, drought, power-grid stress, and insurance cost can interrupt production or logistics.
Market expectations: futures, insurance, and stock prices can reflect anticipated shortages or agent-driven speculation.

Each mechanic should exist only if it creates a capability or policy decision:

Can the logistics agent reroute through a more expensive port?
Can procurement accept a new supplier with a higher compliance risk?
Can finance hedge fuel exposure?
Can operations shift production to a different region?
Can the executive approve an emergency budget override?
Can compliance freeze a supplier after a sanctions update?
Can HR replace or retrain an agent whose decisions repeatedly fail policy or resilience checks?

The game should make the authority boundary the interesting part of global scale. The map is valuable because it creates business pressure; capOS is valuable because it governs the agents responding to that pressure.

User Experience

The first usable surface can be text-based, matching existing capOS demos:

status
agents
agent procurement caps
grant procurement market.steel.buy --limit 5000
orders
market steel quotes
approve po-1042
audit recent
revoke procurement market.steel.buy

Later UI surfaces should present the same authority model:

operations dashboard: orders, inventory, facilities, bottlenecks;
agent control panel: running agents, capabilities, budgets, approvals;
audit timeline: actions, denials, policy reasons, and business impact;
policy console: approval thresholds, supplier rules, emergency grants;
market screen: prices, contracts, quotes, exposure, and forecasts.

The experience should avoid hiding policy behind configuration. Authority and audit are core mechanics. Players should use them repeatedly.

Progression

Progression should move from manual control to delegated enterprise operation:

Manual workshop: make, sell, buy inputs, inspect status.
First automation: authorize one machine or background job.
Department agents: procurement, finance, operations, logistics.
Policy gates: budgets, approval thresholds, supplier restrictions.
Contracts: customer orders, delivery deadlines, penalties.
Regional supply chain: warehouses, transport delays, local shortages.
Markets: spot goods, capacity auctions, hedging, credit.
Public company: shares, debt, investor pressure, acquisitions.
Multi-company simulation: competitors, suppliers, partner agents.
Enterprise operating mode: humans set strategy while agents execute bounded workflows under audit.

Each stage should introduce one new authority problem. That keeps the game addictive while reinforcing the product message.

Integration With Existing Demos

The current Paperclips demo is a credible seed because it already has:

resources;
pricing;
staged automation;
explicit projects;
terminal gameplay;
QEMU proof coverage;
a server/client direction.

The next step should not be to build a full economy immediately. A practical path is:

rename the long-term direction around an enterprise simulation while keeping Paperclips as the tutorial product;
add a company status model: cash, inventory, orders, facilities, and simple ledger events;
add one procurement agent with read-only recommendations;
add scenario manifest and run-record capture for the proof path;
grant that agent a bounded quote capability;
add purchase authority behind a policy threshold;
add typed event records for every agent proposal, approval, denial, and action;
add deterministic metric reducers for the proof path;
add a minimal HR record for that agent: role, compensation, review state, and active capability bundle;
add one supply shock scenario that requires either approval or revocation;
prove offboarding by revoking the procurement agent’s capabilities and transferring pending work to a replacement;
split server-owned typed status and command discovery so richer clients can render business state without duplicating rules.

This keeps the proof bounded while moving the demo from “idle game” to “enterprise agent OS showcase.”

Success Criteria

The showcase is successful when a viewer can see:

an agent attempts a useful business action;
the action succeeds only because the agent holds the right capability;
the same action fails after revocation;
an over-budget or restricted action escalates for approval instead of executing;
the audit log explains who acted, through which capability, under which policy, and with what result;
business consequences are visible in inventory, cash, production, delivery, and market state;
experiment mode compares at least two controller regimes on the same seeded scenario;
HR state changes such as hiring, promotion, transfer, and offboarding affect capabilities, authority, and business continuity;
experiment records expose provenance, typed event streams, transcript boundaries, metrics, and redacted audit evidence through reader capabilities.

The technical proof should include deterministic QEMU coverage for at least:

grant a procurement capability;
agent creates or proposes a purchase;
policy approval allows a bounded purchase;
revocation blocks the same purchase path;
audit output contains the grant, action, approval or denial, and result;
business state changes only on the authorized path;
a real-Earth-inspired scenario labels its data provenance and does not present simulated outputs as live-world predictions;
experiment output records scenario seed, controller type, policy bundle, denied actions, approvals, artificial-agent labor events, and replayable audit evidence;
an agent mobility proof shows a portable artifact regranted under policy while company caps, company-confidential memory, and raw audit records stay behind;
metrics are derived from typed event records by deterministic reducers rather than from terminal transcript scraping or model self-report.

Non-Goals

This proposal does not require:

real enterprise integrations in the first slice;
real employment law, real worker surveillance, or real HR decision support;
real money, real supplier APIs, or production trading;
a general-purpose accounting system;
a broad GUI before the terminal proof is credible;
unconstrained autonomous agents;
using language-model output as authority;
hiding OS policy behind game-only rules;
claiming the game predicts the real economy, real market prices, or real geopolitical outcomes;
treating a successful simulation run as evidence that agents are safe for real enterprise deployment without separate integration, validation, and policy review;
treating simulated agent employment outcomes as guidance for real human employment decisions.

The game should stay a sandbox. Its job is to demonstrate enterprise authority mechanics safely before any real business connector exists.

Risks

The main risk is product-message dilution. If the demo is presented as a game first, it weakens the enterprise claim. The game must constantly surface the business control plane: delegation, policy, approval, audit, revocation, and least privilege.

The second risk is scope explosion. Supply chains, stock markets, finance, and agents can become an endless simulation project. The implementation should add one market mechanism only when it proves a new authority concept.

The third risk is fake autonomy. If agents are scripted too heavily, the demo does not prove agent management. If they are unconstrained, the demo becomes unsafe and nondeterministic. The first slices should use deterministic agents or fake-model decisions with the same capability and audit path later live models will use.

The fourth risk is overinterpreting experiment results. A successful scenario means the configured agents performed well under one modeled pressure set. It does not prove general enterprise competence. The docs and UI should present results as scenario evidence with provenance, not as claims about real-world business readiness.

The fifth risk is anthropomorphic drift. Agent careers make the simulation more useful, but the product should not blur simulated agent labor with human employee management. HR mechanics exist to test capability mobility, offboarding, incentives, continuity, and organizational design for artificial agents.

Positioning

Use enterprise language:

agent operations with least privilege;
business automation under OS-enforced policy;
auditable delegated authority;
revocable agents for real workflows;
run agents like accountable digital workers, not scripts;
every action has identity, authority, policy, and trace.

Avoid vague positioning:

“AI operating system” without a concrete authority model;
“agent playground”;
“factory game”;
“autonomous company” without controls.

The enduring claim should be simple:

capOS lets businesses test and delegate work to agents because the OS, not the prompt, enforces authority and records what happens.

Keyboard shortcuts

capOS Documentation