# Proposal: Capability-Oriented GPU/CUDA Integration


## Purpose

Define a minimal, capability-safe path to integrate GPU-class accelerators
(NVIDIA/CUDA, AMD, Intel, plus future ML-accelerator boards) into capOS without
expanding kernel trust.

The kernel keeps direct control of hardware arbitration and trust boundaries.
GPU hardware interaction is performed by a dedicated userspace driver service
that is invoked through capability calls and that holds device-scoped bootstrap
grants for its single managed device.

This proposal is a downstream consumer of:

- [LLM and agent proposal](llm-and-agent-proposal.md) -- defines the
  `LanguageModel`/`Embedder`/`ImageModel` capability surface that benefits from
  GPU-backed inference backends. The agent runtime treats a GPU-backed model
  process as just another `LanguageModel` capability holder; the GPU service
  proposed here is one of the substrate choices the model process may use.
- [Userspace binaries proposal](userspace-binaries-proposal.md) -- defines
  the native Rust over `capos-rt` userspace runtime, the
  `x86_64-unknown-capos` target, and the libcapos C-substrate path that any
  vendor SDK adapter (CUDA, ROCm, OpenCL, oneAPI) must link against. The GPU
  service runs as one such userspace binary, not as a kernel module.

## Positioning Against Current Project State

capOS currently provides infrastructure that is directly load-bearing for a
future GPU service:

- Process lifecycle, page tables, preemptive scheduling (PIT 100 Hz,
  round-robin, context switching).
- A global and per-process capability table with `CapObject` dispatch.
- Shared-memory capability ring (io_uring-inspired) with syscall-free SQE
  writes. `cap_enter` syscall for ordinary CALL dispatch and completion waits.
- PCI/PCIe enumeration over both legacy I/O ports and ACPI MCFG ECAM, plus
  reusable memory-BAR subregion validation and kernel MMIO mapping helpers
  for diagnostics and driver bring-up.
- MSI/MSI-X capability metadata discovery and typed MSI-X table programming,
  proven end-to-end through the virtio-net `make run-net` smoke.
- I/O APIC routing for masked legacy IRQ programming via MADT.
- Kernel-owned device interrupt source records plus a bounded first-fit
  device MSI vector pool with lock-free dispatch slots and claimed-route
  reassignment/release.
- Kernel-owned DMA pool accounting ledger that tracks pool bytes, live page
  count, page-rounded MMIO mapping bytes, interrupt holds, ring depth, and
  descriptor submission/completion counts for the current virtio-net path.
- Bootstrap-grant authority hooks for `DeviceMmio`, `DMAPool`, `Interrupt`,
  and `HardwareAuditLog` capabilities, exercised by the
  `make run-devicemmio-grant`, `make run-dmapool-grant`,
  `make run-interrupt-grant`, and `make run-hardware-audit` smokes.

What does **not** exist yet and gates real GPU work:

- A userspace driver-authority gate. Today the kernel still owns virtio-net,
  the DMA pool ledger, and the MSI-X dispatch table. The DDF bootstrap-grant
  smokes prove the schema and grant plumbing for the typed device caps, but
  there is no userspace driver process that consumes those grants to run a
  real driver. GPU integration cannot land before that gate moves.
- IOMMU/DMA-remapping integration (VT-d / AMD-Vi). Until a userspace driver
  is constrained by IOMMU domains, no production GPU stack can be granted
  bus-master DMA on a multi-tenant host.
- A `LanguageModel` capability surface to consume the GPU service. The LLM
  proposal defines the schema target; the GPU service is one backend choice.

That means GPU integration must be staged. The early phases are capability
schema and mock-service exercises that ride on the existing DDF bootstrap
grants; real hardware backends arrive after the userspace-driver authority
gate, IOMMU integration, and at least one consuming model surface exist.

## Design Principles

- Keep policy in kernel, execution in userspace. The kernel arbitrates
  device claims, MMIO mapping, MSI-X table programming, and DMA-pool
  accounting; the driver service implements vendor-specific command
  submission and queue management.
- Never expose raw PCI/MMIO/IRQ details to untrusted processes. Clients see
  only `GpuSession`/`GpuBuffer`/`GpuFence` capabilities, never `DeviceMmio`
  or `Interrupt`.
- Make GPU access explicit through narrow capabilities. The interface is the
  permission; a client that should not launch kernels is given a session
  type that does not expose `launchKernel`.
- Treat every stateful resource (session, buffer, queue, fence, command
  pool) as a capability with revocability and bounded lifetime.
- Avoid a Linux-driver-in-kernel compatibility dependency. Vendor SDK code
  runs in the userspace driver service, linked through libcapos /
  libcapos-posix shims where vendor headers expect a POSIX-ish surface.
- Charge GPU memory and submission depth through the existing
  `ResourceLedger` mechanism rather than inventing a parallel accounting
  surface.

## Proposed Architecture

`capOS kernel` (minimal) exposes only resource and mediation capabilities.

`gpu-device service` (userspace) receives device-specific bootstrap grants
(`DeviceMmio`, `DMAPool`, `Interrupt`, `HardwareAuditLog`) for exactly one
GPU function and exposes a stable GPU capability surface to clients.

`application` (e.g. an LLM model server, a numeric workload, a
[robot brain](robot-brain-proposal.md) inference loop) receives only
`GpuSession`/`GpuBuffer`/`GpuFence` capabilities and never sees the
device-scoped grants.

### Kernel responsibilities

- Discover GPUs from PCI/ACPI layers (already implemented for non-GPU
  functions; GPUs are the same discovery path with different class codes).
- Map/register BAR windows and grant a scoped `DeviceMmio` capability bound
  to one decoded memory BAR.
- Set up MSI/MSI-X routing and expose scoped `Interrupt` capability per
  vector with masked-route lifecycle semantics matching the current
  virtio-net proof.
- Hand out a bounded `DMAPool` capability whose accounting ledger charges
  back to the driver process's resource ledger and that participates in
  IOMMU-domain constraints once those exist.
- Enforce revocation when sessions are closed: `DeviceMmio`/`Interrupt`/
  `DMAPool` grants tear down through the bootstrap-grant manager.
- Record device-manager actions through `HardwareAuditLog` snapshots
  (already proven for the DDF smokes).
- Handle all faulting paths that would otherwise crash the kernel: a
  buggy driver service must crash the service, not the kernel.

### Userspace GPU service responsibilities

- Open and initialize one GPU device from its device-scoped bootstrap
  grants. One driver process per GPU function is the working assumption;
  multi-function boards may run one process per function.
- Allocate and track GPU contexts, command queues, and DMA buffers backed
  by the granted `DMAPool`.
- Implement command submission, buffer lifecycle, fence/completion
  signaling, and timeout enforcement.
- Translate capability calls into vendor SDK operations (CUDA driver API,
  ROCm, oneAPI, OpenCL, or a vendor-neutral runtime such as a
  WebGPU/wgpu-style abstraction).
- Expose only narrow, capability-typed handles to callers and refuse any
  attempt to surface raw MMIO/IRQ/DMA to clients.

### Consumer surfaces

- LLM/embedder model servers from
  [llm-and-agent-proposal.md](llm-and-agent-proposal.md). The
  GPU-backed model process holds a `GpuSession`, exposes a `LanguageModel`
  or `Embedder` capability, and is itself a normal userspace binary built
  per [userspace-binaries-proposal.md](userspace-binaries-proposal.md).
- Numerical / HPC workloads from
  [hpc-parallel-patterns-proposal.md](hpc-parallel-patterns-proposal.md)
  once that proposal expands to GPU offload.
- Robotics inference loops from
  [robot-brain-proposal.md](robot-brain-proposal.md).

## Capability Contract (schema additions)

Add to `schema/capos.capnp` (interface-level sketch; final wire layout is
fixed in the implementation slice):

- `GpuDeviceManager`
  - `listDevices() -> (devices: List(GpuDeviceInfo))`
  - `openDevice(capabilityIndex :UInt32) -> (session :GpuSession)`
- `GpuSession`
  - `createBuffer(bytes :UInt64, usage :Text) -> (buffer :GpuBuffer)`
  - `destroyBuffer(buffer :UInt32) -> ()`
  - `launchKernel(program :Text, grid :UInt32, block :UInt32, bufferList :List(UInt32), fence :GpuFence) -> ()`
  - `submitMemcpy(dst :UInt32, src :UInt32, bytes :UInt64) -> ()`
  - `submitFenceWait(fence :UInt32) -> ()`
- `GpuBuffer`
  - `mapReadWrite() -> (addr :UInt64, len :UInt64)`
  - `unmap() -> ()`
  - `size() -> (bytes :UInt64)`
  - `close() -> ()`
- `GpuFence`
  - `poll() -> (status :Text)`
  - `wait(timeoutNanos :UInt64) -> (ok :Bool)`
  - `close() -> ()`

Sessions are the natural restriction point: a model-server session granted
to an LLM process can omit `launchKernel` entirely and expose only memcpy
plus an opaque `runProgram(programCap, ...)` if the model image is itself a
separately-vetted capability. The interface is the permission; do not add
parallel rights bitmasks.

## Implementation Phases

### Phase 0 (prerequisite, landed): kernel capability ring and DDF grants

The Cap'n Proto schema, capability ring, `cap_enter` dispatch, PCI/MSI-X
discovery, and the `DeviceMmio`/`DMAPool`/`Interrupt`/`HardwareAuditLog`
bootstrap-grant smokes already exist. No new kernel surface is required for
this phase; the schema additions for `Gpu*` are pure userspace work once a
driver service is permitted.

### Phase 1: Userspace driver-authority gate (cross-track prerequisite)

GPU work cannot land before the userspace driver-authority gate. Required
pieces, tracked by the device-manager refactor and DMA-isolation design:

- Move virtio-net or another known-good driver out of the kernel and into
  a userspace driver process consuming the DDF bootstrap grants
  end-to-end.
- Add an IOMMU integration path (VT-d / AMD-Vi) so that bus-master DMA
  granted to a driver process is constrained to its registered DMA pages.
- Add a `device-manager` userspace service that owns
  `ManagerGrantSource`-class capabilities and is the only process that
  hands `DeviceMmio`/`DMAPool`/`Interrupt`/`HardwareAuditLog` grants to
  driver services.

This phase is owned by the device-manager and DMA-isolation tracks; the
GPU proposal consumes it.

### Phase 2: Mock GPU service

- Add the `Gpu*` schema in `schema/capos.capnp`.
- Implement a `gpu-mock` userspace service with the full `Gpu*` interface,
  no real driver, and synthetic fences and buffers backed by ordinary
  anonymous memory.
- Prove end-to-end:
  - device-manager spawns the mock driver and grants it a
    fake-device bootstrap grant set.
  - a client process opens a session, allocates and maps a buffer,
    submits a synthetic job, and waits on a fence.
- Add a focused QEMU smoke (`make run-gpu-mock`) that asserts the round-trip
  and demonstrates revocation on session close.

### Phase 3: Real backend integration on one vendor

- Pick one concrete GPU backend available in CI environment (likely NVIDIA
  on a workstation host with `-device vfio-pci` passthrough into QEMU, or
  a virtio-gpu / venus virtualized path as a first stand-in).
- Vendor SDK code lives in the userspace driver process. Where the SDK
  expects a POSIX-ish surface, route it through
  [libcapos-posix](userspace-binaries-proposal.md) rather than expanding
  the kernel.
- Add queue lifecycle, fence lifecycle, DMA registration/validation,
  command execution path, interrupt completion plumbing back to clients
  through fences.
- Keep backend replacement possible via a trait-like abstraction inside
  the driver process so a second vendor backend (AMD ROCm, Intel oneAPI)
  can be added later without rewriting the service.

### Phase 4: Security and reliability hardening

- Per-session limits for mapped pages, in-flight submissions, and queue
  depth, charged through `ResourceLedger`.
- Bounded wait timeouts and explicit fence cancellation semantics so a
  hung GPU does not pin a client's `cap_enter`.
- Revocation propagation:
  - `GpuSession` close => all child `GpuBuffer`/`GpuFence` caps revoked.
  - driver crash / device reset => all active caps fail closed with a
    typed exception.
- Audit hooks for `launchKernel`/`submitMemcpy` recorded through
  `HardwareAuditLog`-style snapshots scoped to the GPU service.
- Coordination with the
  [live-upgrade proposal](live-upgrade-proposal.md) so the GPU driver
  service can be replaced without dropping client `GpuSession` caps.

### Phase 5: Multi-tenant and multi-device

- Multiple driver processes (one per GPU function) under a single
  device-manager.
- Cross-device buffer sharing only through explicit capability transfer;
  no implicit peer mappings.
- Workload isolation: distinct tenants on a single GPU receive distinct
  sessions with their own queue, memory budget, and audit stream.

## Security Model

The kernel does not grant any user process direct MMIO, MSI, or bus-master
DMA access. All such authority is mediated through the device-manager.

Application processes only receive:

- `GpuSession` / `GpuBuffer` / `GpuFence` capabilities with the methods
  the session policy chose to expose.

The GPU driver service process receives:

- `DeviceMmio` bound to the function's decoded BARs.
- `Interrupt` capabilities for the function's claimed MSI vectors.
- `DMAPool` bounded to the function's IOMMU domain.
- `HardwareAuditLog` for snapshotting device-manager actions.

This ensures:

- No userland process can program BAR registers.
- No userland process can claim untrusted memory for DMA.
- No userland process can observe or reset another session's state.
- A buggy or compromised driver crashes the driver process, not the
  kernel; the device-manager observes the crash, fails outstanding
  capabilities closed, and re-spawns the driver on the next session
  request.

## Dependencies and Alignment

This proposal depends on:

- [Device-manager refactor proposal](device-manager-refactor-proposal.md)
  for the userspace device-manager that owns the bootstrap-grant sources.
- DMA-isolation design and IOMMU integration so DMA grants are
  enforceable in a multi-tenant context.
- [Userspace-binaries proposal](userspace-binaries-proposal.md) for the
  driver-process runtime, libcapos / libcapos-posix surface for vendor SDK
  consumption, and the `x86_64-unknown-capos` target.
- [LLM and agent proposal](llm-and-agent-proposal.md) for the primary
  consumer surface (`LanguageModel`, `Embedder`) and the agent runtime
  that exercises GPU-backed inference end-to-end.
- [Resource-accounting proposal](resource-accounting-proposal.md) for
  per-session memory and submission budgets.
- [Live-upgrade proposal](live-upgrade-proposal.md) for driver-service
  replacement without dropping `GpuSession` capabilities.

It complements:

- Service-architecture and authority-broker proposals.
- Storage/service manifest execution flow for shipping GPU service
  binaries and their bootstrap grants.
- In-process threading work for future queue completion callbacks and
  worker pools inside the driver service.

## Minimal acceptance criteria

- `make run-gpu-mock` boots and prints GPU service lifecycle messages.
- The device-manager spawns the GPU service and grants only device-scoped
  bootstrap grants for a single mock function.
- A sample userspace client (Rust over capos-rt; C smoke later through
  libcapos) can create a session, allocate and map a GPU buffer, submit a
  synthetic job, and wait on a fence with a typed completion result.
- Attempts to submit unsupported or malformed operations return explicit
  capnp `CapException` results, not driver crashes.
- Removing the session capability invalidates descendant buffer and fence
  caps without kernel restart.
- A subsequent slice points an LLM model server at the GPU service and
  proves a `LanguageModel.generate(...)` round-trip backed by the GPU
  session, satisfying the LLM proposal's GPU-backend integration point.

## Risks

- Real NVIDIA closed stack integration may require vendor-specific
  adaptation that is hostile to a capability shim; the AMD ROCm or
  vendor-neutral path (Vulkan compute, WebGPU/wgpu) may land first.
- Buffer mapping semantics become complex with paging, fragmentation, and
  IOMMU domains. Pinned physical-memory-only buffers are the conservative
  starting point.
- Interrupt-heavy completion paths require the scheduler evolution work
  (per-CPU run queues, fairness) before client-visible completion
  guarantees scale beyond a single workload.
- Vendor SDKs assume a POSIX-ish process model; the libcapos-posix surface
  has to grow enough to host them without leaking ambient authority.
- A GPU driver process is privileged from the application's point of
  view. Compromise of a single driver process must remain bounded to one
  GPU function and one tenant set; the device-manager and IOMMU are the
  load-bearing controls there.

## Open Questions

- Is CUDA mandatory from first integration, or is the initial surface
  command-focused (opaque "program" bytes interpreted by the driver)
  with CUDA runtime-specific support added later?
- Should memory registration support pinned physical memory only at first,
  or attempt to expose unified-virtual-memory semantics through the
  client's `VirtualMemory` capability?
- Which isolation level is needed for multi-tenant versus single-tenant
  in the first real-backend phase? Single-tenant per GPU function is the
  conservative default; MIG / SR-IOV-style partitioning is later work.
- Does the GPU service expose model artifacts (weights, programs) as
  separate capability types so a model file can be granted to clients
  without the full session, or are programs always inline arguments?