Proposal: Capability-Oriented GPU/CUDA Integration

Purpose

Define a minimal, capability-safe path to integrate GPU-class accelerators (NVIDIA/CUDA, AMD, Intel, plus future ML-accelerator boards) into capOS without expanding kernel trust.

The kernel keeps direct control of hardware arbitration and trust boundaries. GPU hardware interaction is performed by a dedicated userspace driver service that is invoked through capability calls and that holds device-scoped bootstrap grants for its single managed device.

This proposal is a downstream consumer of:

LLM and agent proposal – defines the LanguageModel/Embedder/ImageModel capability surface that benefits from GPU-backed inference backends. The agent runtime treats a GPU-backed model process as just another LanguageModel capability holder; the GPU service proposed here is one of the substrate choices the model process may use.
Userspace binaries proposal – defines the native Rust over capos-rt userspace runtime, the x86_64-unknown-capos target, and the libcapos C-substrate path that any vendor SDK adapter (CUDA, ROCm, OpenCL, oneAPI) must link against. The GPU service runs as one such userspace binary, not as a kernel module.

Positioning Against Current Project State

capOS currently provides infrastructure that is directly load-bearing for a future GPU service:

Process lifecycle, page tables, preemptive scheduling (PIT 100 Hz, round-robin, context switching).
A global and per-process capability table with CapObject dispatch.
Shared-memory capability ring (io_uring-inspired) with syscall-free SQE writes. cap_enter syscall for ordinary CALL dispatch and completion waits.
PCI/PCIe enumeration over both legacy I/O ports and ACPI MCFG ECAM, plus reusable memory-BAR subregion validation and kernel MMIO mapping helpers for diagnostics and driver bring-up.
MSI/MSI-X capability metadata discovery and typed MSI-X table programming, proven end-to-end through the virtio-net make run-net smoke.
I/O APIC routing for masked legacy IRQ programming via MADT.
Kernel-owned device interrupt source records plus a bounded first-fit device MSI vector pool with lock-free dispatch slots and claimed-route reassignment/release.
Kernel-owned DMA pool accounting ledger that tracks pool bytes, live page count, page-rounded MMIO mapping bytes, interrupt holds, ring depth, and descriptor submission/completion counts for the current virtio-net path.
Bootstrap-grant authority hooks for DeviceMmio, DMAPool, Interrupt, and HardwareAuditLog capabilities, exercised by the make run-devicemmio-grant, make run-dmapool-grant, make run-interrupt-grant, and make run-hardware-audit smokes.

What does not exist yet and gates real GPU work:

A userspace driver-authority gate. Today the kernel still owns virtio-net, the DMA pool ledger, and the MSI-X dispatch table. The DDF bootstrap-grant smokes prove the schema and grant plumbing for the typed device caps, but there is no userspace driver process that consumes those grants to run a real driver. GPU integration cannot land before that gate moves.
IOMMU/DMA-remapping integration (VT-d / AMD-Vi). Until a userspace driver is constrained by IOMMU domains, no production GPU stack can be granted bus-master DMA on a multi-tenant host.
A LanguageModel capability surface to consume the GPU service. The LLM proposal defines the schema target; the GPU service is one backend choice.

That means GPU integration must be staged. The early phases are capability schema and mock-service exercises that ride on the existing DDF bootstrap grants; real hardware backends arrive after the userspace-driver authority gate, IOMMU integration, and at least one consuming model surface exist.

Design Principles

Keep policy in kernel, execution in userspace. The kernel arbitrates device claims, MMIO mapping, MSI-X table programming, and DMA-pool accounting; the driver service implements vendor-specific command submission and queue management.
Never expose raw PCI/MMIO/IRQ details to untrusted processes. Clients see only GpuSession/GpuBuffer/GpuFence capabilities, never DeviceMmio or Interrupt.
Make GPU access explicit through narrow capabilities. The interface is the permission; a client that should not launch kernels is given a session type that does not expose launchKernel.
Treat every stateful resource (session, buffer, queue, fence, command pool) as a capability with revocability and bounded lifetime.
Avoid a Linux-driver-in-kernel compatibility dependency. Vendor SDK code runs in the userspace driver service, linked through libcapos / libcapos-posix shims where vendor headers expect a POSIX-ish surface.
Charge GPU memory and submission depth through the existing ResourceLedger mechanism rather than inventing a parallel accounting surface.

Proposed Architecture

capOS kernel (minimal) exposes only resource and mediation capabilities.

gpu-device service (userspace) receives device-specific bootstrap grants (DeviceMmio, DMAPool, Interrupt, HardwareAuditLog) for exactly one GPU function and exposes a stable GPU capability surface to clients.

application (e.g. an LLM model server, a numeric workload, a robot brain inference loop) receives only GpuSession/GpuBuffer/GpuFence capabilities and never sees the device-scoped grants.

Kernel responsibilities

Discover GPUs from PCI/ACPI layers (already implemented for non-GPU functions; GPUs are the same discovery path with different class codes).
Map/register BAR windows and grant a scoped DeviceMmio capability bound to one decoded memory BAR.
Set up MSI/MSI-X routing and expose scoped Interrupt capability per vector with masked-route lifecycle semantics matching the current virtio-net proof.
Hand out a bounded DMAPool capability whose accounting ledger charges back to the driver process’s resource ledger and that participates in IOMMU-domain constraints once those exist.
Enforce revocation when sessions are closed: DeviceMmio/Interrupt/ DMAPool grants tear down through the bootstrap-grant manager.
Record device-manager actions through HardwareAuditLog snapshots (already proven for the DDF smokes).
Handle all faulting paths that would otherwise crash the kernel: a buggy driver service must crash the service, not the kernel.

Userspace GPU service responsibilities

Open and initialize one GPU device from its device-scoped bootstrap grants. One driver process per GPU function is the working assumption; multi-function boards may run one process per function.
Allocate and track GPU contexts, command queues, and DMA buffers backed by the granted DMAPool.
Implement command submission, buffer lifecycle, fence/completion signaling, and timeout enforcement.
Translate capability calls into vendor SDK operations (CUDA driver API, ROCm, oneAPI, OpenCL, or a vendor-neutral runtime such as a WebGPU/wgpu-style abstraction).
Expose only narrow, capability-typed handles to callers and refuse any attempt to surface raw MMIO/IRQ/DMA to clients.

Consumer surfaces

LLM/embedder model servers from Language Models and Agent Runtime. The GPU-backed model process holds a GpuSession, exposes a LanguageModel or Embedder capability, and is itself a normal userspace binary built per Userspace Binaries.
Numerical / HPC workloads from HPC Parallel Processing Patterns once that proposal expands to GPU offload.
Robotics inference loops from capOS As A Robot Brain.

Capability Contract (schema additions)

Add to schema/capos.capnp (interface-level sketch; final wire layout is fixed in the implementation slice):

GpuDeviceManager
- listDevices() -> (devices: List(GpuDeviceInfo))
- openDevice(capabilityIndex :UInt32) -> (session :GpuSession)
GpuSession
- createBuffer(bytes :UInt64, usage :Text) -> (buffer :GpuBuffer)
- destroyBuffer(buffer :UInt32) -> ()
- launchKernel(program :Text, grid :UInt32, block :UInt32, bufferList :List(UInt32), fence :GpuFence) -> ()
- submitMemcpy(dst :UInt32, src :UInt32, bytes :UInt64) -> ()
- submitFenceWait(fence :UInt32) -> ()
GpuBuffer
- mapReadWrite() -> (addr :UInt64, len :UInt64)
- unmap() -> ()
- size() -> (bytes :UInt64)
- close() -> ()
GpuFence
- poll() -> (status :Text)
- wait(timeoutNanos :UInt64) -> (ok :Bool)
- close() -> ()

Sessions are the natural restriction point: a model-server session granted to an LLM process can omit launchKernel entirely and expose only memcpy plus an opaque runProgram(programCap, ...) if the model image is itself a separately-vetted capability. The interface is the permission; do not add parallel rights bitmasks.

Implementation Phases

Phase 0 (prerequisite, landed): kernel capability ring and DDF grants

The Cap’n Proto schema, capability ring, cap_enter dispatch, PCI/MSI-X discovery, and the DeviceMmio/DMAPool/Interrupt/HardwareAuditLog bootstrap-grant smokes already exist. No new kernel surface is required for this phase; the schema additions for Gpu* are pure userspace work once a driver service is permitted.

Phase 1: Userspace driver-authority gate (cross-track prerequisite)

GPU work cannot land before the userspace driver-authority gate. Required pieces, tracked by the device-manager refactor and DMA-isolation design:

Move virtio-net or another known-good driver out of the kernel and into a userspace driver process consuming the DDF bootstrap grants end-to-end.
Add an IOMMU integration path (VT-d / AMD-Vi) so that bus-master DMA granted to a driver process is constrained to its registered DMA pages.
Add a device-manager userspace service that owns ManagerGrantSource-class capabilities and is the only process that hands DeviceMmio/DMAPool/Interrupt/HardwareAuditLog grants to driver services.

This phase is owned by the device-manager and DMA-isolation tracks; the GPU proposal consumes it.

Phase 2: Mock GPU service

Add the Gpu* schema in schema/capos.capnp.
Implement a gpu-mock userspace service with the full Gpu* interface, no real driver, and synthetic fences and buffers backed by ordinary anonymous memory.
Prove end-to-end:
- device-manager spawns the mock driver and grants it a fake-device bootstrap grant set.
- a client process opens a session, allocates and maps a buffer, submits a synthetic job, and waits on a fence.
Add a focused QEMU smoke (make run-gpu-mock) that asserts the round-trip and demonstrates revocation on session close.

Phase 3: Real backend integration on one vendor

Pick one concrete GPU backend available in CI environment (likely NVIDIA on a workstation host with -device vfio-pci passthrough into QEMU, or a virtio-gpu / venus virtualized path as a first stand-in).
Vendor SDK code lives in the userspace driver process. Where the SDK expects a POSIX-ish surface, route it through libcapos-posix rather than expanding the kernel.
Add queue lifecycle, fence lifecycle, DMA registration/validation, command execution path, interrupt completion plumbing back to clients through fences.
Keep backend replacement possible via a trait-like abstraction inside the driver process so a second vendor backend (AMD ROCm, Intel oneAPI) can be added later without rewriting the service.

Phase 4: Security and reliability hardening

Per-session limits for mapped pages, in-flight submissions, and queue depth, charged through ResourceLedger.
Bounded wait timeouts and explicit fence cancellation semantics so a hung GPU does not pin a client’s cap_enter.
Revocation propagation:
- GpuSession close => all child GpuBuffer/GpuFence caps revoked.
- driver crash / device reset => all active caps fail closed with a typed exception.
Audit hooks for launchKernel/submitMemcpy recorded through HardwareAuditLog-style snapshots scoped to the GPU service.
Coordination with the live-upgrade proposal so the GPU driver service can be replaced without dropping client GpuSession caps.

Phase 5: Multi-tenant and multi-device

Multiple driver processes (one per GPU function) under a single device-manager.
Cross-device buffer sharing only through explicit capability transfer; no implicit peer mappings.
Workload isolation: distinct tenants on a single GPU receive distinct sessions with their own queue, memory budget, and audit stream.

Security Model

The kernel does not grant any user process direct MMIO, MSI, or bus-master DMA access. All such authority is mediated through the device-manager.

Application processes only receive:

GpuSession / GpuBuffer / GpuFence capabilities with the methods the session policy chose to expose.

The GPU driver service process receives:

DeviceMmio bound to the function’s decoded BARs.
Interrupt capabilities for the function’s claimed MSI vectors.
DMAPool bounded to the function’s IOMMU domain.
HardwareAuditLog for snapshotting device-manager actions.

This ensures:

No userland process can program BAR registers.
No userland process can claim untrusted memory for DMA.
No userland process can observe or reset another session’s state.
A buggy or compromised driver crashes the driver process, not the kernel; the device-manager observes the crash, fails outstanding capabilities closed, and re-spawns the driver on the next session request.

Dependencies and Alignment

This proposal depends on:

Device-manager refactor proposal for the userspace device-manager that owns the bootstrap-grant sources.
DMA-isolation design and IOMMU integration so DMA grants are enforceable in a multi-tenant context.
Userspace-binaries proposal for the driver-process runtime, libcapos / libcapos-posix surface for vendor SDK consumption, and the x86_64-unknown-capos target.
LLM and agent proposal for the primary consumer surface (LanguageModel, Embedder) and the agent runtime that exercises GPU-backed inference end-to-end.
Resource-accounting proposal for per-session memory and submission budgets.
Live-upgrade proposal for driver-service replacement without dropping GpuSession capabilities.

It complements:

Service-architecture and authority-broker proposals.
Storage/service manifest execution flow for shipping GPU service binaries and their bootstrap grants.
In-process threading work for future queue completion callbacks and worker pools inside the driver service.

Minimal acceptance criteria

make run-gpu-mock boots and prints GPU service lifecycle messages.
The device-manager spawns the GPU service and grants only device-scoped bootstrap grants for a single mock function.
A sample userspace client (Rust over capos-rt; C smoke later through libcapos) can create a session, allocate and map a GPU buffer, submit a synthetic job, and wait on a fence with a typed completion result.
Attempts to submit unsupported or malformed operations return explicit capnp CapException results, not driver crashes.
Removing the session capability invalidates descendant buffer and fence caps without kernel restart.
A subsequent slice points an LLM model server at the GPU service and proves a LanguageModel.generate(...) round-trip backed by the GPU session, satisfying the LLM proposal’s GPU-backend integration point.

Risks

Real NVIDIA closed stack integration may require vendor-specific adaptation that is hostile to a capability shim; the AMD ROCm or vendor-neutral path (Vulkan compute, WebGPU/wgpu) may land first.
Buffer mapping semantics become complex with paging, fragmentation, and IOMMU domains. Pinned physical-memory-only buffers are the conservative starting point.
Interrupt-heavy completion paths require the scheduler evolution work (per-CPU run queues, fairness) before client-visible completion guarantees scale beyond a single workload.
Vendor SDKs assume a POSIX-ish process model; the libcapos-posix surface has to grow enough to host them without leaking ambient authority.
A GPU driver process is privileged from the application’s point of view. Compromise of a single driver process must remain bounded to one GPU function and one tenant set; the device-manager and IOMMU are the load-bearing controls there.

Open Questions

Is CUDA mandatory from first integration, or is the initial surface command-focused (opaque “program” bytes interpreted by the driver) with CUDA runtime-specific support added later?
Should memory registration support pinned physical memory only at first, or attempt to expose unified-virtual-memory semantics through the client’s VirtualMemory capability?
Which isolation level is needed for multi-tenant versus single-tenant in the first real-backend phase? Single-tenant per GPU function is the conservative default; MIG / SR-IOV-style partitioning is later work.
Does the GPU service expose model artifacts (weights, programs) as separate capability types so a model file can be granted to clients without the full session, or are programs always inline arguments?

Keyboard shortcuts

capOS Documentation