Proposal: Capability-Oriented GPU/CUDA Integration
Purpose
Define a minimal, capability-safe path to integrate GPU-class accelerators (NVIDIA/CUDA, AMD, Intel, plus future ML-accelerator boards) into capOS without expanding kernel trust.
The kernel keeps direct control of hardware arbitration and trust boundaries. GPU hardware interaction is performed by a dedicated userspace driver service that is invoked through capability calls and that holds device-scoped bootstrap grants for its single managed device.
This proposal is a downstream consumer of:
- LLM and agent proposal – defines the
LanguageModel/Embedder/ImageModelcapability surface that benefits from GPU-backed inference backends. The agent runtime treats a GPU-backed model process as just anotherLanguageModelcapability holder; the GPU service proposed here is one of the substrate choices the model process may use. - Userspace binaries proposal – defines
the native Rust over
capos-rtuserspace runtime, thex86_64-unknown-capostarget, and the libcapos C-substrate path that any vendor SDK adapter (CUDA, ROCm, OpenCL, oneAPI) must link against. The GPU service runs as one such userspace binary, not as a kernel module.
Positioning Against Current Project State
capOS currently provides infrastructure that is directly load-bearing for a future GPU service:
- Process lifecycle, page tables, preemptive scheduling (PIT 100 Hz, round-robin, context switching).
- A global and per-process capability table with
CapObjectdispatch. - Shared-memory capability ring (io_uring-inspired) with syscall-free SQE
writes.
cap_entersyscall for ordinary CALL dispatch and completion waits. - PCI/PCIe enumeration over both legacy I/O ports and ACPI MCFG ECAM, plus reusable memory-BAR subregion validation and kernel MMIO mapping helpers for diagnostics and driver bring-up.
- MSI/MSI-X capability metadata discovery and typed MSI-X table programming,
proven end-to-end through the virtio-net
make run-netsmoke. - I/O APIC routing for masked legacy IRQ programming via MADT.
- Kernel-owned device interrupt source records plus a bounded first-fit device MSI vector pool with lock-free dispatch slots and claimed-route reassignment/release.
- Kernel-owned DMA pool accounting ledger that tracks pool bytes, live page count, page-rounded MMIO mapping bytes, interrupt holds, ring depth, and descriptor submission/completion counts for the current virtio-net path.
- Bootstrap-grant authority hooks for
DeviceMmio,DMAPool,Interrupt, andHardwareAuditLogcapabilities, exercised by themake run-devicemmio-grant,make run-dmapool-grant,make run-interrupt-grant, andmake run-hardware-auditsmokes.
What does not exist yet and gates real GPU work:
- A userspace driver-authority gate. Today the kernel still owns virtio-net, the DMA pool ledger, and the MSI-X dispatch table. The DDF bootstrap-grant smokes prove the schema and grant plumbing for the typed device caps, but there is no userspace driver process that consumes those grants to run a real driver. GPU integration cannot land before that gate moves.
- IOMMU/DMA-remapping integration (VT-d / AMD-Vi). Until a userspace driver is constrained by IOMMU domains, no production GPU stack can be granted bus-master DMA on a multi-tenant host.
- A
LanguageModelcapability surface to consume the GPU service. The LLM proposal defines the schema target; the GPU service is one backend choice.
That means GPU integration must be staged. The early phases are capability schema and mock-service exercises that ride on the existing DDF bootstrap grants; real hardware backends arrive after the userspace-driver authority gate, IOMMU integration, and at least one consuming model surface exist.
Design Principles
- Keep policy in kernel, execution in userspace. The kernel arbitrates device claims, MMIO mapping, MSI-X table programming, and DMA-pool accounting; the driver service implements vendor-specific command submission and queue management.
- Never expose raw PCI/MMIO/IRQ details to untrusted processes. Clients see
only
GpuSession/GpuBuffer/GpuFencecapabilities, neverDeviceMmioorInterrupt. - Make GPU access explicit through narrow capabilities. The interface is the
permission; a client that should not launch kernels is given a session
type that does not expose
launchKernel. - Treat every stateful resource (session, buffer, queue, fence, command pool) as a capability with revocability and bounded lifetime.
- Avoid a Linux-driver-in-kernel compatibility dependency. Vendor SDK code runs in the userspace driver service, linked through libcapos / libcapos-posix shims where vendor headers expect a POSIX-ish surface.
- Charge GPU memory and submission depth through the existing
ResourceLedgermechanism rather than inventing a parallel accounting surface.
Proposed Architecture
capOS kernel (minimal) exposes only resource and mediation capabilities.
gpu-device service (userspace) receives device-specific bootstrap grants
(DeviceMmio, DMAPool, Interrupt, HardwareAuditLog) for exactly one
GPU function and exposes a stable GPU capability surface to clients.
application (e.g. an LLM model server, a numeric workload, a
robot brain inference loop) receives only
GpuSession/GpuBuffer/GpuFence capabilities and never sees the
device-scoped grants.
Kernel responsibilities
- Discover GPUs from PCI/ACPI layers (already implemented for non-GPU functions; GPUs are the same discovery path with different class codes).
- Map/register BAR windows and grant a scoped
DeviceMmiocapability bound to one decoded memory BAR. - Set up MSI/MSI-X routing and expose scoped
Interruptcapability per vector with masked-route lifecycle semantics matching the current virtio-net proof. - Hand out a bounded
DMAPoolcapability whose accounting ledger charges back to the driver process’s resource ledger and that participates in IOMMU-domain constraints once those exist. - Enforce revocation when sessions are closed:
DeviceMmio/Interrupt/DMAPoolgrants tear down through the bootstrap-grant manager. - Record device-manager actions through
HardwareAuditLogsnapshots (already proven for the DDF smokes). - Handle all faulting paths that would otherwise crash the kernel: a buggy driver service must crash the service, not the kernel.
Userspace GPU service responsibilities
- Open and initialize one GPU device from its device-scoped bootstrap grants. One driver process per GPU function is the working assumption; multi-function boards may run one process per function.
- Allocate and track GPU contexts, command queues, and DMA buffers backed
by the granted
DMAPool. - Implement command submission, buffer lifecycle, fence/completion signaling, and timeout enforcement.
- Translate capability calls into vendor SDK operations (CUDA driver API, ROCm, oneAPI, OpenCL, or a vendor-neutral runtime such as a WebGPU/wgpu-style abstraction).
- Expose only narrow, capability-typed handles to callers and refuse any attempt to surface raw MMIO/IRQ/DMA to clients.
Consumer surfaces
- LLM/embedder model servers from
Language Models and Agent Runtime. The
GPU-backed model process holds a
GpuSession, exposes aLanguageModelorEmbeddercapability, and is itself a normal userspace binary built per Userspace Binaries. - Numerical / HPC workloads from HPC Parallel Processing Patterns once that proposal expands to GPU offload.
- Robotics inference loops from capOS As A Robot Brain.
Capability Contract (schema additions)
Add to schema/capos.capnp (interface-level sketch; final wire layout is
fixed in the implementation slice):
GpuDeviceManagerlistDevices() -> (devices: List(GpuDeviceInfo))openDevice(capabilityIndex :UInt32) -> (session :GpuSession)
GpuSessioncreateBuffer(bytes :UInt64, usage :Text) -> (buffer :GpuBuffer)destroyBuffer(buffer :UInt32) -> ()launchKernel(program :Text, grid :UInt32, block :UInt32, bufferList :List(UInt32), fence :GpuFence) -> ()submitMemcpy(dst :UInt32, src :UInt32, bytes :UInt64) -> ()submitFenceWait(fence :UInt32) -> ()
GpuBuffermapReadWrite() -> (addr :UInt64, len :UInt64)unmap() -> ()size() -> (bytes :UInt64)close() -> ()
GpuFencepoll() -> (status :Text)wait(timeoutNanos :UInt64) -> (ok :Bool)close() -> ()
Sessions are the natural restriction point: a model-server session granted
to an LLM process can omit launchKernel entirely and expose only memcpy
plus an opaque runProgram(programCap, ...) if the model image is itself a
separately-vetted capability. The interface is the permission; do not add
parallel rights bitmasks.
Implementation Phases
Phase 0 (prerequisite, landed): kernel capability ring and DDF grants
The Cap’n Proto schema, capability ring, cap_enter dispatch, PCI/MSI-X
discovery, and the DeviceMmio/DMAPool/Interrupt/HardwareAuditLog
bootstrap-grant smokes already exist. No new kernel surface is required for
this phase; the schema additions for Gpu* are pure userspace work once a
driver service is permitted.
Phase 1: Userspace driver-authority gate (cross-track prerequisite)
GPU work cannot land before the userspace driver-authority gate. Required pieces, tracked by the device-manager refactor and DMA-isolation design:
- Move virtio-net or another known-good driver out of the kernel and into a userspace driver process consuming the DDF bootstrap grants end-to-end.
- Add an IOMMU integration path (VT-d / AMD-Vi) so that bus-master DMA granted to a driver process is constrained to its registered DMA pages.
- Add a
device-manageruserspace service that ownsManagerGrantSource-class capabilities and is the only process that handsDeviceMmio/DMAPool/Interrupt/HardwareAuditLoggrants to driver services.
This phase is owned by the device-manager and DMA-isolation tracks; the GPU proposal consumes it.
Phase 2: Mock GPU service
- Add the
Gpu*schema inschema/capos.capnp. - Implement a
gpu-mockuserspace service with the fullGpu*interface, no real driver, and synthetic fences and buffers backed by ordinary anonymous memory. - Prove end-to-end:
- device-manager spawns the mock driver and grants it a fake-device bootstrap grant set.
- a client process opens a session, allocates and maps a buffer, submits a synthetic job, and waits on a fence.
- Add a focused QEMU smoke (
make run-gpu-mock) that asserts the round-trip and demonstrates revocation on session close.
Phase 3: Real backend integration on one vendor
- Pick one concrete GPU backend available in CI environment (likely NVIDIA
on a workstation host with
-device vfio-pcipassthrough into QEMU, or a virtio-gpu / venus virtualized path as a first stand-in). - Vendor SDK code lives in the userspace driver process. Where the SDK expects a POSIX-ish surface, route it through libcapos-posix rather than expanding the kernel.
- Add queue lifecycle, fence lifecycle, DMA registration/validation, command execution path, interrupt completion plumbing back to clients through fences.
- Keep backend replacement possible via a trait-like abstraction inside the driver process so a second vendor backend (AMD ROCm, Intel oneAPI) can be added later without rewriting the service.
Phase 4: Security and reliability hardening
- Per-session limits for mapped pages, in-flight submissions, and queue
depth, charged through
ResourceLedger. - Bounded wait timeouts and explicit fence cancellation semantics so a
hung GPU does not pin a client’s
cap_enter. - Revocation propagation:
GpuSessionclose => all childGpuBuffer/GpuFencecaps revoked.- driver crash / device reset => all active caps fail closed with a typed exception.
- Audit hooks for
launchKernel/submitMemcpyrecorded throughHardwareAuditLog-style snapshots scoped to the GPU service. - Coordination with the
live-upgrade proposal so the GPU driver
service can be replaced without dropping client
GpuSessioncaps.
Phase 5: Multi-tenant and multi-device
- Multiple driver processes (one per GPU function) under a single device-manager.
- Cross-device buffer sharing only through explicit capability transfer; no implicit peer mappings.
- Workload isolation: distinct tenants on a single GPU receive distinct sessions with their own queue, memory budget, and audit stream.
Security Model
The kernel does not grant any user process direct MMIO, MSI, or bus-master DMA access. All such authority is mediated through the device-manager.
Application processes only receive:
GpuSession/GpuBuffer/GpuFencecapabilities with the methods the session policy chose to expose.
The GPU driver service process receives:
DeviceMmiobound to the function’s decoded BARs.Interruptcapabilities for the function’s claimed MSI vectors.DMAPoolbounded to the function’s IOMMU domain.HardwareAuditLogfor snapshotting device-manager actions.
This ensures:
- No userland process can program BAR registers.
- No userland process can claim untrusted memory for DMA.
- No userland process can observe or reset another session’s state.
- A buggy or compromised driver crashes the driver process, not the kernel; the device-manager observes the crash, fails outstanding capabilities closed, and re-spawns the driver on the next session request.
Dependencies and Alignment
This proposal depends on:
- Device-manager refactor proposal for the userspace device-manager that owns the bootstrap-grant sources.
- DMA-isolation design and IOMMU integration so DMA grants are enforceable in a multi-tenant context.
- Userspace-binaries proposal for the
driver-process runtime, libcapos / libcapos-posix surface for vendor SDK
consumption, and the
x86_64-unknown-capostarget. - LLM and agent proposal for the primary
consumer surface (
LanguageModel,Embedder) and the agent runtime that exercises GPU-backed inference end-to-end. - Resource-accounting proposal for per-session memory and submission budgets.
- Live-upgrade proposal for driver-service
replacement without dropping
GpuSessioncapabilities.
It complements:
- Service-architecture and authority-broker proposals.
- Storage/service manifest execution flow for shipping GPU service binaries and their bootstrap grants.
- In-process threading work for future queue completion callbacks and worker pools inside the driver service.
Minimal acceptance criteria
make run-gpu-mockboots and prints GPU service lifecycle messages.- The device-manager spawns the GPU service and grants only device-scoped bootstrap grants for a single mock function.
- A sample userspace client (Rust over capos-rt; C smoke later through libcapos) can create a session, allocate and map a GPU buffer, submit a synthetic job, and wait on a fence with a typed completion result.
- Attempts to submit unsupported or malformed operations return explicit
capnp
CapExceptionresults, not driver crashes. - Removing the session capability invalidates descendant buffer and fence caps without kernel restart.
- A subsequent slice points an LLM model server at the GPU service and
proves a
LanguageModel.generate(...)round-trip backed by the GPU session, satisfying the LLM proposal’s GPU-backend integration point.
Risks
- Real NVIDIA closed stack integration may require vendor-specific adaptation that is hostile to a capability shim; the AMD ROCm or vendor-neutral path (Vulkan compute, WebGPU/wgpu) may land first.
- Buffer mapping semantics become complex with paging, fragmentation, and IOMMU domains. Pinned physical-memory-only buffers are the conservative starting point.
- Interrupt-heavy completion paths require the scheduler evolution work (per-CPU run queues, fairness) before client-visible completion guarantees scale beyond a single workload.
- Vendor SDKs assume a POSIX-ish process model; the libcapos-posix surface has to grow enough to host them without leaking ambient authority.
- A GPU driver process is privileged from the application’s point of view. Compromise of a single driver process must remain bounded to one GPU function and one tenant set; the device-manager and IOMMU are the load-bearing controls there.
Open Questions
- Is CUDA mandatory from first integration, or is the initial surface command-focused (opaque “program” bytes interpreted by the driver) with CUDA runtime-specific support added later?
- Should memory registration support pinned physical memory only at first,
or attempt to expose unified-virtual-memory semantics through the
client’s
VirtualMemorycapability? - Which isolation level is needed for multi-tenant versus single-tenant in the first real-backend phase? Single-tenant per GPU function is the conservative default; MIG / SR-IOV-style partitioning is later work.
- Does the GPU service expose model artifacts (weights, programs) as separate capability types so a model file can be granted to clients without the full session, or are programs always inline arguments?