Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Proposal: Capability-Oriented GPU/CUDA Integration

Purpose

Define a minimal, capability-safe path to integrate GPU-class accelerators (NVIDIA/CUDA, AMD, Intel, plus future ML-accelerator boards) into capOS without expanding kernel trust.

The kernel keeps direct control of hardware arbitration and trust boundaries. GPU hardware interaction is performed by a dedicated userspace driver service that is invoked through capability calls and that holds device-scoped bootstrap grants for its single managed device.

This proposal is a downstream consumer of:

  • LLM and agent proposal – defines the LanguageModel/Embedder/ImageModel capability surface that benefits from GPU-backed inference backends. The agent runtime treats a GPU-backed model process as just another LanguageModel capability holder; the GPU service proposed here is one of the substrate choices the model process may use.
  • Userspace binaries proposal – defines the native Rust over capos-rt userspace runtime, the x86_64-unknown-capos target, and the libcapos C-substrate path that any vendor SDK adapter (CUDA, ROCm, OpenCL, oneAPI) must link against. The GPU service runs as one such userspace binary, not as a kernel module.

Positioning Against Current Project State

capOS currently provides infrastructure that is directly load-bearing for a future GPU service:

  • Process lifecycle, page tables, preemptive scheduling (PIT 100 Hz, round-robin, context switching).
  • A global and per-process capability table with CapObject dispatch.
  • Shared-memory capability ring (io_uring-inspired) with syscall-free SQE writes. cap_enter syscall for ordinary CALL dispatch and completion waits.
  • PCI/PCIe enumeration over both legacy I/O ports and ACPI MCFG ECAM, plus reusable memory-BAR subregion validation and kernel MMIO mapping helpers for diagnostics and driver bring-up.
  • MSI/MSI-X capability metadata discovery and typed MSI-X table programming, proven end-to-end through the virtio-net make run-net smoke.
  • I/O APIC routing for masked legacy IRQ programming via MADT.
  • Kernel-owned device interrupt source records plus a bounded first-fit device MSI vector pool with lock-free dispatch slots and claimed-route reassignment/release.
  • Kernel-owned DMA pool accounting ledger that tracks pool bytes, live page count, page-rounded MMIO mapping bytes, interrupt holds, ring depth, and descriptor submission/completion counts for the current virtio-net path.
  • Bootstrap-grant authority hooks for DeviceMmio, DMAPool, Interrupt, and HardwareAuditLog capabilities, exercised by the make run-devicemmio-grant, make run-dmapool-grant, make run-interrupt-grant, and make run-hardware-audit smokes.

What does not exist yet and gates real GPU work:

  • A userspace driver-authority gate. Today the kernel still owns virtio-net, the DMA pool ledger, and the MSI-X dispatch table. The DDF bootstrap-grant smokes prove the schema and grant plumbing for the typed device caps, but there is no userspace driver process that consumes those grants to run a real driver. GPU integration cannot land before that gate moves.
  • IOMMU/DMA-remapping integration (VT-d / AMD-Vi). Until a userspace driver is constrained by IOMMU domains, no production GPU stack can be granted bus-master DMA on a multi-tenant host.
  • A LanguageModel capability surface to consume the GPU service. The LLM proposal defines the schema target; the GPU service is one backend choice.

That means GPU integration must be staged. The early phases are capability schema and mock-service exercises that ride on the existing DDF bootstrap grants; real hardware backends arrive after the userspace-driver authority gate, IOMMU integration, and at least one consuming model surface exist.

Design Principles

  • Keep policy in kernel, execution in userspace. The kernel arbitrates device claims, MMIO mapping, MSI-X table programming, and DMA-pool accounting; the driver service implements vendor-specific command submission and queue management.
  • Never expose raw PCI/MMIO/IRQ details to untrusted processes. Clients see only GpuSession/GpuBuffer/GpuFence capabilities, never DeviceMmio or Interrupt.
  • Make GPU access explicit through narrow capabilities. The interface is the permission; a client that should not launch kernels is given a session type that does not expose launchKernel.
  • Treat every stateful resource (session, buffer, queue, fence, command pool) as a capability with revocability and bounded lifetime.
  • Avoid a Linux-driver-in-kernel compatibility dependency. Vendor SDK code runs in the userspace driver service, linked through libcapos / libcapos-posix shims where vendor headers expect a POSIX-ish surface.
  • Charge GPU memory and submission depth through the existing ResourceLedger mechanism rather than inventing a parallel accounting surface.

Proposed Architecture

capOS kernel (minimal) exposes only resource and mediation capabilities.

gpu-device service (userspace) receives device-specific bootstrap grants (DeviceMmio, DMAPool, Interrupt, HardwareAuditLog) for exactly one GPU function and exposes a stable GPU capability surface to clients.

application (e.g. an LLM model server, a numeric workload, a robot brain inference loop) receives only GpuSession/GpuBuffer/GpuFence capabilities and never sees the device-scoped grants.

Kernel responsibilities

  • Discover GPUs from PCI/ACPI layers (already implemented for non-GPU functions; GPUs are the same discovery path with different class codes).
  • Map/register BAR windows and grant a scoped DeviceMmio capability bound to one decoded memory BAR.
  • Set up MSI/MSI-X routing and expose scoped Interrupt capability per vector with masked-route lifecycle semantics matching the current virtio-net proof.
  • Hand out a bounded DMAPool capability whose accounting ledger charges back to the driver process’s resource ledger and that participates in IOMMU-domain constraints once those exist.
  • Enforce revocation when sessions are closed: DeviceMmio/Interrupt/ DMAPool grants tear down through the bootstrap-grant manager.
  • Record device-manager actions through HardwareAuditLog snapshots (already proven for the DDF smokes).
  • Handle all faulting paths that would otherwise crash the kernel: a buggy driver service must crash the service, not the kernel.

Userspace GPU service responsibilities

  • Open and initialize one GPU device from its device-scoped bootstrap grants. One driver process per GPU function is the working assumption; multi-function boards may run one process per function.
  • Allocate and track GPU contexts, command queues, and DMA buffers backed by the granted DMAPool.
  • Implement command submission, buffer lifecycle, fence/completion signaling, and timeout enforcement.
  • Translate capability calls into vendor SDK operations (CUDA driver API, ROCm, oneAPI, OpenCL, or a vendor-neutral runtime such as a WebGPU/wgpu-style abstraction).
  • Expose only narrow, capability-typed handles to callers and refuse any attempt to surface raw MMIO/IRQ/DMA to clients.

Consumer surfaces

Capability Contract (schema additions)

Add to schema/capos.capnp (interface-level sketch; final wire layout is fixed in the implementation slice):

  • GpuDeviceManager
    • listDevices() -> (devices: List(GpuDeviceInfo))
    • openDevice(capabilityIndex :UInt32) -> (session :GpuSession)
  • GpuSession
    • createBuffer(bytes :UInt64, usage :Text) -> (buffer :GpuBuffer)
    • destroyBuffer(buffer :UInt32) -> ()
    • launchKernel(program :Text, grid :UInt32, block :UInt32, bufferList :List(UInt32), fence :GpuFence) -> ()
    • submitMemcpy(dst :UInt32, src :UInt32, bytes :UInt64) -> ()
    • submitFenceWait(fence :UInt32) -> ()
  • GpuBuffer
    • mapReadWrite() -> (addr :UInt64, len :UInt64)
    • unmap() -> ()
    • size() -> (bytes :UInt64)
    • close() -> ()
  • GpuFence
    • poll() -> (status :Text)
    • wait(timeoutNanos :UInt64) -> (ok :Bool)
    • close() -> ()

Sessions are the natural restriction point: a model-server session granted to an LLM process can omit launchKernel entirely and expose only memcpy plus an opaque runProgram(programCap, ...) if the model image is itself a separately-vetted capability. The interface is the permission; do not add parallel rights bitmasks.

Implementation Phases

Phase 0 (prerequisite, landed): kernel capability ring and DDF grants

The Cap’n Proto schema, capability ring, cap_enter dispatch, PCI/MSI-X discovery, and the DeviceMmio/DMAPool/Interrupt/HardwareAuditLog bootstrap-grant smokes already exist. No new kernel surface is required for this phase; the schema additions for Gpu* are pure userspace work once a driver service is permitted.

Phase 1: Userspace driver-authority gate (cross-track prerequisite)

GPU work cannot land before the userspace driver-authority gate. Required pieces, tracked by the device-manager refactor and DMA-isolation design:

  • Move virtio-net or another known-good driver out of the kernel and into a userspace driver process consuming the DDF bootstrap grants end-to-end.
  • Add an IOMMU integration path (VT-d / AMD-Vi) so that bus-master DMA granted to a driver process is constrained to its registered DMA pages.
  • Add a device-manager userspace service that owns ManagerGrantSource-class capabilities and is the only process that hands DeviceMmio/DMAPool/Interrupt/HardwareAuditLog grants to driver services.

This phase is owned by the device-manager and DMA-isolation tracks; the GPU proposal consumes it.

Phase 2: Mock GPU service

  • Add the Gpu* schema in schema/capos.capnp.
  • Implement a gpu-mock userspace service with the full Gpu* interface, no real driver, and synthetic fences and buffers backed by ordinary anonymous memory.
  • Prove end-to-end:
    • device-manager spawns the mock driver and grants it a fake-device bootstrap grant set.
    • a client process opens a session, allocates and maps a buffer, submits a synthetic job, and waits on a fence.
  • Add a focused QEMU smoke (make run-gpu-mock) that asserts the round-trip and demonstrates revocation on session close.

Phase 3: Real backend integration on one vendor

  • Pick one concrete GPU backend available in CI environment (likely NVIDIA on a workstation host with -device vfio-pci passthrough into QEMU, or a virtio-gpu / venus virtualized path as a first stand-in).
  • Vendor SDK code lives in the userspace driver process. Where the SDK expects a POSIX-ish surface, route it through libcapos-posix rather than expanding the kernel.
  • Add queue lifecycle, fence lifecycle, DMA registration/validation, command execution path, interrupt completion plumbing back to clients through fences.
  • Keep backend replacement possible via a trait-like abstraction inside the driver process so a second vendor backend (AMD ROCm, Intel oneAPI) can be added later without rewriting the service.

Phase 4: Security and reliability hardening

  • Per-session limits for mapped pages, in-flight submissions, and queue depth, charged through ResourceLedger.
  • Bounded wait timeouts and explicit fence cancellation semantics so a hung GPU does not pin a client’s cap_enter.
  • Revocation propagation:
    • GpuSession close => all child GpuBuffer/GpuFence caps revoked.
    • driver crash / device reset => all active caps fail closed with a typed exception.
  • Audit hooks for launchKernel/submitMemcpy recorded through HardwareAuditLog-style snapshots scoped to the GPU service.
  • Coordination with the live-upgrade proposal so the GPU driver service can be replaced without dropping client GpuSession caps.

Phase 5: Multi-tenant and multi-device

  • Multiple driver processes (one per GPU function) under a single device-manager.
  • Cross-device buffer sharing only through explicit capability transfer; no implicit peer mappings.
  • Workload isolation: distinct tenants on a single GPU receive distinct sessions with their own queue, memory budget, and audit stream.

Security Model

The kernel does not grant any user process direct MMIO, MSI, or bus-master DMA access. All such authority is mediated through the device-manager.

Application processes only receive:

  • GpuSession / GpuBuffer / GpuFence capabilities with the methods the session policy chose to expose.

The GPU driver service process receives:

  • DeviceMmio bound to the function’s decoded BARs.
  • Interrupt capabilities for the function’s claimed MSI vectors.
  • DMAPool bounded to the function’s IOMMU domain.
  • HardwareAuditLog for snapshotting device-manager actions.

This ensures:

  • No userland process can program BAR registers.
  • No userland process can claim untrusted memory for DMA.
  • No userland process can observe or reset another session’s state.
  • A buggy or compromised driver crashes the driver process, not the kernel; the device-manager observes the crash, fails outstanding capabilities closed, and re-spawns the driver on the next session request.

Dependencies and Alignment

This proposal depends on:

  • Device-manager refactor proposal for the userspace device-manager that owns the bootstrap-grant sources.
  • DMA-isolation design and IOMMU integration so DMA grants are enforceable in a multi-tenant context.
  • Userspace-binaries proposal for the driver-process runtime, libcapos / libcapos-posix surface for vendor SDK consumption, and the x86_64-unknown-capos target.
  • LLM and agent proposal for the primary consumer surface (LanguageModel, Embedder) and the agent runtime that exercises GPU-backed inference end-to-end.
  • Resource-accounting proposal for per-session memory and submission budgets.
  • Live-upgrade proposal for driver-service replacement without dropping GpuSession capabilities.

It complements:

  • Service-architecture and authority-broker proposals.
  • Storage/service manifest execution flow for shipping GPU service binaries and their bootstrap grants.
  • In-process threading work for future queue completion callbacks and worker pools inside the driver service.

Minimal acceptance criteria

  • make run-gpu-mock boots and prints GPU service lifecycle messages.
  • The device-manager spawns the GPU service and grants only device-scoped bootstrap grants for a single mock function.
  • A sample userspace client (Rust over capos-rt; C smoke later through libcapos) can create a session, allocate and map a GPU buffer, submit a synthetic job, and wait on a fence with a typed completion result.
  • Attempts to submit unsupported or malformed operations return explicit capnp CapException results, not driver crashes.
  • Removing the session capability invalidates descendant buffer and fence caps without kernel restart.
  • A subsequent slice points an LLM model server at the GPU service and proves a LanguageModel.generate(...) round-trip backed by the GPU session, satisfying the LLM proposal’s GPU-backend integration point.

Risks

  • Real NVIDIA closed stack integration may require vendor-specific adaptation that is hostile to a capability shim; the AMD ROCm or vendor-neutral path (Vulkan compute, WebGPU/wgpu) may land first.
  • Buffer mapping semantics become complex with paging, fragmentation, and IOMMU domains. Pinned physical-memory-only buffers are the conservative starting point.
  • Interrupt-heavy completion paths require the scheduler evolution work (per-CPU run queues, fairness) before client-visible completion guarantees scale beyond a single workload.
  • Vendor SDKs assume a POSIX-ish process model; the libcapos-posix surface has to grow enough to host them without leaking ambient authority.
  • A GPU driver process is privileged from the application’s point of view. Compromise of a single driver process must remain bounded to one GPU function and one tenant set; the device-manager and IOMMU are the load-bearing controls there.

Open Questions

  • Is CUDA mandatory from first integration, or is the initial surface command-focused (opaque “program” bytes interpreted by the driver) with CUDA runtime-specific support added later?
  • Should memory registration support pinned physical memory only at first, or attempt to expose unified-virtual-memory semantics through the client’s VirtualMemory capability?
  • Which isolation level is needed for multi-tenant versus single-tenant in the first real-backend phase? Single-tenant per GPU function is the conservative default; MIG / SR-IOV-style partitioning is later work.
  • Does the GPU service expose model artifacts (weights, programs) as separate capability types so a model file can be granted to clients without the full session, or are programs always inline arguments?