Proposal: Capability-Oriented GPU/CUDA Integration
Purpose
Define a minimal, capability-safe path to integrate NVIDIA/CUDA-capable GPUs into the capOS architecture without expanding kernel trust.
The kernel keeps direct control of hardware arbitration and trust boundaries. GPU hardware interaction is performed by a dedicated userspace service that is invoked through capability calls.
Positioning Against Current Project State
capOS currently provides:
- Process lifecycle, page tables, preemptive scheduling (PIT 100 Hz, round-robin, context switching).
- A global and per-process capability table with
CapObjectdispatch. - Shared-memory capability ring (io_uring-inspired) with syscall-free SQE
writes.
cap_entersyscall for ordinary CALL dispatch and completion waits. - No ACPI/PCI/interrupt infrastructure yet in-kernel.
That means GPU integration must be staged and should begin as a capability model exercise first, with real hardware I/O added after the underlying kernel subsystems exist.
Design Principles
- Keep policy in kernel, execution in userspace.
- Never expose raw PCI/MMIO/IRQ details to untrusted processes.
- Make GPU access explicit through narrow capabilities.
- Treat every stateful resource (session, buffer, queue, fence) as a capability.
- Require revocability and bounded lifetime for every GPU-facing object.
- Avoid a Linux-driver-in-kernel compatibility dependency.
Proposed Architecture
capOS kernel (minimal) exposes only resource and mediation capabilities.
gpu-device service (userspace) receives device-specific caps and exposes a stable
GPU capability surface to clients.
application receives only GpuSession/GpuBuffer/GpuFence capabilities.
Kernel responsibilities
- Discover GPUs from PCI/ACPI layers.
- Map/register BAR windows and grant a scoped
DeviceMmiocapability. - Set up interrupt routing and expose scoped IRQ signaling capability.
- Enforce DMA trust boundaries for process memory offered to the driver.
- Enforce revocation when sessions are closed.
- Handle all faulting paths that would otherwise crash the kernel.
User-space GPU service responsibilities
- Open/initialize one GPU device from device-scoped caps.
- Allocate and track GPU contexts and queues.
- Implement command submission, buffer lifecycle, and synchronization.
- Translate capability calls into driver-specific operations.
- Expose only narrow, capability-typed handles to callers.
Capability Contract (schema additions)
Add to schema/capos.capnp:
GpuDeviceManagerlistDevices() -> (devices: List(GpuDeviceInfo))openDevice(capabilityIndex :UInt32) -> (session :GpuSession)
GpuSessioncreateBuffer(bytes :UInt64, usage :Text) -> (buffer :GpuBuffer)destroyBuffer(buffer :UInt32) -> ()launchKernel(program :Text, grid :UInt32, block :UInt32, bufferList :List(UInt32), fence :GpuFence) -> ()submitMemcpy(dst :UInt32, src :UInt32, bytes :UInt64) -> ()submitFenceWait(fence :UInt32) -> ()
GpuBuffermapReadWrite() -> (addr :UInt64, len :UInt64)unmap() -> ()size() -> (bytes :UInt64)close() -> ()
GpuFencepoll() -> (status :Text)wait(timeoutNanos :UInt64) -> (ok :Bool)close() -> ()
Exact wire fields are intentionally flexible to keep this proposal at the interface level; method IDs and concrete argument packing should be finalized in the implementation PR.
Implementation Phases
Phase 0 (prerequisite): Stage 4 kernel capability syscalls
- Implement capability-call syscall ABI.
- Add
cap_id,method_id,params_ptr,params_lenpath. - Add kernel/user copy/validation of capnp messages.
- Validate user process permissions before dispatch.
Phase 1: Device mediation foundations
- Add kernel caps:
DeviceManager/DeviceMmio/InterruptHandle/DmaBuffer.
- Add PCI/ACPI discovery enough to identify NVIDIA-compatible functions.
- Add guarded BAR mapping and scoped grant to an init-privileged service.
- Add minimal
GpuDeviceManagerservice scaffold returning synthetic/empty device handles. - Add manifest entries for a GPU service binary and launch dependencies.
Phase 2: Service-based mock backend
- Implement
gpu-mockuserspace service with sameGpu*interface. - Support no-op buffers and synthetic fences.
- Prove end-to-end:
- init spawns driver
- process opens session
- buffer create/map/wait flows via capability calls
- Add regression checks in integration boot path output.
Phase 3: Real backend integration
- Add actual backend adapter for one concrete GPU driver API available in environment.
- Add:
- queue lifecycle
- fence lifecycle
- DMA registration/validation
- command execution path
- interrupt completion path to service and return through caps
- Keep backend replacement possible via trait-like abstraction in userspace service.
Phase 4: Security hardening
- Add per-session limits for mapped pages and in-flight submissions.
- Add bounded queue depth and timeout enforcement.
- Add explicit revocation propagation:
- session close => all child caps revoked.
- driver crash => all active caps fail closed.
- Add explicit audit hooks for submit/launch calls.
Security Model
The kernel does not grant a user process direct MMIO access.
Processes only receive:
GpuSession/GpuBuffer/GpuFencecapabilities.
The service process receives:
DeviceMmio,InterruptHandle, and memory-caps derived from its policy.
This ensures:
- No userland process can program BAR registers.
- No userland process can claim untrusted memory for DMA.
- No userland process can observe or reset another session state.
Dependencies and Alignment
This proposal depends on:
- Stage 4 capability syscalls.
- Kernel networking/PCI/interrupt groundwork from cloud deployment roadmap.
- Stage 6/7 for richer cross-process IPC and SMP behavior.
It complements:
- Device and service architecture proposals.
- Storage/service manifest execution flow.
- In-process threading work (future queue completion callbacks).
Minimal acceptance criteria
make runboots and prints GPU service lifecycle messages.- Init spawns GPU service and grants only device-scoped caps.
- A sample userspace client can:
- create session
- allocate and map a GPU buffer
- submit a synthetic job
- wait on a fence and receive completion
- Attempts to submit unsupported/malformed operations return explicit capnp errors.
- Removing service/session capabilities invalidates descendants without kernel restart.
Risks
- Real NVIDIA closed stack integration may require vendor-specific adaptation.
- Buffer mapping semantics can become complex with paging and fragmentation.
- Interrupt-heavy completion paths require robust scheduling before user-visible completion guarantees.
Open Questions
- Is CUDA mandatory from first integration, or is the initial surface command-focused
(
gpu-kernelpayload as opaque bytes) and CUDA runtime-specific later? - Should memory registration support pinned physical memory only at first?
- Which isolation level is needed for multi-tenant versus single-tenant first phase?