Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

Research: Linux Sandboxes And Virtualization For Workloads

capOS needs a credible way to run Linux-native software before every useful application, language runtime, package manager, development workflow, and desktop or server tool has a native capOS port. Users may want a familiar Linux environment. Agents may need a bounded place to run build systems, interpreters, package managers, browsers, command-line tools, scientific software, or model-generated code. Operators may need a compatibility bridge while capOS-native services are still emerging.

This note separates the available Linux isolation choices and records how they should map to generic capOS capability services. Scientific tooling is one important consumer of this substrate, but the substrate itself should be a general Linux workload sandbox.

The important distinction is between compatibility wrappers and isolation boundaries. Namespaces, cgroups, seccomp, Landlock, User-Mode Linux, containers, gVisor, and KVM microVMs all run “Linux things”, but they do not provide the same boundary, timing behavior, device model, or operational cost.

Source Baseline

External sources checked:

Local grounding:

Isolation Layers

Namespaces, cgroups, seccomp, And Landlock

The basic Linux sandbox stack is:

  • namespaces for separate views of process ids, mounts, users, networks, IPC, UTS names, time, and related global resources;
  • cgroup v2 for resource accounting, placement, and limits;
  • seccomp-BPF for syscall filtering;
  • Landlock for unprivileged filesystem access restriction;
  • rlimits and ordinary Unix credentials for process-local bounds.

This stack is useful for trusted or semi-trusted tools that need quick startup and native Linux performance. It is not a hard boundary against all kernel attack surface: a namespaced process still talks to the host Linux kernel through syscalls, page faults, filesystem code, networking, and device interfaces. For capOS, a namespace/cgroup/seccomp/Landlock sandbox is a good early backend for trusted batch tools, shell commands, build steps, formatters, language package commands, and scientific-base tools such as PARI/GP, Z3, cvc5, HiGHS, or Lean when the tools and inputs are trusted by the same operator.

The capOS wrapper should generate the sandbox policy from capability grants: read-only input directories, a scratch/output directory, optional loopback or egress network, CPU/memory/pids/io quotas, and a syscall profile. The policy is an implementation detail; the capOS-visible object is still a typed command, job, shell, build, solver, proof, CAS, notebook, or application capability.

bubblewrap, nsjail, systemd-nspawn, And OCI Runtimes

bubblewrap is a low-level unprivileged sandboxing tool used by Flatpak-style systems. It is appropriate for single-process or small interactive tools where the desired policy is mostly mount and namespace shaping.

nsjail combines namespaces, cgroups, rlimits, and seccomp-BPF policies with a compact configuration format. It is a strong fit for early batch jobs, command-wrapper services, solver/proof-checker tasks, package commands, and agent tool calls because it already models the same inputs capOS cares about: uid/gid, chroot/root, mounts, network mode, time limits, memory limits, cgroups, and syscall policy.

systemd-nspawn is better for booting or debugging a full Linux userspace tree than for narrow per-tool sandboxing. It is useful for stateful development images and package-build roots, but it should not be the default tool executor because its shape encourages broad OS-in-container authority.

OCI runtimes and images are valuable for supply-chain compatibility. capOS should be able to import OCI image metadata and run image contents through a chosen sandbox backend, but it should not treat “OCI container” as a security claim. The security claim depends on the runtime and host policy.

User-Mode Linux

User-Mode Linux is a Linux kernel port that runs as a normal Linux process and talks to the host kernel instead of hardware. It is useful as a compatibility, debugging, and low-privilege Linux-kernel experiment path. It can contain a guest Linux userspace without requiring hardware virtualization.

UML is not the same category as a hardware-backed Linux guest. It does not give the same boundary as KVM/microVM execution because the UML kernel and guest work ultimately run as host Linux processes and depend heavily on the host kernel surface. For capOS Linux workload execution, UML can be a convenient developer backend when /dev/kvm is unavailable, but it should not be the default answer for untrusted multi-tenant sessions, model-generated code, networked tools, or package-build execution.

gVisor

gVisor moves many host-kernel-facing interfaces into a per-sandbox application kernel and exposes an OCI runtime, runsc. This is an attractive middle tier: it keeps container-like resource behavior and tooling while reducing direct host kernel exposure for many syscalls.

The tradeoff is compatibility and performance. General Linux workloads can exercise native runtimes, dynamic loaders, filesystems, signals, threading, shared memory, networking, debuggers, browser sandboxes, package managers, and sometimes GPU/device paths. gVisor should be treated as a backend to test per workload class, not assumed compatible with every developer tool, package manager, browser, desktop app, scientific stack, proof assistant, or solver.

Hardware-Backed Linux Guests

For stronger isolation, use a Linux guest under hardware virtualization: QEMU/KVM, Firecracker, Cloud Hypervisor, or Kata Containers.

QEMU/KVM is the broadest compatibility target. It can run a full Linux guest with familiar device models, disks, networking, and debugging hooks. It is the right default for compatibility breadth, reproducibility, and complex package systems that expect a normal Linux distribution.

Firecracker is a narrow microVM monitor designed for serverless-style workloads. Its reduced device model and operational focus are attractive for batch jobs, command execution workers, stateless build/test workers, solver workers, and proof-check workers where the rootfs, network, block devices, and API surface can be kept small.

Kata Containers runs container workloads inside lightweight VMs and integrates with container orchestration. It is a good reference for mapping container workload semantics onto VM isolation. capOS does not need to import the full Kubernetes/Kata stack, but the pod-as-VM-sandbox idea maps well to an LinuxWorkloadVm, AgentJobVm, or other specialized Linux workload service.

Hardware-backed Linux guests should be the default for:

  • untrusted interactive Linux shells or familiar Linux workspaces;
  • untrusted notebook execution;
  • model-generated code that may exploit native extensions;
  • package builds from untrusted recipes;
  • network-enabled data processing;
  • multi-tenant hosted agent jobs;
  • browser, GUI, or desktop-like Linux application sessions;
  • workflows that need a full Linux distribution but should not share the host kernel attack surface.

Dedicated Host Isolation

VM and microVM boundaries reduce direct host-kernel sharing, but they do not remove every shared-hardware or operator-domain risk. Dedicated hosts, single-tenant nodes, or separately owned external hardware are appropriate when the workload has unusually high tenant risk, handles sensitive data, requires GPU or device passthrough, runs long-lived browser/GUI sessions with large attack surface, or must limit the blast radius of a VMM, firmware, driver, or VM-escape failure.

Dedicated hardware should be modeled as a deployment and tenancy property, not as a different Linux API. A QemuKvmVm or FirecrackerMicroVm running on a single-tenant host still exposes the same guest workload interface, but its security and scheduler evidence should record that the host was not shared with unrelated tenants. Conversely, a hardware-backed guest on a shared host is still a VM boundary, but it is not the strongest isolation class capOS can offer.

Virtualized Workloads And capOS Auto Full-NOHZ

For capOS scheduling design, Linux sandboxes are modeled as host-visible workloads when making native Tickless and Realtime Scheduling decisions. VMs, microVMs, UML processes, gVisor sandboxes, external sidecars, and VMM helper threads affect capOS through the host-visible set of runnable work, timers, IRQs, polling loops, and housekeeping obligations.

For capOS-native auto full-nohz scheduling:

  • capOS policy applies to the outer capOS-scheduled entity: VMM processes, vCPU threads, I/O helper threads, proxy processes, and native capOS services.
  • Guest Linux scheduler state is opaque. Guest CONFIG_NO_HZ_IDLE, nohz_full, cpuidle, and halt-poll settings may be recorded for diagnostics or benchmark interpretation, but they do not grant capOS CPU-isolation authority.
  • Ordinary Linux sandboxes should run as ordinary scheduled workloads unless the capOS-visible outer backend receives an explicit low-noise placement lease.
  • A sandbox descriptor must not set capOS auto full-nohz, CPU isolation, or exclusive CPU placement by itself. Those are scheduler-authority decisions with global cost.

Idle behavior still needs backend research because it determines whether an “idle” guest is actually idle from the host scheduler’s perspective. Linux CONFIG_NO_HZ_IDLE stops the guest scheduling-clock tick when a guest CPU is idle, which reduces guest-generated timer interrupts and vCPU wakeups. That does not enable capOS host tick suppression by itself. It only helps by making the VMM’s host-visible vCPU thread block more often and wake less often.

KVM prior art shows the boundary clearly. When a guest vCPU halts, the host may block the vCPU thread or poll briefly for a wakeup. Host-side KVM halt polling trades latency for CPU use, and large polling intervals can turn idle guest time into host kernel time. Guest-side halt polling makes the guest vCPU poll before halting and can run even when other host tasks are runnable. A capOS backend intended for low-noise placement therefore needs explicit accounting for VMM/vCPU polling, helper threads, virtio event loops, host timers, and IRQ placement.

The validation target is backend quietness, not Linux nohz integration:

  • idle vCPUs should block or halt instead of forcing periodic outer work;
  • one-shot guest timer deadlines should wake the vCPU correctly without a host periodic tick dependency;
  • VMM helper threads, block/network event loops, and virtio queues should be visible to capOS placement and accounting;
  • halt-polling or busy guest kernel threads should make the outer workload ineligible for low-noise placement rather than silently degrading a capOS scheduler claim;
  • benchmark reports should distinguish guest Linux tickless state from capOS outer scheduler state.

capOS Linux Workload Service Model

The capOS-visible service should hide the backend without hiding the security claim:

LinuxWorkloadSandbox {
  backend: NamespaceSandbox | GVisor | UserModeLinux | QemuKvmVm |
           FirecrackerMicroVm | KataVm | NativeCapos;
  isolationClass: Compatibility | ProcessSandbox | SyscallSandbox |
                  ApplicationKernel | HardwareVm | DedicatedHost;
  deployment: ExternalLinuxHost | CaposScheduledProxy |
              CaposScheduledVmm | DedicatedExternalHost | NativeCapos;
  workloadClass: InteractiveShell | BatchCommand | BuildJob |
                 PackageInstall | BrowserBackend | Notebook |
                 ScientificJob | AgentTool | ServiceDaemon;
  trustClass: SameOperator | UntrustedCode | MultiTenant | FamiliarWorkspace;
  placement: Ordinary | AutoNoHzEligible | CpuIsolationLease;
  packageClosure: PackageClosureId;
  inputCaps: ArtifactId[] | NamespaceGrant[];
  outputCaps: ArtifactSinkId[] | NamespaceGrant[];
  networkPolicy: None | Loopback | BrokeredEgress;
  resourceEnvelope: CpuMemoryIoPidGpuLimits;
  auditPolicy: ProvenanceRequired;
}

The wrapper should record:

  • backend and version;
  • kernel, rootfs, image, and package closure hashes;
  • seccomp/Landlock/cgroup/namespace policy or VM device model;
  • deployment location, distinguishing external Linux-host policy from capOS-scheduled proxy/VMM/native state;
  • CPU affinity, cgroup CPU quota or VM vCPU placement, capOS NoHzEligibility/NoHzActivation state, and outer housekeeping CPU set when the workload is capOS-scheduled;
  • external host CPU/isolation/nohz metadata when the workload runs outside capOS, recorded as host evidence rather than capOS scheduler proof;
  • guest tickless/nohz state when a Linux guest is used, recorded separately from the capOS outer scheduler state;
  • network and block-device grants;
  • input and output artifact ids;
  • exit status, signal, timeout, OOM, or backend failure.

Recommendation

Use a tiered sidecar strategy:

  1. Namespace sandbox tier. Use nsjail or bubblewrap for trusted commands, package steps, build/test tools, and scientific-base batch tools, with cgroup v2 quotas, seccomp, Landlock where available, read-only inputs, and immutable output capture.
  2. gVisor tier. Test high-risk but container-compatible Linux workloads where syscall mediation is useful and full VM overhead is not justified.
  3. Hardware VM tier. Use QEMU/KVM for broad compatibility and Firecracker or Kata-style microVMs for repeated batch jobs. This is the default for untrusted familiar Linux workspaces, notebooks, model-generated code, package builds, networked tools, and multi-tenant agent work.
  4. Dedicated host tier. Use single-tenant nodes or separately owned external hosts for high-risk tenants, sensitive data, GPU/device passthrough, long-lived browser/GUI workloads, side-channel-sensitive jobs, and cases where VM escape or VMM compromise must have a smaller blast radius.
  5. UML tier. Keep User-Mode Linux as a developer/debug/compatibility fallback when KVM is unavailable, not as the primary strong-isolation backend.
  6. Native capOS tier. Migrate stable, small, well-understood services into native capOS userspace after the capability interfaces are proven.

The first serious hardware-backed proof should run a Linux guest workload under QEMU/KVM, expose a narrow Cap’n Proto capability proxy to capOS, and execute a mix of familiar Linux commands plus one or two specialized workloads with artifact capture. Good first cases are a shell/build job, a package-manager or compiler invocation, and a scientific batch job such as PARI/GP, Z3/cvc5, HiGHS, or Lean. A later Firecracker proof can optimize startup and attack surface for stateless command, solver, proof-check, and agent-tool workers.

For browser use, this service is only a possible backend behind the BrowserSession capability. It must not expose a parallel browser authority model: origins, profiles, downloads, uploads, automation, and audit still belong to the browser capability surface, even if the actual browser engine runs in a Linux sandbox or hardware-backed Linux guest.