# Research: Linux Sandboxes And Virtualization For Workloads

capOS needs a credible way to run Linux-native software before every useful
application, language runtime, package manager, development workflow, and
desktop or server tool has a native capOS port. Users may want a familiar
Linux environment. Agents may need a bounded place to run build systems,
interpreters, package managers, browsers, command-line tools, scientific
software, or model-generated code. Operators may need a compatibility bridge
while capOS-native services are still emerging.

This note separates the available Linux isolation choices and records how they
should map to generic capOS capability services. Scientific tooling is one
important consumer of this substrate, but the substrate itself should be a
general Linux workload sandbox.

The important distinction is between **compatibility wrappers** and
**isolation boundaries**. Namespaces, cgroups, seccomp, Landlock, User-Mode
Linux, containers, gVisor, and KVM microVMs all run "Linux things", but they do
not provide the same boundary, timing behavior, device model, or operational
cost.

## Source Baseline

External sources checked:

- Linux kernel documentation, [Namespaces](https://www.kernel.org/doc/html/latest/admin-guide/namespaces/index.html)
- Linux kernel documentation, [Control Group v2](https://www.kernel.org/doc/html/latest/admin-guide/cgroup-v2.html)
- Linux kernel documentation, [Seccomp BPF](https://cdn.kernel.org/doc/html/latest/userspace-api/seccomp_filter.html)
- Linux kernel documentation, [Landlock unprivileged access control](https://docs.kernel.org/userspace-api/landlock.html)
- Linux kernel documentation, [User Mode Linux HOWTO](https://www.kernel.org/doc/html/latest/virt/uml/user_mode_linux_howto_v2.html)
- Linux kernel documentation, [CPU Isolation](https://docs.kernel.org/admin-guide/cpu-isolation.html)
- Linux kernel documentation, [Housekeeping](https://www.kernel.org/doc/html/latest/core-api/housekeeping.html)
- Linux kernel documentation, [KVM halt polling](https://docs.kernel.org/virt/kvm/halt-polling.html)
- Linux kernel documentation, [Guest halt polling](https://docs.kernel.org/virt/guest-halt-polling.html)
- Open Container Initiative, [Runtime Specification](https://github.com/opencontainers/runtime-spec)
- Open Container Initiative, [Image Specification](https://github.com/opencontainers/image-spec)
- bubblewrap: <https://github.com/containers/bubblewrap>
- nsjail: <https://github.com/google/nsjail>
- systemd-nspawn: <https://www.freedesktop.org/software/systemd/man/systemd-nspawn.html>
- gVisor: <https://gvisor.dev/docs/>
- QEMU: <https://qemu-project.gitlab.io/qemu/about/index.html>
- Firecracker: <https://github.com/firecracker-microvm/firecracker>
- Cloud Hypervisor: <https://www.cloudhypervisor.org/docs/prologue/introduction/>
- Kata Containers virtualization design: <https://github.com/kata-containers/kata-containers/blob/main/docs/design/virtualization.md>

Local grounding:

- [Userspace Binaries](../proposals/userspace-binaries-proposal.md)
- [Storage and Naming](../proposals/storage-and-naming-proposal.md)
- [Browser Capability and Agent Web Sessions](../proposals/browser-capability-proposal.md)
- [Scientific Agent-Lab Software Stack](scientific-agent-lab-stack.md)
- [NO_HZ, SQPOLL, and Realtime Scheduling](nohz-sqpoll-realtime.md)
- [Tickless and Realtime Scheduling](../proposals/tickless-realtime-scheduling-proposal.md)
- [System Performance Benchmarks](../proposals/system-performance-benchmarks-proposal.md)
- [HPC Parallel Processing Patterns](../proposals/hpc-parallel-patterns-proposal.md)

## Isolation Layers

### Namespaces, cgroups, seccomp, And Landlock

The basic Linux sandbox stack is:

- namespaces for separate views of process ids, mounts, users, networks, IPC,
  UTS names, time, and related global resources;
- cgroup v2 for resource accounting, placement, and limits;
- seccomp-BPF for syscall filtering;
- Landlock for unprivileged filesystem access restriction;
- rlimits and ordinary Unix credentials for process-local bounds.

This stack is useful for **trusted or semi-trusted tools** that need quick
startup and native Linux performance. It is not a hard boundary against all
kernel attack surface: a namespaced process still talks to the host Linux
kernel through syscalls, page faults, filesystem code, networking, and device
interfaces. For capOS, a namespace/cgroup/seccomp/Landlock sandbox is a good
early backend for trusted batch tools, shell commands, build steps,
formatters, language package commands, and `scientific-base` tools such as
PARI/GP, Z3, cvc5, HiGHS, or Lean when the tools and inputs are trusted by the
same operator.

The capOS wrapper should generate the sandbox policy from capability grants:
read-only input directories, a scratch/output directory, optional loopback or
egress network, CPU/memory/pids/io quotas, and a syscall profile. The policy
is an implementation detail; the capOS-visible object is still a typed command,
job, shell, build, solver, proof, CAS, notebook, or application capability.

### bubblewrap, nsjail, systemd-nspawn, And OCI Runtimes

bubblewrap is a low-level unprivileged sandboxing tool used by Flatpak-style
systems. It is appropriate for single-process or small interactive tools where
the desired policy is mostly mount and namespace shaping.

nsjail combines namespaces, cgroups, rlimits, and seccomp-BPF policies with a
compact configuration format. It is a strong fit for early batch jobs,
command-wrapper services, solver/proof-checker tasks, package commands, and
agent tool calls because it already models the same inputs capOS cares about:
uid/gid, chroot/root, mounts, network mode, time limits, memory limits,
cgroups, and syscall policy.

systemd-nspawn is better for booting or debugging a full Linux userspace tree
than for narrow per-tool sandboxing. It is useful for stateful development
images and package-build roots, but it should not be the default tool executor
because its shape encourages broad OS-in-container authority.

OCI runtimes and images are valuable for supply-chain compatibility. capOS
should be able to import OCI image metadata and run image contents through a
chosen sandbox backend, but it should not treat "OCI container" as a security
claim. The security claim depends on the runtime and host policy.

### User-Mode Linux

User-Mode Linux is a Linux kernel port that runs as a normal Linux process and
talks to the host kernel instead of hardware. It is useful as a compatibility,
debugging, and low-privilege Linux-kernel experiment path. It can contain a
guest Linux userspace without requiring hardware virtualization.

UML is not the same category as a hardware-backed Linux guest. It does not give
the same boundary as KVM/microVM execution because the UML kernel and guest
work ultimately run as host Linux processes and depend heavily on the host
kernel surface. For capOS Linux workload execution, UML can be a convenient
developer backend when `/dev/kvm` is unavailable, but it should not be the
default answer for untrusted multi-tenant sessions, model-generated code,
networked tools, or package-build execution.

### gVisor

gVisor moves many host-kernel-facing interfaces into a per-sandbox application
kernel and exposes an OCI runtime, `runsc`. This is an attractive middle tier:
it keeps container-like resource behavior and tooling while reducing direct
host kernel exposure for many syscalls.

The tradeoff is compatibility and performance. General Linux workloads can
exercise native runtimes, dynamic loaders, filesystems, signals, threading,
shared memory, networking, debuggers, browser sandboxes, package managers, and
sometimes GPU/device paths. gVisor should be treated as a backend to test per
workload class, not assumed compatible with every developer tool, package
manager, browser, desktop app, scientific stack, proof assistant, or solver.

### Hardware-Backed Linux Guests

For stronger isolation, use a Linux guest under hardware virtualization:
QEMU/KVM, Firecracker, Cloud Hypervisor, or Kata Containers.

QEMU/KVM is the broadest compatibility target. It can run a full Linux guest
with familiar device models, disks, networking, and debugging hooks. It is the
right default for compatibility breadth, reproducibility, and complex package
systems that expect a normal Linux distribution.

Firecracker is a narrow microVM monitor designed for serverless-style
workloads. Its reduced device model and operational focus are attractive for
batch jobs, command execution workers, stateless build/test workers, solver
workers, and proof-check workers where the rootfs, network, block devices, and
API surface can be kept small.

Kata Containers runs container workloads inside lightweight VMs and integrates
with container orchestration. It is a good reference for mapping container
workload semantics onto VM isolation. capOS does not need to import the full
Kubernetes/Kata stack, but the pod-as-VM-sandbox idea maps well to an
`LinuxWorkloadVm`, `AgentJobVm`, or other specialized Linux workload service.

Hardware-backed Linux guests should be the default for:

- untrusted interactive Linux shells or familiar Linux workspaces;
- untrusted notebook execution;
- model-generated code that may exploit native extensions;
- package builds from untrusted recipes;
- network-enabled data processing;
- multi-tenant hosted agent jobs;
- browser, GUI, or desktop-like Linux application sessions;
- workflows that need a full Linux distribution but should not share the host
  kernel attack surface.

### Dedicated Host Isolation

VM and microVM boundaries reduce direct host-kernel sharing, but they do not
remove every shared-hardware or operator-domain risk. Dedicated hosts,
single-tenant nodes, or separately owned external hardware are appropriate
when the workload has unusually high tenant risk, handles sensitive data,
requires GPU or device passthrough, runs long-lived browser/GUI sessions with
large attack surface, or must limit the blast radius of a VMM, firmware,
driver, or VM-escape failure.

Dedicated hardware should be modeled as a deployment and tenancy property,
not as a different Linux API. A `QemuKvmVm` or `FirecrackerMicroVm` running on
a single-tenant host still exposes the same guest workload interface, but its
security and scheduler evidence should record that the host was not shared
with unrelated tenants. Conversely, a hardware-backed guest on a shared host is
still a VM boundary, but it is not the strongest isolation class capOS can
offer.

## Virtualized Workloads And capOS Auto Full-NOHZ

For capOS scheduling design, Linux sandboxes are modeled as host-visible
workloads when making native
[Tickless and Realtime Scheduling](../proposals/tickless-realtime-scheduling-proposal.md)
decisions. VMs, microVMs, UML processes, gVisor sandboxes, external sidecars,
and VMM helper threads affect capOS through the host-visible set of runnable
work, timers, IRQs, polling loops, and housekeeping obligations.

For capOS-native auto full-nohz scheduling:

- capOS policy applies to the outer capOS-scheduled entity: VMM processes,
  vCPU threads, I/O helper threads, proxy processes, and native capOS services.
- Guest Linux scheduler state is opaque. Guest `CONFIG_NO_HZ_IDLE`,
  `nohz_full`, cpuidle, and halt-poll settings may be recorded for diagnostics
  or benchmark interpretation, but they do not grant capOS CPU-isolation
  authority.
- Ordinary Linux sandboxes should run as ordinary scheduled workloads unless
  the capOS-visible outer backend receives an explicit low-noise placement
  lease.
- A sandbox descriptor must not set capOS auto full-nohz, CPU isolation, or
  exclusive CPU placement by itself. Those are scheduler-authority decisions
  with global cost.

Idle behavior still needs backend research because it determines whether an
"idle" guest is actually idle from the host scheduler's perspective. Linux
`CONFIG_NO_HZ_IDLE` stops the guest scheduling-clock tick when a guest CPU is
idle, which reduces guest-generated timer interrupts and vCPU wakeups. That
does not enable capOS host tick suppression by itself. It only helps by making
the VMM's host-visible vCPU thread block more often and wake less often.

KVM prior art shows the boundary clearly. When a guest vCPU halts, the host may
block the vCPU thread or poll briefly for a wakeup. Host-side KVM halt polling
trades latency for CPU use, and large polling intervals can turn idle guest
time into host kernel time. Guest-side halt polling makes the guest vCPU poll
before halting and can run even when other host tasks are runnable. A capOS
backend intended for low-noise placement therefore needs explicit accounting
for VMM/vCPU polling, helper threads, virtio event loops, host timers, and IRQ
placement.

The validation target is backend quietness, not Linux nohz integration:

- idle vCPUs should block or halt instead of forcing periodic outer work;
- one-shot guest timer deadlines should wake the vCPU correctly without a host
  periodic tick dependency;
- VMM helper threads, block/network event loops, and virtio queues should be
  visible to capOS placement and accounting;
- halt-polling or busy guest kernel threads should make the outer workload
  ineligible for low-noise placement rather than silently degrading a capOS
  scheduler claim;
- benchmark reports should distinguish guest Linux tickless state from capOS
  outer scheduler state.

## capOS Linux Workload Service Model

The capOS-visible service should hide the backend without hiding the security
claim:

```text
LinuxWorkloadSandbox {
  backend: NamespaceSandbox | GVisor | UserModeLinux | QemuKvmVm |
           FirecrackerMicroVm | KataVm | NativeCapos;
  isolationClass: Compatibility | ProcessSandbox | SyscallSandbox |
                  ApplicationKernel | HardwareVm | DedicatedHost;
  deployment: ExternalLinuxHost | CaposScheduledProxy |
              CaposScheduledVmm | DedicatedExternalHost | NativeCapos;
  workloadClass: InteractiveShell | BatchCommand | BuildJob |
                 PackageInstall | BrowserBackend | Notebook |
                 ScientificJob | AgentTool | ServiceDaemon;
  trustClass: SameOperator | UntrustedCode | MultiTenant | FamiliarWorkspace;
  placement: Ordinary | AutoNoHzEligible | CpuIsolationLease;
  packageClosure: PackageClosureId;
  inputCaps: ArtifactId[] | NamespaceGrant[];
  outputCaps: ArtifactSinkId[] | NamespaceGrant[];
  networkPolicy: None | Loopback | BrokeredEgress;
  resourceEnvelope: CpuMemoryIoPidGpuLimits;
  auditPolicy: ProvenanceRequired;
}
```

The wrapper should record:

- backend and version;
- kernel, rootfs, image, and package closure hashes;
- seccomp/Landlock/cgroup/namespace policy or VM device model;
- deployment location, distinguishing external Linux-host policy from
  capOS-scheduled proxy/VMM/native state;
- CPU affinity, cgroup CPU quota or VM vCPU placement, capOS
  `NoHzEligibility`/`NoHzActivation` state, and outer housekeeping CPU set
  when the workload is capOS-scheduled;
- external host CPU/isolation/nohz metadata when the workload runs outside
  capOS, recorded as host evidence rather than capOS scheduler proof;
- guest tickless/nohz state when a Linux guest is used, recorded separately
  from the capOS outer scheduler state;
- network and block-device grants;
- input and output artifact ids;
- exit status, signal, timeout, OOM, or backend failure.

## Recommendation

Use a tiered sidecar strategy:

1. **Namespace sandbox tier.** Use nsjail or bubblewrap for trusted
   commands, package steps, build/test tools, and `scientific-base` batch
   tools, with cgroup v2 quotas, seccomp, Landlock where available, read-only
   inputs, and immutable output capture.
2. **gVisor tier.** Test high-risk but container-compatible Linux workloads
   where syscall mediation is useful and full VM overhead is not justified.
3. **Hardware VM tier.** Use QEMU/KVM for broad compatibility and Firecracker
   or Kata-style microVMs for repeated batch jobs. This is the default for
   untrusted familiar Linux workspaces, notebooks, model-generated code,
   package builds, networked tools, and multi-tenant agent work.
4. **Dedicated host tier.** Use single-tenant nodes or separately owned
   external hosts for high-risk tenants, sensitive data, GPU/device
   passthrough, long-lived browser/GUI workloads, side-channel-sensitive
   jobs, and cases where VM escape or VMM compromise must have a smaller
   blast radius.
5. **UML tier.** Keep User-Mode Linux as a developer/debug/compatibility
   fallback when KVM is unavailable, not as the primary strong-isolation
   backend.
6. **Native capOS tier.** Migrate stable, small, well-understood services into
   native capOS userspace after the capability interfaces are proven.

The first serious hardware-backed proof should run a Linux guest workload under
QEMU/KVM, expose a narrow Cap'n Proto capability proxy to capOS, and execute a
mix of familiar Linux commands plus one or two specialized workloads with
artifact capture. Good first cases are a shell/build job, a package-manager or
compiler invocation, and a scientific batch job such as PARI/GP, Z3/cvc5,
HiGHS, or Lean. A later Firecracker proof can optimize startup and attack
surface for stateless command, solver, proof-check, and agent-tool workers.

For browser use, this service is only a possible backend behind the
[BrowserSession](../proposals/browser-capability-proposal.md) capability. It
must not expose a parallel browser authority model: origins, profiles,
downloads, uploads, automation, and audit still belong to the browser
capability surface, even if the actual browser engine runs in a Linux sandbox
or hardware-backed Linux guest.
