Proposal: Hardware Abstraction and Cloud Deployment

How capOS goes from “boots in QEMU” to “boots on a real cloud VM” (GCP, AWS, Azure). This covers the hardware abstraction infrastructure missing between the current QEMU-only kernel and real x86_64 hardware, plus the build system changes needed to produce deployable images.

Depends on: Kernel Networking Smoke Test (for PCI enumeration), Stage 5 (for timer history), Stage 7 / SMP proposal Phase C (for LAPIC timer and IPI).

Complements: Networking proposal (extends virtio-net toward cloud NICs), Storage proposal (extends local block-device work toward virtio-scsi and NVMe), SMP proposal (LAPIC timer/IPI infrastructure shared, with x2APIC tracked as a later backend).

Current State

The kernel boots via Limine UEFI, outputs to COM1 serial, has QEMU legacy PCI enumeration for the virtio-net smoke path, and has LAPIC timer/IPI groundwork from the SMP track. It also has an initial bounded, read-only ACPI diagnostic parser for Limine RSDP, RSDT/XSDT table inventory, MADT summaries, and MCFG presence/allocation summaries, plus a Q35 smoke that proves the reusable PCI config backend can enumerate a capped PCIe ECAM function inventory from MCFG. The x86 path exports bounded MADT I/O APIC/source-override records, maps the I/O APIC, and programs masked legacy IRQ routes to LAPIC vectors while honoring source overrides. PCI drivers can validate and map memory BAR subregions through a shared kernel helper; the virtio-net modern transport uses that helper for its common, notify, ISR, and device configuration regions. The PCI capability walk also reports MSI/MSI-X metadata for the virtio-net function, and the QEMU net smoke uses that metadata for a bounded kernel-owned virtio-net MSI-X dispatch/unmask and lifecycle proof through the device MSI vector pool; the remaining run-net fixture also covers queue setup, descriptor guards, ARP, and ICMP. Device-autonomous virtio-net MSI-X delivery is covered by the dedicated userspace-provider gates after the kernel L4 owner is retired.

The cloudboot image/harness slice landed in commit 02635421 (2026-05-05 06:51 UTC): make capos-cloudboot-image builds the importable raw disk tarball and make cloudboot-test drives the GCE upload/import/temporary-instance/serial-log loop with teardown. The first GCP imported-image serial-console boot proof is run 1778230874-715a (2026-05-08 09:06 UTC) against source commit 3951e275 (2026-05-08 08:50 UTC), reaching the capos kernel starting serial landmark on a temporary no-public-IP, no-service-account/scopes e2-small instance before teardown.

It still lacks public L4/SSH/WebShell ingress, AWS/Azure boot proofs and provider drivers, broader storage variants, high-throughput/multiqueue NIC readiness, direct-remapping DMA, production cloud-image release paths, and a cloud-ready clocksource/clockevent closeout. The GCP-first provider rollup has live serial-console operator access, selected NIC raw-frame reachability, selected NVMe Persistent Disk I/O, and gVNIC portability evidence.

The GCP-first usable cloud-instance provider rollup is closed by cloud-usable-instance-provider-nic-storage. Do not cite the cloudboot harness or the first GCP serial-console boot alone as evidence for provider NIC/storage readiness; the closeout depends on separate live NIC, storage, operator-access, and gVNIC evidence records. AWS/Azure, public ingress, and production cloud-image release gates remain separate.

2026-07-18: Private development reachability

The current development topology has active VPC peering between the development network and the capos-cloudtest-peered test network. The development host at 10.156.0.8/32 initiates direct private connections to capOS VMs in 172.20.0.0/24; the topology does not depend on reverse ingress or a public endpoint. Cloudboot selects it with --peered and launches each VM with --network-interface=subnet=capos-cloudtest-peered,no-address.

The peered firewall is externally owned and admits the development host plus the internal 172.20.0.0/24 range under the provisioned policy. The cloudboot runner has subnet-use authority only and does not create, sweep, or delete firewall rules in this mode. Peered VMs have no external IP and the topology has no Cloud NAT. Private Google API access on the subnet is not general internet egress.

The isolated capos-cloudtest sandbox on 10.200.0.0/24, including its script-managed private firewall rules and the IAP/vantage tunnel used to reach a private VM from the development host, remains the legacy deployment context. It is the superseded workaround for development-host-initiated access; the historical sandbox and public-ingress design discussion below remains relevant to those distinct deployment shapes.

Trusted Build Inputs And Reproducibility Cross-Links

Cloud deployment depends on the same trusted-build-inputs inventory that covers local builds. The consolidated supply-chain risk view – floating Rust nightly, observed-not-pinned xorriso / qemu-system-x86_64 / OVMF, CI publication and comparison of build-provenance records, and pinned production runner identity – is tracked as R13 in docs/design-risks-register.md; the detailed inventory, dependency policy, vendored-snapshot table, and the build-provenance retention/comparison policy live in docs/trusted-build-inputs.md. This proposal is recorded as a secondary owner of R13 because cloud-image release paths and provider-driver bring-up both depend on those reproducibility gates.

The implication for cloud bring-up is concrete: imported cloud images must travel with the corresponding make build-provenance record (source commit, toolchain identity, embedded-binary hashes, OVMF identity or explicit absence) before any provider serial-console run is cited as production evidence. Until the R13 gates close, cloud images remain local/CI proof artifacts rather than third-party reproducible boot images.

What Cloud VMs Provide

GCP (n2-standard), AWS (m6i/c7i), and Azure (Dv5) all expose:

Resource	Cloud interface	capOS status
Boot firmware	UEFI (all three)	Limine UEFI works
Serial console	COM1 0x3F8	Works (serial.rs)
Boot media	Hybrid BIOS+UEFI raw disk image, packaged per provider import rules	Partial (`make capos-cloudboot-image` builds a GCE-importable raw disk tarball; production release packaging and non-GCP provider packaging remain future)
Storage	virtio-scsi or NVMe (GCP Persistent Disk), NVMe/EBS (AWS Nitro), managed disks	Partial (GCP NVMe Persistent Disk brokered `READ` proof landed; GCP virtio-scsi, Local SSD, AWS/Azure storage, and broader filesystem-backed cloud storage remain future)
NIC	virtio-net or gVNIC (GCP), ENA (AWS), MANA (Azure)	Partial (GCP legacy virtio-net raw-frame `provider-nic-bound` and gVNIC raw-frame / typed-Nic proofs landed; public ingress, high-throughput/multiqueue, ENA, and MANA remain future)
Virtio NIC	QEMU, GCP where selectable, some bare-metal	Partial (QEMU smoke; reusable/cloud path planned)
Timer	LAPIC timer, TSC, HPET	Partial (LAPIC timer groundwork; cloud clocksource work missing)
Interrupt delivery	I/O APIC, MSI/MSI-X	Partial (masked MADT-backed I/O APIC routes, MSI/MSI-X capability metadata, and bounded kernel-owned virtio-net MSI-X dispatch/lifecycle proof; I/O APIC ownership and userspace interrupt authority missing)
Device discovery	ACPI + PCI/PCIe	Partial (QEMU legacy PCI smoke, bounded ACPI diagnostics/routing state, reusable legacy/ECAM PCI config access, kernel BAR/MMIO validation, MSI/MSI-X metadata discovery, and bounded virtio-net MSI-X dispatch proof; broader driver authority still missing)
Display	None (headless)	N/A

Cloud NIC And Storage Portability Notes

The Device Driver Foundation is not complete just because QEMU virtio-net works. Cloud bring-up has provider-specific NIC and storage surfaces, and the first implementation slices must keep those differences visible while still deferring the actual provider drivers.

Provider path	Expected device surface	capOS dependency	Current state
QEMU / constrained GCP virtio-net	Virtio PCI transport, virtqueues, MSI-X where available	Shared virtio transport helpers, `DMAPool`, `DeviceMmio`, `Interrupt`, and queue lifecycle proofs	QEMU virtio-net proofs and the live GCE legacy virtio-net raw-frame `provider-nic-bound` proof landed. This does not claim public L4 ingress, high-throughput/multiqueue readiness, or device-autonomous MSI-X completion delivery
GCP gVNIC	gVNIC as the modern Compute Engine NIC, replacing virtio-net on newer machine generations and required for some features	PCI BAR/MMIO binding, MSI-X routing, per-queue ring setup, image metadata declaring `GVNIC`, and fallback choice between virtio-net and gVNIC by machine family	Grounding plus bounded live proofs landed: the GCE gVNIC provenance map records the spec basis and authority mapping, the GCE harness can request `GVNIC` image/instance posture and inventory the `1ae0:0042` PCI function, the admin-queue/register proof maps BAR0 and issues one `DESCRIBE_DEVICE`, the raw-frame proof configures one GQI/QPL TX/RX queue pair, and the typed `Nic` adaptation proof exercises inline-frame `Nic.transmit` / `Nic.receive` over live gVNIC. No QEMU gVNIC model exists. This remains a separate GCE portability lane, not a blocker for the first public Web UI proof on a virtio-compatible machine type
AWS Nitro ENA + EBS	ENA enhanced networking plus Nitro NVMe storage	ENA queue/MSI-X driver, NVMe controller/storage path, IOMMU or bounce-buffer policy, and image import with ENA/NVMe expectations	Planned; no ENA, NVMe EBS, or AWS boot proof
Azure Accelerated Networking	Accelerated Networking exposes SR-IOV hardware families, with MANA as the newer Azure NIC and Mellanox mlx4/mlx5 still relevant on some hosts	Synthetic-interface fallback awareness, VF binding/revocation handling, MANA/Mellanox driver binding, MSI-X routing, and reset/revoke paths that survive VF removal	Planned; no MANA, Mellanox VF, or Azure boot proof

These rows are planning gates, not implementation evidence. Each provider NIC has its own queue layout, feature negotiation, MSI-X/vector conventions, reset behavior, and driver-binding rules. Azure’s accelerated-networking path also requires the OS and applications to tolerate dynamic SR-IOV VF revocation by falling back to the synthetic network interface. Provider storage follows the same rule: AWS Nitro uses NVMe for EBS, GCP can require NVMe on newer or Confidential VM paths while retaining virtio-scsi on older paths, and Azure uses SCSI on many older families while Azure Boost and newer NVMe-capable VM families expose managed disks through NVMe. The shared foundation therefore needs ACPI/PCIe discovery, BAR validation, interrupt ownership, DMAPool accounting, IOMMU/bounce-buffer policy, and lifecycle teardown before any cloud NIC or storage driver is treated as portable.

What Already Works

UEFI boot – Limine ISO includes BOOTX64.EFI. The boot path itself is cloud-compatible.
Serial output – all three clouds expose COM1. gcloud compute instances get-serial-port-output, aws ec2 get-console-output, and Azure serial console all read from it.
x86_64 long mode – cloud VMs are KVM-based x86_64. Architecture matches.

Managed Application Services

Booting capOS on a cloud VM and using managed cloud services are separate tracks. The VM path proves hardware, disk, network, and serial behavior. Managed services can be useful earlier for application persistence, especially game profile/world state, as long as they sit behind narrow capOS service capabilities.

For a GCP-backed adventure persistence bridge:

Cloud Run hosts a small bridge endpoint. It translates capOS save/load/append requests into provider calls and enforces request bounds before touching cloud APIs.
Cloud KMS owns the key-encrypting keys (KEKs) for each game-world instance or shard. The bridge or game-world service gets narrow authority to wrap or unwrap data-encrypting keys (DEKs) through Cloud KMS envelope encryption. Ordinary browser clients do not receive DEKs, game-world key capabilities, KMS decrypt/unwrap grants, or provider-independent plaintext authority; provider storage objects contain ciphertext, wrapped DEKs, and metadata only.
Firestore Native mode stores mutable profile summaries, indexes, and compare-and-set version records.
Cloud Storage stores larger immutable snapshots, evidence blobs, exports, and content-addressed records. Object versioning and lifecycle policy are required before using it for durable game data.
Secret Manager stores bridge-side provider credentials and rotation material. Those secrets are never granted to ordinary capOS game clients.

This does not change the storage proposal’s rule: persistence is still application-level serialization of bounded Cap’n Proto records. The cloud bridge is just one backing implementation for Store, Namespace, or an app-specific AdventureSaveStore/CloudGameStore capability. Local fake-cloud tests must enforce stale-write rejection, wrong-profile rejection, append-only ledger behavior, and size bounds before a real GCP deployment is trusted.

A separate browser-mediated path can serve user-owned private backups. In that model, the browser or web terminal host authenticates the user to Google, stores encrypted save capsules in Drive appDataFolder or Firebase user documents, and returns only opaque provider handles and encrypted capsule bytes through explicit restore flows. DEK unwrap and plaintext validation happen in the local capOS key domain or in the game-world service with KMS/IAM authority, not in browser JavaScript. This is appropriate for user profile backup, private expedition checkpoints, and settings sync. It is not appropriate for authoritative public world state, reward witness records, market receipts, or multiplayer outcomes. The user’s browser holds provider tokens; capOS game services do not. For GCP-backed game worlds, the browser transports envelope-encrypted capsules with wrapped DEKs but does not hold game-world key capabilities, KMS decrypt/unwrap grants, DEKs, or plaintext authority.

Firebase user-document capsule paths must make the auth binding visible in the path template, not just in policy metadata. Use a narrow shape such as users/{request.auth.uid}/saveCapsules/{capsule_id} so Firestore rules can bind the user wildcard to request.auth.uid; literal profile names such as users/alice/... are not accepted by the capOS policy model. Firestore rules remain access control for opaque encrypted capsules only. They must not be treated as validation for decrypted adventure semantics, and path segments must respect Firestore ID constraints such as no ., no .., no __.*__, and the 1,500-byte collection/document ID limit.

GCP Cloud KMS And IAM Notes For Adventure Saves

GCP-backed adventure save capsules follow the same envelope-encryption model as CloudKmsKeySource and the volume-encryption proposal: Cloud KMS holds a key-encrypting key (KEK), the game-world service owns the capsule data-encrypting key (DEK), and KMS Encrypt/Decrypt wraps or unwraps that DEK rather than bulk-encrypting capsule bytes. Provision one Cloud KMS key ring and one symmetric CryptoKey KEK per game-world instance or shard. The key ring is an administrative grouping boundary; ordinary runtime authority should be granted on the CryptoKey resource where possible, not at the project or key-ring level. Do not claim key-version-scoped IAM as a design primitive for this path: predefined Cloud KMS crypto roles have CryptoKey as their lowest grantable resource.

Service accounts are split by operation:

Writers that only create new ciphertext receive roles/cloudkms.cryptoKeyEncrypter on the configured game-world CryptoKey so they can wrap a freshly generated DEK.
Restore, validation, and migration workers that must read protected capsules receive roles/cloudkms.cryptoKeyDecrypter on that CryptoKey so they can unwrap an existing DEK.
The narrow game-world service account receives roles/cloudkms.cryptoKeyEncrypterDecrypter only when the same service must both wrap and unwrap DEKs. Avoid roles/cloudkms.cryptoOperator, project-wide grants, owner/editor roles, browser OAuth identities, and service-agent roles for ordinary adventure runtime access.

The browser-vault boundary does not change. Browser JavaScript may carry ciphertext, wrapped DEKs, capsule metadata, and opaque Drive/Firebase provider handles. It must not receive plaintext DEKs, capOS SymmetricKey or KeySource capabilities, Cloud KMS decrypt/unwrap grants, service account credentials, or provider-independent plaintext. The game-world service may use the unwrapped DEK internally as service authority, modeled as a SymmetricKey capability, but that authority does not cross into browser JavaScript. Possession of a Drive file id or Firebase document path is only transport authority over opaque encrypted bytes.

Rotation creates a new primary KEK version for future DEK wrapping. It does not re-encrypt existing capsules, rewrite wrapped DEK blobs, or disable/destroy old key versions automatically. Capsule re-encryption or rewrapping is a managed game-world service operation: unwrap the old DEK while its KEK version remains enabled and authorized, decrypt and validate the capsule inside the service, then write a new capsule using a new DEK or a DEK rewrapped by the current primary KEK version. The service verifies content hashes and ledger/profile bindings before replacing capsule metadata. Old KEK versions should only be disabled or scheduled for destruction after inventory proves no accepted wrapped DEK still depends on them.

Retiring a game-world first removes IAM decrypt authority from the world service and migration workers. If the retirement is meant to make existing capsules inaccessible, disable the relevant key versions and record the expected outage and recovery procedure before doing it. Destruction is delayed by Cloud KMS’ scheduled destruction period and is irreversible once completed, so destroy key versions only after audit retention, export, and break-glass recovery decisions are recorded. Disabling or destroying a key version can make all capsules that depend on it unreadable; this is a revocation tool, not cleanup.

Phase 1: Bootable Disk Image And Serial Diagnostics

Goal: Produce a raw hybrid BIOS+UEFI disk image that can boot locally and can be packaged for cloud import, alongside the existing ISO for QEMU. The first cloud-visible proof is serial-console boot to init/diagnostics, not network shell access.

The Problem

Cloud VMs boot from disk images, not ISOs. Each cloud has provider-specific format and boot-mode rules:

Cloud	Image format	Import method
GCP	`disk.raw` in gzip `.tar.gz` using old GNU tar; raw size in 1 GiB increments	`gcloud compute images create --source-uri=gs://...`
AWS	raw, VMDK, VHD/VHDX, or OVA	`aws ec2 import-image` with explicit boot-mode notes
Azure	VHD (fixed size)	`az image create --source`

GCP’s manual import path documents a functional MBR partition table or a hybrid GPT+MBR bootloader configuration for imported boot disks, plus ACPI support. AWS VM Import/Export supports both UEFI and legacy BIOS boot modes, but UEFI imports need a fallback EFI binary at /EFI/BOOT/BOOTX64.EFI; Nitro instances generally expect NVMe storage and ENA networking for useful operation. Therefore the first capOS image target should be a hybrid BIOS+UEFI raw disk: an ESP for UEFI fallback boot and a BIOS/MBR-compatible Limine path for import paths that still validate MBR bootability.

Disk Layout

Hybrid raw disk image (1 GiB-aligned for cloud packaging)
  Protective/hybrid MBR + GPT
  Partition 1: EFI System Partition (FAT32, ~32 MB)
    /EFI/BOOT/BOOTX64.EFI     (Limine UEFI loader)
    /limine.conf               (bootloader config)
    /boot/kernel               (capOS kernel ELF)
    /boot/init                 (init process ELF)
  Partition 2: (reserved for future use -- persistent store backing)

Build Tooling

New Makefile target make image using standard tools:

IMAGE := capos.img
IMAGE_SIZE := 1024  # MB, keeps GCP raw image packaging simple

image: kernel init $(LIMINE_DIR)
	# Create raw disk image
	dd if=/dev/zero of=$(IMAGE) bs=1M count=$(IMAGE_SIZE)
	# Partition with GPT + ESP; keep room for hybrid/MBR boot metadata.
	sgdisk -n 1:2048:+32M -t 1:ef00 $(IMAGE)
	# Format ESP as FAT32, copy files
	# (mtools or loop mount + mkfs.fat)
	mformat -i $(IMAGE)@@1M -F -T 65536 ::
	mcopy -i $(IMAGE)@@1M $(LIMINE_DIR)/BOOTX64.EFI ::/EFI/BOOT/
	mcopy -i $(IMAGE)@@1M limine.conf ::/
	mcopy -i $(IMAGE)@@1M $(KERNEL) ::/boot/kernel
	mcopy -i $(IMAGE)@@1M $(INIT) ::/boot/init
	# Install Limine BIOS path as well as UEFI fallback files.
	$(LIMINE_DIR)/limine bios-install $(IMAGE)

New QEMU target to test disk boot locally:

run-disk: $(IMAGE)
	qemu-system-x86_64 -drive file=$(IMAGE),format=raw \
		-bios /usr/share/edk2/x64/OVMF.4m.fd \
		-display none $(QEMU_COMMON); \
	test $$? -eq 1

Cloud upload helpers (scripts, not Makefile targets):

# GCP
cp capos.img disk.raw
tar --format=oldgnu -Sczf capos.tar.gz disk.raw
gcloud storage cp capos.tar.gz gs://my-bucket/
gcloud compute images create capos --source-uri=gs://my-bucket/capos.tar.gz

# AWS
aws ec2 import-image --disk-containers \
  "Format=raw,UserBucket={S3Bucket=my-bucket,S3Key=capos.img}" \
  --boot-mode uefi

Serial diagnostics are part of Phase 1 rather than a later convenience. The cloud bring-up loop should be:

make run-disk proves the hybrid image under local QEMU/OVMF.
a local BIOS-mode disk run proves the MBR/Limine path if provider import requires it;
a serial diagnostics prompt is reachable on COM1 in QEMU;
GCP/AWS imported instances reach the same prompt through provider serial console output.

The serial diagnostics prompt should expose bounded read-only commands for status, cpu, mem, acpi, pci, irq, timers, devices, and logs, plus reboot/halt. It is the early remote debugging path for cloud driver bring-up before NICs or disks are reliable. It should not be required to upload large binaries, replace kernels in place, or stream high-volume tracing through cloud serial consoles.

Dependencies

sgdisk (gdisk package) – GPT partitioning
mtools (mformat, mcopy) – FAT32 manipulation without root/loop mount

Scope

Makefile/helper script work for the image plus a narrow diagnostics-mode surface. Kernel changes are limited to serial diagnostics and any boot path adjustments needed for disk images; network and block drivers remain later phases.

Phase 0 closeout: GCE harness landed (2026-05-05 06:51 UTC)

Commit 02635421 (2026-05-05 06:51 UTC) records this harness closeout.

The first build-and-boot leg of Phase 1 landed as the cloud-boot harness. make capos-cloudboot-image produces a 10 GiB GPT-partitioned target/disk.raw with a 128 MiB FAT32 EFI System Partition holding the Limine UEFI loader, limine.conf, the kernel ELF, and the manifest, plus the Limine BIOS stage 2 embedded in the GPT for legacy SeaBIOS boot. The disk is repackaged as target/capos-disk.tar.gz using tar --format=oldgnu -czf, the exact form GCE’s manual import path expects. Disk size is enforced as an exact multiple of 1 GiB.

tools/cloudboot/run-test.sh (also wired as make cloudboot-test) drives the end-to-end loop on a sandbox GCE project: an idempotent orphan sweep on a configured project-pinned label, a staging tarball upload, image creation, instance creation with no public IP, no service account, no API scopes, the same project-pinned label set, and the configured sandbox subnet, then serial-port polling for the capos kernel starting landmark with a hard wall-clock budget. Serial output is captured under target/cloudboot-evidence/run-<id>/serial.log BEFORE teardown, and a bash trap on EXIT INT TERM always deletes the instance, image, and staged tarball even on signal or partial failure. The harness hard-fails if the active project name does not match the configured sandbox.

Sandbox project name, subnet, staging bucket, and the IAM custom roles the harness assumes are operational details that depend on the host environment; they belong in tools/cloudboot/README.md and operator-local configuration, not in this proposal.

This is the harness only. The recurring portability gate that records cloud boot evidence on every reviewed cloud-relevant change remains open as docs/backlog/hardware-boot-storage.md Task 6, and the userspace driver authority gate remains open under DDF Task 5.

First GCP serial-console boot proof (2026-05-08 09:06 UTC)

The first imported-image GCP serial-console proof reached capos kernel starting as run 1778230874-715a at 2026-05-08 09:06 UTC, against source commit 3951e275 from 2026-05-08 08:50 UTC. The run used the cloudboot harness to import the staged disk image, create a temporary e2-small instance with no public IP and no service account/scopes, poll serial output for the kernel-start landmark, save the serial log under the run evidence directory, and tear down the temporary instance/image/staging objects.

This proves imported-image firmware/bootloader/kernel serial reachability on one GCP sandbox run only. It does not prove a usable cloud instance, provider NIC or storage drivers, cloud clocking, persistence, SSH/network shell access, AWS/Azure import, or production cloud readiness.

Web UI Image Feature-Contamination Gate

Every kernel variant is built into the shared cargo target/, so $(KERNEL) is a single mutable file. If a sibling feature build writes target/ between the GCE legacy-virtio Web UI image build and its upload, the raw disk can silently embed the wrong datapath arm. This happened on 2026-07-16: the L4 sustained-receive feature transitively enables the modern userspace Nic-cap roundtrip arm, so the image picked up the production-nic-grant datapath instead of the kernel-brokered legacy virtio-net datapath, and the first GCE boot failed after a ~32 min spend window was already committed to diagnosis.

The capos-gce-private-webui-cloudboot-image tarball (the uploaded artifact) now depends on a mandatory arm-verify gate, tools/cloudboot/webui-image-arm-verify.sh, that runs before the raw disk can be packaged:

It extracts the kernel from the assembled raw disk’s ESP (::/boot/kernel, the exact bytes GCE would boot) and fails closed unless it is the legacy-virtio arm.
The legacy-datapath sentinel (datapath=legacy-virtio-0.9-polled, a compile-time literal in the legacy runtime module) MUST be present.
The modern production-nic-grant sentinel (nic-grant-source-prod, a compile-time literal in the modern Nic-cap grant module) MUST be absent. kernel/src/cap/mod.rs turns the legacy + modern feature combination into a compile_error!, so a contaminated kernel flips at least one sentinel.

The gate is a deterministic static byte scan, not a boot: it needs no QEMU, OVMF, or provider access, so it enforces on every build path including headless CI, and no upload path can package an unverified tarball. make webui-image-arm-verify-test is its no-build fixture self-test; make capos-gce-private-webui-arm-verify re-runs it against the current raw disk. The live behavioral companion remains make run-cloud-gce-legacy-virtio-webui-serving, which boots the same kernel+manifest under the legacy virtio-net QEMU shape (OVMF, disable-modern=on, hostfwd) and asserts /healthz plus the serving marker end to end.

Structural target isolation (a dedicated CARGO_TARGET_DIR per cloudboot image) would prevent the shared-target/ overwrite at the source and remains a reasonable future hardening, but the artifact-faithful arm-verify makes shipping a contaminated image impossible through any make-driven build path regardless of target-dir sharing.

Private Web UI Reachability Evidence Contract

The first self-hosted Web UI provider proof is private GCE reachability, not operator browser exposure. The behavior task cloud-gce-private-self-hosted-webui-proof extends tools/cloudboot/run-test.sh with --require-web-ui-proof only after the local Web UI L4 proof, DHCP/IPv4 configuration, and Web UI hardening tasks are closed. This proposal defines the evidence contract for that later behavior slice; it does not authorize a billable GCE run, a public endpoint, broad firewall changes, TLS certificate provisioning, service-account broadening, or a production release.

The proof must keep the current cloudboot posture unless the behavior task is explicitly amended: no public IP on the capOS VM, no service account, no API scopes, no public firewall rule, and teardown through the existing orphan-sweep and EXIT INT TERM trap discipline. The reachability probe must cross the live GCE virtual network boundary. Acceptable shapes include a same-VPC probe instance, a provider-supported internal probe path, or another reviewed private path that sends packets through the capOS VM’s GCE NIC and private endpoint.

Evidence classes stay separate:

Evidence class	What it can prove	What it cannot prove
Cloudboot-only	The image imports, boots, emits serial markers, and tears down provider resources	Web UI reachability over the provider network
Provider-private	A private probe reaches `remote-session-web-ui` through the live GCE NIC and Phase C L4 path	Public operator access, TLS readiness, DNS readiness, or browser production posture
Operator-exposure	A separately authorized public or browser-mediated path reaches the Web UI under the selected ingress policy	The private proof by itself; it must depend on the private proof instead

The private Web UI proof records, before teardown, at least:

Field	Requirement
Run identity	Cloudboot run id plus source commit or image provenance used for the imported image
Machine shape	GCE machine family/type, NIC selection posture, and zone
Private posture	`public_ip=false` or equivalent, service-account/scopes posture, and no public firewall rule
Private endpoint	Internal IP or provider-private endpoint, UI port, and probe source identity
Probe path	Same-VPC probe, provider-supported internal probe, or other reviewed private path that crosses the GCE virtual network boundary
Web UI marker	A run-unique Web UI response marker, header, or body token observed by the private probe
Phase C L4 marker	The `remote-session-web-ui` Phase C L4 evidence marker, such as `cloudboot-evidence: remote-session-web-ui-l4 <token>`, tied to the same source commit/image
Private proof marker	A final structured marker, such as `cloudboot-evidence: gce-private-self-hosted-webui <token>`, emitted only after the private probe succeeds
Teardown	Instance, image, staged object, probe resources, and any private firewall or route resources created by the run were deleted or reported as a failed run

Private Proof Runbook Checklist

The future --require-web-ui-proof harness gate closes provider-private Web UI reachability only when the run records these steps in order:

Preflight confirms the local Web UI L4 proof, DHCP/IPv4 proof, session hardening, and connection-bound prerequisites are closed, and confirms that the run has current authorization for billable private GCE execution.
Image/source provenance records the cloudboot run id, source commit, imported image or staged object identity, and the local artifact set used for the VM.
Launch posture records the zone, machine type, NIC posture, no public IP, no service account or API scopes, and no public firewall rule.
Probe setup records the private endpoint, UI port, probe source identity, and same-VPC or provider-supported private path that crosses the GCE virtual network boundary.
The private probe fetches the Web UI over that provider-private path and records a run-unique response marker, header, or body token.
The serial or harness evidence ties the same run to the Phase C L4 marker for remote-session-web-ui, such as cloudboot-evidence: remote-session-web-ui-l4 <token>, from the same source commit/image.
The harness emits the private proof marker, such as cloudboot-evidence: gce-private-self-hosted-webui <token>, only after the provider-private probe and L4-marker correlation both succeed.
Teardown removes the VM, imported image, staged object, probe resources, and any private firewall or route resources created by the run, using the normal orphan-sweep and trap discipline.
Failed-run reporting preserves the run id, failure class, last observed private posture, teardown result, and whether any loopback, same-guest, or serial-only diagnostics passed without treating those diagnostics as a provider-private proof.

No-Spend Preflight (Step 1, Landed as a Local Gate)

Step 1 of the checklist is implemented and testable today without any provider mutation: tools/cloudboot/run-test.sh --require-web-ui-proof --preflight-only runs the local no-spend preflight and exits before the harness access probe, orphan sweep, upload, image import, instance launch, firewall mutation, or any probe resource. It validates that the local prerequisite proofs are done (cloud-prod-remote-session-web-ui-l4-local-proof, remote-session-web-ui-session-hardening, remote-session-web-ui-connection-bounds, and the legacy-datapath serving prerequisite cloud-gce-legacy-virtio-webui-serving-local-proof), that an operator supplied a firewall-IAM attestation (the documented live blocker), and that a current per-run billable authorization is present, emitting one structured cloudboot-webui-preflight: line per check naming the failure class without printing credentials or attestation values. make cloudboot-gce-private-webui-preflight-check is the fixture gate proving the safe failure paths and that no provider CLI is invoked on any preflight path (tools/cloudboot/README.md documents the inputs and failure classes). A preflight pass is cloudboot-only evidence – the output labels itself evidence-class=cloudboot-local-preflight – and is neither the provider-private proof nor authorization for a billable run. The live --require-web-ui-proof gate is implemented (make cloudboot-gce-private-webui-test): it runs this preflight first and fails closed before any provider interaction when the preflight fails, then drives the same-VPC probe, evidence-report, and teardown flow the checklist above describes, validating the rendered report with validate-private-webui-evidence.sh.

Evidence-Grammar Fixture (Local Gate)

The closeout evidence grammar for the table above is also locally testable without any provider mutation: tools/cloudboot/validate-private-webui-evidence.sh validates a harness-rendered evidence report for field completeness, marker ordering (the private proof marker only after the recorded private-probe pass and the correlated remote-session-web-ui-l4 marker), run/source identity agreement, private posture, and teardown result, and rejects loopback-only, serial-only, same-guest, public-IP, public-firewall, and missing-teardown evidence with structured failure classes. make cloudboot-gce-private-webui-evidence-fixture-check is the fixture gate (tools/cloudboot/README.md documents the report grammar and failure classes). A pass is evidence-class=cloudboot-local-private-webui-evidence-fixture with an explicit provider-private-reachability=not-proven label: it proves only that a future successful run’s evidence will be parsed, ordered, and classified correctly, not that any provider-private probe has run.

Loopback-only checks (127.0.0.1, guest-local localhost, or an in-guest HTTP health request) are supplemental service-health evidence. They may help diagnose a failed run, but they do not close cloud-gce-private-self-hosted-webui-proof because they do not prove the provider NIC, VPC routing, private endpoint, or probe-to-VM packet path. Serial-only markers are likewise insufficient for the private Web UI proof unless the private probe also succeeds and the harness records the required provider-private fields.

The public ingress policy below remains a later authorization boundary. Closing the private proof does not permit a public IP, load balancer, DNS name, TLS certificate, Identity-Aware Proxy, operator browser exposure, or widened service account. Public browser-facing exposure must reference the private proof as an input and then satisfy the separate public-ingress policy and on-hold approval gate.

Public Web UI Ingress Policy (First Operator-Access Proof)

The cloudboot harness intentionally launches with no public IP, no service account, and no API scopes. Exposing the self-served capOS Web UI (remote-session-web-ui, see Remote Session CapSet Client Gate 1B) to an operator browser is therefore a separate, reviewed exposure decision, not a follow-on of the private reachability proof. This section is the selected policy that the first public-ingress behavior task (cloud-gce-public-self-hosted-webui-ingress-tls) builds against, decided by cloud-gce-public-webui-ingress-tls-policy-design. That behavior task is now also blocked on independent accounted browser sessions, a private real-GCE network-ceiling closeout, and a zero-provider-call readiness-provenance extension. Fresh public-ingress authorization is necessary after those dependencies, but does not waive them.

Selected Ingress Shape: Provider-Terminated HTTPS Load Balancer

The first public proof uses a GCP external Application Load Balancer that terminates HTTPS at the Google front end. capOS serves only plain HTTP/1.1 on its UI backend port; the operator browser reaches the UI exclusively through the load balancer’s HTTPS virtual IP and hostname. TLS is terminated by Google’s front end against a managed certificate; capOS never holds the TLS private key and never parses hostile TLS bytes in this proof.

graph LR
    B[Operator browser] -- HTTPS --> LB[GCP external HTTPS<br/>Application Load Balancer<br/>Google-managed cert]
    LB -- HTTP, health-check-scoped firewall --> NEG[Zonal NEG / backend service]
    NEG --> VM[capOS VM<br/>remote-session-web-ui :8080<br/>plain HTTP/1.1, no public IP]
    style LB fill:#2d5,stroke:#333
    style VM fill:#2d5,stroke:#333

Why this shape is the first proof rather than direct capOS TLS termination:

Provider termination is the selected bootstrap boundary. A capOS-terminated TLS WebUI path, private-key custody, and ACME pieces now have bounded local proofs, but that does not make direct public termination the least-privilege first exposure. The provider shape keeps the VM private, avoids placing key custody and hostile TLS parsing on the first public proof, and reuses a separately reviewed forwarder trust boundary. Direct capOS-terminated public TLS remains a successor with its own DNS, renewal, resource-admission, and provider proof; this choice is policy, not a claim that no local TLS implementation exists.
Least privilege and reversibility. Provider-terminated TLS keeps the VM with no public IP, no inbound 0.0.0.0/0, and no private-key custody in either capOS or the harness. Teardown is the deletion of a bounded set of provider resources, not the rotation of an exposed key.
Clean successor path. When the capability-native TLS stack and an ACME flow ship, the direct-external-IP / capOS-terminated shape becomes available as a second, separately reviewed ingress. This proof does not foreclose it; it is the bootstrap step before it. The interim posture is recorded as “Bootstrap TLS for the First Public GCE Web UI” in Certificates and TLS, and the public GCE successor task is cloud-gce-public-webui-letsencrypt-direct-termination-proof. That successor requires a controlled public DNS name plus explicit billable/public-ingress authorization, and any Let’s Encrypt production call requires explicit CA authorization.

Raw public HTTP is not acceptable closeout evidence. If port 80 is published at all, it exists only as an HTTP-to-HTTPS 301 redirect at the load balancer and never reaches capOS. The closeout evidence must be the HTTPS path.

An optional hardening for the first proof is to enable Identity-Aware Proxy (IAP) on the backend service so the public door is gated by Google IAM before any request reaches the capOS backend. IAP here is not a separate ingress shape: it rides on the same external HTTPS load balancer and gates that backend service, so the ALB is still the only public entry point. IAP composes with, and does not replace, the capOS SessionManager/AuthorityBroker login boundary: IAP authenticates the human to Google; capOS still mints its own UserSession and projects only browser-safe view models. The browser never receives raw capOS caps.

Ingress Trust Model: What the Load Balancer Is Trusted With

The load-balancer shape makes the Google front end a trusted intermediary, and that trust should be named, not implied:

It terminates TLS, so it processes client traffic in plaintext between decryption and the backend leg. The LB→backend hop is plain HTTP inside the provider fabric (health-check-scoped firewall), not the public internet.
It is an admitted backend transport peer. remote-session-web-ui gates login attempt reachability by peer address against two boot-configured CIDR lists (see the ingress trust model below). For this LB shape the balancer’s forwarder ranges are placed on trustedForwarderPeers, so the recorded ALB front-end/health-check ranges may reach login. The range authenticates the proxy transport and its forwarded-scheme claim, not the browser user or the resource account charged for pre-auth work.
Its forwarded-scheme claim is believed. X-Forwarded-Proto is authoritative only from the trustedForwarderPeers ranges — the balancer asserts whether the client leg was encrypted, and capOS accepts that assertion.

Consequently, a compromised or misconfigured balancer compromises the ingress model; this is the standard edge-termination trade-off, accepted deliberately for the first proof because the alternative (capOS-terminated TLS) is the separately reviewed successor.

What peer-pinning does and does not buy: the balancer forwards every internet client, so pinning does not stop credential brute force by itself — edge policy (Cloud Armor rate limiting / WAF) can absorb part of that traffic, and pinning prevents a direct backend bypass. Origin accounting is still required. The WebUI currently has peer-address and listener-wide failure backoff plus one global nonblocking password-verifier arena; behind the LB these controls are shared by unrelated users and can amplify a cheap failure stream into global lockout. Public readiness requires fair bounded service/anonymous-ingress CPU, crypto-arena, socket, buffer, and request admission with protected recovery/control progress. Cloud Armor is defense in depth, not the ledger of record.

For completeness, the postures ranked by what a hostile network position yields: HTTPS-via-LB exposes nothing on the client leg; direct plain HTTP would expose the operator password and session cookie to any on-path observer and allow active substitution of the served pages — which is why raw public HTTP is rejected as closeout evidence above. Private lab topologies that are neither loopback-proof nor LB-fronted (for example an IAP-tunneled sandbox vantage) are served by adding the sandbox’s private range to the insecureLoginTrustedPeers list in the web UI manifest (a DEMO/LAB posture, never a public shape), which grants that range both plaintext-login authority and loopback Host/Origin acceptance. That coupling is current deployment containment, not user identity or quota policy, and should be split if either role needs to evolve independently.

Certificate and Key Custody

Concern	First proof	Successor (deferred)
TLS terminator	Google front end (load balancer)	capOS userspace TLS service
Certificate source	Google-managed certificate (Certificate Manager or classic managed cert), or an operator-supplied cert resource on the load balancer	ACME (`AcmeClient` + `http-01`/`tls-alpn-01` solver) from Certificates and TLS
Private-key custody	Google-held; never in capOS or the harness	capOS `PrivateKey` cap sealed under a `KeySource`
Min TLS version / cipher policy	Load balancer SSL policy (TLS 1.2+ minimum; prefer the GCP `MODERN`/`RESTRICTED` profile)	capOS `CipherPolicy` (`modern`)

The first proof must not write a private key into the disk image, the manifest, the cloudboot evidence directory, or any harness-staged object. A managed certificate keeps key material entirely on the provider side.

The successor must preserve the same no-export rule on the capOS side: the ACME account key and TLS private key remain behind PrivateKey / KeyVault authority and are not copied into cloudboot images, manifests, logs, or evidence directories. Local ACME proofs use a local directory; public GCE/Let’s Encrypt proofs require explicit run authorization, DNS-name control, public-ingress teardown evidence, and staging-vs-production CA labeling.

Browser Session and Origin Policy

The self-served Web UI keeps the Gate 1B boundary: remote-session-web-ui is the trusted backend that holds remote-session/CapSet state server-side, and browser JavaScript receives only browser-safe view models. Public exposure adds the following reviewed browser rules:

Single public origin. UI assets and the same-origin JSON API are served from the one HTTPS origin (the load balancer hostname). No second origin, no wildcard CORS, no cross-origin credentialed requests. The service-side policy is implemented in remote-session-web-ui as a boot-manifest input: one public_origin.<host> marker cap (an inert Endpoint, granted after the service caps) fixes the accepted https://<host> origin at boot, validated fail-closed (second marker, malformed, loopback-named, or IP-literal-shaped host, or any unrecognized extra grant fails the boot), and consulted by the Host/Origin/Referer gates only for requests on the trusted forwarded-scheme HTTPS path, so a direct client can never claim the public origin. Browser-supplied principal/source hint headers (IAP assertions, authenticated-user hints) are rejected on the public-origin path before any backend-held capability dispatch, and no CORS headers are emitted. The public origin governs only Host/Origin/Referer validation; it does not by itself decide login authority (see the two-list peer trust model below). Loopback acceptance is peer-gated to insecureLoginTrustedPeers (default: guest loopback and the QEMU SLIRP gateway 10.0.2.2, itself never a forwarded public path): only a peer on that list keeps the loopback Host/Origin/Referer posture. A load-balancer forwarder peer is not on the insecure list, so its loopback-shaped requests fail closed before backend-held capability dispatch – the LB-forwarded path serves exactly one origin posture and the QEMU loopback proof stays usable for local manifests. Because 10.0.2.2 is a plausible RFC1918 neighbor address outside QEMU, the live deployment must keep untrusted sources off that address through the firewall/subnet plan (the backend port already admits only the GFE ranges above); verifying that posture is part of the on-hold public-ingress task’s acceptance. Locally proven by make run-cloud-prod-remote-session-web-ui-l4 (in-process trusted-forwarder fixture positive plus cross-origin, mixed-scheme, wildcard, missing-origin, hostile-Referer, principal-hint, real-ingress direct-client forged, and forwarded/ordinary-peer loopback-bypass negatives, with the preserved real-ingress loopback positive); the proof is local deployment-mode readiness only and claims no DNS name, load balancer, TLS endpoint, private GCE reachability, public ingress, or operator-browser exposure.
Two-list peer trust model. remote-session-web-ui derives three peer roles from two orthogonal, boot-configured IPv4 CIDR lists carried as inert marker caps (insecure_peer.<cidr> / forward_peer.<cidr>), each generated by CUE comprehension from the manifest insecureLoginTrustedPeers / trustedForwarderPeers fields (setting a field REPLACES its default, it does not append):
- insecureLoginTrustedPeers (default ["127.0.0.0/8", "10.0.2.2/32"]) grants loopback-posture (the loopback Host/Origin/Referer acceptance above) AND contributes to login authority. Its entries are trusted both to attempt plaintext login and to present loopback Host/Origin; 0.0.0.0/0 is forbidden posture.
- trustedForwarderPeers (default []; the LB shape sets it to the GFE ranges 130.211.0.0/22, 35.191.0.0/16) grants forwarder trust (X-Forwarded-Proto is believed) AND contributes to login authority, but only when a public origin is configured — an LB forwards authenticated logins, and without a configured public origin there is no LB in front, so a forwarder-range peer connecting directly is not the LB-forwarding scenario and is refused. It does NOT grant loopback-posture, so a forwarder peer cannot present a loopback Host. Only real reverse-proxy/load-balancer forwarder ranges belong here; a direct client placed here could forge the client’s TLS leg.
- login authority is insecureLoginTrustedPeers always, plus trustedForwarderPeers only under a configured public origin; forwarder trust and loopback-posture each come from exactly one list (trustedForwarderPeers and insecureLoginTrustedPeers respectively). These are deployment config now, not hardcoded ranges or a bolt-on “extension”.
Forwarded-scheme trust is firewall-bounded. Because the backend hop is plain HTTP, capOS derives the external scheme from the load balancer’s X-Forwarded-Proto/forwarding headers. It trusts those headers only from the configured trustedForwarderPeers ranges (kept off other peers by the firewall below), and treats any such header from an unexpected source as absent (default to “not HTTPS”, fail closed on secure-context assumptions). The service-side trust gate is implemented in remote-session-web-ui (forwarded_scheme_peer_trusted / external_scheme_is_https, driven by the configured forwarder list, fail-closed on unknown peer formats) and locally proven by make run-cloud-prod-remote-session-web-ui-l4: a real ingress client forging X-Forwarded-Proto: https keeps the non-Secure cookie posture, and a fixture simulating the recorded ranges is the only path that flips the session cookie to Secure. The local proof remains plaintext-loopback and claims no live load balancer or TLS endpoint.
Session cookies. The session cookie is Secure, HttpOnly, and SameSite. The SameSite value is picked deterministically rather than mid-slice: Strict when no IAP front door is used, and Lax when IAP is enabled (the IAP sign-in redirect is a cross-site top-level navigation that would drop a Strict cookie on return). Secure is honored because the browser only ever sees the cookie over the load balancer’s HTTPS origin. The switch is implemented in remote-session-web-ui as a boot-manifest policy input: an IAP-fronted deployment manifest grants the inert iap_fronted_ingress marker cap (last in the web-ui grant list) to select Lax; without it the service emits Strict, and SameSite=None is never emitted. The posture applies uniformly to the session, CSRF, and logout/expiry clear-cookie headers, stays independent of the forwarded-scheme-derived Secure attribute, and is fixed at boot so no request header, cookie, or body field can select the weaker branch. Because a Lax cookie attaches on cross-site top-level GET navigations, the Lax posture additionally rejects authenticated GET views whose Fetch Metadata provenance (Sec-Fetch-Site) is cross-site – and cookie-bearing GETs with no Fetch Metadata at all, covering legacy browsers and webviews that attach Lax cookies without stating provenance – before any session state is touched; the gate is inert under Strict, where the cookie never attaches cross-site. make run-cloud-prod-remote-session-web-ui-l4 proves the default Strict posture end to end (including a real-ingress login forging IAP-shaped headers and body fields) and the Lax branch through the service’s in-process policy fixture; the live IAP-fronted deployment is future work.
HSTS and redirect. The HTTPS edge sets Strict-Transport-Security with a conservative max-age (no preload, no includeSubDomains commitment for the first proof). Any port-80 listener is a 301 to HTTPS only.
CSRF. State-changing JSON routes require a per-session anti-CSRF token and an Origin/Referer check against the known public origin; cross-origin or origin-absent state changes are rejected.
Session lifetime and logout. Sessions carry a bounded idle timeout and an absolute lifetime. Logout drops the server-side session and clears the cookie; the existing self-served stale-session / logout failure-closed boundary (proven in the Gate 1B implementation gate) extends unchanged to the public endpoint. A stale or expired cookie yields no authority.

Firewall and Source-Range Policy

The instance keeps no public IP. Ingress to the capOS UI backend port is allowed only from Google’s load-balancer and health-check ranges, never from 0.0.0.0/0:

Allowed source	Purpose
`130.211.0.0/22`, `35.191.0.0/16`	Google Front Ends and load-balancer health checks reaching their separately port-scoped application and health backends
`35.235.240.0/20`	Identity-Aware Proxy (only if IAP fronting or IAP-tunneled SSH/diagnostics is used)

No other ingress rule is created. The proof does not broaden the service account, add API scopes beyond the LB/health-check need, open SSH to the public internet, or attach a broad firewall tag. Egress stays default-deny-friendly: the LB-terminated path needs no capOS outbound, and the future ACME path (which would require egress 443 to the ACME directory) is explicitly out of scope here.

The public application backend port and the protected provider-health port are distinct. The URL map targets only the application port; the health-check resource targets only the health port. Source-range firewalling remains necessary, but port/listener authority is what prevents an ordinary frontend request from selecting the protected health reserve.

Backend Health-Check Contract (Local Baseline Landed; Split Required)

The landed local proof uses GET /healthz on the WebUI application listener. That establishes the bounded no-authority response semantics below, but an internet client routed through the load balancer can request the same path. A path string therefore cannot select protected health capacity. The required public implementation splits the provider probe onto a dedicated listener/port that the public URL map cannot reach; ordinary application-listener /healthz traffic, if retained, consumes the anonymous-ingress ledger.

Response: GET /healthz, served by demos/remote-session-web-ui (HEALTH_BODY). The exact bounded response body is {"ok":true,"service":"remote-session-web-ui"} with Content-Type: application/json and Cache-Control: no-store; it carries no cap ids, session ids, user/profile names, endpoint handles, provider resource ids, host paths, or secret material.
No authority: the route is unauthenticated and never creates, rotates, refreshes, or consumes a browser session; it never emits Set-Cookie, and a presented (even forged) session cookie changes nothing. The local proof drives a /healthz probe with live session cookies against an idle-expired session and asserts the next authenticated call still fails closed. It is the only unauthenticated public-ingress liveness exception; the Host/Origin/CSRF/session gates on authority-bearing routes are unchanged. (/api/health remains the bundled operator app’s same-origin page-load ping with the same no-authority posture; the provider health check never probes it.)
Protected provider route: the dedicated health listener may accept the provider’s by-IP probe without the application Host allowlist. Its listener cap, port, transport policy, occupancy budget, and response budget are distinct from the public application listener; headers and paths cannot mint that provenance.
Ordinary public route: an application-listener /healthz request is anonymous input. It may keep the same no-authority body, but receives no protected reserve merely because of its path or Host exemption.
Fail-closed variants: non-GET methods and path variants (POST /healthz, /healthz/extra, /HEALTHZ) return 404 without reaching any authority-bearing handler.
Availability under abuse: the current slow-client phases prove the local route completes while idle, partial-request, and drip-feed clients are held open. The public-readiness gate additionally floods ordinary /healthz and static routes while the distinct provider-health/lifecycle lane retains its declared progress; protected occupancy itself is bounded and tested for exhaustion/recovery.

The landed route proof is local backend evidence (evidence-class=local-qemu), not a live GCE health check: no health-check resource, load balancer, firewall rule, or public endpoint exists, and a passing local contract proof authorizes none of them. It also is not evidence that the required protected-listener split has landed.

Audit and Evidence Fields

The public proof records, before teardown, at least:

selected ingress shape (https-load-balancer) and whether IAP was enabled;
public endpoint (hostname and HTTPS virtual IP);
TLS posture: terminator (google-frontend), certificate type (google-managed or operator-supplied), and the load balancer SSL-policy minimum TLS version;
authentication method exercised (capOS SessionManager login, and Google IAM identity if IAP is enabled);
firewall/forwarding scope: the named source ranges, backend port, and the URL-map/forwarding-rule chain created;
HTTP-to-HTTPS redirect and HSTS header observation;
teardown result for every resource the proof created.

Teardown Checklist

The existing harness deletes the instance, image, and staging tarball in an EXIT INT TERM trap. The public proof extends that trap to delete, in dependency order, every ingress resource it creates:

global forwarding rule and target HTTPS proxy;
URL map and any HTTP-to-HTTPS redirect URL map / target HTTP proxy;
backend service and health check;
zonal/serverless NEG or managed instance group backing the backend;
managed certificate / certificate-map entry / SSL policy created for the run;
the LB-scoped and (if used) IAP-scoped firewall rules;
the reserved external IP address, if one was allocated for the LB;
the instance, image, and staged tarball (existing harness behavior).

Teardown must be idempotent and must run on signal or partial failure, matching the existing orphan-sweep discipline. A run that cannot confirm deletion of an ingress resource is a failed run, not a passed one.

Local Plan Gate (Landed)

The resource graph above is locally reviewable before any billable work: tools/cloudboot/plan-public-webui-ingress.sh renders and validates the selected plan shape with zero provider interaction, and make cloudboot-public-webui-ingress-plan-check is the fixture gate proving each rejected hazard (raw public HTTP to capOS, instance public IP, 0.0.0.0/0 backend ingress, missing /healthz health check, broad service account/scopes, staged private-key material, non-provider certificate custody) fails closed by structured class before any provider CLI could be invoked. Output is stamped evidence-class=cloudboot-local-plan with operator-exposure=not-proven; a plan pass is not public reachability, TLS readiness, or authorization for the on-hold public proof. The command contract and failure classes are documented in tools/cloudboot/README.md (“Public Web UI ingress plan gate”).

Local Teardown Fixture Gate (Landed)

The teardown checklist above is locally proven before any billable work: tools/cloudboot/teardown-public-webui-ingress.sh is the dependency-ordered, idempotent, deletion-confirming teardown engine over a per-run created-resources journal, and make cloudboot-public-webui-teardown-fixture-check exercises it against recording stub provider CLIs across complete, partial-create, command-failure, delete-claims-success-but-persists, unreadable-state, signal-trap, and orphan-sweep paths. Every checklist resource class is modeled and the engine’s class list must equal the plan gate’s rendered teardown-order= line (the fixture fails on drift), so a class added to the selected plan cannot go missing from the cleanup graph. An unconfirmed deletion is a blocking structured failure (undeleted-<class> / resource-state-unknown), matching the failed-run policy above. All public-ingress resource names must carry the capos-test- sweepable marker; a journal naming anything else is rejected before any provider call, and the orphan sweep enforces the marker client-side so out-of-scope resources are never deleted. Output is stamped evidence-class=cloudboot-local-teardown-fixture live-teardown=not-proven; a fixture pass is local harness evidence only, never live provider teardown evidence, and authorizes no public ingress. The journal grammar, sweep contract, and failure classes are documented in tools/cloudboot/README.md (“Public Web UI ingress teardown fixture gate”).

Local Evidence Fixture Gate (Landed)

The “Audit and Evidence Fields” contract above is locally proven before any billable work: tools/cloudboot/validate-public-webui-evidence.sh validates a harness-rendered public-proof closeout report against the selected evidence grammar, and make cloudboot-public-webui-proof-evidence-fixture-check is the fixture gate proving accepted and rejected reports over stub inputs with zero provider CLI invocations. Acceptance requires the recorded ingress shape, public HTTPS hostname/VIP, provider TLS terminator and managed or operator-supplied certificate resource, minimum TLS policy, IAP posture, no-key-custody statement, no-public-IP instance posture, GFE/health-check firewall scope, health-check, HTTP-to-HTTPS redirect and HSTS observations, capOS SessionManager login observation, a public HTTPS probe record, the correlated gce-public-self-hosted-webui-ingress-tls proof marker, and a per-resource teardown record pinned to the plan gate’s teardown-order= class list (the fixture fails on drift). Raw public HTTP, a direct instance public IP, wildcard backend ingress, a missing health check, missing HSTS/redirect observation, capOS or harness private-key custody, stale/missing/incomplete teardown, a non-provider TLS terminator, and private-proof-only evidence (a same-VPC or provider-internal probe path, or a proof marker without a recorded HTTPS probe) each fail closed by structured class. The tls terminator= label structurally separates this provider-terminated evidence contract from the later capOS-terminated TLS successor, so successor evidence can never pass through the first-proof grammar. Output names field names, classes, and line numbers only; input values are never echoed. Every pass is stamped evidence-class=cloudboot-local-public-webui-evidence-fixture with operator-exposure=not-proven: a fixture pass is local evidence-grammar validation only, never public reachability or operator-access evidence, and it does not authorize public exposure or move the live proof out of cloud-gce-public-self-hosted-webui-ingress-tls. The report grammar and failure classes are documented in tools/cloudboot/README.md (“Public Web UI evidence-grammar fixture gate”).

Local Provider Command Allowlist Gate (Landed)

The provider command boundary the future public proof may use is locally proven before any billable work: tools/cloudboot/check-public-webui-provider-commands.sh validates a recorded provider-command transcript against the selected resource graph, and make cloudboot-public-webui-provider-command-allowlist-check is the fixture gate proving both directions over recording stub gcloud/gsutil with zero live provider invocations. The allowlist permits only the resource families the plan and teardown checklist name – forwarding rules, target HTTPS/HTTP proxies, URL maps, backend services, health checks, zonal NEGs, scoped firewall rules, managed-certificate resources, SSL policies, reserved addresses, instance/image creation, and staged tarball upload/delete – and requires the capos-test- marker on every created resource, journal-pinned deletion (a delete must name a resource the created-resources journal recorded), GFE/IAP-only firewall source ranges, the capos-test filter on every listing, marker discipline on create-wired references, per-surface create flags and parameters pinned to the selected graph shape, an explicit pin of the documented sandbox project on every command, and explicit --global/--zone scope on deletes (ambient Cloud SDK project/region defaults are never trusted). Drift toward broader provider authority fails closed by structured class: IAM mutation, service-account/scopes changes, DNS mutation, private-key upload, 0.0.0.0/0 backend ingress, unmarked resources, deletion outside the journal (zone-pinned), project-wide or filter-restating sweeps, ambient credential flags, project/network/region scope overrides beyond the pinned sandbox forms, --flags-file indirection, non-selected create parameters, shell/environment inspection, and provider CLI resolution from an unexpected path. Rejected command content is reported by class and line number only; credentials, principals, key paths, and rejected names are never echoed. Output is stamped evidence-class=cloudboot-local-provider-command-allowlist with provider-mutation=none: a pass narrows what the future live proof may execute, it is not live provider evidence and does not authorize the on-hold public proof. The transcript grammar and failure classes are documented in tools/cloudboot/README.md (“Public Web UI provider-command allowlist gate”).

Local No-Spend Preflight Gate (Landed Baseline; Extension Required)

The future billable public proof must fail closed before any provider access, cost, certificate work, DNS change, or public exposure unless its pre-spend facts hold. That first-stage gate is the public equivalent of the private Web UI no-spend preflight: tools/cloudboot/run-test.sh --require-public-web-ui-proof --preflight-only runs entirely locally, exits before any gcloud/gsutil call, image import, instance/load-balancer/DNS/certificate mutation, or public exposure, and emits structured cloudboot-public-webui-preflight: lines naming each failure class. make cloudboot-public-webui-preflight-check is the fixture gate proving accepted and rejected cases against stub provider CLIs with zero provider CLI invocations. The preflight fails closed unless all of the following hold, each reported by its own check line:

the private Web UI reachability proof (cloud-gce-private-self-hosted-webui-proof) is recorded done in the task backend (private-proof-missing); it is a named input to the public policy, not a follow-on;
the selected public plan, teardown, and evidence-grammar local fixtures are present (local-fixture-missing);
explicit per-run public-ingress/TLS authorization is supplied (public-ingress-authorization-missing); it is a distinct operator input from the private billable authorization and any generic cloud authorization, so neither can stand in for public browser exposure;
the operator supplied the DNS-name, certificate-custody (google-managed/operator-supplied), minimum-TLS (TLS_1_2/TLS_1_3), and IAP (enabled/disabled) posture inputs for the provider-terminated HTTPS shape (dns-posture-missing, cert-posture-{missing,invalid}, tls-posture-{missing,invalid}, iap-posture-{missing,invalid});
the provider permissions the selected plan needs are attested (provider-permission-attestation-{missing,invalid}), without the preflight probing the provider or printing the attestation value.

Before the live public flag can be implemented, the public-ingress task must extend that landed baseline to require webui-independent-browser-sessions and webui-network-ceiling-real-gce-proof in task-backend done state, non-empty landed-commit provenance, and ancestry of every recorded closeout commit in the candidate source/image revision. Missing tasks, missing provenance, source mismatch, or task-backend failure must produce distinct fail-closed classes before any provider command. This readiness extension is target behavior, not part of the currently landed preflight evidence; it is tracked separately in cloud-gce-public-webui-readiness-preflight-extension.

The live public flag without --preflight-only fails closed before any provider interaction (no live public proof is implemented; the billable proof is on-hold behind explicit authorization), and the gate is dedicated: it does not combine with the private Web UI gate or with other proof, NIC-type, or machine-type flags. Output names failure classes and the two non-secret certificate/TLS/IAP modes only; it never echoes authorization strings, DNS names, private keys, credentials, tokens, host paths, or broad environment state. Every pass is stamped evidence-class=cloudboot-local-public-preflight with provider-public=not-proven operator-exposure=not-proven: a pass is local pre-spend evidence only, never public reachability, TLS readiness, or authorization for the on-hold public-ingress/TLS proof, and it does not move that proof out of cloud-gce-public-self-hosted-webui-ingress-tls. The command contract, operator inputs, and failure classes are documented in tools/cloudboot/README.md (“Public Web UI no-spend preflight gate”).

Phase 2: ACPI and Device Discovery

Goal: Parse ACPI tables to discover hardware topology, interrupt routing, and PCI root complexes. This replaces QEMU-specific hardcoded assumptions.

Why ACPI

On QEMU with default settings, you can hardcode PCI config space at 0xCF8/0xCFC and assume legacy interrupt routing. On real cloud hardware:

PCI root complex addresses come from ACPI MCFG table (PCIe ECAM)
Interrupt routing comes from ACPI MADT (I/O APIC entries) and _PRT
CPU topology comes from ACPI MADT (LAPIC entries)
Timer info comes from ACPI HPET/PMTIMER tables

Limine provides the RSDP (Root System Description Pointer) address via its protocol. From there, the kernel can walk RSDT/XSDT to find specific tables.

Required Tables

Table	Purpose	Priority
MADT	LAPIC and I/O APIC addresses, CPU enumeration	High (Phase 2)
MCFG	PCIe Enhanced Configuration Access Mechanism base	High (Phase 2)
HPET	High Precision Event Timer address	Medium (fallback timer)
FADT	PM timer, shutdown/reset methods	Low (future)

Landed Discovery Slice

The first landed slices are bounded diagnostics plus reusable config access. The ACPI parser requests Limine’s RSDP, validates RSDP/RSDT/XSDT/static-table lengths and checksums within fixed caps, emits serial summaries for RSDT/XSDT table count and MADT/MCFG presence, reports MADT LAPIC/I/O APIC/interrupt-source-override inputs, and reports MCFG ECAM allocation records when firmware provides the table. The PCI layer now keeps the existing legacy I/O-port backend and adds an ECAM backend selected from MCFG allocations; devices retain their discovery backend so config reads, writes, capability walking, and BAR sizing use the same access path. The PCI layer also exposes a shared memory-BAR subregion validator/mapper, and the virtio-net transport uses it for modern capability regions. It also reports MSI/MSI-X capability metadata for the virtio-net function and uses kernel-owned config/RX/TX source records with a bounded first-fit LAPIC device MSI vector pool plus lock-free dispatch slots for QEMU virtio-net MSI-X table programming, virtio vector assignment, driver-owned route unmask, claimed-route lifecycle/reassignment proof, and TX delivery proof. The x86 setup maps MADT I/O APICs and programs masked legacy IRQ routes from MADT source overrides before higher-level drivers can depend on interrupt routing. The Q35 smoke asserts both the ECAM inventory lines, a pci: config backend=ecam enumerated ... proof line, and representative masked I/O APIC route lines; the net smoke asserts virtio-net BAR, capability, MSI-X metadata, source-route records, route unmask records, vector programming, queue assignment, descriptor guards, ARP, and ICMP fixture lines before MMIO transport mapping completes. This path does not interpret AML, provide userspace driver authorities, or provide full unbounded bus discovery yet.

Implementation

#![allow(unused)]
fn main() {
// kernel/src/acpi.rs

/// Minimal ACPI table parser.
/// Walks RSDP -> XSDT -> individual tables.
/// Does NOT implement AML interpretation -- static tables only.

pub struct AcpiInfo {
    pub lapics: Vec<LapicEntry>,
    pub io_apics: Vec<IoApicEntry>,
    pub iso_overrides: Vec<InterruptSourceOverride>,
    pub mcfg_base: Option<u64>,  // PCIe ECAM base address
    pub hpet_base: Option<u64>,
}

pub fn parse_acpi(rsdp_addr: u64, hhdm: u64) -> AcpiInfo { ... }
}

For the fuller static-table subsystem, prefer the acpi crate (or an equivalent maintained no_std parser) rather than expanding the diagnostic parser into a general hand-written ACPI stack. The landed parser is a boot-time inventory proof for RSDP/RSDT/MADT/MCFG summaries; it can be retired or narrowed once the crate-backed table model fits capOS mapping and table lifetime constraints.

Limine RSDP

#![allow(unused)]
fn main() {
use limine::request::RsdpRequest;

static RSDP: RsdpRequest = RsdpRequest::new();

// In kmain:
let rsdp_addr = RSDP.response().expect("no RSDP").address as u64;
let acpi_info = acpi::parse_acpi(rsdp_addr, hhdm_offset);
}

Crate Dependencies

Crate	Purpose	no_std
`acpi`	Planned fuller/static ACPI table parsing (MADT, MCFG, HPET, FADT, etc.)	yes

Scope

The landed diagnostic slice is kernel-local bounded read-only parsing for serial inventory. Fuller handling should be mostly glue around a maintained static-table parser plus capOS mapping, lifetime, and authority types.

Phase 3: Interrupt Infrastructure

Goal: Set up I/O APIC for device interrupt routing and MSI/MSI-X for modern PCI devices. This replaces the implicit legacy PIC setup.

I/O APIC

The I/O APIC routes external device interrupts (keyboard, serial, PCI devices) to specific LAPIC entries (CPUs). Its address and configuration come from the ACPI MADT (Phase 2).

#![allow(unused)]
fn main() {
// kernel/src/arch/x86_64/ioapic.rs

pub struct IoApic {
    base: *mut u32,  // MMIO registers via HHDM
}

impl IoApic {
    /// Route an IRQ to a specific LAPIC/vector.
    pub fn route_irq(&mut self, irq: u8, lapic_id: u8, vector: u8) { ... }

    /// Mask/unmask an IRQ line.
    pub fn set_mask(&mut self, irq: u8, masked: bool) { ... }
}
}

The current x86 implementation maps MADT I/O APIC MMIO, reads each controller’s ID/version/redirection count, and programs legacy IRQ 0-15 routes to LAPIC vectors while keeping the redirection entries masked. It respects Interrupt Source Override entries from MADT (for example, Q35 remaps IRQ 0 to GSI 2). Driver-owned unmask policy, dispatch, and EOI handling remain planned.

MSI/MSI-X

Modern PCI/PCIe devices (NVMe, cloud NICs) use Message Signaled Interrupts instead of pin-based IRQs routed through the I/O APIC. MSI/MSI-X writes directly to the LAPIC’s interrupt command register, bypassing the I/O APIC entirely.

This is critical for cloud deployment because:

NVMe controllers require MSI or MSI-X (no legacy IRQ fallback on many controllers)
Cloud NICs (ENA, gVNIC) use MSI-X exclusively
MSI-X supports per-queue interrupts (one vector per virtqueue/submission queue), enabling better SMP scalability

#![allow(unused)]
fn main() {
// kernel/src/pci/msi.rs

/// Configure MSI for a PCI device.
pub fn enable_msi(device: &PciDevice, vector: u8, lapic_id: u8) { ... }

/// Configure MSI-X for a PCI device.
pub fn enable_msix(
    device: &PciDevice,
    table_bar: u8,
    entries: &[(u16, u8, u8)],  // (index, vector, lapic_id)
) { ... }
}

MSI/MSI-X capability structures are found by walking the PCI capability list (already needed for PCI enumeration in the networking proposal). The current PCI path reports MSI/MSI-X capability metadata for virtio-net so diagnostics can see the advertised table and pending-bit-array layout. The virtio-net QEMU smoke now records kernel-owned config/RX/TX MSI-X sources, publishes them into the device interrupt dispatch table, allocates LAPIC vectors from the bounded device MSI vector pool to program their table entries and virtio vector registers, lets the in-kernel virtio-net owner unmask only those routes, then proves TX delivery by observing that source’s dispatch counter advance after maskable interrupts are live. The same smoke uses an unused masked MSI-X table entry to prove claimed-route reassignment, stale old-route rejection, old-vector unregistered delivery, reassigned-vector masked delivery, unsupported-vector delivery, and release. Broader driver dispatch and userspace interrupt authority remain planned.

Integration with SMP

LAPIC initialization is shared with the SMP proposal. The active x86 path uses xAPIC MMIO for the immediate QEMU/KVM timer and IPI foundation, with PIT/PIC fallback. This cloud phase consumes that architectural LAPIC path for local interrupt delivery and now adds masked ACPI MADT I/O APIC routing plus MSI/MSI-X capability metadata discovery and a bounded virtio-net MSI-X dispatch/lifecycle proof; userspace device interrupts remain planned.

KVM/QEMU paravirtual features such as PV EOI, PV IPI, and PV TLB flush are host-specific accelerations. They are useful later for cloud performance, but cloud boot correctness should use the architectural LAPIC path first. x2APIC is a later backend for newer/high-core systems and firmware states where xAPIC is unavailable or undesirable; it is not a blocker for the current LAPIC path.

Scope

~300-400 lines total:

I/O APIC driver: ~150 lines
MSI/MSI-X setup: ~100-150 lines
Integration/routing logic: ~50-100 lines

Phase 4: PCI/PCIe Infrastructure

Goal: Standalone PCI bus enumeration and device management, usable by all device drivers (virtio-net, NVMe, cloud NICs).

The networking proposal includes PCI enumeration as a substep for finding virtio-net. This phase promotes it to a reusable kernel subsystem that all device drivers build on.

PCI Configuration Access

Two mechanisms, determined by ACPI:

Legacy I/O ports (0xCF8/0xCFC) – works in QEMU, limited to 256 bytes of config space per function. Insufficient for PCIe extended capabilities.
PCIe ECAM (Enhanced Configuration Access Mechanism) – memory-mapped config space, 4 KB per function. Base address from ACPI MCFG table. Required for MSI-X capability parsing and NVMe BAR discovery on real hardware.

Legacy I/O and Q35 ECAM config access exist today behind the same early PCI backend abstraction. The PCI layer also validates memory BAR subregions with checked offset/length/alignment bounds and maps selected subregions through the kernel MMIO window for in-kernel drivers, and it records non-programming MSI/MSI-X metadata for the current virtio-net path by walking the standard PCI capability list. The virtio-net path now selects a usable MSI-X capability and programs config/RX/TX table entries through the typed PCI MSI-X table helper using the kernel-owned source records and bounded first-fit LAPIC device MSI vectors. The QEMU net smoke lets the in-kernel virtio-net owner claim and unmask those routes, assigns the virtio common and queue MSI-X vector registers, and proves TX delivery by observing that source’s dispatch counter advance after the TX completion path has run and maskable interrupts are live. It also proves claimed-route reassignment and release with an unused masked MSI-X table entry. The next steps are using that path for full bus discovery, userspace DeviceMmio authority, broader driver dispatch, and driver binding.

Device Enumeration

#![allow(unused)]
fn main() {
// kernel/src/pci.rs

pub struct PciDevice {
    pub bus: u8,
    pub device: u8,
    pub function: u8,
    pub vendor_id: u16,
    pub device_id: u16,
    pub class: u8,
    pub subclass: u8,
    pub bars: [Option<Bar>; 6],
    pub interrupt_pin: u8,
    pub interrupt_line: u8,
}

pub enum Bar {
    Memory {
        base: u64,
        size: u64,
        prefetchable: bool,
        width: MemoryBarWidth,
    },
    Io { base: u32, size: u32 },
}

/// Scan all PCI buses and return discovered devices.
pub fn enumerate() -> Vec<PciDevice> { ... }

/// Find a device by vendor/device ID.
pub fn find_device(vendor: u16, device: u16) -> Option<PciDevice> { ... }

/// Walk the PCI capability list for a device.
pub fn capabilities(device: &PciDevice) -> Vec<PciCapability> { ... }
}

BAR Mapping

Device drivers need MMIO access to BAR regions. The kernel now maps validated memory-BAR subregions into its bounded MMIO virtual window for in-kernel drivers. A future DeviceMmio capability will carry equivalent authority to userspace drivers as described in the networking proposal.

PCI Device IDs for Cloud Hardware

Device	Vendor:Device	Cloud
virtio-net	1AF4:1000 (transitional) or 1AF4:1041 (modern)	QEMU, supported first/second-generation GCP machine families
virtio-blk	1AF4:1001 (transitional) or 1AF4:1042 (modern)	QEMU
NVMe	8086:various, 144D:various, etc.	All clouds (EBS, PD, Managed Disk)
AWS ENA	1D0F:EC20 / 1D0F:EC21	AWS
GCP gVNIC	1AE0:0042	GCP
Azure MANA	1414:00BA	Azure

Scope

~400-500 lines:

Config space access (I/O + ECAM): ~100 lines
Bus enumeration: ~150 lines
BAR parsing and mapping: ~100 lines
Capability list walking: ~50-100 lines

Phase 5: NVMe Driver

Goal: Basic NVMe block device driver, sufficient to read/write sectors. This is the storage equivalent of virtio-net for networking – the first real storage driver.

Why NVMe Over virtio-blk

The storage-and-naming proposal mentions virtio-blk for Phase 3 (persistent store). On cloud VMs, all three providers expose NVMe:

AWS EBS – NVMe interface (even for gp3/io2 volumes)
GCP Persistent Disk – NVMe or SCSI (NVMe is default for newer VMs)
Azure Managed Disks – SCSI on many older VM families such as D/Ev5 or Fv2 and older; NVMe on Azure Boost and newer NVMe-capable families such as Ebsv5 and Da/Ea/Fav6 and newer

virtio-blk is QEMU-only. An NVMe driver unlocks persistent storage on all cloud platforms where the selected VM shape exposes NVMe. For QEMU testing, QEMU also emulates NVMe well: -drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0.

NVMe Architecture

NVMe is a register-level standard with well-defined queue-pair semantics:

Application
    |
    v
Submission Queue (SQ) -- ring buffer of 64-byte command entries
    |
    | doorbell write (MMIO)
    v
NVMe Controller (hardware)
    |
    | DMA completion
    v
Completion Queue (CQ) -- ring buffer of 16-byte completion entries
    |
    | MSI-X interrupt
    v
Driver processes completions

Minimum viable driver needs:

Admin Queue Pair (for identify, create I/O queues)
One I/O Queue Pair (for read/write commands)
MSI-X for completion notification (or polling)

Implementation Sketch

#![allow(unused)]
fn main() {
// kernel/src/nvme.rs (or kernel/src/drivers/nvme.rs)

pub struct NvmeController {
    bar0: *mut u8,          // MMIO registers
    admin_sq: SubmissionQueue,
    admin_cq: CompletionQueue,
    io_sq: SubmissionQueue,
    io_cq: CompletionQueue,
    namespace_id: u32,
    block_size: u32,
    block_count: u64,
}

impl NvmeController {
    pub fn init(pci_device: &PciDevice) -> Result<Self, NvmeError> { ... }
    pub fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), NvmeError> { ... }
    pub fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), NvmeError> { ... }
    pub fn identify(&self) -> NvmeIdentify { ... }
}
}

DMA Considerations

NVMe uses DMA for data transfer. The controller reads/writes directly from physical memory addresses provided in commands. Requirements:

Buffers must be physically contiguous (or use PRP lists / SGLs for scatter-gather)
Physical addresses must be provided (not virtual)
Cache coherence is handled by hardware on x86_64 (DMA-coherent architecture)

The existing frame allocator can provide physically contiguous pages. For larger transfers, PRP (Physical Region Page) lists allow scatter-gather.

Crate Dependencies

Crate	Purpose	no_std
(none)	NVMe register-level protocol is simple enough to implement directly	N/A

The NVMe spec is cleaner than virtio and the register interface is straightforward. A minimal driver (admin + 1 I/O queue pair, read/write) is ~500-700 lines without external dependencies.

Integration with Storage Proposal

The storage proposal’s Phase 3 (Persistent Store) specifies virtio-blk as the backing device. This can be generalized to a BlockDevice trait:

#![allow(unused)]
fn main() {
trait BlockDevice {
    fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), Error>;
    fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), Error>;
    fn block_size(&self) -> u32;
    fn block_count(&self) -> u64;
}
}

Both NVMe and virtio-blk implement this trait. The store service doesn’t care which backing driver it uses.

Scope

~500-700 lines for a minimal in-kernel NVMe driver (admin queue + 1 I/O queue pair, read/write, identify). Userspace decomposition follows the same pattern as the networking proposal (kernel driver first, then extract to userspace process with DeviceMmio + Interrupt caps).

Phase 6: Cloud NIC Strategy

Goal: Define the path to networking on cloud VMs, given that each cloud uses a different proprietary NIC.

The Landscape

Cloud	Primary NIC	Virtio NIC available?	Open-source driver?
GCP	gVNIC (1AE0:0042)	Yes on supported first/second-generation machine families	Yes (Linux, ~3000 LoC)
AWS	ENA (1D0F:EC20)	No (Nitro only)	Yes (Linux, ~8000 LoC)
Azure	MANA (1414:00BA)	No (accelerated networking)	Yes (Linux, ~6000 LoC)

Recommended Strategy

Short term: constrained virtio-net on GCP

GCP can expose VIRTIO_NET on supported first/second-generation machine families. After the shared image, ACPI/PCIe, interrupt, DMA/MMIO, and virtio foundation exists, that gives a constrained early cloud-network proof without writing a provider-specific NIC driver. It is not the general GCP target: third-generation-and-later machine families, Tau T2A, Confidential VM, and some higher-bandwidth paths require gVNIC.

gcloud compute instances create capos-test \
    --image=capos \
    --machine-type=e2-micro \
    --network-interface=nic-type=VIRTIO_NET

Medium term: gVNIC driver

gVNIC is a simpler device than ENA or MANA. The Linux driver is ~3000 lines (vs ~8000 for ENA). It uses standard PCI BAR MMIO + MSI-X interrupts. A minimal gVNIC driver (init, link up, send/receive) would be ~800-1200 lines.

gVNIC is worth prioritizing because:

GCP’s constrained virtio-net path can de-risk cloud networking before a provider-specific NIC driver exists
Graduating from virtio-net to gVNIC on the same cloud is the required path for newer, Tau T2A, Confidential VM, and higher-bandwidth GCP instances
The gVNIC register interface is documented in the Linux driver source

Long term: ENA and MANA

ENA and MANA are more complex and less well-documented outside their Linux drivers. These should be deferred until the driver model is mature (userspace drivers with DeviceMmio caps, as described in the networking proposal Part 2).

At that point, the kernel only needs to provide PCI enumeration + BAR mapping + MSI-X routing. The actual NIC driver logic runs in a userspace process, making it feasible to port from the Linux driver source with appropriate licensing considerations.

Alternative: Paravirt Abstraction Layer

Instead of writing native drivers for each cloud NIC, an alternative is a thin paravirt layer:

Application -> NetworkManager cap -> Net Stack (smoltcp) -> NIC cap -> [driver]

Where [driver] is one of:

virtio-net (QEMU, supported first/second-generation GCP machine families)
gvnic (GCP)
ena (AWS)
mana (Azure)

All drivers implement the same Nic capability interface from the networking proposal. The network stack and applications are driver-agnostic.

This is already the architecture described in the networking proposal. The only addition is recognizing that multiple driver implementations will exist behind the same Nic interface.

Phase Summary and Dependencies

graph TD
    P1[Phase 1: Disk Image + Serial Diagnostics] --> BOOT[Boots on Cloud VM]
    P2[Phase 2: ACPI Parsing] --> P3[Phase 3: Interrupt Infrastructure]
    P2 --> P4[Phase 4: PCI/PCIe]
    P3 --> P5[Phase 5: NVMe Driver]
    P4 --> P5
    P4 --> NET[Networking Smoke Test<br>virtio-net driver]
    P3 --> NET
    P4 --> P6[Phase 6: Cloud NIC Drivers]
    P3 --> P6
    NET --> P6

    S5[Stage 5: Scheduling] --> P3
    SMP_C[SMP Phase C: LAPIC timer/IPI] --> P3

    style P1 fill:#2d5,stroke:#333
    style BOOT fill:#2d5,stroke:#333

Phase	Depends on	Estimated scope	Enables
1: Disk image + diagnostics	Nothing	image tooling plus bounded diagnostics mode	Cloud serial boot
2: ACPI	Nothing (kernel code)	~200-300 lines	Phases 3, 4
3: Interrupts	Phase 2, LAPIC (SMP Phase C)	~300-400 lines	NVMe, cloud NICs
4: PCI/PCIe	Phase 2	~400-500 lines	All device drivers
5: NVMe	Phases 3, 4	~500-700 lines	Cloud storage
6: Cloud NICs	Phases 3, 4, networking smoke test	~800-1200 lines each	Cloud networking

Minimum Path to “Boots on Cloud VM, Prints Hello”

Raw serial output and UEFI boot support already exist, so the smallest “prints hello” experiment is mostly Phase 1 image packaging plus any boot-path adjustments needed to reach the same COM1 output from an imported disk image. That experiment is a precursor, not the full Phase 1 closeout.

Phase 1 closeout also includes a bounded serial diagnostics prompt so cloud driver bring-up can inspect CPU, memory, ACPI, PCI, IRQ, timer, device, and log state before cloud NICs or storage drivers are reliable. That diagnostics surface is kernel/userspace behavior, not just build-system work.

Minimum Path to “Useful on Cloud VM”

Phases 1-5 (disk image + ACPI + interrupts + PCI + NVMe) plus the existing roadmap items (Stages 4-6 for capability syscalls, scheduling, IPC). On a supported first/second-generation GCP machine family, networking can use the existing virtio-net proposal without a provider-specific gVNIC/ENA/MANA driver on that constrained target.

QEMU Testing

All phases can be tested in QEMU before deploying to cloud:

Phase	QEMU flags
Disk image	`-drive file=capos.img,format=raw -bios OVMF.4m.fd`
ACPI	Default QEMU provides ACPI tables (MADT, MCFG, etc.)
I/O APIC	Default QEMU emulates I/O APIC
PCI/PCIe	`-device ...` adds PCI devices; QEMU has PCIe root complex
NVMe	`-drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0`
MSI-X	Supported by QEMU’s NVMe and virtio-net-pci emulation; current net smoke asserts metadata selection, kernel-owned source-route records, route unmask, vector programming, virtio queue assignment, descriptor guards, ARP, and ICMP fixture evidence. Device-autonomous virtio-net MSI-X delivery is covered by the dedicated userspace-provider gates.
Multi-CPU	`-smp 4` (already works with Limine SMP)
x2APIC backend	future explicit QEMU CPU feature such as `-cpu qemu64,+smep,+smap,+rdrand,+x2apic`

aarch64 and ARM Cloud Instances

This proposal focuses on x86_64 because that’s the current kernel target, but ARM-based cloud instances are significant and growing:

Cloud	ARM offering	Instance types
AWS	Graviton2/3/4	m7g, c7g, r7g, etc.
GCP	Tau T2A (Ampere Altra)	t2a-standard-*
Azure	Cobalt 100 (Arm Neoverse)	Dpsv6, Dplsv6

ARM cloud VMs have the same general requirements (UEFI boot, ACPI tables, PCI/PCIe, NVMe storage) but different specifics:

Interrupt controller: GIC (Generic Interrupt Controller) instead of APIC. GICv3 is standard on cloud ARM instances.
Boot: UEFI via Limine (already targets aarch64). Limine handles the architecture differences at boot time.
Timer: ARM generic timer (CNTPCT_EL0) instead of LAPIC/PIT/TSC.
Serial: PL011 UART instead of 16550 COM1. Different register interface.
NIC: Same PCI devices (ENA, gVNIC, MANA) with the same register interfaces – PCI/PCIe is architecture-neutral.
NVMe: Same NVMe register interface – PCIe is architecture-neutral.

The arch-neutral parts of this proposal (PCI enumeration, NVMe, disk image format, ACPI table parsing) apply equally to aarch64. The arch-specific parts (I/O APIC, MSI delivery address format, LAPIC) need aarch64 equivalents (GIC, ARM MSI translation).

The existing roadmap lists “aarch64 support” as a future item. For cloud deployment, aarch64 should be considered as soon as the x86_64 hardware abstraction is stable, since:

Device drivers (NVMe, virtio-net, cloud NICs) are architecture-neutral – they talk to PCI config space and MMIO BARs, which are the same on both architectures
The acpi crate handles both x86_64 and aarch64 ACPI tables
Limine already targets aarch64
AWS Graviton instances are often cheaper than x86_64 equivalents

The main aarch64 kernel work is: exception handling (EL0/EL1 instead of Ring 0/3), GIC driver (instead of APIC), ARM generic timer, PL011 serial, and the MMU setup (4-level page tables exist on both but with different register interfaces).

Open Questions

ACPI scope. The landed diagnostic parser covers bounded read-only RSDP/RSDT/MADT/MCFG summaries only. The acpi crate can parse fuller static tables (MADT, MCFG, HPET, FADT). Full ACPI requires AML interpretation (for _PRT interrupt routing, dynamic device enumeration). Do we need AML, or are static tables sufficient for cloud VMs? Cloud VM firmware typically provides simple, static ACPI tables – AML interpretation is likely unnecessary initially.
PCIe ECAM vs legacy. Should we support both config access methods, or require ECAM (which all cloud VMs and modern QEMU provide)? Supporting both adds ~50 lines but makes bare-metal testing on older hardware possible.
NVMe queue depth. A single I/O queue pair with depth 32 is sufficient for initial use. Per-CPU queues (leveraging MSI-X per-queue interrupts) improve SMP throughput but add complexity. Defer per-CPU queues to after SMP is working.
Driver model unification. Resolved: PCI enumeration is the standalone PCI/PCIe Infrastructure item in the roadmap. The networking smoke test and NVMe driver both consume this shared subsystem. The networking proposal’s Part 1 Step 1 has been updated to reference this phase.
GCP vs AWS as first cloud target. The first cloud proof should be imported-image serial-console boot on both providers when practical, because that validates image format, firmware, bootloader, and early ACPI without depending on cloud NICs. For the later usable-networked-instance milestone, a constrained first/second-generation GCP virtio-net target is the easiest first network proof; broader GCP coverage needs gVNIC, and AWS follows once the NVMe/ENA path or an explicit workaround is ready.

References

Specifications

NVMe Base Specification 2.1 – register interface, queue semantics, command set
PCI Express Base Specification – ECAM, MSI/MSI-X capability structures
ACPI Specification 6.5 – MADT, MCFG, HPET table formats
Intel SDM Vol. 3, Ch. 10 – APIC architecture (LAPIC, I/O APIC)

Crates

acpi – no_std ACPI table parser
virtio-drivers – no_std virtio (already in networking proposal)

Prior Art

Redox PCI – microkernel PCI driver in Rust
Hermit NVMe – unikernel NVMe driver
rCore virtio – educational OS with virtio + PCI in Rust
Linux gVNIC driver – reference for gVNIC register interface (~3000 LoC)
Linux ENA driver – reference for ENA

Cloud Documentation

capOS Cross-Links

docs/design-risks-register.md – R13 (trusted build inputs are partly pinned) consolidates the long-horizon supply-chain risk view that gates cloud-image release paths; this proposal is recorded as a secondary owner.
docs/trusted-build-inputs.md – the actual inventory of pinned and observed-not-pinned build inputs, dependency policy, vendored upstream snapshots, and the build-provenance retention/comparison policy that cloud proofs must satisfy before they are cited as production evidence.
cloud-usable-instance-provider-nic-storage – the completed GCP-first usable-instance provider rollup covering provider NIC/storage authority, DMA backend selection, cloud teardown, and serial-console operator access.
docs/dma-isolation-design.md – DMA isolation backend selection (kernel-owned bounce buffers vs IOMMU/remapping) that cloud provider drivers must commit to before claiming usable-instance status.
docs/backlog/hardware-boot-storage.md – DDF Tasks 5 (userspace driver authority) and 6 (recurring cloud-portability gate) referenced from Phase 1 closeout above.

Keyboard shortcuts

capOS Documentation