Proposal: Hardware Abstraction and Cloud Deployment
How capOS goes from “boots in QEMU” to “boots on a real cloud VM” (GCP, AWS, Azure). This covers the hardware abstraction infrastructure missing between the current QEMU-only kernel and real x86_64 hardware, plus the build system changes needed to produce deployable images.
Depends on: Kernel Networking Smoke Test (for PCI enumeration), Stage 5 (for timer history), Stage 7 / SMP proposal Phase C (for LAPIC timer and IPI).
Complements: Networking proposal (extends virtio-net toward cloud NICs), Storage proposal (extends local block-device work toward virtio-scsi and NVMe), SMP proposal (LAPIC timer/IPI infrastructure shared, with x2APIC tracked as a later backend).
Current State
The kernel boots via Limine UEFI, outputs to COM1 serial, has QEMU legacy PCI
enumeration for the virtio-net smoke path, and has LAPIC timer/IPI groundwork
from the SMP track. It also has an initial bounded, read-only ACPI diagnostic
parser for Limine RSDP, RSDT/XSDT table inventory, MADT summaries, and MCFG
presence/allocation summaries, plus a Q35 smoke that proves the reusable PCI
config backend can enumerate a capped PCIe ECAM function inventory from MCFG.
The x86 path exports bounded MADT I/O APIC/source-override records, maps the
I/O APIC, and programs masked legacy IRQ routes to LAPIC vectors while honoring
source overrides. PCI drivers can validate and map memory BAR subregions through
a shared kernel helper; the virtio-net modern transport uses that helper for its
common, notify, ISR, and device configuration regions. The PCI capability walk
also reports MSI/MSI-X metadata for the virtio-net function, and the QEMU net
smoke uses that metadata for a bounded kernel-owned virtio-net MSI-X
dispatch/unmask and lifecycle proof through the device MSI vector pool; the
remaining run-net fixture also covers queue setup, descriptor guards, ARP, and
ICMP. Device-autonomous virtio-net MSI-X delivery is covered by the dedicated
userspace-provider gates after the kernel L4 owner is retired.
The cloudboot image/harness slice landed in commit 02635421
(2026-05-05 06:51 UTC):
make capos-cloudboot-image builds the importable raw disk tarball and
make cloudboot-test drives the GCE upload/import/temporary-instance/serial-log
loop with teardown. The first GCP imported-image serial-console boot proof is
run 1778230874-715a (2026-05-08 09:06 UTC) against source commit
3951e275 (2026-05-08 08:50 UTC), reaching the capos kernel starting
serial landmark on a temporary no-public-IP, no-service-account/scopes
e2-small instance before teardown.
It still lacks public L4/SSH/WebShell ingress, AWS/Azure boot proofs and provider drivers, broader storage variants, high-throughput/multiqueue NIC readiness, direct-remapping DMA, production cloud-image release paths, and a cloud-ready clocksource/clockevent closeout. The GCP-first provider rollup has live serial-console operator access, selected NIC raw-frame reachability, selected NVMe Persistent Disk I/O, and gVNIC portability evidence.
The GCP-first usable cloud-instance provider rollup is closed by
docs/tasks/done/2026-06-07/cloud-usable-instance-provider-nic-storage.md.
Do not cite the cloudboot harness or the first GCP serial-console boot alone as
evidence for provider NIC/storage readiness; the closeout depends on separate
live NIC, storage, operator-access, and gVNIC evidence records. AWS/Azure,
public ingress, and production cloud-image release gates remain separate.
Trusted Build Inputs And Reproducibility Cross-Links
Cloud deployment depends on the same trusted-build-inputs inventory that
covers local builds. The consolidated supply-chain risk view – floating Rust
nightly, observed-not-pinned xorriso / qemu-system-x86_64 / OVMF, CI
publication and comparison of build-provenance records, and pinned production
runner identity – is tracked as R13 in docs/design-risks-register.md; the
detailed inventory, dependency policy, vendored-snapshot table, and the
build-provenance retention/comparison policy live in
docs/trusted-build-inputs.md. This proposal is recorded as a secondary
owner of R13 because cloud-image release paths and provider-driver bring-up
both depend on those reproducibility gates.
The implication for cloud bring-up is concrete: imported cloud images must
travel with the corresponding make build-provenance record (source commit,
toolchain identity, embedded-binary hashes, OVMF identity or explicit
absence) before any provider serial-console run is cited as production
evidence. Until the R13 gates close, cloud images remain local/CI proof
artifacts rather than third-party reproducible boot images.
What Cloud VMs Provide
GCP (n2-standard), AWS (m6i/c7i), and Azure (Dv5) all expose:
| Resource | Cloud interface | capOS status |
|---|---|---|
| Boot firmware | UEFI (all three) | Limine UEFI works |
| Serial console | COM1 0x3F8 | Works (serial.rs) |
| Boot media | Hybrid BIOS+UEFI raw disk image, packaged per provider import rules | Partial (make capos-cloudboot-image builds a GCE-importable raw disk tarball; production release packaging and non-GCP provider packaging remain future) |
| Storage | virtio-scsi or NVMe (GCP Persistent Disk), NVMe/EBS (AWS Nitro), managed disks | Partial (GCP NVMe Persistent Disk brokered READ proof landed; GCP virtio-scsi, Local SSD, AWS/Azure storage, and broader filesystem-backed cloud storage remain future) |
| NIC | virtio-net or gVNIC (GCP), ENA (AWS), MANA (Azure) | Partial (GCP legacy virtio-net raw-frame provider-nic-bound and gVNIC raw-frame / typed-Nic proofs landed; public ingress, high-throughput/multiqueue, ENA, and MANA remain future) |
| Virtio NIC | QEMU, GCP where selectable, some bare-metal | Partial (QEMU smoke; reusable/cloud path planned) |
| Timer | LAPIC timer, TSC, HPET | Partial (LAPIC timer groundwork; cloud clocksource work missing) |
| Interrupt delivery | I/O APIC, MSI/MSI-X | Partial (masked MADT-backed I/O APIC routes, MSI/MSI-X capability metadata, and bounded kernel-owned virtio-net MSI-X dispatch/lifecycle proof; I/O APIC ownership and userspace interrupt authority missing) |
| Device discovery | ACPI + PCI/PCIe | Partial (QEMU legacy PCI smoke, bounded ACPI diagnostics/routing state, reusable legacy/ECAM PCI config access, kernel BAR/MMIO validation, MSI/MSI-X metadata discovery, and bounded virtio-net MSI-X dispatch proof; broader driver authority still missing) |
| Display | None (headless) | N/A |
Cloud NIC And Storage Portability Notes
The Device Driver Foundation is not complete just because QEMU virtio-net
works. Cloud bring-up has provider-specific NIC and storage surfaces, and the
first implementation slices must keep those differences visible while still
deferring the actual provider drivers.
| Provider path | Expected device surface | capOS dependency | Current state |
|---|---|---|---|
| QEMU / constrained GCP virtio-net | Virtio PCI transport, virtqueues, MSI-X where available | Shared virtio transport helpers, DMAPool, DeviceMmio, Interrupt, and queue lifecycle proofs | QEMU virtio-net proofs and the live GCE legacy virtio-net raw-frame provider-nic-bound proof landed. This does not claim public L4 ingress, high-throughput/multiqueue readiness, or device-autonomous MSI-X completion delivery |
| GCP gVNIC | gVNIC as the modern Compute Engine NIC, replacing virtio-net on newer machine generations and required for some features | PCI BAR/MMIO binding, MSI-X routing, per-queue ring setup, image metadata declaring GVNIC, and fallback choice between virtio-net and gVNIC by machine family | Grounding plus bounded live proofs landed: the GCE gVNIC provenance map records the spec basis and authority mapping, the GCE harness can request GVNIC image/instance posture and inventory the 1ae0:0042 PCI function, the admin-queue/register proof maps BAR0 and issues one DESCRIBE_DEVICE, the raw-frame proof configures one GQI/QPL TX/RX queue pair, and the typed Nic adaptation proof exercises inline-frame Nic.transmit / Nic.receive over live gVNIC. No QEMU gVNIC model exists. This remains a separate GCE portability lane, not a blocker for the first public Web UI proof on a virtio-compatible machine type |
| AWS Nitro ENA + EBS | ENA enhanced networking plus Nitro NVMe storage | ENA queue/MSI-X driver, NVMe controller/storage path, IOMMU or bounce-buffer policy, and image import with ENA/NVMe expectations | Planned; no ENA, NVMe EBS, or AWS boot proof |
| Azure Accelerated Networking | Accelerated Networking exposes SR-IOV hardware families, with MANA as the newer Azure NIC and Mellanox mlx4/mlx5 still relevant on some hosts | Synthetic-interface fallback awareness, VF binding/revocation handling, MANA/Mellanox driver binding, MSI-X routing, and reset/revoke paths that survive VF removal | Planned; no MANA, Mellanox VF, or Azure boot proof |
These rows are planning gates, not implementation evidence. Each provider NIC
has its own queue layout, feature negotiation, MSI-X/vector conventions, reset
behavior, and driver-binding rules. Azure’s accelerated-networking path also
requires the OS and applications to tolerate dynamic SR-IOV VF revocation by
falling back to the synthetic network interface. Provider storage follows the
same rule: AWS Nitro uses NVMe for EBS, GCP can require NVMe on newer or
Confidential VM paths while retaining virtio-scsi on older paths, and Azure
uses SCSI on many older families while Azure Boost and newer NVMe-capable VM
families expose managed disks through NVMe. The shared foundation therefore
needs ACPI/PCIe discovery, BAR validation, interrupt ownership, DMAPool
accounting, IOMMU/bounce-buffer policy, and lifecycle teardown before any cloud
NIC or storage driver is treated as portable.
What Already Works
- UEFI boot – Limine ISO includes
BOOTX64.EFI. The boot path itself is cloud-compatible. - Serial output – all three clouds expose COM1.
gcloud compute instances get-serial-port-output,aws ec2 get-console-output, and Azure serial console all read from it. - x86_64 long mode – cloud VMs are KVM-based x86_64. Architecture matches.
Managed Application Services
Booting capOS on a cloud VM and using managed cloud services are separate tracks. The VM path proves hardware, disk, network, and serial behavior. Managed services can be useful earlier for application persistence, especially game profile/world state, as long as they sit behind narrow capOS service capabilities.
For a GCP-backed adventure persistence bridge:
- Cloud Run hosts a small bridge endpoint. It translates capOS save/load/append requests into provider calls and enforces request bounds before touching cloud APIs.
- Cloud KMS owns the key-encrypting keys (KEKs) for each game-world instance or shard. The bridge or game-world service gets narrow authority to wrap or unwrap data-encrypting keys (DEKs) through Cloud KMS envelope encryption. Ordinary browser clients do not receive DEKs, game-world key capabilities, KMS decrypt/unwrap grants, or provider-independent plaintext authority; provider storage objects contain ciphertext, wrapped DEKs, and metadata only.
- Firestore Native mode stores mutable profile summaries, indexes, and compare-and-set version records.
- Cloud Storage stores larger immutable snapshots, evidence blobs, exports, and content-addressed records. Object versioning and lifecycle policy are required before using it for durable game data.
- Secret Manager stores bridge-side provider credentials and rotation material. Those secrets are never granted to ordinary capOS game clients.
This does not change the storage proposal’s rule: persistence is still
application-level serialization of bounded Cap’n Proto records. The cloud bridge
is just one backing implementation for Store, Namespace, or an
app-specific AdventureSaveStore/CloudGameStore capability. Local fake-cloud
tests must enforce stale-write rejection, wrong-profile rejection, append-only
ledger behavior, and size bounds before a real GCP deployment is trusted.
A separate browser-mediated path can serve user-owned private backups. In that
model, the browser or web terminal host authenticates the user to Google, stores
encrypted save capsules in Drive appDataFolder or Firebase user documents, and
returns only opaque provider handles and encrypted capsule bytes through
explicit restore flows. DEK unwrap and plaintext validation happen in the local
capOS key domain or in the game-world service with KMS/IAM authority, not in
browser JavaScript.
This is appropriate for user profile backup, private expedition checkpoints,
and settings sync. It is not appropriate for authoritative public world state,
reward witness records, market receipts, or multiplayer outcomes. The user’s
browser holds provider tokens; capOS game services do not. For GCP-backed game
worlds, the browser transports envelope-encrypted capsules with wrapped DEKs but
does not hold game-world key capabilities, KMS decrypt/unwrap grants, DEKs, or
plaintext authority.
Firebase user-document capsule paths must make the auth binding visible in the
path template, not just in policy metadata. Use a narrow shape such as
users/{request.auth.uid}/saveCapsules/{capsule_id} so Firestore rules can
bind the user wildcard to request.auth.uid; literal profile names such as
users/alice/... are not accepted by the capOS policy model. Firestore rules
remain access control for opaque encrypted capsules only. They must not be
treated as validation for decrypted adventure semantics, and path segments must
respect Firestore ID constraints such as no ., no .., no __.*__, and the
1,500-byte collection/document ID limit.
GCP Cloud KMS And IAM Notes For Adventure Saves
GCP-backed adventure save capsules follow the same envelope-encryption model as
CloudKmsKeySource and the volume-encryption proposal: Cloud KMS holds a
key-encrypting key (KEK), the game-world service owns the capsule
data-encrypting key (DEK), and KMS Encrypt/Decrypt wraps or unwraps that
DEK rather than bulk-encrypting capsule bytes. Provision one Cloud KMS key ring
and one symmetric CryptoKey KEK per game-world instance or shard. The key ring
is an administrative grouping boundary; ordinary runtime authority should be
granted on the CryptoKey resource where possible, not at the project or key-ring
level. Do not claim key-version-scoped IAM as a design primitive for this path:
predefined Cloud KMS crypto roles have CryptoKey as their lowest grantable
resource.
Service accounts are split by operation:
- Writers that only create new ciphertext receive
roles/cloudkms.cryptoKeyEncrypteron the configured game-world CryptoKey so they can wrap a freshly generated DEK. - Restore, validation, and migration workers that must read protected capsules
receive
roles/cloudkms.cryptoKeyDecrypteron that CryptoKey so they can unwrap an existing DEK. - The narrow game-world service account receives
roles/cloudkms.cryptoKeyEncrypterDecrypteronly when the same service must both wrap and unwrap DEKs. Avoidroles/cloudkms.cryptoOperator, project-wide grants, owner/editor roles, browser OAuth identities, and service-agent roles for ordinary adventure runtime access.
The browser-vault boundary does not change. Browser JavaScript may carry
ciphertext, wrapped DEKs, capsule metadata, and opaque Drive/Firebase provider
handles. It must not receive plaintext DEKs, capOS SymmetricKey or
KeySource capabilities, Cloud KMS decrypt/unwrap grants, service account
credentials, or provider-independent plaintext. The game-world service may use
the unwrapped DEK internally as service authority, modeled as a SymmetricKey
capability, but that authority does not cross into browser JavaScript.
Possession of a Drive file id or Firebase document path is only transport
authority over opaque encrypted bytes.
Rotation creates a new primary KEK version for future DEK wrapping. It does not re-encrypt existing capsules, rewrite wrapped DEK blobs, or disable/destroy old key versions automatically. Capsule re-encryption or rewrapping is a managed game-world service operation: unwrap the old DEK while its KEK version remains enabled and authorized, decrypt and validate the capsule inside the service, then write a new capsule using a new DEK or a DEK rewrapped by the current primary KEK version. The service verifies content hashes and ledger/profile bindings before replacing capsule metadata. Old KEK versions should only be disabled or scheduled for destruction after inventory proves no accepted wrapped DEK still depends on them.
Retiring a game-world first removes IAM decrypt authority from the world service and migration workers. If the retirement is meant to make existing capsules inaccessible, disable the relevant key versions and record the expected outage and recovery procedure before doing it. Destruction is delayed by Cloud KMS’ scheduled destruction period and is irreversible once completed, so destroy key versions only after audit retention, export, and break-glass recovery decisions are recorded. Disabling or destroying a key version can make all capsules that depend on it unreadable; this is a revocation tool, not cleanup.
Phase 1: Bootable Disk Image And Serial Diagnostics
Goal: Produce a raw hybrid BIOS+UEFI disk image that can boot locally and can be packaged for cloud import, alongside the existing ISO for QEMU. The first cloud-visible proof is serial-console boot to init/diagnostics, not network shell access.
The Problem
Cloud VMs boot from disk images, not ISOs. Each cloud has provider-specific format and boot-mode rules:
| Cloud | Image format | Import method |
|---|---|---|
| GCP | disk.raw in gzip .tar.gz using old GNU tar; raw size in 1 GiB increments | gcloud compute images create --source-uri=gs://... |
| AWS | raw, VMDK, VHD/VHDX, or OVA | aws ec2 import-image with explicit boot-mode notes |
| Azure | VHD (fixed size) | az image create --source |
GCP’s manual import path documents a functional MBR partition table or a
hybrid GPT+MBR bootloader configuration for imported boot disks, plus ACPI
support. AWS VM Import/Export supports both UEFI and legacy BIOS boot modes,
but UEFI imports need a fallback EFI binary at /EFI/BOOT/BOOTX64.EFI; Nitro
instances generally expect NVMe storage and ENA networking for useful
operation. Therefore the first capOS image target should be a hybrid
BIOS+UEFI raw disk: an ESP for UEFI fallback boot and a BIOS/MBR-compatible
Limine path for import paths that still validate MBR bootability.
Disk Layout
Hybrid raw disk image (1 GiB-aligned for cloud packaging)
Protective/hybrid MBR + GPT
Partition 1: EFI System Partition (FAT32, ~32 MB)
/EFI/BOOT/BOOTX64.EFI (Limine UEFI loader)
/limine.conf (bootloader config)
/boot/kernel (capOS kernel ELF)
/boot/init (init process ELF)
Partition 2: (reserved for future use -- persistent store backing)
Build Tooling
New Makefile target make image using standard tools:
IMAGE := capos.img
IMAGE_SIZE := 1024 # MB, keeps GCP raw image packaging simple
image: kernel init $(LIMINE_DIR)
# Create raw disk image
dd if=/dev/zero of=$(IMAGE) bs=1M count=$(IMAGE_SIZE)
# Partition with GPT + ESP; keep room for hybrid/MBR boot metadata.
sgdisk -n 1:2048:+32M -t 1:ef00 $(IMAGE)
# Format ESP as FAT32, copy files
# (mtools or loop mount + mkfs.fat)
mformat -i $(IMAGE)@@1M -F -T 65536 ::
mcopy -i $(IMAGE)@@1M $(LIMINE_DIR)/BOOTX64.EFI ::/EFI/BOOT/
mcopy -i $(IMAGE)@@1M limine.conf ::/
mcopy -i $(IMAGE)@@1M $(KERNEL) ::/boot/kernel
mcopy -i $(IMAGE)@@1M $(INIT) ::/boot/init
# Install Limine BIOS path as well as UEFI fallback files.
$(LIMINE_DIR)/limine bios-install $(IMAGE)
New QEMU target to test disk boot locally:
run-disk: $(IMAGE)
qemu-system-x86_64 -drive file=$(IMAGE),format=raw \
-bios /usr/share/edk2/x64/OVMF.4m.fd \
-display none $(QEMU_COMMON); \
test $$? -eq 1
Cloud upload helpers (scripts, not Makefile targets):
# GCP
cp capos.img disk.raw
tar --format=oldgnu -Sczf capos.tar.gz disk.raw
gcloud storage cp capos.tar.gz gs://my-bucket/
gcloud compute images create capos --source-uri=gs://my-bucket/capos.tar.gz
# AWS
aws ec2 import-image --disk-containers \
"Format=raw,UserBucket={S3Bucket=my-bucket,S3Key=capos.img}" \
--boot-mode uefi
Serial diagnostics are part of Phase 1 rather than a later convenience. The cloud bring-up loop should be:
make run-diskproves the hybrid image under local QEMU/OVMF.- a local BIOS-mode disk run proves the MBR/Limine path if provider import requires it;
- a serial diagnostics prompt is reachable on COM1 in QEMU;
- GCP/AWS imported instances reach the same prompt through provider serial console output.
The serial diagnostics prompt should expose bounded read-only commands for
status, cpu, mem, acpi, pci, irq, timers, devices, and logs,
plus reboot/halt. It is the early remote debugging path for cloud driver
bring-up before NICs or disks are reliable. It should not be required to upload
large binaries, replace kernels in place, or stream high-volume tracing through
cloud serial consoles.
Dependencies
sgdisk(gdisk package) – GPT partitioningmtools(mformat, mcopy) – FAT32 manipulation without root/loop mount
Scope
Makefile/helper script work for the image plus a narrow diagnostics-mode surface. Kernel changes are limited to serial diagnostics and any boot path adjustments needed for disk images; network and block drivers remain later phases.
Phase 0 closeout: GCE harness landed (2026-05-05 06:51 UTC)
Commit 02635421 (2026-05-05 06:51 UTC) records this harness closeout.
The first build-and-boot leg of Phase 1 landed as the cloud-boot harness.
make capos-cloudboot-image produces a 10 GiB GPT-partitioned target/disk.raw
with a 128 MiB FAT32 EFI System Partition holding the Limine UEFI loader,
limine.conf, the kernel ELF, and the manifest, plus the Limine BIOS stage 2
embedded in the GPT for legacy SeaBIOS boot. The disk is repackaged as
target/capos-disk.tar.gz using tar --format=oldgnu -czf, the exact form
GCE’s manual import path expects. Disk size is enforced as an exact multiple
of 1 GiB.
tools/cloudboot/run-test.sh (also wired as make cloudboot-test) drives the
end-to-end loop on a sandbox GCE project: an idempotent orphan sweep on a
configured project-pinned label, a staging tarball upload, image creation,
instance creation with no public IP, no service account, no API scopes, the
same project-pinned label set, and the configured sandbox subnet, then
serial-port polling for the capos kernel starting landmark with a hard
wall-clock budget. Serial output is captured under
target/cloudboot-evidence/run-<id>/serial.log BEFORE teardown, and a bash
trap on EXIT INT TERM always deletes the instance, image, and staged
tarball even on signal or partial failure. The harness hard-fails if the
active project name does not match the configured sandbox.
Sandbox project name, subnet, staging bucket, and the IAM custom roles the
harness assumes are operational details that depend on the host environment;
they belong in tools/cloudboot/README.md and operator-local configuration,
not in this proposal.
This is the harness only. The recurring portability gate that records cloud
boot evidence on every reviewed cloud-relevant change remains open as
docs/backlog/hardware-boot-storage.md Task 6, and the userspace driver
authority gate remains open under DDF Task 5.
First GCP serial-console boot proof (2026-05-08 09:06 UTC)
The first imported-image GCP serial-console proof reached
capos kernel starting as run 1778230874-715a at 2026-05-08 09:06 UTC,
against source commit 3951e275 from 2026-05-08 08:50 UTC. The run used
the cloudboot harness to import the staged disk image, create a temporary
e2-small instance with no public IP and no service account/scopes, poll
serial output for the kernel-start landmark, save the serial log under the
run evidence directory, and tear down the temporary instance/image/staging
objects.
This proves imported-image firmware/bootloader/kernel serial reachability on one GCP sandbox run only. It does not prove a usable cloud instance, provider NIC or storage drivers, cloud clocking, persistence, SSH/network shell access, AWS/Azure import, or production cloud readiness.
Private Web UI Reachability Evidence Contract
The first self-hosted Web UI provider proof is private GCE reachability, not
operator browser exposure. The behavior task
cloud-gce-private-self-hosted-webui-proof
extends tools/cloudboot/run-test.sh with --require-web-ui-proof only after
the local Web UI L4 proof, DHCP/IPv4 configuration, and Web UI hardening tasks
are closed. This proposal defines the evidence contract for that later behavior
slice; it does not authorize a billable GCE run, a public endpoint, broad
firewall changes, TLS certificate provisioning, service-account broadening, or a
production release.
The proof must keep the current cloudboot posture unless the behavior task is
explicitly amended: no public IP on the capOS VM, no service account, no API
scopes, no public firewall rule, and teardown through the existing orphan-sweep
and EXIT INT TERM trap discipline. The reachability probe must cross the live
GCE virtual network boundary. Acceptable shapes include a same-VPC probe
instance, a provider-supported internal probe path, or another reviewed private
path that sends packets through the capOS VM’s GCE NIC and private endpoint.
Evidence classes stay separate:
| Evidence class | What it can prove | What it cannot prove |
|---|---|---|
| Cloudboot-only | The image imports, boots, emits serial markers, and tears down provider resources | Web UI reachability over the provider network |
| Provider-private | A private probe reaches remote-session-web-ui through the live GCE NIC and Phase C L4 path | Public operator access, TLS readiness, DNS readiness, or browser production posture |
| Operator-exposure | A separately authorized public or browser-mediated path reaches the Web UI under the selected ingress policy | The private proof by itself; it must depend on the private proof instead |
The private Web UI proof records, before teardown, at least:
| Field | Requirement |
|---|---|
| Run identity | Cloudboot run id plus source commit or image provenance used for the imported image |
| Machine shape | GCE machine family/type, NIC selection posture, and zone |
| Private posture | public_ip=false or equivalent, service-account/scopes posture, and no public firewall rule |
| Private endpoint | Internal IP or provider-private endpoint, UI port, and probe source identity |
| Probe path | Same-VPC probe, provider-supported internal probe, or other reviewed private path that crosses the GCE virtual network boundary |
| Web UI marker | A run-unique Web UI response marker, header, or body token observed by the private probe |
| Phase C L4 marker | The remote-session-web-ui Phase C L4 evidence marker, such as cloudboot-evidence: remote-session-web-ui-l4 <token>, tied to the same source commit/image |
| Private proof marker | A final structured marker, such as cloudboot-evidence: gce-private-self-hosted-webui <token>, emitted only after the private probe succeeds |
| Teardown | Instance, image, staged object, probe resources, and any private firewall or route resources created by the run were deleted or reported as a failed run |
Private Proof Runbook Checklist
The future --require-web-ui-proof harness gate closes provider-private Web UI
reachability only when the run records these steps in order:
- Preflight confirms the local Web UI L4 proof, DHCP/IPv4 proof, session hardening, and connection-bound prerequisites are closed, and confirms that the run has current authorization for billable private GCE execution.
- Image/source provenance records the cloudboot run id, source commit, imported image or staged object identity, and the local artifact set used for the VM.
- Launch posture records the zone, machine type, NIC posture, no public IP, no service account or API scopes, and no public firewall rule.
- Probe setup records the private endpoint, UI port, probe source identity, and same-VPC or provider-supported private path that crosses the GCE virtual network boundary.
- The private probe fetches the Web UI over that provider-private path and records a run-unique response marker, header, or body token.
- The serial or harness evidence ties the same run to the Phase C L4 marker
for
remote-session-web-ui, such ascloudboot-evidence: remote-session-web-ui-l4 <token>, from the same source commit/image. - The harness emits the private proof marker, such as
cloudboot-evidence: gce-private-self-hosted-webui <token>, only after the provider-private probe and L4-marker correlation both succeed. - Teardown removes the VM, imported image, staged object, probe resources, and any private firewall or route resources created by the run, using the normal orphan-sweep and trap discipline.
- Failed-run reporting preserves the run id, failure class, last observed private posture, teardown result, and whether any loopback, same-guest, or serial-only diagnostics passed without treating those diagnostics as a provider-private proof.
No-Spend Preflight (Step 1, Landed as a Local Gate)
Step 1 of the checklist is implemented and testable today without any provider
mutation: tools/cloudboot/run-test.sh --require-web-ui-proof --preflight-only
runs the local no-spend preflight and exits before the harness access probe,
orphan sweep, upload, image import, instance launch, firewall mutation, or any
probe resource. It validates that the local prerequisite proofs are done
(cloud-prod-remote-session-web-ui-l4-local-proof,
remote-session-web-ui-session-hardening,
remote-session-web-ui-connection-bounds, and the legacy-datapath serving
prerequisite cloud-gce-legacy-virtio-webui-serving-local-proof), that an
operator supplied a firewall-IAM attestation (the documented live blocker), and
that a current per-run billable authorization is present, emitting one
structured cloudboot-webui-preflight: line per check naming the failure class
without printing credentials or attestation values. make cloudboot-gce-private-webui-preflight-check is the fixture gate proving the
safe failure paths and that no provider CLI is invoked on any preflight path
(tools/cloudboot/README.md documents the inputs and failure classes). A
preflight pass is cloudboot-only evidence – the output labels itself
evidence-class=cloudboot-local-preflight – and is neither the
provider-private proof nor authorization for a billable run. The live
--require-web-ui-proof gate remains unimplemented and fails closed without
--preflight-only.
Evidence-Grammar Fixture (Local Gate)
The closeout evidence grammar for the table above is also locally testable
without any provider mutation:
tools/cloudboot/validate-private-webui-evidence.sh validates a
harness-rendered evidence report for field completeness, marker ordering (the
private proof marker only after the recorded private-probe pass and the
correlated remote-session-web-ui-l4 marker), run/source identity agreement,
private posture, and teardown result, and rejects loopback-only, serial-only,
same-guest, public-IP, public-firewall, and missing-teardown evidence with
structured failure classes. make cloudboot-gce-private-webui-evidence-fixture-check is the fixture gate
(tools/cloudboot/README.md documents the report grammar and failure
classes). A pass is
evidence-class=cloudboot-local-private-webui-evidence-fixture with an
explicit provider-private-reachability=not-proven label: it proves only that
a future successful run’s evidence will be parsed, ordered, and classified
correctly, not that any provider-private probe has run.
Loopback-only checks (127.0.0.1, guest-local localhost, or an in-guest HTTP
health request) are supplemental service-health evidence. They may help diagnose
a failed run, but they do not close cloud-gce-private-self-hosted-webui-proof
because they do not prove the provider NIC, VPC routing, private endpoint, or
probe-to-VM packet path. Serial-only markers are likewise insufficient for the
private Web UI proof unless the private probe also succeeds and the harness
records the required provider-private fields.
The public ingress policy below remains a later authorization boundary. Closing the private proof does not permit a public IP, load balancer, DNS name, TLS certificate, Identity-Aware Proxy, operator browser exposure, or widened service account. Public browser-facing exposure must reference the private proof as an input and then satisfy the separate public-ingress policy and on-hold approval gate.
Public Web UI Ingress Policy (First Operator-Access Proof)
The cloudboot harness intentionally launches with no public IP, no service
account, and no API scopes. Exposing the self-served capOS Web UI
(remote-session-web-ui, see
Remote Session CapSet Client
Gate 1B) to an operator browser is therefore a separate, reviewed exposure
decision, not a follow-on of the private reachability proof. This section is the
selected policy that the first public-ingress behavior task
(cloud-gce-public-self-hosted-webui-ingress-tls)
builds against, decided by
cloud-gce-public-webui-ingress-tls-policy-design.
Selected Ingress Shape: Provider-Terminated HTTPS Load Balancer
The first public proof uses a GCP external Application Load Balancer that terminates HTTPS at the Google front end. capOS serves only plain HTTP/1.1 on its UI backend port; the operator browser reaches the UI exclusively through the load balancer’s HTTPS virtual IP and hostname. TLS is terminated by Google’s front end against a managed certificate; capOS never holds the TLS private key and never parses hostile TLS bytes in this proof.
graph LR
B[Operator browser] -- HTTPS --> LB[GCP external HTTPS<br/>Application Load Balancer<br/>Google-managed cert]
LB -- HTTP, health-check-scoped firewall --> NEG[Zonal NEG / backend service]
NEG --> VM[capOS VM<br/>remote-session-web-ui :8080<br/>plain HTTP/1.1, no public IP]
style LB fill:#2d5,stroke:#333
style VM fill:#2d5,stroke:#333
Why this shape is the first proof rather than direct capOS TLS termination:
- No capOS TLS termination stack exists yet. The Phase-1 certificate
verifier has landed, but the capability-native TLS termination model
(
TlsServerConfig, ACME issuance, OCSP stapling, and private-key custody) is not landed in Certificates and TLS, and the userspace L4 network stack has not yet completed fullTcpSocketrelocation. The ACME/Let’s Encrypt successor path is decomposed, but it still depends on minimalPrivateKey/KeyVault/KeySourcecustody, server-side TLS, the RFC 8555 client, the scopedhttp-01solver, andCertificateStore.watchrenewal. A direct external IP would put capOS’s nascent userspace HTTP parser at the first byte of hostile internet traffic with no TLS and no reviewed key custody. - Least privilege and reversibility. Provider-terminated TLS keeps the VM
with no public IP, no inbound
0.0.0.0/0, and no private-key custody in either capOS or the harness. Teardown is the deletion of a bounded set of provider resources, not the rotation of an exposed key. - Clean successor path. When the capability-native TLS stack and an ACME
flow ship, the direct-external-IP / capOS-terminated shape becomes available
as a second, separately reviewed ingress. This proof does not foreclose it; it
is the bootstrap step before it. The interim posture is recorded as
“Bootstrap TLS for the First Public GCE Web UI” in
Certificates and TLS, and the
public GCE successor task is
cloud-gce-public-webui-letsencrypt-direct-termination-proof. That successor requires a controlled public DNS name plus explicit billable/public-ingress authorization, and any Let’s Encrypt production call requires explicit CA authorization.
Raw public HTTP is not acceptable closeout evidence. If port 80 is published at all, it exists only as an HTTP-to-HTTPS 301 redirect at the load balancer and never reaches capOS. The closeout evidence must be the HTTPS path.
An optional hardening for the first proof is to enable Identity-Aware Proxy
(IAP) on the backend service so the public door is gated by Google IAM before
any request reaches the capOS backend. IAP here is not a separate ingress shape:
it rides on the same external HTTPS load balancer and gates that backend service,
so the ALB is still the only public entry point. IAP composes with, and does not
replace, the capOS SessionManager/AuthorityBroker login boundary: IAP
authenticates the human to Google; capOS still mints its own UserSession and
projects only browser-safe view models. The browser never receives raw capOS
caps.
Certificate and Key Custody
| Concern | First proof | Successor (deferred) |
|---|---|---|
| TLS terminator | Google front end (load balancer) | capOS userspace TLS service |
| Certificate source | Google-managed certificate (Certificate Manager or classic managed cert), or an operator-supplied cert resource on the load balancer | ACME (AcmeClient + http-01/tls-alpn-01 solver) from Certificates and TLS |
| Private-key custody | Google-held; never in capOS or the harness | capOS PrivateKey cap sealed under a KeySource |
| Min TLS version / cipher policy | Load balancer SSL policy (TLS 1.2+ minimum; prefer the GCP MODERN/RESTRICTED profile) | capOS CipherPolicy (modern) |
The first proof must not write a private key into the disk image, the manifest, the cloudboot evidence directory, or any harness-staged object. A managed certificate keeps key material entirely on the provider side.
The successor must preserve the same no-export rule on the capOS side: the ACME
account key and TLS private key remain behind PrivateKey / KeyVault
authority and are not copied into cloudboot images, manifests, logs, or evidence
directories. Local ACME proofs use a local directory; public GCE/Let’s Encrypt
proofs require explicit run authorization, DNS-name control, public-ingress
teardown evidence, and staging-vs-production CA labeling.
Browser Session and Origin Policy
The self-served Web UI keeps the Gate 1B boundary: remote-session-web-ui is
the trusted backend that holds remote-session/CapSet state server-side, and
browser JavaScript receives only browser-safe view models. Public exposure adds
the following reviewed browser rules:
- Single public origin. UI assets and the same-origin JSON API are served
from the one HTTPS origin (the load balancer hostname). No second origin, no
wildcard CORS, no cross-origin credentialed requests. The service-side
policy is implemented in
remote-session-web-uias a boot-manifest input: onepublic_origin.<host>marker cap (an inert Endpoint, granted after the service caps) fixes the acceptedhttps://<host>origin at boot, validated fail-closed (second marker, malformed, loopback-named, or IP-literal-shaped host, or any unrecognized extra grant fails the boot), and consulted by theHost/Origin/Referergates only for requests on the trusted forwarded-scheme HTTPS path, so a direct client can never claim the public origin. Browser-supplied principal/source hint headers (IAP assertions, authenticated-user hints) are rejected on the public-origin path before any backend-held capability dispatch, no CORS headers are emitted, and login ingress extends to the recorded GFE ranges only when a public origin is configured. Locally proven bymake run-cloud-prod-remote-session-web-ui-l4(in-process trusted-forwarder fixture positive plus cross-origin, mixed-scheme, wildcard, missing-origin, hostile-Referer, principal-hint, and real-ingress direct-client forged negatives); the proof claims no DNS name, load balancer, TLS endpoint, or live public exposure. - Forwarded-scheme trust is firewall-bounded. Because the backend hop is
plain HTTP, capOS derives the external scheme from the load balancer’s
X-Forwarded-Proto/forwarding headers. It must trust those headers only from the Google front-end source ranges (enforced by the firewall below), and treat any such header from an unexpected source as absent (default to “not HTTPS”, fail closed on secure-context assumptions). The service-side trust gate is implemented inremote-session-web-ui(forwarded_scheme_peer_trusted/external_scheme_is_https, pinned to130.211.0.0/22and35.191.0.0/16, fail-closed on unknown peer formats) and locally proven bymake run-cloud-prod-remote-session-web-ui-l4: a real ingress client forgingX-Forwarded-Proto: httpskeeps the non-Securecookie posture, and a fixture simulating the recorded ranges is the only path that flips the session cookie toSecure. The local proof remains plaintext-loopback and claims no live load balancer or TLS endpoint. - Session cookies. The session cookie is
Secure,HttpOnly, andSameSite. TheSameSitevalue is picked deterministically rather than mid-slice:Strictwhen no IAP front door is used, andLaxwhen IAP is enabled (the IAP sign-in redirect is a cross-site top-level navigation that would drop aStrictcookie on return).Secureis honored because the browser only ever sees the cookie over the load balancer’s HTTPS origin. The switch is implemented inremote-session-web-uias a boot-manifest policy input: an IAP-fronted deployment manifest grants the inertiap_fronted_ingressmarker cap (last in the web-ui grant list) to selectLax; without it the service emitsStrict, andSameSite=Noneis never emitted. The posture applies uniformly to the session, CSRF, and logout/expiry clear-cookie headers, stays independent of the forwarded-scheme-derivedSecureattribute, and is fixed at boot so no request header, cookie, or body field can select the weaker branch. Because aLaxcookie attaches on cross-site top-level GET navigations, the Lax posture additionally rejects authenticated GET views whose Fetch Metadata provenance (Sec-Fetch-Site) is cross-site – and cookie-bearing GETs with no Fetch Metadata at all, covering legacy browsers and webviews that attach Lax cookies without stating provenance – before any session state is touched; the gate is inert underStrict, where the cookie never attaches cross-site.make run-cloud-prod-remote-session-web-ui-l4proves the defaultStrictposture end to end (including a real-ingress login forging IAP-shaped headers and body fields) and theLaxbranch through the service’s in-process policy fixture; the live IAP-fronted deployment is future work. - HSTS and redirect. The HTTPS edge sets
Strict-Transport-Securitywith a conservativemax-age(nopreload, noincludeSubDomainscommitment for the first proof). Any port-80 listener is a 301 to HTTPS only. - CSRF. State-changing JSON routes require a per-session anti-CSRF token and
an
Origin/Referercheck against the known public origin; cross-origin or origin-absent state changes are rejected. - Session lifetime and logout. Sessions carry a bounded idle timeout and an absolute lifetime. Logout drops the server-side session and clears the cookie; the existing self-served stale-session / logout failure-closed boundary (proven in the Gate 1B implementation gate) extends unchanged to the public endpoint. A stale or expired cookie yields no authority.
Firewall and Source-Range Policy
The instance keeps no public IP. Ingress to the capOS UI backend port is allowed
only from Google’s load-balancer and health-check ranges, never from
0.0.0.0/0:
| Allowed source | Purpose |
|---|---|
130.211.0.0/22, 35.191.0.0/16 | Google Front Ends and load-balancer health checks reaching the backend port |
35.235.240.0/20 | Identity-Aware Proxy (only if IAP fronting or IAP-tunneled SSH/diagnostics is used) |
No other ingress rule is created. The proof does not broaden the service
account, add API scopes beyond the LB/health-check need, open SSH to the public
internet, or attach a broad firewall tag. Egress stays default-deny-friendly:
the LB-terminated path needs no capOS outbound, and the future ACME path (which
would require egress 443 to the ACME directory) is explicitly out of scope
here.
Backend Health-Check Contract (Local Proof Landed)
The backend port is reachable only from the GFE/health-check ranges above, so
the load balancer’s health checker is the route’s only intended public caller.
The backend health contract, proven locally by
make run-cloud-prod-remote-session-web-ui-l4:
- Route:
GET /healthzon the Web UI backend port, served bydemos/remote-session-web-ui(HEALTH_BODY). The exact bounded response body is{"ok":true,"service":"remote-session-web-ui"}withContent-Type: application/jsonandCache-Control: no-store; it carries no cap ids, session ids, user/profile names, endpoint handles, provider resource ids, host paths, or secret material. - No authority: the route is unauthenticated and never creates, rotates,
refreshes, or consumes a browser session; it never emits
Set-Cookie, and a presented (even forged) session cookie changes nothing. The local proof drives a/healthzprobe with live session cookies against an idle-expired session and asserts the next authenticated call still fails closed. It is the only unauthenticated public-ingress liveness exception; the Host/Origin/CSRF/session gates on authority-bearing routes are unchanged. (/api/healthremains the bundled operator app’s same-origin page-load ping with the same no-authority posture; the provider health check never probes it.) - Host-gate exemption: the health checker probes the backend by IP, so
/healthzdeliberately does not require the loopback/public-hostHostallowlist that authority-bearing routes enforce. - Fail-closed variants: non-
GETmethods and path variants (POST /healthz,/healthz/extra,/HEALTHZ) return 404 without reaching any authority-bearing handler. - Availability under abuse: the slow-client phases of the L4 smoke prove a
concurrent
/healthzkeeps completing while idle, partial-request, and drip-feed clients are held open, and after they are abandoned.
This is local backend readiness for the selected policy
(evidence-class=local-qemu), not a live GCE health check: no health-check
resource, load balancer, firewall rule, or public endpoint exists, and a
passing local contract proof authorizes none of them.
Audit and Evidence Fields
The public proof records, before teardown, at least:
- selected ingress shape (
https-load-balancer) and whether IAP was enabled; - public endpoint (hostname and HTTPS virtual IP);
- TLS posture: terminator (
google-frontend), certificate type (google-managedoroperator-supplied), and the load balancer SSL-policy minimum TLS version; - authentication method exercised (capOS
SessionManagerlogin, and Google IAM identity if IAP is enabled); - firewall/forwarding scope: the named source ranges, backend port, and the URL-map/forwarding-rule chain created;
- HTTP-to-HTTPS redirect and HSTS header observation;
- teardown result for every resource the proof created.
Teardown Checklist
The existing harness deletes the instance, image, and staging tarball in an
EXIT INT TERM trap. The public proof extends that trap to delete, in
dependency order, every ingress resource it creates:
- global forwarding rule and target HTTPS proxy;
- URL map and any HTTP-to-HTTPS redirect URL map / target HTTP proxy;
- backend service and health check;
- zonal/serverless NEG or managed instance group backing the backend;
- managed certificate / certificate-map entry / SSL policy created for the run;
- the LB-scoped and (if used) IAP-scoped firewall rules;
- the reserved external IP address, if one was allocated for the LB;
- the instance, image, and staged tarball (existing harness behavior).
Teardown must be idempotent and must run on signal or partial failure, matching the existing orphan-sweep discipline. A run that cannot confirm deletion of an ingress resource is a failed run, not a passed one.
Local Plan Gate (Landed)
The resource graph above is locally reviewable before any billable work:
tools/cloudboot/plan-public-webui-ingress.sh renders and validates the
selected plan shape with zero provider interaction, and
make cloudboot-public-webui-ingress-plan-check is the fixture gate proving
each rejected hazard (raw public HTTP to capOS, instance public IP,
0.0.0.0/0 backend ingress, missing /healthz health check, broad service
account/scopes, staged private-key material, non-provider certificate custody)
fails closed by structured class before any provider CLI could be invoked.
Output is stamped evidence-class=cloudboot-local-plan with
operator-exposure=not-proven; a plan pass is not public reachability, TLS
readiness, or authorization for the on-hold public proof. The command contract
and failure classes are documented in tools/cloudboot/README.md (“Public Web
UI ingress plan gate”).
Local Teardown Fixture Gate (Landed)
The teardown checklist above is locally proven before any billable work:
tools/cloudboot/teardown-public-webui-ingress.sh is the dependency-ordered,
idempotent, deletion-confirming teardown engine over a per-run
created-resources journal, and
make cloudboot-public-webui-teardown-fixture-check exercises it against
recording stub provider CLIs across complete, partial-create,
command-failure, delete-claims-success-but-persists, unreadable-state,
signal-trap, and orphan-sweep paths. Every checklist resource class is
modeled and the engine’s class list must equal the plan gate’s rendered
teardown-order= line (the fixture fails on drift), so a class added to the
selected plan cannot go missing from the cleanup graph. An unconfirmed
deletion is a blocking structured failure (undeleted-<class> /
resource-state-unknown), matching the failed-run policy above. All
public-ingress resource names must carry the capos-test- sweepable marker;
a journal naming anything else is rejected before any provider call, and the
orphan sweep enforces the marker client-side so out-of-scope resources are
never deleted. Output is stamped
evidence-class=cloudboot-local-teardown-fixture live-teardown=not-proven;
a fixture pass is local harness evidence only, never live provider teardown
evidence, and authorizes no public ingress. The journal grammar, sweep
contract, and failure classes are documented in tools/cloudboot/README.md
(“Public Web UI ingress teardown fixture gate”).
Local Evidence Fixture Gate (Landed)
The “Audit and Evidence Fields” contract above is locally proven before any
billable work: tools/cloudboot/validate-public-webui-evidence.sh validates
a harness-rendered public-proof closeout report against the selected
evidence grammar, and
make cloudboot-public-webui-proof-evidence-fixture-check is the fixture
gate proving accepted and rejected reports over stub inputs with zero
provider CLI invocations. Acceptance requires the recorded ingress shape,
public HTTPS hostname/VIP, provider TLS terminator and managed or
operator-supplied certificate resource, minimum TLS policy, IAP posture,
no-key-custody statement, no-public-IP instance posture, GFE/health-check
firewall scope, health-check, HTTP-to-HTTPS redirect and HSTS observations,
capOS SessionManager login observation, a public HTTPS probe record, the
correlated gce-public-self-hosted-webui-ingress-tls proof marker, and a
per-resource teardown record pinned to the plan gate’s teardown-order=
class list (the fixture fails on drift). Raw public HTTP, a direct
instance public IP, wildcard backend ingress, a missing health check,
missing HSTS/redirect observation, capOS or harness private-key custody,
stale/missing/incomplete teardown, a non-provider TLS terminator, and
private-proof-only evidence (a same-VPC or provider-internal probe path,
or a proof marker without a recorded HTTPS probe) each fail closed by
structured class. The tls terminator= label structurally separates this
provider-terminated evidence contract from the later capOS-terminated TLS
successor, so successor evidence can never pass through the first-proof
grammar. Output names field names, classes, and line numbers only; input
values are never echoed. Every pass is stamped
evidence-class=cloudboot-local-public-webui-evidence-fixture with
operator-exposure=not-proven: a fixture pass is local evidence-grammar
validation only, never public reachability or operator-access evidence,
and it does not authorize public exposure or move the live proof out of
cloud-gce-public-self-hosted-webui-ingress-tls.
The report grammar and failure classes are documented in
tools/cloudboot/README.md (“Public Web UI evidence-grammar fixture
gate”).
Local Provider Command Allowlist Gate (Landed)
The provider command boundary the future public proof may use is locally
proven before any billable work:
tools/cloudboot/check-public-webui-provider-commands.sh validates a
recorded provider-command transcript against the selected resource graph,
and make cloudboot-public-webui-provider-command-allowlist-check is the
fixture gate proving both directions over recording stub gcloud/gsutil
with zero live provider invocations. The allowlist permits only the
resource families the plan and teardown checklist name – forwarding rules,
target HTTPS/HTTP proxies, URL maps, backend services, health checks,
zonal NEGs, scoped firewall rules, managed-certificate resources, SSL
policies, reserved addresses, instance/image creation, and staged
tarball upload/delete – and requires the capos-test- marker on every
created resource, journal-pinned deletion (a delete must name a resource
the created-resources journal recorded), GFE/IAP-only firewall source
ranges, the capos-test filter on every listing, marker discipline on
create-wired references, per-surface create flags and parameters pinned to
the selected graph shape, an explicit pin of the documented sandbox project on
every command, and explicit --global/--zone scope on deletes (ambient
Cloud SDK project/region defaults are never trusted). Drift toward broader
provider authority fails closed
by structured class: IAM mutation, service-account/scopes changes, DNS
mutation, private-key upload, 0.0.0.0/0 backend ingress, unmarked
resources, deletion outside the journal (zone-pinned), project-wide or
filter-restating sweeps, ambient credential flags, project/network/region
scope overrides beyond the pinned sandbox forms, --flags-file
indirection, non-selected create parameters, shell/environment
inspection, and provider CLI resolution from an unexpected path. Rejected
command content is reported by class and line number only; credentials,
principals, key paths, and rejected names are never echoed. Output is
stamped evidence-class=cloudboot-local-provider-command-allowlist with
provider-mutation=none: a pass narrows what the future live proof may
execute, it is not live provider evidence and does not authorize the
on-hold public proof. The transcript grammar and failure classes are
documented in tools/cloudboot/README.md (“Public Web UI
provider-command allowlist gate”).
Phase 2: ACPI and Device Discovery
Goal: Parse ACPI tables to discover hardware topology, interrupt routing, and PCI root complexes. This replaces QEMU-specific hardcoded assumptions.
Why ACPI
On QEMU with default settings, you can hardcode PCI config space at
0xCF8/0xCFC and assume legacy interrupt routing. On real cloud hardware:
- PCI root complex addresses come from ACPI MCFG table (PCIe ECAM)
- Interrupt routing comes from ACPI MADT (I/O APIC entries) and _PRT
- CPU topology comes from ACPI MADT (LAPIC entries)
- Timer info comes from ACPI HPET/PMTIMER tables
Limine provides the RSDP (Root System Description Pointer) address via its protocol. From there, the kernel can walk RSDT/XSDT to find specific tables.
Required Tables
| Table | Purpose | Priority |
|---|---|---|
| MADT | LAPIC and I/O APIC addresses, CPU enumeration | High (Phase 2) |
| MCFG | PCIe Enhanced Configuration Access Mechanism base | High (Phase 2) |
| HPET | High Precision Event Timer address | Medium (fallback timer) |
| FADT | PM timer, shutdown/reset methods | Low (future) |
Landed Discovery Slice
The first landed slices are bounded diagnostics plus reusable config access.
The ACPI parser requests
Limine’s RSDP, validates RSDP/RSDT/XSDT/static-table lengths and checksums
within fixed caps, emits serial summaries for RSDT/XSDT table count and
MADT/MCFG presence, reports MADT LAPIC/I/O APIC/interrupt-source-override
inputs, and reports MCFG ECAM allocation records when firmware provides the
table. The PCI layer now keeps the existing legacy I/O-port backend and adds an
ECAM backend selected from MCFG allocations; devices retain their discovery
backend so config reads, writes, capability walking, and BAR sizing use the
same access path. The PCI layer also exposes a shared memory-BAR subregion
validator/mapper, and the virtio-net transport uses it for modern capability
regions. It also reports MSI/MSI-X capability metadata for the virtio-net
function and uses kernel-owned config/RX/TX source records with a bounded
first-fit LAPIC device MSI vector pool plus lock-free dispatch slots for QEMU
virtio-net MSI-X table programming, virtio vector assignment, driver-owned
route unmask, claimed-route lifecycle/reassignment proof, and TX delivery
proof. The x86 setup
maps MADT I/O APICs and programs masked legacy IRQ routes from MADT source
overrides before higher-level drivers can depend on interrupt routing. The Q35
smoke asserts both the ECAM inventory lines, a
pci: config backend=ecam enumerated ... proof line, and representative masked
I/O APIC route lines; the net smoke asserts virtio-net BAR, capability, MSI-X
metadata, source-route records, route unmask records, vector programming,
queue assignment, descriptor guards, ARP, and ICMP fixture lines before
MMIO transport mapping completes. This path does not interpret AML, provide
userspace driver authorities, or provide full unbounded bus discovery yet.
Implementation
#![allow(unused)]
fn main() {
// kernel/src/acpi.rs
/// Minimal ACPI table parser.
/// Walks RSDP -> XSDT -> individual tables.
/// Does NOT implement AML interpretation -- static tables only.
pub struct AcpiInfo {
pub lapics: Vec<LapicEntry>,
pub io_apics: Vec<IoApicEntry>,
pub iso_overrides: Vec<InterruptSourceOverride>,
pub mcfg_base: Option<u64>, // PCIe ECAM base address
pub hpet_base: Option<u64>,
}
pub fn parse_acpi(rsdp_addr: u64, hhdm: u64) -> AcpiInfo { ... }
}
For the fuller static-table subsystem, prefer the acpi crate (or an
equivalent maintained no_std parser) rather than expanding the diagnostic
parser into a general hand-written ACPI stack. The landed parser is a boot-time
inventory proof for RSDP/RSDT/MADT/MCFG summaries; it can be retired or
narrowed once the crate-backed table model fits capOS mapping and table
lifetime constraints.
Limine RSDP
#![allow(unused)]
fn main() {
use limine::request::RsdpRequest;
static RSDP: RsdpRequest = RsdpRequest::new();
// In kmain:
let rsdp_addr = RSDP.response().expect("no RSDP").address as u64;
let acpi_info = acpi::parse_acpi(rsdp_addr, hhdm_offset);
}
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
acpi | Planned fuller/static ACPI table parsing (MADT, MCFG, HPET, FADT, etc.) | yes |
Scope
The landed diagnostic slice is kernel-local bounded read-only parsing for serial inventory. Fuller handling should be mostly glue around a maintained static-table parser plus capOS mapping, lifetime, and authority types.
Phase 3: Interrupt Infrastructure
Goal: Set up I/O APIC for device interrupt routing and MSI/MSI-X for modern PCI devices. This replaces the implicit legacy PIC setup.
I/O APIC
The I/O APIC routes external device interrupts (keyboard, serial, PCI devices) to specific LAPIC entries (CPUs). Its address and configuration come from the ACPI MADT (Phase 2).
#![allow(unused)]
fn main() {
// kernel/src/arch/x86_64/ioapic.rs
pub struct IoApic {
base: *mut u32, // MMIO registers via HHDM
}
impl IoApic {
/// Route an IRQ to a specific LAPIC/vector.
pub fn route_irq(&mut self, irq: u8, lapic_id: u8, vector: u8) { ... }
/// Mask/unmask an IRQ line.
pub fn set_mask(&mut self, irq: u8, masked: bool) { ... }
}
}
The current x86 implementation maps MADT I/O APIC MMIO, reads each controller’s ID/version/redirection count, and programs legacy IRQ 0-15 routes to LAPIC vectors while keeping the redirection entries masked. It respects Interrupt Source Override entries from MADT (for example, Q35 remaps IRQ 0 to GSI 2). Driver-owned unmask policy, dispatch, and EOI handling remain planned.
MSI/MSI-X
Modern PCI/PCIe devices (NVMe, cloud NICs) use Message Signaled Interrupts instead of pin-based IRQs routed through the I/O APIC. MSI/MSI-X writes directly to the LAPIC’s interrupt command register, bypassing the I/O APIC entirely.
This is critical for cloud deployment because:
- NVMe controllers require MSI or MSI-X (no legacy IRQ fallback on many controllers)
- Cloud NICs (ENA, gVNIC) use MSI-X exclusively
- MSI-X supports per-queue interrupts (one vector per virtqueue/submission queue), enabling better SMP scalability
#![allow(unused)]
fn main() {
// kernel/src/pci/msi.rs
/// Configure MSI for a PCI device.
pub fn enable_msi(device: &PciDevice, vector: u8, lapic_id: u8) { ... }
/// Configure MSI-X for a PCI device.
pub fn enable_msix(
device: &PciDevice,
table_bar: u8,
entries: &[(u16, u8, u8)], // (index, vector, lapic_id)
) { ... }
}
MSI/MSI-X capability structures are found by walking the PCI capability list (already needed for PCI enumeration in the networking proposal). The current PCI path reports MSI/MSI-X capability metadata for virtio-net so diagnostics can see the advertised table and pending-bit-array layout. The virtio-net QEMU smoke now records kernel-owned config/RX/TX MSI-X sources, publishes them into the device interrupt dispatch table, allocates LAPIC vectors from the bounded device MSI vector pool to program their table entries and virtio vector registers, lets the in-kernel virtio-net owner unmask only those routes, then proves TX delivery by observing that source’s dispatch counter advance after maskable interrupts are live. The same smoke uses an unused masked MSI-X table entry to prove claimed-route reassignment, stale old-route rejection, old-vector unregistered delivery, reassigned-vector masked delivery, unsupported-vector delivery, and release. Broader driver dispatch and userspace interrupt authority remain planned.
Integration with SMP
LAPIC initialization is shared with the SMP proposal. The active x86 path uses xAPIC MMIO for the immediate QEMU/KVM timer and IPI foundation, with PIT/PIC fallback. This cloud phase consumes that architectural LAPIC path for local interrupt delivery and now adds masked ACPI MADT I/O APIC routing plus MSI/MSI-X capability metadata discovery and a bounded virtio-net MSI-X dispatch/lifecycle proof; userspace device interrupts remain planned.
KVM/QEMU paravirtual features such as PV EOI, PV IPI, and PV TLB flush are host-specific accelerations. They are useful later for cloud performance, but cloud boot correctness should use the architectural LAPIC path first. x2APIC is a later backend for newer/high-core systems and firmware states where xAPIC is unavailable or undesirable; it is not a blocker for the current LAPIC path.
Scope
~300-400 lines total:
- I/O APIC driver: ~150 lines
- MSI/MSI-X setup: ~100-150 lines
- Integration/routing logic: ~50-100 lines
Phase 4: PCI/PCIe Infrastructure
Goal: Standalone PCI bus enumeration and device management, usable by all device drivers (virtio-net, NVMe, cloud NICs).
The networking proposal includes PCI enumeration as a substep for finding virtio-net. This phase promotes it to a reusable kernel subsystem that all device drivers build on.
PCI Configuration Access
Two mechanisms, determined by ACPI:
- Legacy I/O ports (0xCF8/0xCFC) – works in QEMU, limited to 256 bytes of config space per function. Insufficient for PCIe extended capabilities.
- PCIe ECAM (Enhanced Configuration Access Mechanism) – memory-mapped config space, 4 KB per function. Base address from ACPI MCFG table. Required for MSI-X capability parsing and NVMe BAR discovery on real hardware.
Legacy I/O and Q35 ECAM config access exist today behind the same early PCI
backend abstraction. The PCI layer also validates memory BAR subregions with
checked offset/length/alignment bounds and maps selected subregions through the
kernel MMIO window for in-kernel drivers, and it records non-programming
MSI/MSI-X metadata for the current virtio-net path by walking the standard PCI
capability list. The virtio-net path now selects a usable MSI-X capability and
programs config/RX/TX table entries through the typed PCI MSI-X table helper
using the kernel-owned source records and bounded first-fit LAPIC device MSI
vectors. The QEMU net smoke lets the in-kernel virtio-net owner claim and
unmask those routes, assigns the virtio common and queue MSI-X vector
registers, and proves TX delivery by observing that source’s dispatch counter
advance after the TX completion path has run and maskable interrupts are live.
It also proves claimed-route reassignment and release with an unused masked
MSI-X table entry. The next steps are using that path for full bus discovery,
userspace DeviceMmio authority, broader driver dispatch, and driver binding.
Device Enumeration
#![allow(unused)]
fn main() {
// kernel/src/pci.rs
pub struct PciDevice {
pub bus: u8,
pub device: u8,
pub function: u8,
pub vendor_id: u16,
pub device_id: u16,
pub class: u8,
pub subclass: u8,
pub bars: [Option<Bar>; 6],
pub interrupt_pin: u8,
pub interrupt_line: u8,
}
pub enum Bar {
Memory {
base: u64,
size: u64,
prefetchable: bool,
width: MemoryBarWidth,
},
Io { base: u32, size: u32 },
}
/// Scan all PCI buses and return discovered devices.
pub fn enumerate() -> Vec<PciDevice> { ... }
/// Find a device by vendor/device ID.
pub fn find_device(vendor: u16, device: u16) -> Option<PciDevice> { ... }
/// Walk the PCI capability list for a device.
pub fn capabilities(device: &PciDevice) -> Vec<PciCapability> { ... }
}
BAR Mapping
Device drivers need MMIO access to BAR regions. The kernel now maps validated
memory-BAR subregions into its bounded MMIO virtual window for in-kernel
drivers. A future DeviceMmio capability will carry equivalent authority to
userspace drivers as described in the networking proposal.
PCI Device IDs for Cloud Hardware
| Device | Vendor:Device | Cloud |
|---|---|---|
| virtio-net | 1AF4:1000 (transitional) or 1AF4:1041 (modern) | QEMU, supported first/second-generation GCP machine families |
| virtio-blk | 1AF4:1001 (transitional) or 1AF4:1042 (modern) | QEMU |
| NVMe | 8086:various, 144D:various, etc. | All clouds (EBS, PD, Managed Disk) |
| AWS ENA | 1D0F:EC20 / 1D0F:EC21 | AWS |
| GCP gVNIC | 1AE0:0042 | GCP |
| Azure MANA | 1414:00BA | Azure |
Scope
~400-500 lines:
- Config space access (I/O + ECAM): ~100 lines
- Bus enumeration: ~150 lines
- BAR parsing and mapping: ~100 lines
- Capability list walking: ~50-100 lines
Phase 5: NVMe Driver
Goal: Basic NVMe block device driver, sufficient to read/write sectors. This is the storage equivalent of virtio-net for networking – the first real storage driver.
Why NVMe Over virtio-blk
The storage-and-naming proposal mentions virtio-blk for Phase 3 (persistent store). On cloud VMs, all three providers expose NVMe:
- AWS EBS – NVMe interface (even for gp3/io2 volumes)
- GCP Persistent Disk – NVMe or SCSI (NVMe is default for newer VMs)
- Azure Managed Disks – SCSI on many older VM families such as D/Ev5 or Fv2 and older; NVMe on Azure Boost and newer NVMe-capable families such as Ebsv5 and Da/Ea/Fav6 and newer
virtio-blk is QEMU-only. An NVMe driver unlocks persistent storage on all
cloud platforms where the selected VM shape exposes NVMe. For QEMU testing,
QEMU also emulates NVMe well:
-drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0.
NVMe Architecture
NVMe is a register-level standard with well-defined queue-pair semantics:
Application
|
v
Submission Queue (SQ) -- ring buffer of 64-byte command entries
|
| doorbell write (MMIO)
v
NVMe Controller (hardware)
|
| DMA completion
v
Completion Queue (CQ) -- ring buffer of 16-byte completion entries
|
| MSI-X interrupt
v
Driver processes completions
Minimum viable driver needs:
- Admin Queue Pair (for identify, create I/O queues)
- One I/O Queue Pair (for read/write commands)
- MSI-X for completion notification (or polling)
Implementation Sketch
#![allow(unused)]
fn main() {
// kernel/src/nvme.rs (or kernel/src/drivers/nvme.rs)
pub struct NvmeController {
bar0: *mut u8, // MMIO registers
admin_sq: SubmissionQueue,
admin_cq: CompletionQueue,
io_sq: SubmissionQueue,
io_cq: CompletionQueue,
namespace_id: u32,
block_size: u32,
block_count: u64,
}
impl NvmeController {
pub fn init(pci_device: &PciDevice) -> Result<Self, NvmeError> { ... }
pub fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), NvmeError> { ... }
pub fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), NvmeError> { ... }
pub fn identify(&self) -> NvmeIdentify { ... }
}
}
DMA Considerations
NVMe uses DMA for data transfer. The controller reads/writes directly from physical memory addresses provided in commands. Requirements:
- Buffers must be physically contiguous (or use PRP lists / SGLs for scatter-gather)
- Physical addresses must be provided (not virtual)
- Cache coherence is handled by hardware on x86_64 (DMA-coherent architecture)
The existing frame allocator can provide physically contiguous pages. For larger transfers, PRP (Physical Region Page) lists allow scatter-gather.
Crate Dependencies
| Crate | Purpose | no_std |
|---|---|---|
| (none) | NVMe register-level protocol is simple enough to implement directly | N/A |
The NVMe spec is cleaner than virtio and the register interface is straightforward. A minimal driver (admin + 1 I/O queue pair, read/write) is ~500-700 lines without external dependencies.
Integration with Storage Proposal
The storage proposal’s Phase 3 (Persistent Store) specifies virtio-blk as
the backing device. This can be generalized to a BlockDevice trait:
#![allow(unused)]
fn main() {
trait BlockDevice {
fn read(&self, lba: u64, count: u16, buf: &mut [u8]) -> Result<(), Error>;
fn write(&self, lba: u64, count: u16, buf: &[u8]) -> Result<(), Error>;
fn block_size(&self) -> u32;
fn block_count(&self) -> u64;
}
}
Both NVMe and virtio-blk implement this trait. The store service doesn’t care which backing driver it uses.
Scope
~500-700 lines for a minimal in-kernel NVMe driver (admin queue + 1 I/O queue pair, read/write, identify). Userspace decomposition follows the same pattern as the networking proposal (kernel driver first, then extract to userspace process with DeviceMmio + Interrupt caps).
Phase 6: Cloud NIC Strategy
Goal: Define the path to networking on cloud VMs, given that each cloud uses a different proprietary NIC.
The Landscape
| Cloud | Primary NIC | Virtio NIC available? | Open-source driver? |
|---|---|---|---|
| GCP | gVNIC (1AE0:0042) | Yes on supported first/second-generation machine families | Yes (Linux, ~3000 LoC) |
| AWS | ENA (1D0F:EC20) | No (Nitro only) | Yes (Linux, ~8000 LoC) |
| Azure | MANA (1414:00BA) | No (accelerated networking) | Yes (Linux, ~6000 LoC) |
Recommended Strategy
Short term: constrained virtio-net on GCP
GCP can expose VIRTIO_NET on supported first/second-generation machine
families. After the shared image, ACPI/PCIe, interrupt, DMA/MMIO, and virtio
foundation exists, that gives a constrained early cloud-network proof without
writing a provider-specific NIC driver. It is not the general GCP target:
third-generation-and-later machine families, Tau T2A, Confidential VM, and
some higher-bandwidth paths require gVNIC.
gcloud compute instances create capos-test \
--image=capos \
--machine-type=e2-micro \
--network-interface=nic-type=VIRTIO_NET
Medium term: gVNIC driver
gVNIC is a simpler device than ENA or MANA. The Linux driver is ~3000 lines (vs ~8000 for ENA). It uses standard PCI BAR MMIO + MSI-X interrupts. A minimal gVNIC driver (init, link up, send/receive) would be ~800-1200 lines.
gVNIC is worth prioritizing because:
- GCP’s constrained virtio-net path can de-risk cloud networking before a provider-specific NIC driver exists
- Graduating from virtio-net to gVNIC on the same cloud is the required path for newer, Tau T2A, Confidential VM, and higher-bandwidth GCP instances
- The gVNIC register interface is documented in the Linux driver source
Long term: ENA and MANA
ENA and MANA are more complex and less well-documented outside their Linux drivers. These should be deferred until the driver model is mature (userspace drivers with DeviceMmio caps, as described in the networking proposal Part 2).
At that point, the kernel only needs to provide PCI enumeration + BAR mapping + MSI-X routing. The actual NIC driver logic runs in a userspace process, making it feasible to port from the Linux driver source with appropriate licensing considerations.
Alternative: Paravirt Abstraction Layer
Instead of writing native drivers for each cloud NIC, an alternative is a thin paravirt layer:
Application -> NetworkManager cap -> Net Stack (smoltcp) -> NIC cap -> [driver]
Where [driver] is one of:
virtio-net(QEMU, supported first/second-generation GCP machine families)gvnic(GCP)ena(AWS)mana(Azure)
All drivers implement the same Nic capability interface from the networking
proposal. The network stack and applications are driver-agnostic.
This is already the architecture described in the networking proposal. The
only addition is recognizing that multiple driver implementations will exist
behind the same Nic interface.
Phase Summary and Dependencies
graph TD
P1[Phase 1: Disk Image + Serial Diagnostics] --> BOOT[Boots on Cloud VM]
P2[Phase 2: ACPI Parsing] --> P3[Phase 3: Interrupt Infrastructure]
P2 --> P4[Phase 4: PCI/PCIe]
P3 --> P5[Phase 5: NVMe Driver]
P4 --> P5
P4 --> NET[Networking Smoke Test<br>virtio-net driver]
P3 --> NET
P4 --> P6[Phase 6: Cloud NIC Drivers]
P3 --> P6
NET --> P6
S5[Stage 5: Scheduling] --> P3
SMP_C[SMP Phase C: LAPIC timer/IPI] --> P3
style P1 fill:#2d5,stroke:#333
style BOOT fill:#2d5,stroke:#333
| Phase | Depends on | Estimated scope | Enables |
|---|---|---|---|
| 1: Disk image + diagnostics | Nothing | image tooling plus bounded diagnostics mode | Cloud serial boot |
| 2: ACPI | Nothing (kernel code) | ~200-300 lines | Phases 3, 4 |
| 3: Interrupts | Phase 2, LAPIC (SMP Phase C) | ~300-400 lines | NVMe, cloud NICs |
| 4: PCI/PCIe | Phase 2 | ~400-500 lines | All device drivers |
| 5: NVMe | Phases 3, 4 | ~500-700 lines | Cloud storage |
| 6: Cloud NICs | Phases 3, 4, networking smoke test | ~800-1200 lines each | Cloud networking |
Minimum Path to “Boots on Cloud VM, Prints Hello”
Raw serial output and UEFI boot support already exist, so the smallest “prints hello” experiment is mostly Phase 1 image packaging plus any boot-path adjustments needed to reach the same COM1 output from an imported disk image. That experiment is a precursor, not the full Phase 1 closeout.
Phase 1 closeout also includes a bounded serial diagnostics prompt so cloud driver bring-up can inspect CPU, memory, ACPI, PCI, IRQ, timer, device, and log state before cloud NICs or storage drivers are reliable. That diagnostics surface is kernel/userspace behavior, not just build-system work.
Minimum Path to “Useful on Cloud VM”
Phases 1-5 (disk image + ACPI + interrupts + PCI + NVMe) plus the existing roadmap items (Stages 4-6 for capability syscalls, scheduling, IPC). On a supported first/second-generation GCP machine family, networking can use the existing virtio-net proposal without a provider-specific gVNIC/ENA/MANA driver on that constrained target.
QEMU Testing
All phases can be tested in QEMU before deploying to cloud:
| Phase | QEMU flags |
|---|---|
| Disk image | -drive file=capos.img,format=raw -bios OVMF.4m.fd |
| ACPI | Default QEMU provides ACPI tables (MADT, MCFG, etc.) |
| I/O APIC | Default QEMU emulates I/O APIC |
| PCI/PCIe | -device ... adds PCI devices; QEMU has PCIe root complex |
| NVMe | -drive file=disk.img,if=none,id=d0 -device nvme,drive=d0,serial=capos0 |
| MSI-X | Supported by QEMU’s NVMe and virtio-net-pci emulation; current net smoke asserts metadata selection, kernel-owned source-route records, route unmask, vector programming, virtio queue assignment, descriptor guards, ARP, and ICMP fixture evidence. Device-autonomous virtio-net MSI-X delivery is covered by the dedicated userspace-provider gates. |
| Multi-CPU | -smp 4 (already works with Limine SMP) |
| x2APIC backend | future explicit QEMU CPU feature such as -cpu qemu64,+smep,+smap,+rdrand,+x2apic |
aarch64 and ARM Cloud Instances
This proposal focuses on x86_64 because that’s the current kernel target, but ARM-based cloud instances are significant and growing:
| Cloud | ARM offering | Instance types |
|---|---|---|
| AWS | Graviton2/3/4 | m7g, c7g, r7g, etc. |
| GCP | Tau T2A (Ampere Altra) | t2a-standard-* |
| Azure | Cobalt 100 (Arm Neoverse) | Dpsv6, Dplsv6 |
ARM cloud VMs have the same general requirements (UEFI boot, ACPI tables, PCI/PCIe, NVMe storage) but different specifics:
- Interrupt controller: GIC (Generic Interrupt Controller) instead of APIC. GICv3 is standard on cloud ARM instances.
- Boot: UEFI via Limine (already targets aarch64). Limine handles the architecture differences at boot time.
- Timer: ARM generic timer (CNTPCT_EL0) instead of LAPIC/PIT/TSC.
- Serial: PL011 UART instead of 16550 COM1. Different register interface.
- NIC: Same PCI devices (ENA, gVNIC, MANA) with the same register interfaces – PCI/PCIe is architecture-neutral.
- NVMe: Same NVMe register interface – PCIe is architecture-neutral.
The arch-neutral parts of this proposal (PCI enumeration, NVMe, disk image format, ACPI table parsing) apply equally to aarch64. The arch-specific parts (I/O APIC, MSI delivery address format, LAPIC) need aarch64 equivalents (GIC, ARM MSI translation).
The existing roadmap lists “aarch64 support” as a future item. For cloud deployment, aarch64 should be considered as soon as the x86_64 hardware abstraction is stable, since:
- Device drivers (NVMe, virtio-net, cloud NICs) are architecture-neutral – they talk to PCI config space and MMIO BARs, which are the same on both architectures
- The
acpicrate handles both x86_64 and aarch64 ACPI tables - Limine already targets aarch64
- AWS Graviton instances are often cheaper than x86_64 equivalents
The main aarch64 kernel work is: exception handling (EL0/EL1 instead of Ring 0/3), GIC driver (instead of APIC), ARM generic timer, PL011 serial, and the MMU setup (4-level page tables exist on both but with different register interfaces).
Open Questions
-
ACPI scope. The landed diagnostic parser covers bounded read-only RSDP/RSDT/MADT/MCFG summaries only. The
acpicrate can parse fuller static tables (MADT, MCFG, HPET, FADT). Full ACPI requires AML interpretation (for _PRT interrupt routing, dynamic device enumeration). Do we need AML, or are static tables sufficient for cloud VMs? Cloud VM firmware typically provides simple, static ACPI tables – AML interpretation is likely unnecessary initially. -
PCIe ECAM vs legacy. Should we support both config access methods, or require ECAM (which all cloud VMs and modern QEMU provide)? Supporting both adds ~50 lines but makes bare-metal testing on older hardware possible.
-
NVMe queue depth. A single I/O queue pair with depth 32 is sufficient for initial use. Per-CPU queues (leveraging MSI-X per-queue interrupts) improve SMP throughput but add complexity. Defer per-CPU queues to after SMP is working.
-
Driver model unification. Resolved: PCI enumeration is the standalone PCI/PCIe Infrastructure item in the roadmap. The networking smoke test and NVMe driver both consume this shared subsystem. The networking proposal’s Part 1 Step 1 has been updated to reference this phase.
-
GCP vs AWS as first cloud target. The first cloud proof should be imported-image serial-console boot on both providers when practical, because that validates image format, firmware, bootloader, and early ACPI without depending on cloud NICs. For the later usable-networked-instance milestone, a constrained first/second-generation GCP virtio-net target is the easiest first network proof; broader GCP coverage needs gVNIC, and AWS follows once the NVMe/ENA path or an explicit workaround is ready.
References
Specifications
- NVMe Base Specification 2.1 – register interface, queue semantics, command set
- PCI Express Base Specification – ECAM, MSI/MSI-X capability structures
- ACPI Specification 6.5 – MADT, MCFG, HPET table formats
- Intel SDM Vol. 3, Ch. 10 – APIC architecture (LAPIC, I/O APIC)
Crates
- acpi – no_std ACPI table parser
- virtio-drivers – no_std virtio (already in networking proposal)
Prior Art
- Redox PCI – microkernel PCI driver in Rust
- Hermit NVMe – unikernel NVMe driver
- rCore virtio – educational OS with virtio + PCI in Rust
- Linux gVNIC driver – reference for gVNIC register interface (~3000 LoC)
- Linux ENA driver – reference for ENA
Cloud Documentation
- GCP: Creating custom images
- GCP: Manually import boot disks
- GCP: Requirements to build custom images
- GCP: Persistent Disk storage interfaces
- AWS: Importing VM images
- AWS: VM Import/Export requirements
- AWS: VM Import/Export limitations
- AWS: EC2 UEFI boot mode requirements
- Azure: Creating custom images
- GCP: Choosing a NIC type
- GCP: Cloud Run overview
- GCP: Firestore Native mode
- GCP: Cloud Storage object versioning
- GCP: Secret Manager
- GCP: Cloud KMS overview
- GCP: Cloud KMS IAM
- GCP: Cloud KMS roles and permissions
- GCP: Cloud KMS key rotation
- GCP: Rotate a Cloud KMS key
- GCP: Enable and disable Cloud KMS key versions
- GCP: Destroy and restore Cloud KMS key versions
- AWS: Enhanced networking
- AWS: Nitro instances
- Azure: Accelerated Networking
- Azure: Microsoft Azure Network Adapter
- Azure: Manage Accelerated Networking
- Azure: NVMe overview
- Google Drive: application data folder
- Google Drive: Drive API scopes
- Firebase: Firestore offline persistence
- Firebase: Firestore security rule conditions
- Firebase: Firestore usage and limits
- Firebase: Google sign-in for web
capOS Cross-Links
docs/design-risks-register.md– R13 (trusted build inputs are partly pinned) consolidates the long-horizon supply-chain risk view that gates cloud-image release paths; this proposal is recorded as a secondary owner.docs/trusted-build-inputs.md– the actual inventory of pinned and observed-not-pinned build inputs, dependency policy, vendored upstream snapshots, and the build-provenance retention/comparison policy that cloud proofs must satisfy before they are cited as production evidence.docs/tasks/done/2026-06-07/cloud-usable-instance-provider-nic-storage.md– the completed GCP-first usable-instance provider rollup covering provider NIC/storage authority, DMA backend selection, cloud teardown, and serial-console operator access.docs/dma-isolation-design.md– DMA isolation backend selection (kernel-owned bounce buffers vs IOMMU/remapping) that cloud provider drivers must commit to before claiming usable-instance status.docs/backlog/hardware-boot-storage.md– DDF Tasks 5 (userspace driver authority) and 6 (recurring cloud-portability gate) referenced from Phase 1 closeout above.