Keyboard shortcuts

Press or to navigate between chapters

Press S or / to search in the book

Press ? to show this help

Press Esc to hide this help

virtio-blk (modern PCI block device)

This is a provenance map for the in-tree virtio-blk driver: it cites the spec, summarizes only the wire-format subset the code actually implements, and points into the implementation. It is not a re-spec – where the spec is implemented unchanged it links rather than transcribes. The driver was the first real BlockDevice CapObject, so the treatment is a concise map rather than exhaustive register tables. It reuses the modern split-ring transport seam introduced for virtio-net (virtio-net); this page covers only the block-specific additions.

Status: QEMU fixture, not the production storage route. The kernel-owned virtio-blk driver, its BlockDevice cap arm, and its PCI discovery are all gated behind the qemu cargo feature (diagnose_qemu_virtio_blk in kernel/src/pci.rs; the BlockDeviceBackend::Virtio arm in kernel/src/cap/block_device.rs). The default non-qemu production kernel never enumerates, claims, or binds virtio-blk, and its block_device grant source resolves to the userspace-brokered NVMe BlockDevice arm (BlockDeviceBackend::NvmeBrokered) instead, failing closed when no verified NVMe controller and live device_mmio grant are present. virtio-blk remains as a named local fixture / regression test only – a fully QEMU-emulable end-to-end BlockDevice proof and the substrate the storage-layer (read-only / persistent / writable filesystem) QEMU proofs read through. It is not an ambiguous forward production driver. The kernel broker responsibilities it exercises (PCI claim arbitration, MMIO/IRQ/DMA admission, bounce/IOMMU isolation, stale-generation rejection, and revocation) are the same ones the production userspace storage driver binds into; see §3 capOS mapping.

The driver lives in the virtio-blk section of kernel/src/virtio.rs (VirtioBlkDriver) and the cap surface in kernel/src/cap/block_device.rs (BlockDeviceCap).

1. Spec basis

  • Device: virtio block device, modern (virtio 1.x) PCI transport. PCI vendor 0x1af4; device 0x1042 (modern) / 0x1001 (transitional). IDs at kernel/src/pci.rs (VIRTIO_VENDOR_ID, VIRTIO_BLK_MODERN_DEVICE_ID, VIRTIO_BLK_TRANSITIONAL_DEVICE_ID; matched by PciDevice::is_virtio_blk). Up to device_dma::MAX_VIRTIO_BLK_DEVICES functions are bound, each in its own const-generic driver slot (VIRTIO_BLK_DRIVER_0 / VIRTIO_BLK_DRIVER_1) so the two devices cannot alias DMA or queue state. The target disk is selected by manifest PCI identity; the ordinary boot/storage disk resolves to the non-target disk when both are present.
  • Authoritative spec: Virtual I/O Device (VIRTIO) Version 1.2, OASIS Committee Specification 01 (2022-07-01). Source: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html. Relevant sections: 4.1 (virtio over PCI bus), 2.7 (split virtqueues), 5.2 (block device).
  • Reference: cross-checked against the Linux virtio_blk driver for the request framing and the virtio_pci_modern modern-transport handshake.

2. Wire format (implemented subset)

The modern PCI capability parsing, common-config register map, split-ring descriptor layout, and feature-negotiation handshake are the shared transport seam documented in virtio-net §2 (kernel/src/virtio.rs transport module, ModernTransport, Virtqueue, DescriptorTrackingSlot). Only the block-specific subset is summarized here.

  • Feature negotiation: the driver requires and selects only VIRTIO_F_VERSION_1 (read_device_features / write_driver_features in VirtioBlkDriver::initialize); a device that does not offer it fails closed with BlkInitError::MissingRequiredFeatures. No block feature bits (read-only, multi-queue, discard, …) are negotiated, so the device is driven as a single read/write request queue.
  • Device config (capacity): the block device config space carries the capacity in 512-byte sectors as a little-endian u64 (VIRTIO_BLK_CONFIG_CAPACITY_LEN = 8 bytes, read low/high in initialize). A config region shorter than that, or a zero capacity, fails closed (BlkInitError::DeviceConfigTooSmall / ZeroCapacity).
  • Request queue: a single request virtqueue (queue 0). The negotiated size is clamped to the largest power of two not exceeding both the device-advertised COMMON_QUEUE_SIZE and VIRTIO_BLK_REQUEST_QUEUE_SIZE (8); a usable size below 4 (one request chain needs 3 descriptors) fails closed. Per-queue notify address is computed from notify_off_multiplier like any modern virtio queue.
  • Request framing (VirtioBlkDriver::issue_request): each request is a 3-descriptor chain over one bounce-buffer page (ChainSegment):
    1. headerVIRTIO_BLK_REQ_HEADER_LEN (16) bytes, device-readable: type (u32VIRTIO_BLK_T_IN = 0 read / VIRTIO_BLK_T_OUT = 1 write), a reserved u32, and the sector (u64 LBA), at VIRTIO_BLK_HEADER_OFFSET (0).
    2. data512 * count bytes at VIRTIO_BLK_DATA_OFFSET (512), device-writable for reads, device-readable for writes.
    3. status – 1 byte at VIRTIO_BLK_STATUS_OFFSET (16), device-writable; pre-seeded with VIRTIO_BLK_STATUS_SENTINEL (0xff) and checked for VIRTIO_BLK_S_OK (0) after completion (BlockDeviceRequestError::DeviceStatus otherwise).
  • Completion: QEMU completes virtio-blk requests synchronously, so the driver notifies the queue and polls the used ring (poll_used_within_ns, bounded by the real-time VIRTIO_BLK_COMPLETION_BUDGET_NS budget with the VIRTIO_BLK_COMPLETION_FALLBACK_SPIN_LIMIT spin-count backstop when the monotonic clocksource is tick-derived) rather than waiting on the request MSI-X interrupt, which is claimed but left masked (see §3). The bound is time-based because the device side includes QEMU’s host file I/O, whose latency a raw spin count does not track.

3. capOS mapping

  • Binding (qemu fixture, in-kernel): virtio-blk is driven in the kernel and only under the qemu feature. Unlike the userspace storage driver, it does not receive DeviceMmio/Interrupt/DMAPool caps; instead VirtioBlkDriver::initialize binds authority through the kernel device_manager transactions – claim_pci_function(.., DeviceOwner::VirtioBlk) then attach_dmapool_record_with_remapping / attach_devicemmio_record / attach_interrupt_source. The BlockDevice cap is the userspace-facing surface; the hardware authority stays kernel-owned. This in-kernel ownership is why the driver is kept as a qemu fixture rather than a production route: the production BlockDevice is served by the userspace-brokered NVMe provider chain (BlockDeviceBackend::NvmeBrokered, gated on a verified controller and a live device_mmio grant), where the device-specific protocol logic runs in userspace over DeviceMmio/DMAPool/Interrupt caps and the kernel retains only broker/admission/isolation/revocation.
  • MMIO: the modern-transport common/notify/ISR/device-config regions are mapped from the device BARs (map_blk_region over pci::map_bar_region) and recorded with device_manager::attach_devicemmio_record against the first decoded memory BAR. Doorbell (queue-notify) writes are scoped to the per-queue notify address computed from notify_off_multiplier. The DDF DeviceMmio cap (kernel/src/cap/device_mmio.rs) is the userspace successor surface.
  • Interrupt: one MSI-X route is registered for the request queue (VIRTIO_BLK_REQUEST_MSIX_ENTRY = 0, PciMsixInterruptRole::BlockRequestQueue), claimed (DeviceInterruptDriver::VirtioBlk) and attached to the device handle for authority binding, but left masked: completion is by polled used ring, not interrupt delivery. Route records are tracked by the kernel-owned device-interrupt ledger (kernel/src/device_interrupt.rs).
  • DMA: each bound device gets its own DMA pool (device_dma::begin_virtio_blk_pool, keyed by the const-generic DEV index via VirtioBlkDma<DEV>). Ring pages and the request bounce buffer are allocated and accounted through the blk-keyed ledger (allocate_virtio_blk_page / register_virtio_blk_queue / record_virtio_blk_submission/..._completion_for_allocation in kernel/src/device_dma.rs). DMA uses the manager-owned bounce-buffer backend; no host physical address or IOVA is exposed to userspace – the request MSI-X route is kept masked specifically so no raw address leaves the kernel boundary.
  • BlockDevice cap surface: BlockDeviceCap (kernel/src/cap/block_device.rs) is scoped to one device_index and routes the schema’s readBlocks/writeBlocks/info/flush methods (schema/capos.capnp interface BlockDevice) to that device only, failing closed when it is not bound. Under the qemu feature the block_device KernelCapSource reaches the resolved boot/storage virtio-blk disk, and the block_device_target source requires SystemConfig.blockDeviceTarget.pci (schema/capos.capnp) and resolves that PCI segment:bus:device.function selector to a bound non-boot virtio-blk device; absent, mismatched, or boot-disk selectors fail closed. In the production (non-qemu) kernel the same block_device source instead mints the NvmeBrokered arm, and block_device_target fails closed (requires the qemu feature). The read-only/ persistent/writable filesystem and store caps (readonly_fs, persistent_store, writable_fs) layer their on-disk formats over whichever BlockDevice backs the boot/storage cap – the virtio-blk fixture under qemu, the brokered NVMe arm in production.
  • Fail-closed / validation rules: VirtioBlkDriver::validate_range rejects a zero count, a count over VIRTIO_BLK_MAX_SECTORS_PER_REQUEST (7 – bounded so header + status + 512 * count fit one 4 KiB page), start_lba + count arithmetic overflow, and any range past the reported capacity_sectors, all before device access. The cap layer additionally enforces that writeBlocks data length equals count * 512 (BlockDeviceRequestError::DataLengthMismatch). A non-OK device status, a used-ring poll timeout, or a DMA accounting failure each fail closed (DeviceStatus / Completion / Accounting). Descriptor reuse is generation-tracked through the shared bounded tracking-slot array.
  • QEMU-emulable vs hardware-only: fully QEMU-emulable, and these are the fixture gates. QEMU provides virtio-blk-pci; make run-virtio-blk is the single-device end-to-end BlockDevice fixture, make run-multi-virtio-blk proves the two-device (boot + target) binding with independent per-device DMA pools, make run-blockdevice-target-identity proves manifest identity selection when PCI/BDF order would otherwise bind the intended target first, and make run-virtio-blk-failover exercises the multi-device failover path. All are --features qemu fixtures over dedicated system-virtio-blk.cue / system-multi-virtio-blk.cue / system-blockdevice-target-identity.cue manifests, not production-storage evidence. No hardware-only path. The production-storage gate is the userspace-brokered NVMe BlockDevice chain (make run-cloud-provider-nvme-blockdevice-read-graduated and the other run-cloud-provider-nvme-blockdevice-* proofs).
  • kernel/src/virtio.rs – the virtio-blk driver (VirtioBlkDriver), request framing, queue setup, and the shared modern split-ring transport.
  • kernel/src/cap/block_device.rs – the BlockDevice cap surface (BlockDeviceCap) routing schema methods to a single bound device.
  • kernel/src/device_dma.rs – the per-device virtio-blk DMA pool/queue ledger.
  • kernel/src/device_interrupt.rs – the request-queue MSI-X route record.
  • schema/capos.capnp (interface BlockDevice) – the readBlocks/writeBlocks/info/flush contract.
  • docs/dma-isolation-design.md – the DMA backend and isolation model the userspace successor binds into.