virtio-blk (modern PCI block device)
This is a provenance map for the in-tree virtio-blk driver: it cites the spec,
summarizes only the wire-format subset the code actually implements, and points
into the implementation. It is not a re-spec – where the spec is implemented
unchanged it links rather than transcribes. The driver was the first real
BlockDevice CapObject, so the treatment is a concise map rather than
exhaustive register tables. It reuses the modern split-ring transport seam
introduced for virtio-net (virtio-net); this page covers only
the block-specific additions.
Status: QEMU fixture, not the production storage route. The kernel-owned
virtio-blk driver, its BlockDevice cap arm, and its PCI discovery are all
gated behind the qemu cargo feature (diagnose_qemu_virtio_blk in
kernel/src/pci.rs; the BlockDeviceBackend::Virtio arm in
kernel/src/cap/block_device.rs). The default non-qemu production kernel never
enumerates, claims, or binds virtio-blk, and its block_device grant source
resolves to the userspace-brokered NVMe BlockDevice arm
(BlockDeviceBackend::NvmeBrokered) instead, failing closed when no verified
NVMe controller and live device_mmio grant are present. virtio-blk remains as a
named local fixture / regression test only – a fully QEMU-emulable end-to-end
BlockDevice proof and the substrate the storage-layer (read-only / persistent /
writable filesystem) QEMU proofs read through. It is not an ambiguous forward
production driver. The kernel broker responsibilities it exercises (PCI claim
arbitration, MMIO/IRQ/DMA admission, bounce/IOMMU isolation, stale-generation
rejection, and revocation) are the same ones the production userspace storage
driver binds into; see §3 capOS mapping.
The driver lives in the virtio-blk section of kernel/src/virtio.rs
(VirtioBlkDriver) and the cap surface in kernel/src/cap/block_device.rs
(BlockDeviceCap).
1. Spec basis
- Device: virtio block device, modern (virtio 1.x) PCI transport.
PCI vendor
0x1af4; device0x1042(modern) /0x1001(transitional). IDs atkernel/src/pci.rs(VIRTIO_VENDOR_ID,VIRTIO_BLK_MODERN_DEVICE_ID,VIRTIO_BLK_TRANSITIONAL_DEVICE_ID; matched byPciDevice::is_virtio_blk). Up todevice_dma::MAX_VIRTIO_BLK_DEVICESfunctions are bound, each in its own const-generic driver slot (VIRTIO_BLK_DRIVER_0/VIRTIO_BLK_DRIVER_1) so the two devices cannot alias DMA or queue state. The target disk is selected by manifest PCI identity; the ordinary boot/storage disk resolves to the non-target disk when both are present. - Authoritative spec: Virtual I/O Device (VIRTIO) Version 1.2, OASIS Committee Specification 01 (2022-07-01). Source: https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html. Relevant sections: 4.1 (virtio over PCI bus), 2.7 (split virtqueues), 5.2 (block device).
- Reference: cross-checked against the Linux
virtio_blkdriver for the request framing and thevirtio_pci_modernmodern-transport handshake.
2. Wire format (implemented subset)
The modern PCI capability parsing, common-config register map, split-ring
descriptor layout, and feature-negotiation handshake are the shared transport
seam documented in virtio-net §2
(kernel/src/virtio.rs transport module, ModernTransport, Virtqueue,
DescriptorTrackingSlot). Only the block-specific subset is summarized here.
- Feature negotiation: the driver requires and selects only
VIRTIO_F_VERSION_1(read_device_features/write_driver_featuresinVirtioBlkDriver::initialize); a device that does not offer it fails closed withBlkInitError::MissingRequiredFeatures. No block feature bits (read-only, multi-queue, discard, …) are negotiated, so the device is driven as a single read/write request queue. - Device config (capacity): the block device config space carries the
capacity in 512-byte sectors as a little-endian
u64(VIRTIO_BLK_CONFIG_CAPACITY_LEN= 8 bytes, read low/high ininitialize). A config region shorter than that, or a zero capacity, fails closed (BlkInitError::DeviceConfigTooSmall/ZeroCapacity). - Request queue: a single request virtqueue (queue 0). The negotiated size
is clamped to the largest power of two not exceeding both the device-advertised
COMMON_QUEUE_SIZEandVIRTIO_BLK_REQUEST_QUEUE_SIZE(8); a usable size below 4 (one request chain needs 3 descriptors) fails closed. Per-queue notify address is computed fromnotify_off_multiplierlike any modern virtio queue. - Request framing (
VirtioBlkDriver::issue_request): each request is a 3-descriptor chain over one bounce-buffer page (ChainSegment):- header –
VIRTIO_BLK_REQ_HEADER_LEN(16) bytes, device-readable:type(u32–VIRTIO_BLK_T_IN= 0 read /VIRTIO_BLK_T_OUT= 1 write), a reservedu32, and thesector(u64LBA), atVIRTIO_BLK_HEADER_OFFSET(0). - data –
512 * countbytes atVIRTIO_BLK_DATA_OFFSET(512), device-writable for reads, device-readable for writes. - status – 1 byte at
VIRTIO_BLK_STATUS_OFFSET(16), device-writable; pre-seeded withVIRTIO_BLK_STATUS_SENTINEL(0xff) and checked forVIRTIO_BLK_S_OK(0) after completion (BlockDeviceRequestError::DeviceStatusotherwise).
- header –
- Completion: QEMU completes virtio-blk requests synchronously, so the
driver notifies the queue and polls the used ring (
poll_used_within_ns, bounded by the real-timeVIRTIO_BLK_COMPLETION_BUDGET_NSbudget with theVIRTIO_BLK_COMPLETION_FALLBACK_SPIN_LIMITspin-count backstop when the monotonic clocksource is tick-derived) rather than waiting on the request MSI-X interrupt, which is claimed but left masked (see §3). The bound is time-based because the device side includes QEMU’s host file I/O, whose latency a raw spin count does not track.
3. capOS mapping
- Binding (qemu fixture, in-kernel): virtio-blk is driven in the kernel
and only under the
qemufeature. Unlike the userspace storage driver, it does not receiveDeviceMmio/Interrupt/DMAPoolcaps; insteadVirtioBlkDriver::initializebinds authority through the kerneldevice_managertransactions –claim_pci_function(.., DeviceOwner::VirtioBlk)thenattach_dmapool_record_with_remapping/attach_devicemmio_record/attach_interrupt_source. TheBlockDevicecap is the userspace-facing surface; the hardware authority stays kernel-owned. This in-kernel ownership is why the driver is kept as a qemu fixture rather than a production route: the productionBlockDeviceis served by the userspace-brokered NVMe provider chain (BlockDeviceBackend::NvmeBrokered, gated on a verified controller and a livedevice_mmiogrant), where the device-specific protocol logic runs in userspace overDeviceMmio/DMAPool/Interruptcaps and the kernel retains only broker/admission/isolation/revocation. - MMIO: the modern-transport common/notify/ISR/device-config regions are
mapped from the device BARs (
map_blk_regionoverpci::map_bar_region) and recorded withdevice_manager::attach_devicemmio_recordagainst the first decoded memory BAR. Doorbell (queue-notify) writes are scoped to the per-queue notify address computed fromnotify_off_multiplier. The DDFDeviceMmiocap (kernel/src/cap/device_mmio.rs) is the userspace successor surface. - Interrupt: one MSI-X route is registered for the request queue
(
VIRTIO_BLK_REQUEST_MSIX_ENTRY= 0,PciMsixInterruptRole::BlockRequestQueue), claimed (DeviceInterruptDriver::VirtioBlk) and attached to the device handle for authority binding, but left masked: completion is by polled used ring, not interrupt delivery. Route records are tracked by the kernel-owned device-interrupt ledger (kernel/src/device_interrupt.rs). - DMA: each bound device gets its own DMA pool (
device_dma::begin_virtio_blk_pool, keyed by the const-genericDEVindex viaVirtioBlkDma<DEV>). Ring pages and the request bounce buffer are allocated and accounted through the blk-keyed ledger (allocate_virtio_blk_page/register_virtio_blk_queue/record_virtio_blk_submission/..._completion_for_allocationinkernel/src/device_dma.rs). DMA uses the manager-owned bounce-buffer backend; no host physical address or IOVA is exposed to userspace – the request MSI-X route is kept masked specifically so no raw address leaves the kernel boundary. BlockDevicecap surface:BlockDeviceCap(kernel/src/cap/block_device.rs) is scoped to onedevice_indexand routes the schema’sreadBlocks/writeBlocks/info/flushmethods (schema/capos.capnpinterface BlockDevice) to that device only, failing closed when it is not bound. Under theqemufeature theblock_deviceKernelCapSourcereaches the resolved boot/storage virtio-blk disk, and theblock_device_targetsource requiresSystemConfig.blockDeviceTarget.pci(schema/capos.capnp) and resolves that PCI segment:bus:device.function selector to a bound non-boot virtio-blk device; absent, mismatched, or boot-disk selectors fail closed. In the production (non-qemu) kernel the sameblock_devicesource instead mints theNvmeBrokeredarm, andblock_device_targetfails closed (requires the qemu feature). The read-only/ persistent/writable filesystem and store caps (readonly_fs,persistent_store,writable_fs) layer their on-disk formats over whicheverBlockDevicebacks the boot/storage cap – the virtio-blk fixture underqemu, the brokered NVMe arm in production.- Fail-closed / validation rules:
VirtioBlkDriver::validate_rangerejects a zero count, a count overVIRTIO_BLK_MAX_SECTORS_PER_REQUEST(7 – bounded so header + status +512 * countfit one 4 KiB page),start_lba + countarithmetic overflow, and any range past the reportedcapacity_sectors, all before device access. The cap layer additionally enforces thatwriteBlocksdata length equalscount * 512(BlockDeviceRequestError::DataLengthMismatch). A non-OKdevice status, a used-ring poll timeout, or a DMA accounting failure each fail closed (DeviceStatus/Completion/Accounting). Descriptor reuse is generation-tracked through the shared bounded tracking-slot array. - QEMU-emulable vs hardware-only: fully QEMU-emulable, and these are the
fixture gates. QEMU provides virtio-blk-pci;
make run-virtio-blkis the single-device end-to-endBlockDevicefixture,make run-multi-virtio-blkproves the two-device (boot + target) binding with independent per-device DMA pools,make run-blockdevice-target-identityproves manifest identity selection when PCI/BDF order would otherwise bind the intended target first, andmake run-virtio-blk-failoverexercises the multi-device failover path. All are--features qemufixtures over dedicatedsystem-virtio-blk.cue/system-multi-virtio-blk.cue/system-blockdevice-target-identity.cuemanifests, not production-storage evidence. No hardware-only path. The production-storage gate is the userspace-brokered NVMeBlockDevicechain (make run-cloud-provider-nvme-blockdevice-read-graduatedand the otherrun-cloud-provider-nvme-blockdevice-*proofs).
Related
kernel/src/virtio.rs– the virtio-blk driver (VirtioBlkDriver), request framing, queue setup, and the shared modern split-ring transport.kernel/src/cap/block_device.rs– theBlockDevicecap surface (BlockDeviceCap) routing schema methods to a single bound device.kernel/src/device_dma.rs– the per-device virtio-blk DMA pool/queue ledger.kernel/src/device_interrupt.rs– the request-queue MSI-X route record.schema/capos.capnp(interface BlockDevice) – thereadBlocks/writeBlocks/info/flushcontract.docs/dma-isolation-design.md– the DMA backend and isolation model the userspace successor binds into.