# virtio-blk (modern PCI block device)

This is a provenance map for the in-tree virtio-blk driver: it cites the spec,
summarizes only the wire-format subset the code actually implements, and points
into the implementation. It is not a re-spec -- where the spec is implemented
unchanged it links rather than transcribes. The driver was the first real
`BlockDevice` `CapObject`, so the treatment is a concise map rather than
exhaustive register tables. It reuses the modern split-ring transport seam
introduced for virtio-net ([`virtio-net`](virtio-net.md)); this page covers only
the block-specific additions.

**Status: QEMU fixture, not the production storage route.** The kernel-owned
virtio-blk driver, its `BlockDevice` cap arm, and its PCI discovery are all
gated behind the `qemu` cargo feature (`diagnose_qemu_virtio_blk` in
`kernel/src/pci.rs`; the `BlockDeviceBackend::Virtio` arm in
`kernel/src/cap/block_device.rs`). The default non-`qemu` production kernel never
enumerates, claims, or binds virtio-blk, and its `block_device` grant source
resolves to the userspace-brokered NVMe `BlockDevice` arm
(`BlockDeviceBackend::NvmeBrokered`) instead, failing closed when no verified
NVMe controller and live `device_mmio` grant are present. virtio-blk remains as a
named local fixture / regression test only -- a fully QEMU-emulable end-to-end
`BlockDevice` proof and the substrate the storage-layer (read-only / persistent /
writable filesystem) QEMU proofs read through. It is not an ambiguous forward
production driver. The kernel broker responsibilities it exercises (PCI claim
arbitration, MMIO/IRQ/DMA admission, bounce/IOMMU isolation, stale-generation
rejection, and revocation) are the same ones the production userspace storage
driver binds into; see [§3 capOS mapping](#3-capos-mapping).

The driver lives in the virtio-blk section of `kernel/src/virtio.rs`
(`VirtioBlkDriver`) and the cap surface in `kernel/src/cap/block_device.rs`
(`BlockDeviceCap`).

## 1. Spec basis

- **Device**: virtio block device, modern (virtio 1.x) PCI transport.
  PCI vendor `0x1af4`; device `0x1042` (modern) / `0x1001` (transitional).
  IDs at `kernel/src/pci.rs` (`VIRTIO_VENDOR_ID`,
  `VIRTIO_BLK_MODERN_DEVICE_ID`, `VIRTIO_BLK_TRANSITIONAL_DEVICE_ID`; matched by
  `PciDevice::is_virtio_blk`). Up to `device_dma::MAX_VIRTIO_BLK_DEVICES`
  functions are bound, each in its own const-generic driver slot
  (`VIRTIO_BLK_DRIVER_0` / `VIRTIO_BLK_DRIVER_1`) so the two devices cannot
  alias DMA or queue state. The target disk is selected by manifest PCI identity;
  the ordinary boot/storage disk resolves to the non-target disk when both are
  present.
- **Authoritative spec**: *Virtual I/O Device (VIRTIO) Version 1.2*, OASIS
  Committee Specification 01 (2022-07-01).
  Source: <https://docs.oasis-open.org/virtio/virtio/v1.2/virtio-v1.2.html>.
  Relevant sections: 4.1 (virtio over PCI bus), 2.7 (split virtqueues),
  5.2 (block device).
- **Reference**: cross-checked against the Linux `virtio_blk` driver for the
  request framing and the `virtio_pci_modern` modern-transport handshake.

## 2. Wire format (implemented subset)

The modern PCI capability parsing, common-config register map, split-ring
descriptor layout, and feature-negotiation handshake are the **shared transport
seam** documented in [`virtio-net` §2](virtio-net.md#2-wire-format-implemented-subset)
(`kernel/src/virtio.rs` `transport` module, `ModernTransport`, `Virtqueue`,
`DescriptorTrackingSlot`). Only the block-specific subset is summarized here.

- **Feature negotiation**: the driver requires and selects only
  `VIRTIO_F_VERSION_1` (`read_device_features` / `write_driver_features` in
  `VirtioBlkDriver::initialize`); a device that does not offer it fails closed
  with `BlkInitError::MissingRequiredFeatures`. No block feature bits
  (read-only, multi-queue, discard, ...) are negotiated, so the device is driven
  as a single read/write request queue.
- **Device config (capacity)**: the block device config space carries the
  capacity in 512-byte sectors as a little-endian `u64`
  (`VIRTIO_BLK_CONFIG_CAPACITY_LEN` = 8 bytes, read low/high in `initialize`).
  A config region shorter than that, or a zero capacity, fails closed
  (`BlkInitError::DeviceConfigTooSmall` / `ZeroCapacity`).
- **Request queue**: a single request virtqueue (queue 0). The negotiated size
  is clamped to the largest power of two not exceeding both the device-advertised
  `COMMON_QUEUE_SIZE` and `VIRTIO_BLK_REQUEST_QUEUE_SIZE` (`8`); a usable size
  below 4 (one request chain needs 3 descriptors) fails closed. Per-queue notify
  address is computed from `notify_off_multiplier` like any modern virtio queue.
- **Request framing** (`VirtioBlkDriver::issue_request`): each request is a
  3-descriptor chain over one bounce-buffer page (`ChainSegment`):
  1. **header** -- `VIRTIO_BLK_REQ_HEADER_LEN` (16) bytes, device-readable:
     `type` (`u32` -- `VIRTIO_BLK_T_IN` = 0 read / `VIRTIO_BLK_T_OUT` = 1 write),
     a reserved `u32`, and the `sector` (`u64` LBA), at
     `VIRTIO_BLK_HEADER_OFFSET` (0).
  2. **data** -- `512 * count` bytes at `VIRTIO_BLK_DATA_OFFSET` (512),
     device-writable for reads, device-readable for writes.
  3. **status** -- 1 byte at `VIRTIO_BLK_STATUS_OFFSET` (16), device-writable;
     pre-seeded with `VIRTIO_BLK_STATUS_SENTINEL` (`0xff`) and checked for
     `VIRTIO_BLK_S_OK` (0) after completion (`BlockDeviceRequestError::DeviceStatus`
     otherwise).
- **Completion**: QEMU completes virtio-blk requests synchronously, so the
  driver notifies the queue and **polls the used ring** (`poll_used_within_ns`,
  bounded by the real-time `VIRTIO_BLK_COMPLETION_BUDGET_NS` budget with the
  `VIRTIO_BLK_COMPLETION_FALLBACK_SPIN_LIMIT` spin-count backstop when the
  monotonic clocksource is tick-derived) rather than waiting on the request
  MSI-X interrupt, which is claimed but left masked (see §3). The bound is
  time-based because the device side includes QEMU's host file I/O, whose
  latency a raw spin count does not track.

## 3. capOS mapping

- **Binding (qemu fixture, in-kernel)**: virtio-blk is driven **in the kernel**
  and only under the `qemu` feature. Unlike the userspace storage driver, it does
  not receive `DeviceMmio`/`Interrupt`/`DMAPool` *caps*; instead
  `VirtioBlkDriver::initialize` binds authority through the kernel
  `device_manager` transactions -- `claim_pci_function(.., DeviceOwner::VirtioBlk)`
  then `attach_dmapool_record_with_remapping` / `attach_devicemmio_record` /
  `attach_interrupt_source`. The `BlockDevice` cap is the userspace-facing
  surface; the hardware authority stays kernel-owned. This in-kernel ownership is
  why the driver is kept as a qemu fixture rather than a production route: the
  production `BlockDevice` is served by the userspace-brokered NVMe provider chain
  (`BlockDeviceBackend::NvmeBrokered`, gated on a verified controller and a live
  `device_mmio` grant), where the device-specific protocol logic runs in
  userspace over `DeviceMmio`/`DMAPool`/`Interrupt` caps and the kernel retains
  only broker/admission/isolation/revocation.
- **MMIO**: the modern-transport common/notify/ISR/device-config regions are
  mapped from the device BARs (`map_blk_region` over `pci::map_bar_region`) and
  recorded with `device_manager::attach_devicemmio_record` against the first
  decoded memory BAR. Doorbell (queue-notify) writes are scoped to the per-queue
  notify address computed from `notify_off_multiplier`. The DDF `DeviceMmio` cap
  (`kernel/src/cap/device_mmio.rs`) is the userspace successor surface.
- **Interrupt**: one MSI-X route is registered for the request queue
  (`VIRTIO_BLK_REQUEST_MSIX_ENTRY` = 0, `PciMsixInterruptRole::BlockRequestQueue`),
  claimed (`DeviceInterruptDriver::VirtioBlk`) and attached to the device handle
  for authority binding, but left **masked**: completion is by polled used ring,
  not interrupt delivery. Route records are tracked by the kernel-owned
  device-interrupt ledger (`kernel/src/device_interrupt.rs`).
- **DMA**: each bound device gets its own DMA pool (`device_dma::begin_virtio_blk_pool`,
  keyed by the const-generic `DEV` index via `VirtioBlkDma<DEV>`). Ring pages and
  the request bounce buffer are allocated and accounted through the blk-keyed
  ledger (`allocate_virtio_blk_page` / `register_virtio_blk_queue` /
  `record_virtio_blk_submission`/`..._completion_for_allocation` in
  `kernel/src/device_dma.rs`). DMA uses the manager-owned bounce-buffer backend;
  no host physical address or IOVA is exposed to userspace -- the request MSI-X
  route is kept masked specifically so no raw address leaves the kernel boundary.
- **`BlockDevice` cap surface**: `BlockDeviceCap` (`kernel/src/cap/block_device.rs`)
  is scoped to one `device_index` and routes the schema's
  `readBlocks`/`writeBlocks`/`info`/`flush` methods
  (`schema/capos.capnp` `interface BlockDevice`) to that device only, failing
  closed when it is not bound. Under the `qemu` feature the `block_device`
  `KernelCapSource` reaches the resolved boot/storage virtio-blk disk, and the
  `block_device_target` source requires `SystemConfig.blockDeviceTarget.pci`
  (`schema/capos.capnp`) and resolves that PCI segment:bus:device.function
  selector to a bound non-boot virtio-blk device; absent, mismatched, or boot-disk
  selectors fail closed. In the production (non-`qemu`) kernel the same
  `block_device` source instead mints the `NvmeBrokered` arm, and
  `block_device_target` fails closed (`requires the qemu feature`). The read-only/
  persistent/writable filesystem and store caps (`readonly_fs`,
  `persistent_store`, `writable_fs`) layer their on-disk formats over whichever
  `BlockDevice` backs the boot/storage cap -- the virtio-blk fixture under `qemu`,
  the brokered NVMe arm in production.
- **Fail-closed / validation rules**: `VirtioBlkDriver::validate_range` rejects a
  zero count, a count over `VIRTIO_BLK_MAX_SECTORS_PER_REQUEST` (7 -- bounded so
  header + status + `512 * count` fit one 4 KiB page), `start_lba + count`
  arithmetic overflow, and any range past the reported `capacity_sectors`, all
  before device access. The cap layer additionally enforces that
  `writeBlocks` data length equals `count * 512`
  (`BlockDeviceRequestError::DataLengthMismatch`). A non-`OK` device status, a
  used-ring poll timeout, or a DMA accounting failure each fail closed
  (`DeviceStatus` / `Completion` / `Accounting`). Descriptor reuse is
  generation-tracked through the shared bounded tracking-slot array.
- **QEMU-emulable vs hardware-only**: fully QEMU-emulable, and these are the
  fixture gates. QEMU provides virtio-blk-pci; `make run-virtio-blk` is the
  single-device end-to-end `BlockDevice` fixture, `make run-multi-virtio-blk`
  proves the two-device (boot + target) binding with independent per-device DMA
  pools, `make run-blockdevice-target-identity` proves manifest identity selection
  when PCI/BDF order would otherwise bind the intended target first, and
  `make run-virtio-blk-failover` exercises the multi-device failover path. All are
  `--features qemu` fixtures over dedicated `system-virtio-blk.cue` /
  `system-multi-virtio-blk.cue` / `system-blockdevice-target-identity.cue`
  manifests, not production-storage evidence. No hardware-only path. The
  production-storage gate is the userspace-brokered NVMe `BlockDevice` chain
  (`make run-cloud-provider-nvme-blockdevice-read-graduated` and the other
  `run-cloud-provider-nvme-blockdevice-*` proofs).

## Related

- `kernel/src/virtio.rs` -- the virtio-blk driver (`VirtioBlkDriver`), request
  framing, queue setup, and the shared modern split-ring transport.
- `kernel/src/cap/block_device.rs` -- the `BlockDevice` cap surface
  (`BlockDeviceCap`) routing schema methods to a single bound device.
- `kernel/src/device_dma.rs` -- the per-device virtio-blk DMA pool/queue ledger.
- `kernel/src/device_interrupt.rs` -- the request-queue MSI-X route record.
- `schema/capos.capnp` (`interface BlockDevice`) -- the
  `readBlocks`/`writeBlocks`/`info`/`flush` contract.
- `docs/dma-isolation-design.md` -- the DMA backend and isolation model the
  userspace successor binds into.
