Interactive Visualization

Batch-Aware NVMe Schedule

The GPU waits for the last read in a KV-cache recall batch. These small traces show how FIFO leaves foreign work ahead of that tail, and how a request-aware controller can pull the critical reads forward without changing the total work. This is intentionally a single-device, visible-window sketch, not a full channel-aware controller model. Pick a trace below and step through each tick to see where the tail moves.

Loading scheduler

NVIDIA's Dynamo serving stack has a four-tier memory model for inference. G1 is GPU HBM, where active KV cache lives during generation. G2 is host DRAM. G3 is local NVMe. G4 is remote shared storage. The KV Block Manager moves blocks between these tiers, and NIXL handles the data transfers. The reason for all this plumbing: KV cache per token is about 0.14 MB for an 8B-class model and 0.31 MB for a 70B model (per LMCache's FAQ). At 128K context on the 70B, that's ~40 GB of state per conversation, and it gets worse with concurrency.

At GTC 2026, NVIDIA added a fifth position: G3.5. This is the CMX platform. Ethernet-attached flash, fronted by BlueField-4, accessible across nodes through DOCA Memos. Where G3 is a local NVMe drive tied to one machine, G3.5 is a shared flash pool that multiple nodes can reach for long-context KV cache.

GPU HBM

Hot KV cache, active generation

Per GPU

Host DRAM

Staging, buffering off HBM

Per node

Local SSD

Warm KV, tied to one node

Per node

Spectrum-X Ethernet

G3.5

CMX (ICMSP)

Ethernet-attached shared flash, BlueField-4

Per pod

Remote Shared Storage

Cold state, durable

Cluster

G3.5 is what got me thinking about this. If shared flash is going to hold KV cache for many concurrent requests, their recall reads will routinely overlap on the same devices. What does the device controller do when that happens?

Here's what happens. When concurrent requests need evicted KV blocks, their reads land in the same device queue and fan out across NAND dies. The dies work in parallel, so a recall finishes when the slowest die drains. The GPU can't move on until every read completes. The tail determines the wait. And FIFO doesn't know which request owns which read. It can't tell a decode-blocking read from a speculative prefetch.

The visualization above shows this at small scale. Three dies, a few reads per request, traces built by hand. You can step through each scheduler tick by tick. The three cases show tail blocking from interleaved reads, priority between demand and prefetch, and GC timing during decode-critical recall.

The batch-aware scheduler does the same total work. Same reads, same dies, same service times. It only changes the order, and that's enough. The completion gap comes from rearranging which reads go first on contended dies, so less foreign work sits ahead of each request's tail. Reordering without reducing work still shifts when a request finishes.

Source: LMCache FAQ, KV cache sizes for popular models Source: NVIDIA Dynamo introduction, KV cache routing, offloading, and NIXL Source: NVIDIA technical blog on CMX, BlueField, and the context tier