LLM Inference: From Black Box to Production
A ground-up explanation of LLM inference, from black box to production optimizations. Covers tokenization, embeddings, attention, KV cache, memory bottlenecks, batching, PagedAttention, and quantization, using TinyLlama 1.1B as the running example.
What This Post Covers
You type "Write a story" into ChatGPT. A moment later, tokens start appearing: "Once upon a time..."
You've done this enough times to develop an intuition for the rhythm. There's always a pause before the first token. Then tokens stream in steadily, one after another, at a roughly constant rate. Longer prompts mean longer initial pauses. Longer responses take proportionally longer. But the speed of the stream itself barely changes whether you asked for a poem or a legal brief.
Most people stop here. "The model is thinking." That's the beginning of the explanation, not the end. Underneath, there is a concrete, traceable pipeline, and every one of those observable behaviors maps to a specific stage in it. The initial pause, the constant streaming rate, the way everything slows down with bigger models: none of it is mysterious once you see the machinery.
The pipeline itself is simpler than you might expect. Text becomes tokens. Tokens become vectors. Vectors pass through a stack of transformer blocks. Out comes a probability distribution over the vocabulary. Sample one token. Append it. Run the stack again. One token per trip through the full model. If you can trace that loop, you understand inference. Everything else is optimization.
This post builds that loop from scratch. We start with the black box (text in, text out) and open it one component at a time. Each section adds one new piece: tokenization, embeddings, attention, the feed-forward network, the KV cache. No piece arrives without a reason. Part 2 turns to the systems picture: why token generation is bottlenecked by memory bandwidth rather than compute, and how production systems like vLLM work around that constraint.
No machine learning prerequisites. I wrote this to be accessible from the ground up, and we define every term when it first shows up.
Training vs Inference: A Quick Note
You've probably heard about "training" (or "pre-training"), the phase where models learn by adjusting their parameters (backpropagation). Inference is what comes after. The parameters are frozen; we just run forward passes to generate output tokens.
Everything in this post is about inference: running an already-trained model.
Part 1: How Inference Works
Part 1 builds the core pipeline, from black box to transformer block.
A Black-Box View
From a user's perspective, an LLM is just a black box. You type "Write a story" and a moment later see "Once upon a time...". Nothing in the interface hints at tokens, matrices, or GPUs. The mental model is simple: text goes in, text comes out.
It's a simple diagram, but also the whole interface. As you'll see in the upcoming sections, everything else is just opening the box and naming the pieces inside.
Detail 1: Tokens
The model needs numbers, not text. So the first question is: how do we break text into pieces and assign each piece a number?
One idea: individual characters. The alphabet gives you a small, clean vocabulary, around 100 entries counting letters, digits, and punctuation. But sequences get long fast. "Write a story" is 13 characters. A paragraph is hundreds. Each character is a separate step for the model, and each one is nearly meaningless on its own. The letter "t" tells you nothing without the letters around it.
The other extreme: whole words. Shorter sequences, more meaning per unit. But how big is the vocabulary? Including technical jargon, proper nouns, URLs, code, and multiple languages, it explodes. And when the model encounters a word it has never seen, it's stuck.
What you actually want is a unit somewhere in the middle. Bigger than a character, so sequences stay compact and each piece carries meaning. Smaller than a word, so the vocabulary stays finite and novel words can be handled by splitting them into familiar subpieces. Where we see one word "responding," the model sees two pieces: ["respond", "ing"]. Common words get their own entry while rare words get broken into recognizable chunks.
These pieces are called tokens.
A "tokenizer" handles this mapping: text in, integer IDs out, drawn from a fixed vocabulary of subword units. TinyLlama's vocabulary has 32,000 tokens. A prompt like "Write a story" might become:
["Write", " a", " story"] → [8144, 264, 3446]
The vocabulary and split rules are learned once offline (algorithms like BPE scan a large corpus for common subword patterns), then frozen before the model ever trains. At inference time, tokenization is purely deterministic: same input, same output, no model weights involved.
From this point on, the model only sees these integers. The text is gone.
Try it yourself: You can experiment with tokenization at tiktokenizer.vercel.app. Paste in some text and see how different models split it into tokens.
At the output end, a "detokenizer" reverses this: token ID → text chunk.
But wait, how does the model actually understand what these numbers mean? Token 8144 and token 8143 are adjacent integers, but they could represent completely unrelated words. We need a way to give these numbers meaning.
Detail 2a: Embeddings
After tokenization, we have integers: [8144, 264, 3446]. Clean indices. Easy to store.
But try to do anything useful with them! What's the "average" of the numbers associated with "Write" (8144) and "story" (3446)? The number 5795, which probably maps to some completely unrelated token. The IDs are arbitrary labels. Adding them, comparing them, measuring distances between them: none of it tells you anything about meaning.
The model needs to compute with language. Everything downstream is matrix multiplications, weighted sums, dot products. For any of that to produce sensible results, the numbers representing words need to reflect what the words actually mean. "Lion" and "tiger" should sit close together. "Lion" and "teacup" should be far apart.
An embedding does exactly this. Each token ID maps to a vector of floating-point numbers: a point in a high-dimensional space where the geometry itself encodes meaning.
To build intuition, imagine just 2 dimensions. One axis captures something like "danger level," the other "size." Plot a few animal words:
Lion and scorpion both land high on the danger axis. Elephant and lion both land high on size. The clustering is the point: words that share semantic properties end up in similar regions of the space. The geometry is the meaning.
TinyLlama uses 2,048 dimensions, not 2. So "Write" is not stored as the integer 8144 but as a vector of 2,048 numbers: [0.21, -0.83, 0.45, ..., 0.12]. Every token in the vocabulary gets its own vector. No human designed what each dimension captures. The geometry fell out of training.
If meaning lives in geometry, you can do math on meaning. The direction from "mouse" to "elephant" encodes something like "getting larger while staying harmless." Add that direction to "scorpion" and you land near "crocodile": large and dangerous. Not a metaphor. Literal vector addition, and the fact that it produces sensible results tells you the structure is real.
System implications: With a 32,000-token vocabulary and 2,048-dimensional embeddings stored in half-precision (FP16 or BF16, both 2 bytes per value):
32,000 × 2,048 × 2 bytes = 131 MB
(Throughout this post, MB and GB refer to decimal units: 1 MB = 10⁶ bytes, 1 GB = 10⁹ bytes. This matches how GPU vendors report VRAM and bandwidth specs.)
This embedding table is a matrix of shape [32,000 × 2,048]. When the model sees token ID 8144, it looks up row 8,144 and retrieves that token's vector. That vector, not the original ID, is what flows through the rest of the model.
So we have a way in: token IDs become rich vectors. But what about the way out? The model needs to eventually produce a token, not a vector. How do we get from 2,048 numbers back to a word?
Detail 2b: Unembeddings
We solved the input side: each token ID became a rich vector. Those vectors enter the black box, pass through the model's layers, and what comes out is a single 2,048-dimensional vector for the last position. This vector is the model's compressed answer: everything it "wants to say" next, packed into 2,048 numbers.
But a vector is not a word. We need to cross back from vector space to a token the user can read.
Problem: we have 2,048 numbers, and we need to turn them into a single token. How?
Scoring Every Token in the Vocabulary
The vocabulary has 32,000 tokens. We need to decide which one comes next. A natural approach: compute a score for every token. 32,000 candidates, 32,000 scores. The highest-scoring tokens are the ones the model thinks are most likely to follow.
How do we score a token? Think back to embeddings. Every token in the vocabulary has a vector that encodes its meaning. Our output is also a vector, living in the same space. If the output vector aligns well with a particular token's vector, that token is a good match for what the model wants to say next. If it points in a completely different direction, that token is a poor candidate.
So the score for a given token is just a measure of alignment between two vectors: the output and that token's representation. Compute this for all 32,000 tokens and you have your 32,000 scores. In practice, we line up all the token vectors as columns of a single matrix and multiply the output vector by it. One matrix multiplication, all 32,000 scores at once.
This scoring matrix is called the LM head (language modeling head). It's learned during training alongside the rest of the model.
What Do These Scores Mean?
The 32,000 numbers we just computed are not probabilities. They're raw scores that can be any real number, positive or negative. A score of 5.2 for "Once" does not mean 5.2% probability. It just means the model rates "Once" higher than something with score 3.1. Only the relative ordering matters, not the absolute values.
These raw scores are called logits.
Picking a Token: The Naive Approach
We have 32,000 scores. The simplest idea: pick the token with the highest one. If "Once" has 5.2 and everything else is lower, output "Once." Done.
This is called greedy decoding, and it sounds reasonable. But it produces boring, repetitive text. Consider what happens when the model generates "The cat sat on the mat." The same context leads to the same prediction, which leads to the same context, forever: "The cat sat on the mat. The cat sat on the..."
There's a deeper issue too. Language is not deterministic. When you write, you don't always pick the "most likely" next word. Sometimes a surprising word choice takes the sentence in an interesting direction. Greedy decoding can't do this. It always takes the safe, expected path.
The fix: sample from the scores instead of always taking the maximum. Sometimes pick "Once", sometimes "The." Controlled randomness produces more natural, varied text.
But to sample, we need probabilities: numbers between 0 and 1 that sum to 1. A score of 5.2 is not a probability.
From Scores to Probabilities
How do we convert arbitrary scores into a valid probability distribution?
First attempt: divide each score by the sum of all scores. That would normalize them to sum to 1. But some scores are negative. Dividing by a sum that includes negative numbers gives nonsensical results. A negative "probability" is meaningless.
So we need to make all scores positive first, then normalize. Exponentiation does exactly this: is a large positive number. is a small positive number. Any real number maps to a positive number through .
Now divide each exponentiated score by the sum of all exponentiated scores. Every value lands between 0 and 1, and they sum to 1. That's a valid probability distribution.
Exponentiation has a useful side effect: it amplifies differences. The gap between and is much larger than the gap between 5.2 and 3.1. Strong candidates get boosted, weak ones get suppressed. This makes the distribution peakier, concentrating probability mass on the tokens the model is most confident about.
This operation (exponentiate, then normalize) is called softmax.
After softmax, we have a probability distribution over all 32,000 tokens. "Once" might have probability 0.08, "The" might have 0.12, "In" might have 0.05, and so on.
Sampling Strategies
We can now sample from this distribution. But pure random sampling has its own problem: sometimes it picks very unlikely tokens (the ones with 0.001% probability), producing incoherent nonsense. So we use strategies that balance randomness with quality:
- Temperature: Divide the logits by a temperature value before softmax. Temperature > 1 makes the distribution flatter (more random, more creative). Temperature < 1 makes it peakier (more deterministic, more focused). As temperature approaches 0, sampling converges to greedy. Most APIs treat
temperature=0as a special case meaning "always pick the top token" (since the literal math would be division by zero). - Top-k sampling: Only consider the k highest-probability tokens, then sample among them. This filters out the long tail of unlikely tokens while preserving variety among the good options.
- Top-p (nucleus) sampling: Sort tokens by probability from highest to lowest, then keep only the smallest set whose cumulative probability reaches p (e.g., 0.9). Unlike top-k, this adapts to the distribution's shape. If the model is very confident, only a few tokens are considered. If it's uncertain, more tokens make the cut.
The Complete Output Path
That's the full output side. Trace it end to end:
- The model's layers output a 2,048-dim vector for the last position
- The LM head projects it to 32,000 logits
- Softmax converts logits to probabilities
- Sampling picks one token
- The detokenizer converts that token ID back to text
Embedding and the LM head are bookends. One converts a token ID into a rich vector (embedding). The other converts a rich vector back into a token ID (unembedding). The layers between them are where the actual processing happens, and we haven't opened that box yet.
That's both ends of the pipeline. Tokens become vectors on the way in. Vectors become tokens on the way out. The interesting part is what happens in between.
Detail 3: Autoregressive Generation
We now have the full path for producing one next token: tokenize the prompt, embed it, run it through the model's layers, project to logits, sample. One token out.
But the user asked for a story. "Once upon a time there was a dragon" is eight tokens. How does the model get from one to eight?
It runs the pipeline again. After producing "Once," the model takes the full sequence so far, [8144, 264, 3446, 12483] ("Write a story Once"), and feeds it back through all 22 layers. Out comes a new probability distribution. It samples "upon." Appends it. Feeds the now-longer sequence back through. Gets "a." Then "time." Each token requires a separate, complete trip through the entire model.
There is no internal draft. The model is not composing a response and then revealing it word by word. It is discovering the response as it goes. On step 5, it has no idea what step 6 will produce. It can't: step 6's token depends on step 5's choice, which hasn't been made yet.
This one-at-a-time process is called autoregressive generation. "Autoregressive" because each output feeds back as input to the next step. The model is a next-token predictor, called in a loop.
Each iteration of that loop is a forward pass: the full sequence runs through all 22 layers, the LM head produces logits, and sampling picks one token. On generation step N, the model sees P + N − 1 tokens: the original P prompt tokens plus the N − 1 tokens generated so far.
How does generation stop? The vocabulary includes special tokens like <EOS> (end of sequence) or </s>. When the model samples one of these, generation terminates. The model learned during training that these tokens signal "I'm done responding." You can also set a maximum output length, and generation stops when that limit is hit, even if no EOS appeared.
What's the context window? The context window (or context length) is the maximum number of tokens the model can "see" at once. TinyLlama's is 2,048 tokens. If the prompt plus generated output exceeds that, systems typically reject the input, truncate it, or use approximations like sliding window attention to handle the overflow. The practical effect is always the same: the model loses access to tokens beyond its range. This is why long documents or extended conversations eventually "lose" earlier context.
Now think about what this loop actually costs. Your prompt is 100 tokens and you want a 200-token response. That's 200 forward passes. Pass 1 processes 100 tokens. Pass 2 processes 101. Pass 200 processes 299. Every single pass runs the full model: all 22 layers, all 1.1 billion parameters.
200 forward passes. 220 billion parameter reads. To produce a few sentences.
Those weight reads are a fixed cost per step: the full model gets loaded every time. We address that later with batching (Detail 13) and quantization (Detail 18). But there's a deeper problem. Most of those reads are redundant.
Look at the diagram. Each row is one forward pass. The blue tokens (your original prompt) appear on every single row, unchanged. On step 4, "Write," "a," and "story" are processed for the fourth time. On step 200, they'd be processed for the two-hundredth time. The only thing actually new on each row is the single green token at the end.
Step 150 versus step 149: the model processes 249 tokens. 248 are identical to the previous step. One is new. Everything else is wasted work.
The waste points to a natural split. The prompt tokens are all known before generation starts. They don't change. They could be processed in a single parallel pass, all at once. The output tokens are different: each depends on the previous, so they're inherently sequential.
Serving stacks split them accordingly: prefill processes the entire prompt in parallel, and decode generates tokens one by one. The redundant recomputation within decode gets eliminated by the KV cache. To understand what gets cached and why, we need to look inside a transformer block.
Detail 4: Self-Attention
In Detail 3 we saw the autoregressive loop: embed, process, project to logits, sample, repeat. The "Processing" step was a black box. Before we open it, think about what we actually have at this point.
The embedding table gives each token a rich vector. "Lion" sits near "tiger." "Write" sits near "compose." Geometry encodes meaning. We went from arbitrary integer IDs to vectors the model can compute with.
Those vectors have a problem, though. They're static.
Read this sentence: "A cave bat flew out into the night." Now this one: "She picked up the baseball bat." The word "bat" appears in both. The embedding table gives it the same vector both times. Same 2,048 numbers but two completely different meanings.
The embedding table can't tell the difference. It's a lookup table: token ID in, vector out. No access to the surrounding sentence. The vector it produces for "bat" is a compromise, sitting somewhere between the animal meaning and the sports equipment meaning, fully committing to neither.
I like to think of the embedding table as a dictionary. You look up "bat" and get multiple definitions covering the common uses. A dictionary gives you a generic sense of what a word may mean, covering the common cases. But when you encounter "bat" in an actual sentence, you don't use the generic sense. You resolve it instantly. "Cave" pulls the meaning toward the animal. "Baseball" pulls it toward sports equipment. You collapse the dictionary entry into the one meaning that fits this context, effortlessly, using the surrounding words.
Embeddings can't do this. The embedding for "bat" is the same vector whether the sentence is about caves or baseball.
And the problem goes deeper than ambiguous words. Consider: "The animal didn't cross the street because it was too tired." What does "it" refer to? For us, obvious: the animal, not the street. But the embedding for "it" is a generic pronoun vector. Nothing in it says "I refer to the animal in this sentence." That connection only exists in context, in the specific words surrounding it.
Every token faces this. Each one carries a generic embedding that ignores where it sits and what surrounds it. Something has to take these static vectors and make them contextual: let each token look at the others, find what's relevant, and update its own representation. If embeddings give us dictionary definitions, we need something that gives us contextual definitions.
That something lives inside the "Processing" black box.
Inside the box is a stack of transformer blocks. TinyLlama has 22 of them, each structurally identical but with different learned weights. The embedded vectors enter block 1, flow through all 22 blocks in sequence, and exit into the LM head for next-token prediction.
Each block has two components:
-
Self-Attention: where tokens look at each other. This is where the context problem gets solved. Attention lets "bat" examine "cave" and shift toward the animal meaning. It lets "it" look back at "animal" and figure out the reference. Each token scans the sequence and adjusts its representation based on what it finds relevant.
-
Feed-Forward Network (FFN): where each token is processed on its own. After attention gathers context, the FFN takes each token's now-enriched vector and transforms it through a small neural network. Attention asks: "what context is relevant?" The FFN asks: "given that context, what should I conclude?"
One gathers. The other processes. They alternate, 22 times.
For the next few sections, we focus only on what's inside one block. The rest of the pipeline stays unchanged. Let's start with attention.
Building Attention from First Principles
Back to the "bat" problem. The embedding table gives "bat" a single, generic vector, a compromise sitting somewhere between the animal meaning and the sports equipment meaning. Take the phrase "A cave bat." After embedding, we have three vectors:
embed("A") = [0.04, -0.11, 0.52, ..., 0.03] (2,048 numbers)
embed("cave") = [0.58, 0.39, 0.21, ..., -0.14]
embed("bat") = [0.61, 0.45, -0.33, ..., 0.27]
Each vector encodes that token's meaning in isolation. "bat" doesn't know which meaning applies here. "cave" doesn't know what lives inside it. None of these vectors can see the others.
We need some way for "bat" to look at the other tokens in the sequence, figure out which ones are relevant, and update its own vector accordingly. The embedding space itself gives us a starting point: we built embeddings so that similar meanings live near each other. "Lion" sits near "tiger." "Cave" sits near "underground." If meaning is encoded in direction, then two vectors pointing the same way should be related.
Can we just measure how similar each pair of vectors is, and use that to pull in context?
Measuring Similarity
Given two embedding vectors, how do we express how similar they are as a single number?
Think about what "similar" means geometrically. Two vectors pointing in the same direction carry similar meanings. Two vectors pointing perpendicular to each other are unrelated. The angle between them is the relevant quantity. Small angle, similar meaning. Right angle, no relationship. Opposite directions, opposing meanings.
The cosine of that angle gives a clean score: 1 means same direction, 0 means perpendicular, −1 means opposite. This is called cosine similarity. And the standard way to compute it is the dot product of two vectors divided by their magnitudes.
Attention uses the raw dot product, without the magnitude normalization. Why skip it? Because dot products map directly to matrix multiplications, the operation GPUs are built around. Computing dot products between every pair of tokens is just a single matrix multiply: if is a matrix where each row is a token embedding, then gives all pairwise similarity scores in one operation. The scaling we'll derive shortly keeps things numerically stable.
Focus on "bat." Compute its dot product with every token in the sequence:
# How much is the word "bat" related to each of the words in "A cave bat"?
# How related is "bat" with "A"?
score("bat", "A") = embed("bat") · embed("A") = 0.3
# How related is "bat" with "cave"?
score("bat", "cave") = embed("bat") · embed("cave") = 5.8
# How related is "bat" with itself?
score("bat", "bat") = embed("bat") · embed("bat") = 6.1
Look at what these numbers say. "bat" is barely related to "A" (0.3). Highly similar to "cave" (5.8). Highest with itself (6.1), which makes sense: the dot product of a vector with itself is its squared magnitude.
The interesting signal is "cave" at 5.8. Caves are where bats live. Baseball diamonds are not caves. The embedding space encoded that relationship, and the dot product surfaced it.
To turn these raw scores into proper weights, we normalize with softmax (same function from the LM head section) so they sum to 1:
weight("bat", "A") = 0.002
weight("bat", "cave") = 0.425
weight("bat", "bat") = 0.573
Now compute a new vector for "bat" as a weighted sum of all the embeddings:
new_bat = 0.002 × embed("A") + 0.425 × embed("cave") + 0.573 × embed("bat")
Think about what this does. We are reading all the definitions for a word in a dictionary, glancing at the phrase it appears in, and collapsing them into the one meaning that fits. "bat" keeps the largest share of its own embedding (0.573), which is natural. But "cave" contributes 42.5%. That's the context that pulls "bat" toward its animal meaning: bats live in caves, baseball bats do not. "A" contributed nearly nothing.
"bat" started as a generic embedding. A compromise. After this weighted sum, "cave" pulled the vector toward the animal cluster. The result is no longer a generic dictionary definition. It's a contextual definition: "bat, the cave-dwelling animal."
(In reality, these embeddings live in 2,048 dimensions, not the 2D picture you might be imagining. The shift happens across many dimensions simultaneously. But the intuition of "movement toward the right meaning" holds.)
We do this for every token position. Each token computes similarity scores against all others, normalizes them, and takes a weighted sum. After this operation, every token's vector carries context from the rest of the sequence.
This is a perfectly valid form of attention. Each token looks at every other token, measures relevance, and pulls in information proportionally. And in this example, raw dot products captured exactly the right signal. The disambiguation we needed aligned with semantic similarity: "cave" is semantically closer to the animal meaning of "bat" than to the sports equipment meaning. The dot product picked that up. The vector shifted. Problem solved?
Not quite.
Why Raw Dot Products Aren't Enough
Go back to the sentence from earlier: "The animal didn't cross the street because it was too tired." We said the embedding for "it" is a generic pronoun vector with no knowledge of its referent. Now we can be precise about why.
Now think about what the dot product would do here. "it" is a generic pronoun. "animal" is a concrete noun. Their raw embeddings are nothing alike. They appear in completely different linguistic contexts during training. A pronoun's embedding lives in pronoun-land; a noun like "animal" lives somewhere else entirely. The dot product between them will be low.
But "it" needs to attend heavily to "animal." The whole point of attention in this sentence is to figure out the referent of "it." And the raw dot product says these two words are unrelated.
The problem is that the dot product measures one thing: semantic similarity. How often do these words appear in similar contexts? "it" and "animal" fail that test completely. But the relationship between them has nothing to do with semantic similarity. It's about grammatical role. "it" is a pronoun looking for its referent. "animal" is a noun that can serve as one. What we actually want is for "it" to ask "who am I referring to?" and for "animal" to answer "I'm a noun that can be referred to."
That's not a question about what these words mean. It's a question about what these words do in this sentence.
And it's not just pronouns. Think about "quickly" and "ran." Decent semantic similarity, sure, both relate to speed and action. But the reason "quickly" should attend to "ran" is that it modifies "ran": an adverb-verb relationship that's structural, not semantic. Or consider "not" and the verb it negates. "not" is a function word whose embedding sits far from most content words. The dot product will barely register it. But "not" and the verb need to interact intensely, because "not" flips the meaning of whatever it touches.
In each case, the relationship the model needs to capture has nothing to do with what the dot product naturally measures. Raw dot products are locked into whatever similarity the embedding space already encodes. We need the model to learn what to compare: to transform the vectors before comparing them, so that the comparison exposes the right kind of relationship for the task at hand.
Queries, Keys, and Values
So we need learned transformations. Each token gets projected into a new space before comparison, one where "it" and "animal" can score high even though their raw embeddings are dissimilar. The model learns during training which aspects of each embedding matter for matching.
But one transformation isn't enough. Think carefully about what happens when "it" attends to "animal." Two distinct things are going on.
First, a matching step. "it" has a need ("I'm looking for my referent") and "animal" has a profile ("I'm a referent-eligible noun"). The match is good. This produces the attention weight.
Second, a contribution step. Once matched, "animal" shares its actual semantic content, the thing that gets mixed into "it"'s updated representation.
What "animal" signals for matching and what it actually contributes are two different things. The matching profile should highlight "I'm a noun, I'm the subject, I can be referred to." The contribution is the actual meaning: "four-legged creature, living thing, biological entity." These serve different purposes. If a single vector had to do both, those roles would be locked together.
Imagine you're hiring an ML infrastructure engineer. Three candidates sit across from you: an infra engineer with five years at a big tech company, a strong generalist software engineer, and an ML researcher fresh out of a PhD.
The outcome of your evaluation is a weighted blend of what each candidate brings:
hiring_outcome = 0.7 × infra_engineer + 0.2 × generalist + 0.1 × researcher
The infra engineer is the strongest match for your need, so they contribute the most. The generalist brings something useful but less targeted. The researcher contributes least to this particular role.
Focus on one candidate. We need two things: how relevant the infra engineer is to your need, and what they actually contribute if hired.
hiring_outcome = (how_relevant_infra_engineer_is) × what_infra_engineer_actually_brings
Simplify:
hiring_outcome = (weight) × value_of_infra_engineer
This equation has two unknowns: the weight and the value. Your instinct might be that they're closely related. Both involve the infra engineer after all! But they serve fundamentally different purposes.
Look at the hiring process closely. Three distinct things are at play:
(a) Your job spec: "someone who combines ML knowledge with infrastructure experience." Structured for matching against candidates.
(b) Each candidate's resume: the infra engineer highlights cloud deployments, the generalist highlights system design, the researcher highlights ML publications. Optimized for comparison against job specs, not a description of what they'd actually do on the job.
(c) Each candidate's actual contribution if hired: the infra engineer writes the modules, debugs the 3 AM outages, and knows which Kubernetes configs silently break under load. The generalist spots architectural bottlenecks early and writes code that the rest of the team can actually maintain. The researcher catches a flawed training setup before it burns a week of GPU time.
Each candidate is one person, but the process needs three distinct representations: one of your need ("I'm looking for these things"), one of each candidate for matching ("here is what I have that may help you"), and another of the candidate for the actual work ("this is how I actually contribute"). Now we can label our equation properly:
hiring_outcome = match(your_job_spec, candidate_resume) × candidate_actual_contribution
= weight × value_of_infra_engineer
Here, weight comes from comparing (a) against (b). And value_of_infra_engineer corresponds to (c): what that person contributes once chosen. The resume gets them matched; the value is what you actually mix into the outcome.
Same thing with tokens. Each token starts as one embedding vector, but attention needs three forms of it:
- What it's looking for from context (like a job spec). What kind of surrounding information would help disambiguate this token? This is the Query.
- How it describes itself for matching (like a resume). When other tokens are looking around for context, what should they see when they look at this token? This is the Key.
- What it actually contributes when selected (like the real work a hired person does). The content it shares when pulled into the weighted sum. This is the Value.
Why three forms, not one? Same reasoning as the analogy. If a single vector had to serve as both the matching criteria and the actual contribution, those two roles would be locked together. The resume is optimized for matching; the actual work is what matters after the match. Separating them gives the model flexibility to match on one set of features and contribute a different set. The comparison is also asymmetric: your spec matched against Candidate A's resume is a different question than Candidate A's spec matched against yours. Two separate transformations for Query and Key capture directional relationships.
Each form is produced by multiplying the raw embedding by a learned weight matrix:
Query = input × W_Q
Key = input × W_K
Value = input × W_V
Three separate matrices (, , ), all learned during training. The model discovers what aspects of each embedding to expose for each role.
Putting it all together for a given token:
- Compute this token's Query. Compute every token's Key and Value.
- Dot product of this token's Query against each Key to get relevance scores.
- Softmax to normalize scores into weights summing to 1.
- Weighted sum of Values to produce a new, context-aware embedding.
Three learned projections give the model independent control over what to look for (Query), what to advertise for matching (Key), and what to share when selected (Value). We started from raw dot products and arrived here by fixing limitations one at a time. A generic, dictionary-esque meaning goes in. A contextual one comes out.
The Attention Formula
We described the mechanism in four steps. Here it is with actual numbers.
Take "A cave bat." After projecting into Q, K, and V, each token has three 64-dimensional vectors. Focus on "bat" and trace its attention computation.
Matching first. Compute the dot product of bat's Query against every token's Key:
score("bat" → "A") = Q_bat · K_A = 1.0
score("bat" → "cave") = Q_bat · K_cave = 5.0
score("bat" → "bat") = Q_bat · K_bat = 4.0
These are not the raw embedding dot products from earlier (those were 0.3, 5.8, 6.1). Q_bat encodes what "bat" is looking for. K_cave encodes how "cave" advertises itself for matching. The learned projections expose different features than the raw embeddings, which is the whole point of having separate Q, K, V transformations.
Stack all the Queries as rows of a matrix Q and all the Keys as rows of K, and every pairwise dot product falls out of a single matrix multiply:
For 3 tokens, that's a 3×3 grid. For a 2,048-token prompt, about 4.2 million scores. One operation.
Now these scores need to become proportions that sum to 1. Softmax (same operation from the unembeddings section) handles this:
softmax([1.0, 5.0, 4.0]) → [0.01, 0.72, 0.27]
"cave" gets 72% of the weight. "bat" retains 27%. "A" is nearly invisible at 1%. The distribution does exactly what we wanted: "bat" attends mostly to "cave," pulling its meaning toward the animal sense.
Use those weights to blend the Value vectors:
output_bat = 0.01 × V_A + 0.72 × V_cave + 0.27 × V_bat
The weights decided who matters. The Values carry what they contribute. The result is "bat"'s new representation: no longer the generic dictionary entry, but a context-specific one shaped primarily by "cave."
In matrix notation, the whole process for all tokens at once:
Match. Normalize. Collect. Three operations, one line. But this formula has a numerical problem.
The Scaling Problem
Our dot products came out small: 1.0, 5.0, 4.0. That's because we used a toy example with a handful of dimensions. TinyLlama's attention heads work in 64 dimensions. A dot product sums one term per dimension. More dimensions, more terms, larger results.
Watch what happens:
With a few dimensions:
scores = [1.0, 5.0, 4.0]
softmax → [0.01, 0.72, 0.27] ← blends three tokens
With 64 dimensions:
scores = [8, 42, 35]
softmax → [0.00, 1.00, 0.00] ← collapsed to one token
Softmax exponentiates its inputs. When the inputs are moderate, the exponentiation spreads weight across multiple tokens. When they are large, it becomes extreme. A gap of 42 vs 35 looks modest, but . The highest score drowns everything else. Attention degrades from a weighted blend into a hard lookup: pick the top token, ignore the rest.
The whole point of attention is to blend information from multiple positions. Collapse defeats that.
We need to shrink the scores before they hit softmax:
A fixed constant like 10 would work for one model and break for another with different dimensions. We need something that adapts.
The scores grew because we summed more terms. Call the number of dimensions per head . With , the dot product sums 3 terms and stays small. With , it sums 64 and drifts further from zero. How much further?
Think of each term as a step in a random walk. Each step adds a small random amount. After 3 steps, you haven't gone far. After 64, you've drifted. If each step has variance 1 (a standard assumption for how Q and K components are initialized), then after steps the total variance is . But variance measures spread in squared units. Think of a square with area 64: its side length is 8, not 64. The actual distance the values drift from zero, the standard deviation, is .
For : . Divide the inflated scores by 8:
Before scaling: [8, 42, 35] → softmax → [0.00, 1.00, 0.00] ← collapsed
After ÷ 8: [1.0, 5.25, 4.38] → softmax → [0.01, 0.70, 0.29] ← blends again
After scaling, the distribution looks like the small-dimensional case. "cave" still dominates (it should, it really is the most relevant context here), but the other tokens retain their ability to contribute when the signal is less one-sided. Scaling doesn't force a uniform distribution. It prevents softmax from collapsing into a hard argmax.
That is the complete attention formula. Every piece maps to something we derived: is the Query-Key matching. Softmax converts scores to weights. carries the actual contributions. And keeps the math stable as dimensions grow. No piece is arbitrary.
Each projection is a learned weight matrix. In standard attention with a 2,048-dimensional hidden state:
Q = input × W_Q (2,048 → 2,048)
K = input × W_K (2,048 → 2,048)
V = input × W_V (2,048 → 2,048)
(TinyLlama actually uses smaller K and V projections than this, sharing them across groups of query heads. Detail 6 covers the exact shapes. The attention formula stays the same.)
So far we've described attention with a single set of Q, K, V projections. That works, but it's limiting. A single attention computation can only capture one type of relationship per layer. What if the model needs to simultaneously track syntax, semantics, and position?
Detail 5: Multi-Head Attention
We built attention with one set of Q, K, V projections. Each token asks one question of the sequence, gets one blended answer. For "bat" near "cave," that was enough. One ambiguity, one piece of context needed to resolve it.
Most words in a sentence aren't facing a single ambiguity though, they're basically navigating several at once.
One Head, Multiple Needs
Extend the "bat" sentence: "A cave bat flew out and landed on a nearby branch." Focus on "landed."
What does this word need from the rest of the sequence? It needs to know who landed (the bat, not the cave or the branch). It needs to know what preceded it (flying, which tells you this is a physical landing, not a metaphorical one). And it needs where from (the cave, which anchors the spatial story).
Three relationships. Three different tokens carrying the answer. One attention mechanism that must handle all of them through a single set of weights.
The issue is, attention produces one distribution over the sequence, and it sums to 1. That's a budget. If "landed" assigns weight 0.4 to "bat," 0.3 to "flew," and 0.2 to "cave," each signal gets pulled into a single weighted average of Value vectors.
So....what does that average look like? The Value for "bat" encodes something about being a subject, an animal. The Value for "flew" encodes something about airborne motion. The Value for "cave" encodes something about an enclosed space. The weighted sum blends all three: 40% subject-animal, 30% motion, 20% enclosed-space. None of the three signals comes through at full strength. The model gets a muddy compromise when what it actually needed were three clean, separate answers.
For this short sentence, the blur might be tolerable. But scale up to real sequences with hundreds of tokens, where the model needs to simultaneously track who is doing what, when, where, and why. One probability distribution, constrained to sum to 1, cannot cleanly separate all of those into independent channels. Spending more weight on one relationship means spending less on another. Something has to give.
The Fix: Ask Multiple Questions
Don't force one set of projections to handle everything. Run attention multiple times in parallel, each time with its own , , .
Copy 1 learns to ask "who is the subject?" and attends heavily to "bat." Copy 2 asks "what action preceded this?" and locks onto "flew." Copy 3 asks "where did this start?" and finds "cave." Each copy runs the full attention formula independently and produces its own output. Three questions, three answers, no compromise.
Return to the hiring analogy from Detail 4. A single interviewer evaluates candidates against one rubric. If the rubric tries to cover infrastructure, communication, and ML depth all in one score, the result is a blurry average. An interview panel solves this. Each panelist evaluates the same candidates but through a different lens: one scores infrastructure depth, another communication, a third ML intuition. The final hiring decision synthesizes all their evaluations rather than forcing everything through one number.
Each of these independent attention computations is called a head. Running multiple heads in parallel is multi-head attention.
The Cost, and the Trick
An immediate concern: TinyLlama uses 2,048-dimensional vectors. Running 32 separate full-size attention computations means 32 copies of Q, K, V projections, 32 score matrices, 32 weighted sums. That's 32× the parameters and 32× the compute.
The trick: don't run them at full size.
Instead of 32 operations on 2,048-dimensional vectors, run 32 operations on 64-dimensional vectors. Total dimensionality: 32 × 64 = 2,048. Same total. No extra work. We haven't added capacity. We've reorganized it.
d_model = 2,048
n_heads = 32
d_head = d_model / n_heads = 64
In standard multi-head attention, you compute Q, K, V at full width:
Q = input × W_Q (2,048 → 2,048)
K = input × W_K (2,048 → 2,048)
V = input × W_V (2,048 → 2,048)
(If you check TinyLlama's config, you'll notice its K and V projections are actually smaller, 2,048 → 256. That's because it uses grouped-query attention, which we'll cover in Detail 6. For now, we're building the standard picture.)
Each result is a 2,048-dimensional vector per token. Slice each into 32 pieces of 64 dimensions. Piece 1 gets dimensions 1 through 64. Piece 2 gets dimensions 65 through 128. And so on. Each piece is one head's workspace. Run the full attention computation (, softmax, multiply by ) independently on each 64-dimensional slice. Head 1 uses only its slice of Q, K, V. Head 2 does the same with its slice. All 32 heads operate in parallel, each in its own 64-dimensional subspace.
This is where the scaling becomes concrete: each head operates in 64 dimensions, so and the scale factor is .
The total compute is roughly the same as one full-size attention operation. We haven't added work. Just 32 parallel smaller pieces, each free to specialize.
Each head produces a 64-dimensional output per token. To get back to full width, concatenate all 32 outputs end-to-end: 32 × 64 = 2,048. Then one final learned projection () mixes information across heads, letting the model combine what different heads discovered into a single output vector.
What Heads Learn
Because each head has its own projections, heads can specialize without anyone telling them to. The training signal (predict the next token well) is the only pressure, and it shapes each head into whatever role helps.
If you inspect trained models, you find genuinely different behaviors. Some heads attend almost exclusively to the previous one or two tokens, capturing local patterns like word order and syntax. Some latch onto subject-verb pairs across long distances, tracking grammatical structure. Some jump to delimiters: quotes, parentheses, newlines, the structural markers that organize text. Some attend heavily to the first token regardless of content, a consistent phenomenon researchers call "attention sinks" (not fully understood, but it shows up reliably across architectures).
Nothing forces a head into any particular role. Some heads end up redundant. The point is capacity: the layer is no longer limited to a single attention pattern. Go back to "landed." Head 7 puts 90% of its weight on "bat." Head 15 independently puts 85% on "flew." Head 22 focuses on "cave." Three questions, three clean answers, combined at the end. No blur.
There is a cost to this though. In standard multi-head attention, every head maintains its own Keys and Values, and during inference we store those for every past token. 32 heads means 32 sets of K and V per layer per position. TinyLlama doesn't actually pay this full cost. It uses a variant that shares K/V across groups of query heads, so it only needs 4 distinct sets instead of 32.
Detail 6: Attention Variants (GQA, SWA, MLA)
Detail 5 gave us 32 independent heads. Each one specializes. Each one asks its own question of the sequence. The representations are richer, the blur is gone, and the trick of splitting into smaller subspaces kept the total compute roughly the same.
So what did it actually cost?
What 32 Heads Actually Costs
Let's think about what happens during generation. The model produces tokens one at a time (Detail 3). To generate token t, attention needs the Keys of every previous token to compute similarity scores, and the Values of every previous token to collect information. And that's pretty much what attention does: match against Keys, collect from Values.
Those previous tokens haven't changed. Token 3's Key is the same whether we're generating token 50 or token 500. Recomputing them from scratch at every step would be pure waste. So inference engines store them: compute each token's K and V once, add them to a growing collection, reuse on every subsequent step. This collection is the KV cache. (Detail 10 formalizes it. For now, just know it exists and grows by one entry per token generated).
Now think about what 32 heads means for that store. Each head maintains its own K and V. That's 32 separate K vectors and 32 separate V vectors, for every token, at every layer.
Count the bytes. If TinyLlama used standard multi-head attention (it doesn't, and you'll soon see why):
Per position, per layer:
32 heads × 64 dims × 2 (one K, one V) = 4,096 values
Across all 22 layers, at FP16 (2 bytes each):
4,096 × 22 × 2 bytes = 176 KB per token position
At max context (2,048 tokens):
2,048 × 176 KB ≈ 360 MB per sequence
360 MB. For one conversation. On a 1.1B model. For a 70B model, the KV cache reaches gigabytes per sequence. Serve 50 users at once and the cache alone could exhaust your GPU's memory.
The cost scales directly with the number of heads. 32 heads, 32 copies of K and V at every position and every layer. Do we actually need all 32?
Not All Projections Are Created Equal
Let's go back to the three roles from Detail 4. Each token gets projected into three vectors. Q is what a head is looking for. K is how a token advertises itself for matching. V is what it contributes when selected.
In the "landed" example from Detail 5, three heads asked three different questions: "who is the subject?", "what action preceded this?", "where did it come from?" Those are different Queries. Each head needs its own Q to ask its own question. No sharing there.
But think about what K and V represent. They describe the context tokens, the sentence everyone is searching over. And it's the same sentence regardless of which question you're asking. One head hunts for the subject, another for the verb, a third for the location. Different questions, same pool of tokens to search.
Imagine 32 analysts, each researching a different question about the same company. One is analyzing revenue trends, another is evaluating supply chain risks, a third is assessing management turnover. Do they each need their own private copy of the company's financial filings?
No. They can all search the same documents. What matters is that each analyst frames their own research question. The filings don't change depending on who's reading them.
Q and K/V serve different roles. They don't have to scale together. You can have many Query heads (many different questions) while sharing fewer sets of Keys and Values (fewer copies of the documents to search).
The Extreme: One Shared K/V Set
Take this to its logical endpoint. What if all 32 query heads shared a single Key and a single Value? Each head keeps its own , so each can still attend to different tokens. But they all look into the same K/V.
MQA for TinyLlama:
1 K + 1 V × 64 dims = 128 values per position per layer
128 × 22 layers × 2 bytes = 5.6 KB per token position
At 2,048 tokens: ~11.5 MB per sequence
From 360 MB down to 11.5 MB. A 32× reduction from one change: store 1 K/V set instead of 32.
This is called Multi-Query Attention (MQA).
But there's a cost. All heads now search through the exact same representation of context. Go back to the analysts. With 32 separate filing systems, each analyst could have their documents organized differently: one sorted by date, another by department, another by transaction size. Each organization is tuned for a different kind of question. With MQA, everyone shares one filing system. If one analyst's question would benefit from a different organization, too bad.
MQA can lose quality. For some models and tasks the hit is small. For others, collapsing 32 specialized K/V representations into one generic one costs too much. Is there something between 32 and 1?
Where Most Models Land
Instead of going all the way from 32 K/V sets down to 1, keep a small number in between and divide the query heads into groups. Each group shares K/V.
TinyLlama uses GQA with 4 KV groups:
- 32 query heads ÷ 4 KV groups = 8 query heads per group
- Within each group, 8 heads share the same K and V
- Across groups, the K/V representations are different
This changes the projection shapes. The query projection stays full-width, but K and V shrink by 8×:
W_Q: 2,048 → 2,048 (32 query heads × 64 dims)
W_K: 2,048 → 256 (4 KV heads × 64 dims)
W_V: 2,048 → 256 (4 KV heads × 64 dims)
Each of the 4 KV heads produces a 64-dim Key and Value. Those get broadcast to the 8 query heads in their group. The attention math per head is identical to standard MHA. The only difference is where the Keys and Values come from: shared within the group, not unique per head.
GQA-4 for TinyLlama:
4 groups × 64 dims × 2 (K+V) = 512 values per position per layer
512 × 22 layers × 2 bytes = 22 KB per token position
At 2,048 tokens: ~45 MB per sequence
8× smaller than standard MHA. Not as extreme as MQA's 32× reduction, but 4 groups can still specialize. Different groups learn to organize context differently (syntax, semantics, positional patterns). You lose some diversity compared to 32 separate representations, but you keep the most important distinctions.
This is Grouped-Query Attention (GQA). Llama 2 70B, Llama 3, Gemma, TinyLlama: they all use it.
All three variants (standard MHA, GQA, MQA) are really points on the same spectrum, controlled by a single number: n_kv_heads, the count of distinct K/V sets.
| Variant | n_kv_heads | Cache per position (TinyLlama) |
|---|---|---|
| MHA | 32 (= n_heads) | 176 KB |
| GQA-4 | 4 | 22 KB |
| MQA | 1 | 5.6 KB |
MHA is one end. MQA is the other. GQA is everything in between. This is an architecture choice baked in during training. By the time you're serving the model, you're living with whatever was chosen.
A Different Lever: How Far Back to Look
Everything so far has attacked one dimension of the problem: how much K/V data you store per token position. 32 sets, 4 sets, 1 set. We've been shrinking the per-position footprint.
But there's a completely different question: how many past positions do you need to store at all?
Standard attention lets every token attend to all previous tokens. Token 5,000 can look all the way back to token 1. Does it actually need to?
Think about how you read. Right now, understanding this sentence mostly depends on the last few sentences. You need the context of this section, the current paragraph, maybe the heading. You don't need to re-read the opening of the blog post to parse this sentence. Occasionally you do look far back (a term defined much earlier, a running example established at the start), but for most of what you read, recent context dominates.
Sliding Window Attention (SWA) formalizes this observation. Instead of attending to all previous tokens, each token only attends to the nearest W tokens (the "window"), including itself. Everything older is masked out, as if it doesn't exist.
If W = 4,096 and the sequence has reached 10,000 tokens, token 5,000 can attend to tokens 905 through 5,000. Token 1 is invisible to it.
The trade-off is direct. A smaller window bounds memory because the KV cache only needs to hold the last W tokens' worth of K/V, regardless of how long the sequence gets. But it also caps reach: if important context lives 5,000 tokens back and your window is 4,096, you can't directly reach it.
But there's a subtlety. Information can still propagate beyond the window size by flowing through intermediate layers. Token 5,000 can't attend to token 1 directly, but if token 3,000 attended to token 1 in an earlier layer, and token 5,000 attends to token 3,000, the information has traveled indirectly. With window size and layers, the effective receptive field is bounded by roughly , and only if intermediate tokens learn to relay the signal forward. In practice this propagation is lossy and unreliable: it is not a substitute for direct global attention.
In practice, many models use SWA in some layers and full attention in others. Gemma 3, for example, uses a 5:1 ratio: five sliding-window layers for every one full-attention layer. The sliding-window layers handle local relationships cheaply. The occasional full-attention layer provides a direct path for long-range information. This hybrid approach gets most of the memory savings while keeping long-range capability.
Notice that this is a fundamentally different lever than GQA. GQA reduces the size of what you store per token position (fewer K/V sets). SWA reduces the number of token positions you store. They're orthogonal, and a model can use both.
Yet Another Lever: Compress What You Store
GQA shares K/V across heads: fewer sets, same size each. SWA limits how many past tokens you keep: fewer positions, same K/V per position. There's a third option: keep all the positions and all the heads, but make each K/V entry smaller.
Multi-Head Latent Attention (MLA), used in DeepSeek V2 and V3, takes this approach. Instead of caching the full K and V vectors for each token, it compresses them into a smaller "latent" vector and avoids reconstructing a full per-head K/V cache on the hot path.
Think of the difference between storing full-resolution photos and storing compressed thumbnails. The thumbnails take far less space. With MLA, you don't reconstruct the full-resolution photo at runtime; you compute attention directly on the thumbnails.
Concretely, the exact savings depend on the latent rank and head layout. In the DeepSeek V2 paper's large-model comparison, the KV cache drops from 860.2K elements per token under an MHA baseline to 34.6K under MLA: about 25× fewer elements, or a 93.3% reduction.
Quality trade-offs are different from GQA's. GQA saves memory by making heads share identical K/V vectors. That's a hard constraint: heads that would benefit from seeing different K/V representations are forced to see the same one. MLA stores a shared latent instead. Each head can still recover its own K and V from that latent via its own learned decompression matrix, so you store one small latent per position while keeping per-head specialization.
In DeepSeek V2's ablations, MLA outperforms GQA at similar cache sizes. That matches the constraint story: GQA forces heads to share identical K/V, while MLA reconstructs head-specific K/V from the latent. Against standard MHA (where each head already has its own full, uncompressed K/V), MLA doesn't have an inherent quality advantage, but it can get close at a fraction of the memory cost.
You might expect a compute trade-off: reconstructing K and V from the latent for every past token sounds like it would add an expensive matrix multiply per layer per token. But MLA avoids an explicit full decompression pass through a linear algebra trick. Since the decompression is just a matrix multiply, and matrix multiplication is associative, you can absorb the decompression matrices into the Query and Output projection weights offline. Instead of reconstructing full K from the latent and then computing Q·K^T, you fold the key-side decompression into Q's projection. Same idea on the output side: fold the value-side decompression into the output projection. On the hot path, there is no separate "expand the whole cache back to per-head K/V" step.
There is one wrinkle. The absorption trick works because decompression is the same matrix multiply regardless of token position. But modern transformers also inject positional information into Q and K, a position-dependent operation that varies per token (Detail 7 covers exactly how). A position-dependent operation cannot be folded into a static weight matrix. If you tried to compress everything into the latent, the position-dependent part would block the absorption.
DeepSeek V2's solution is to decouple the key into two pieces. The content component (what the token means, independent of where it sits) goes into the compressed latent. This is the part the absorption trick handles. The positional component, a small RoPE-carrying slice, stays uncompressed and is stored separately alongside the latent in the KV cache. It is shared across heads for that positional path, which keeps the overhead small.
The trade-off is the mirror of GQA's:
- GQA saves memory by sharing K/V across heads (fewer copies, full size)
- MLA saves memory by compressing K/V per position (all copies, smaller size)
- GQA sacrifices per-head specialization; MLA preserves it and, thanks to the absorption trick, adds no extra compute during inference
The Three Levers
Every attention variant we've seen pulls one of three levers to reduce KV cache memory:
KV cache memory ∝ n_layers × seq_len × n_kv_heads × d_head × 2 (K,V)
| Lever | What it targets | Examples |
|---|---|---|
| Share K/V across heads | ↓ n_kv_heads | GQA, MQA |
| Limit attention range | ↓ effective seq_len | SWA |
| Compress K/V entries | ↓ effective d_head | MLA |
These levers are orthogonal. A model can combine them: GQA to share K/V across heads, SWA on most layers to cap context length, and full attention on a few layers for long-range access. Real models do exactly this.
For a Llama 2 70B-like configuration at 2,048 tokens:
- Standard MHA: ~5.4 GB per sequence
- With GQA-8 (what it actually uses): ~0.67 GB per sequence
That 8× reduction is the difference between serving 1 user and serving 8 on the same GPU. These are architecture choices baked in at training time, but understanding them explains why some models are cheap to serve and others are expensive.
But there's something we skipped. Everything about attention so far has been about what tokens mean. None of it knows where tokens are.
Detail 7: Positional Encodings with RoPE
We just built a mechanism where every token examines every other token, measures relevance through learned Q/K projections, and pulls in context proportionally. It resolved "bat" near "cave." It let "it" find its referent.
But look at these two sentences:
"The dog bit the man." "The man bit the dog."
Very different meanings. One is a news story, the other is a bizarre headline. Let's trace what attention actually does with each.
Both sentences contain the same tokens. The embedding table gives "dog" the same vector in both sentences. Same for "man." Same for "bit." Embeddings are a lookup: token identity in, vector out. Position plays no role.
The Q, K, V projections are linear functions of those embeddings. "Dog" produces the same Query vector in both sentences. The same Key. The same Value. So does "man." So does "bit."
Now look at the attention scores. The score between "dog" and "man" is a dot product: . That number depends on what those vectors contain. Not on where in the sequence they sit. Swap the positions of "dog" and "man" in the input, and is the exact same number. The attention weights come out the same up to that same swap. The weighted sums do too.
So the two sentences do not produce different internal computations that reflect the different meanings. They produce the same attention computation up to a permutation of positions. The model has no built-in notion that one ordering is "dog bit man" and the other is "man bit dog."
We built a mechanism that finds relevant context but is blind to where that context sits. Attention treats its input as a set of tokens, not a sequence. Word order is invisible.
The First Attempt: Stamping Each Position
The most natural fix: give each position a unique identity. Token 0 gets a vector , token 1 gets , and so on. Add these to the token embeddings before block 1. Now "dog" at position 1 carries a slightly different vector than "dog" at position 4. The Q and K projections produce different results. The dot products change. Order becomes visible.
This is exactly what the original Transformer paper (Vaswani et al., 2017) did. They used fixed sinusoidal patterns: position-dependent sine and cosine values added to embeddings at the input. GPT-2 replaced the fixed patterns with learned vectors, one per position, but the idea is the same. Encode position once, at the input, by adding something to the embedding.
This works. But there are two structural problems.
The encoding is absolute. Token 10 always gets the same position stamp, whether the token it's attending to sits at position 9 or position 500. But what often matters for attention isn't "I'm at position 10." It's "the token I care about is 3 positions away." Relative distance. The model can learn to extract relative distance from absolute positions (if one token is at 10 and another at 15, their difference is 5), but it has to discover subtraction from the raw data. The relationship isn't built into the encoding. It has to be learned.
The signal fades. The position vectors are added once, before block 1. Then the data passes through 22 layers of attention mixing and FFN transformations. Each layer's attention blends vectors from different positions, smearing the position signal. Each FFN applies a nonlinear transformation that can further reshape it. By block 22, the original position stamp may be diluted or distorted.
What we actually want has two properties. First: the attention score between two tokens should depend on the distance between them, not their absolute positions. "Dog" attending to "man" three positions away should produce the same score whether they sit at positions (2, 5) or (100, 103). Second: the position signal should be injected fresh inside each attention layer, not once at the start.
What Kind of Operation Gives Us This?
Think about the constraint precisely. We want to apply some function to Q and K based on their positions, such that when we take their dot product, the result depends on the position difference and nothing else.
Call the function . We apply to the Query at position , and to the Key at position . Their dot product should contain but not or independently.
Does addition work? If you add a position vector to , the dot product becomes . Look at those last three terms. They involve and in complicated, entangled ways. Not just their difference. Addition doesn't give us relative position cleanly.
So what operation does?
Picture two clock hands on the same face. Each hand points in some direction: that's its "content" (what the token means). Now spin the entire clock by some angle. Both hands rotate. The angle between the two hands stays the same. The absolute orientation of each hand changed, but their relative angle is invariant.
That invariance is precisely the property we need. If you rotate one vector by (proportional to its position) and the other by , the angle between them contains . The individual positions cancel. Only the difference survives.
Tracing Rotation in 2D
Let's make this concrete. Take a Query at position and a Key at position , both 2D vectors. To encode position, we rotate each vector by an angle proportional to its position index. The rotation uses the standard 2D rotation matrix:
Apply to and to , where is a fixed frequency. Then compute the dot product of the rotated vectors.
Here is the key output though. The trig identities behind rotation cause the absolute angles to cancel. What remains is:
where and are the original angles of the unrotated vectors. Position enters as . Not . Not . Just the difference.
Let's verify this with real numbers. Set . Take at position 3, and at position 5.
Without any rotation, the dot product is . Pure content similarity, no position information.
With rotation: gets rotated by , and by . Both vectors change. Their dot product comes out to roughly . The value now encodes both content similarity and the fact that these two tokens sit 2 positions apart.
Now move the same tokens to positions (100, 102). Same content. Same distance of 2. rotates by , by . Completely different absolute rotations. But the relative rotation between them is still . And the dot product? . Identical.
Two tokens at the same relative distance produce the same attention score regardless of where in the sequence they sit. That is the property we wanted, and it falls out of the rotation math rather than being approximated.
This is Rotary Position Embedding (RoPE), the position encoding used by Llama-family models including TinyLlama. The name captures the mechanism: position is encoded through rotation, not addition.
Scaling to 64 Dimensions
A single rotation angle works in 2D. But TinyLlama's Q and K heads are 64-dimensional vectors. You can't rotate a 64D vector with one angle.
RoPE's solution: treat the 64 dimensions as 32 independent pairs. Dimensions (1, 2) form the first pair. Dimensions (3, 4) form the second. And so on, through pair 32 at dimensions (63, 64). Each pair is its own 2D plane, and each gets its own rotation at its own frequency.
Why different frequencies? Think of a clock with 32 hands, each turning at a different speed. A second hand rotates fast: it completes many full rotations even over short position intervals, so it distinguishes nearby positions with high precision. Is this token 1 or 2 positions away? The fast hand can tell. An hour hand rotates slowly: it barely budges between adjacent positions, but over hundreds of positions, it provides coarse, long-range orientation. Is this token roughly 100 or 200 positions back? The slow hand captures that.
Reading all 32 simultaneously gives you precise position information at every scale. The low-numbered pairs rotate quickly, capturing fine-grained nearby distinctions. The high-numbered pairs rotate slowly, capturing broad long-range patterns. And the dot-product cancellation that makes rotation relative holds independently within each pair, so the full 64-dimensional dot product encodes relative position at multiple resolutions simultaneously.
No learned parameters. The rotation angles are deterministic functions of position and dimension index, similar in spirit to the original 2017 sinusoidal scheme, but applied multiplicatively (via rotation) rather than additively (via addition to embeddings). And because the rotation happens inside each attention layer, the position signal is fresh every time. Not a fading residue from the input, but a direct injection at every layer, 22 times.
What Gets Rotated
RoPE applies only to the Query and Key vectors. Not the Values.
This follows naturally from what each piece does. Q and K produce the attention scores: "how much should token attend to token ?" That's where relative position belongs. Whether two tokens should attend to each other depends partly on how far apart they sit.
Values carry the content that gets mixed into the output once the scores are decided. Rotating V would distort the information being passed along without improving the position-dependent matching. Position shapes who talks to whom. It shouldn't warp what gets said.
Why This Matters for Inference
Because RoPE is applied inside each attention layer, the Keys stored in the KV cache (which we'll formalize in Detail 10) already have their positional rotations baked in. When a new token at position t arrives during decode, we only need to rotate the new Q and K vectors for position t. There is no extra cost to replay or recompute position information for cached tokens.
This also explains why positional information doesn't appear in the pipeline diagrams before the transformer stack: in a RoPE-based model, there is no separate "add position" step at the input. Position enters the computation inside each attention layer, 22 times rather than once.
That covers position. But attention is only half of each transformer block. The other half processes tokens individually.
Detail 8: Feed-Forward Network (FFN)
Look at the weighted sum from Detail 4:
output_bat = 0.01 × V_A + 0.72 × V_cave + 0.27 × V_bat
That's what attention produced for "bat." A blend. 72% cave, 27% bat, 1% the article "A." The vector is no longer the generic dictionary entry. It has been pulled toward "cave-dwelling animal" by the weight on "cave." But it is still a weighted average of Value vectors. Nothing more.
Think about what a weighted average can and cannot do. Any weighted sum of vectors lands somewhere inside the space those vectors already span. You can shift the balance, more cave, less bat, but you can't reach a point outside the region they define. If the Value vectors are points on a map, attention slides along the lines between them. It never leaves the roads.
That's a real limitation. "bat" attended heavily to "cave." Good. But the concept "cave-dwelling animal" is richer than any blend of the raw vectors for "cave" and "bat." Gathering context isn't enough. Something needs to process what attention collected and synthesize something new. A blend is a mix, not a conclusion.
Go back to the hiring analogy. Attention was the panel of interviewers, each asking their question, each collecting weighted impressions. But nobody has synthesized "strong infra skills + weak communication + deep ML intuition" into "this person would be great for the platform team but not the customer-facing role." The signal has been collected. Now someone needs to sit down and think about what it means.
So: a per-token processing step. A function that takes each token's blended vector and transforms it, independently, into a representation that no weighted average could produce.
This is the feed-forward network, or FFN. If you want the full MLP story from scratch, I wrote a separate deep dive: Multi-Layer Perceptrons: How Neural Networks Bend Space to See. Here we only need the transformer version: a position-wise neural network applied at every token position.
The Structure
The FFN has three steps. Each one is there for a reason.
Step 1: Expand. The token vector lives in 2,048 dimensions. That's the width of TinyLlama's residual stream, shared by every component in the transformer. But 2,048 dimensions is cramped for the kind of computation we need. Features overlap. The "cave" feature and the "dark" feature and the "underground" feature might share some of the same dimensions, because 2,048 slots aren't enough to give each concept its own dedicated space.
The first matrix multiplication projects the vector up to 5,632 dimensions. Think of it as spreading parts across a large workbench. In the smaller space, components are piled on top of each other and hard to manipulate independently. In the larger space, related but distinct features get separated. The network can work on "cave" without accidentally disturbing "dark."
Step 2: Apply a nonlinearity. This is the step that matters most.
Without a nonlinearity between them, two matrix multiplications compose into a single matrix multiplication. In linear algebra, if you multiply a vector by matrix A and then by matrix B, the result is the same as multiplying by the single matrix BA. The expansion to 5,632 dimensions would buy nothing. You'd get a 2,048 → 5,632 → 2,048 pipeline that collapses into one 2,048 → 2,048 transformation. All that extra room on the workbench, wasted, because the operations are linear and the whole thing simplifies away.
The nonlinearity breaks this collapse. It's a simple elementwise function (in TinyLlama's case, SiLU) applied to each of the 5,632 values independently. But its effect is profound: it makes the two linear operations non-composable. The expansion genuinely creates a richer working space, and the network can perform computations there that no single matrix multiply on the original 2,048 dimensions could express.
This is what lets the FFN go beyond what attention produces. Attention's output is a weighted sum, constrained to the space spanned by the Value vectors. The FFN's nonlinearity lets it reach outside that space. Feature combinations, thresholds, interactions that no weighted average of inputs could generate.
Step 3: Compress. The rest of the model expects 2,048-dimensional vectors. The expansion was working memory, not a permanent change in representation size. The final matrix multiplication projects back down from 5,632 to 2,048, packing the results into the standard width.
Expand, transform, compress. The bottleneck shape is the whole point: create room to do precise work, then pack it back down.
Same Function, Every Position
The FFN runs independently at every token position. No token-to-token connections. The same weights, applied to every row of the [seq_len × d_model] matrix.
But "same function" doesn't mean "same result." Consider what happens in "A cave bat." "bat" enters the FFN carrying context about caves (attention weighted "cave" at 72%). "cave" enters carrying its own context (it attended to "bat," pulling toward "a cave where bats live"). Same weights, same architecture, same nonlinearity. Different inputs, different outputs. "bat" exits encoding something closer to "cave-dwelling animal." "cave" exits encoding something closer to "a place with bats." One function, many outcomes.
A useful shorthand: attention is communication, the FFN is computation. Attention routes information between tokens. The FFN processes information within each token.
Where the Parameters Live
The FFN is where the majority of the model's parameters live. For each of TinyLlama's 22 layers:
FFN parameters per layer:
Three matrices of ~2,048 × 5,632 each ≈ 34.6M parameters
Across all 22 layers: 34.6M × 22 ≈ 761M parameters
That's roughly 2/3 of TinyLlama's total 1.1B parameters. People tend to focus on attention, but the FFN is where most of the model's learned knowledge actually lives. Attention is comparatively parameter-light: its job is routing. The FFN is where the model actually knows things.
Now we know both pieces inside a transformer block. How do they wire together?
Detail 9: The Complete Transformer Block
We've been inside a single transformer block for five sections now, pulling apart attention and the FFN. Here's where that block stood before we opened it:
We can now fill it in. Two components, in series:
Attention mixes information across tokens. The FFN processes each token individually. Wire them end to end and a single block works: "bat" enters as a generic embedding, attention shifts it toward "cave-dwelling animal," the FFN refines that into something richer.
But TinyLlama has 22 of these blocks in sequence. Block 1's output feeds block 2, block 2 feeds block 3, all the way through block 22. At that depth, two problems show up that would break the model if left unaddressed.
The Erasure Problem
Think about what happens across multiple layers if each sublayer simply replaces its input with a new vector.
Block 1: attention resolves "bat" from generic to "cave-dwelling animal." The FFN processes that and outputs a new 2,048-dimensional vector. The input embedding is gone. Block 1's output is the only thing that survives.
Block 2: attention takes block 1's output and produces another entirely new vector. The FFN transforms that into yet another. Block 1's specific contribution? Absorbed and rewritten.
By block 5, the representation has been overwritten five times. By block 22, twenty-two times. Each layer discards what came before and reconstructs from scratch. If block 1 captured a subtle signal ("this 'bat' is definitely the animal, not equipment"), that signal has no protected channel through the remaining 21 layers. It survives only to the extent that later layers happen to reconstruct it.
There's a subtler cost. What if block 15 only needs to make a small adjustment, nudging "cave bat" slightly toward "insectivorous cave bat"? Block 15 can't just add a small correction. It has to output a complete 2,048-dimensional vector that preserves everything blocks 1 through 14 built, plus its tiny adjustment. That's asking one layer to reconstruct the entire history just to change one thing. Fragile, and wasteful.
The deeper the stack, the worse this gets. Twenty-two successive rewrites is not a recipe for preserving nuance.
The Fix: Add, Don't Replace
The solution, once you see the problem, is almost obvious. Each sublayer shouldn't replace its input. It should contribute a correction.
output = input + sublayer(input)
The input passes through unchanged. The sublayer's output is treated as a delta, added on top. After block 1, the vector is the original embedding plus block 1's correction. After block 2, it's the original plus two corrections. After all 22, the original embedding plus 22 learned adjustments, all summed together.
Now block 15's job is easy. The original signal and the work of blocks 1 through 14 are already in the vector, flowing through. Block 15 adds its own contribution on top. If that contribution is small, fine. If it's zero (the block learned this token needs no change at this layer), also fine. Nothing from earlier layers gets lost.
This pattern is called a residual connection (or skip connection). The name comes from what the sublayer learns: the residual, the difference between what came in and what should go out, rather than the full output from scratch.
During training, residual connections serve a second purpose: they give gradients a direct path backward through the stack. Without them, gradients passing through 22 layers of transformations tend to shrink toward zero (the "vanishing gradient" problem). The skip connection provides a shortcut. For inference, the benefit is simpler: early work survives.
The Scale Problem
Residual connections solve erasure. They introduce something else.
We keep adding. Each sublayer contributes a correction vector. After 22 blocks, each with two sublayers, that's 44 additive corrections to the running sum. Even if each correction is modest, the accumulated values can drift.
Suppose the original embedding for "bat" has values mostly in the range [-1, 1]. Block 1's attention adds a correction in a similar range. The FFN adds more. After a few layers, some dimensions have accumulated large positive values while others barely moved. By block 22, dimension 47 might sit at 42.3 while dimension 100 sits at -0.8.
The next sublayer takes this as input and multiplies it by a weight matrix. That matrix was trained expecting inputs in a reasonable range. When one dimension is 50× larger than another, the matrix multiply produces wildly lopsided outputs. Training becomes unstable. The model struggles to compute anything meaningful.
The fix: before each sublayer, rescale the vector so all dimensions have comparable magnitudes. Divide each element by a measure of the overall spread, squashing everything back to a consistent scale. TinyLlama uses RMSNorm (root-mean-square normalization): divide each element by the root-mean-square of the full vector. A learned per-dimension scale factor lets the model fine-tune the result, but the stabilization is the point.
One convention worth noting: TinyLlama applies this normalization before each sublayer, not after. Normalize, then feed into attention. Normalize, then feed into the FFN. This is called pre-norm, and it's the standard in modern transformers. (The original 2017 Transformer paper did it the other way, normalizing after each sublayer. Pre-norm turned out to train more stably.)
The Complete Block
Every piece is now motivated. Here is the full data flow through a single transformer block:
- Input arrives (2,048-dim vector per token)
- RMSNorm → Self-attention → Add to input (residual connection)
- RMSNorm → FFN → Add to previous result (residual connection)
- Output passes to the next block
Two sublayers. Two normalizations (one before each). Two residual connections (one wrapping each). The input stays in the running sum, with attention and the FFN layering corrections on top.
Nothing here is decorative. Remove the residual connections and early work gets erased. Remove normalization and values drift out of range. Remove attention and tokens can't see each other. Remove the FFN and there's no per-token processing.
The Full Stack
TinyLlama repeats this block 22 times. Block 1 feeds block 2, block 2 feeds block 3. The wiring is identical, but the weights differ per layer, so each layer can specialize. After the final block, one last RMSNorm, and the vectors enter the LM head (from Detail 2b) for next-token prediction.
That's the complete pipeline: tokenizer → embedding → 22 transformer blocks (each with RoPE-augmented attention + FFN, wrapped in residuals and norms) → LM head → sampling → detokenizer.
Back to Detail 3's autoregressive loop: every step reprocesses the whole prefix. With attention unpacked across all 22 layers, you can name the waste precisely: recomputing the Q, K, and V vectors for every previous token at every layer, even though those tokens haven't changed.
Detail 10: The KV Cache
Now that we've opened the transformer block, we can name the waste in the autoregressive loop precisely. Let's do that.
Go back to generating "Once upon a time" from the prompt "Write a story." Consider step 5, when the model produces "time." The prefix at this point is ["Write", "a", "story", "Once", "upon", "a", "time"]: seven tokens. To produce the eighth token, the model runs all seven through all 22 transformer blocks.
At each layer, each of the seven tokens gets projected into Q, K, and V. Attention runs: token 7's Query dots against all seven Keys, softmax normalizes, the weighted sum collects from all seven Values. The FFN processes each result. Repeat across all 22 layers.
Now look at what actually changed since step 4. One token. "time" is new. The other six are identical to what they were on the previous step. Same inputs, same weight matrices, same outputs. "story"'s Key at layer 12 is the exact same 64-dimensional vector it was last step. We computed it again. For nothing.
And this gets worse with every token generated. Step 100 recomputes projections for 102 tokens when only 1 is new. Step 200: 202 tokens, 1 new. The ratio of useful work to wasted work approaches zero.
We sketched this problem at a high level in Detail 3, but back then the transformer block was still a black box. Now that we've unpacked Q, K, and V (Details 4 through 6), we can ask a sharper question: which of these computations are actually redundant, and which ones do we genuinely need to redo?
What Attention Actually Needs
Focus on a single attention layer during decode. The model is generating token t. Attention needs to answer one question: what should token t attend to?
To answer that, it needs three things:
- Token
t's Query: what is this token looking for? - Every previous token's Key: how does each past token advertise itself for matching?
- Every previous token's Value: what does each past token contribute if selected?
Token t's Query has to be computed fresh. It depends on the representation this token has built up through the preceding layers, which is new at every step.
But the Keys and Values for tokens 0 through t-1? Token 3's Key at layer 7 is determined by token 3's representation entering layer 7 and the weight matrix at that layer. Neither of these changes between step 4 and step 200. The result is identical every time we compute it.
This is the asymmetry that makes caching possible.
Q is ephemeral. Token t uses its Query at step t to scan the sequence, and then that Query is never referenced again. Step t+1 brings a new token with its own Query. No future step will ever need token t's Query. It is a one-time intermediate.
K and V are persistent. Token 3's Key is needed at step 4 (when token 4 queries it), at step 5, at step 6, at step 200. Every future token will dot its Query against token 3's Key and potentially blend token 3's Value into its representation. These vectors are referenced on every future step, and they never change.
The Fix
Once you see the asymmetry, the fix writes itself. K and V for past tokens never change and are needed on every future step. Compute them once, store them, and on every subsequent step read from the store instead of recomputing.
Here is what a single decode step looks like with this change:
- Compute Q, K, V for only the new token (token
t) at each layer. - Append token
t's K and V to the stored collection. - For attention: dot token
t's Q against all stored Keys (tokens 0 throught), softmax, weighted sum of all stored Values.
The attention math is unchanged. The same Query dots against the same Keys. The same softmax normalizes. The same weighted sum collects from the same Values. The only difference is where old K/V come from: a memory read instead of a matrix multiplication. The output is bit-for-bit the same.
This persistent store of past Keys and Values is the KV cache.
The attention score computation still scales with prefix length (the new Q dots against all stored Keys). But the projection operations, the matrix multiplications that produce each token's K and V through the learned weight matrices and , happen once per token and then get stored. Those were the redundant operations. They're gone.
What Changes in the Pipeline
The cache splits the pipeline into the two modes that Detail 3 previewed: prefill and decode.
During prefill, all prompt tokens are known upfront. The model processes them in a single parallel pass through all 22 layers, computes K and V for every prompt position, and populates the cache. At the end, it produces logits for the last position and samples the first output token.
During decode, the model processes one new token per step. Each step reads the cached K/V for attention, computes fresh Q/K/V for just the new token, appends the new K/V, and samples.
The model still generates one token per forward pass. But each decode step now does dramatically less projection work: one new token's Q/K/V per layer, instead of recomputing the entire prefix. The cache trades memory (storing all those K/V vectors) for compute (not recomputing them).
How Much Memory Does the Cache Take?
The cache stores K and V for every token position at every layer. It grows by one entry per decode step. What does each entry cost?
TinyLlama uses GQA with 4 KV heads, each 64 dimensions:
Per position, per layer:
K: 4 heads × 64 dims = 256 values
V: 4 heads × 64 dims = 256 values
Total: 512 values
At FP16 (2 bytes per value):
Per layer: 512 × 2 bytes = 1 KB per position
All 22 layers: 1 KB × 22 = 22 KB per position
For max context (2,048 tokens):
2,048 × 22 KB ≈ 45 MB per sequence
45 MB for one sequence's cached attention state, on top of the 2.2 GB for model weights. Manageable.
But inference servers don't run one sequence at a time. They batch many requests together (Details 13 and 14 will cover how). Ten concurrent sequences need 450 MB of cache. A hundred need 4.5 GB. The cache grows linearly with both sequence length and the number of active requests. At scale, it becomes the dominant consumer of GPU memory, overtaking the model weights themselves.
This is where Detail 6's architecture choice pays off. If TinyLlama used standard multi-head attention with 32 KV heads instead of GQA with 4, the cache would be 8× larger: ~360 MB per sequence. A hundred concurrent sequences would demand 36 GB of cache alone. GQA's reduction from 32 K/V sets to 4 translates directly into serving more users on the same hardware.
The KV cache kills the redundant recompute. But it also locks in two very different shapes of work: prefill processes hundreds of tokens in parallel, decode processes one at a time. How the GPU handles that contrast is next.
Detail 11: Prefill vs Decode
At the start of this post, we noticed something about the rhythm of LLM responses. A pause before the first token. Then a steady stream after it. The pause grows with longer prompts. The streaming speed stays roughly constant whether you asked for a poem or a legal brief.
We said those behaviors map to specific stages in the pipeline. With the KV cache now in place, we can finally name those stages and explain exactly why they behave so differently.
Two Different Jobs
The user sends "Write a story." After tokenization, the model has three token IDs: [8144, 264, 3446]. All three are known before the model runs. There's nothing to wait for, no dependency between them. So the model processes all three at once: runs them through every transformer block in parallel, computes their K and V vectors at each layer, and fills the entire KV cache in a single pass. At the end, it produces logits for the last position and samples the first output token. "Once."
That was the parallel part.
Now the model needs to produce "upon." But "upon" depends on "Once." It couldn't have been predicted without knowing "Once" was chosen first. And the next token after "upon" depends on what "upon" turned out to be. And so on. Each output token depends on the previous one, which means the model is forced to generate them one at a time: process "Once" through all 22 layers (reading cached K/V for attention), append its K/V to the cache, sample "upon." Process "upon," append, sample "a." Process "a," append, sample "time." One full trip through the model per token. No shortcuts.
The prompt is a batch job. The output is a serial pipeline.
These two jobs have names:
- Prefill: Process all prompt tokens in one parallel pass. Fill the KV cache. Emit the first output token.
- Decode: Generate output tokens one at a time. Each step reads the cache, processes one new token, appends its K/V, and samples.
Same model, same weights, same attention formula. The difference is how many tokens go through per forward pass. And that difference matters more than it looks.
Why the GPU Cares
Three prompt tokens barely register on a GPU. Scale up. A realistic request: 1,000 prompt tokens, 200 output tokens desired.
During prefill, the GPU runs all 1,000 tokens through the model at once. Think about what that means for a single weight matrix. Take one of TinyLlama's attention projections: a [2,048 × 2,048] matrix, about 8.4 MB at FP16. The GPU reads that matrix from memory once. Then it multiplies 1,000 token vectors through it. One read, 1,000 matrix-vector products. Across the full model, all 2.2 GB of weights get loaded once and do useful work for every prompt token.
The compute units stay fed. Billions of multiply-add operations, layer after layer. While the GPU crunches through one batch of weights, the memory system has time to stage the next set. The pipeline stays full. Compute capacity is the ceiling, not memory bandwidth.
Now decode. The same weights get loaded. Same 2.2 GB of data traveling the same path from VRAM to the compute cores. But this time there's one token vector, not 1,000. Each weight matrix gets read, multiplied against a single vector, and the arithmetic finishes in microseconds. Then the compute units sit idle while the memory bus fetches the next matrix. And the next. And the next. All 22 layers, all projections, all 2.2 GB. For one token.
Same weights loaded. Same bytes moved. One thousandth the useful work.
During prefill, the GPU spent most of its time computing. During decode, it spent most of its time waiting for data to arrive. The bottleneck flipped completely.
Prefill is limited by how fast the GPU can do math. There's so much work per byte transferred that the memory system can keep up. This regime is called compute-bound.
Decode is limited by how fast the GPU can read from memory. The arithmetic completes almost instantly; the wait is for the next chunk of weights to arrive from VRAM. This regime is called memory-bandwidth-bound.
Same model, same hardware, but the two phases hit completely different ceilings. Which ceiling you're hitting determines which optimizations matter.
Back to the User Experience
Now those observations from the opening make sense.
The pause before the first token? That's prefill. The GPU is crunching through your entire prompt in parallel, building the KV cache. Longer prompt, more compute, longer pause. This metric is called Time-to-First-Token (TTFT).
The steady stream of tokens after that? That's decode. One forward pass per token, each one reading roughly the same volume of weight data from memory, each taking roughly the same amount of time. The stream feels constant because each decode step costs about the same whether it's token 5 or token 195. This metric is called Inter-Token Latency (ITL).
The pause scales with prompt length because prefill is compute-bound: more tokens, more parallel work. The streaming rate barely changes because decode is bandwidth-bound: the weight read is the same size on every step, and one token of arithmetic doesn't change that.
| Phase | Tokens per pass | Bottleneck | User-visible metric |
|---|---|---|---|
| Prefill | 1,000 (parallel) | Compute | TTFT |
| Decode | 1 (sequential) | Memory bandwidth | ITL |
Decode Dominates
The total time to produce a complete response:
Total latency = TTFT + ITL × (output_tokens - 1)
For our 1,000-token prompt with a 200-token response: 1 prefill pass, then 199 decode steps. (The first output token comes from prefill; every subsequent token is a decode step.)
Even if prefill takes 10× longer than a single decode step, 199 decode steps overwhelm it. Prefill contributes less than 5% of the total latency. Decode contributes over 95%. Extend the response to 500 tokens and prefill drops below 2%.
Most of the time a user spends waiting is spent in decode. The model reads essentially all its weights from memory once per output token, and the number of output tokens dominates the timeline.
This is why most inference optimization effort targets decode. Making each weight read cheaper (quantization), spreading it across more useful work (batching), or extracting multiple tokens from a single read (speculative decoding): different strategies, same bottleneck. Part 2 builds each one.
Let's put concrete bandwidth numbers to this.
Detail 12: Memory Bandwidth is the Bottleneck
Detail 11 showed that prefill and decode hit different ceilings. Prefill keeps the GPU busy computing. Decode leaves it waiting for data. That was the qualitative picture. Let's put actual numbers to it, because the imbalance is more extreme than the words suggest.
One Decode Step, Traced
Take one decode step. Batch size 1. The model generates one next token.
To produce that token, every weight in TinyLlama passes through the compute units: the Q, K, V, and output projections at each of the 22 layers, the FFN's three matrices per layer, the LM head at the end. 1.1 billion parameters, stored in FP16 at 2 bytes each. The GPU reads roughly 2.2 GB of weight data from VRAM. (It also reads the KV cache for the current context and writes back the new token's KV entries and logits, but the weight read dwarfs everything else.)
How much computation does all that data produce? Each parameter participates in one multiply and one add: two floating-point operations. Across the entire model:
~2 FLOPs × 1.1 billion parameters ≈ 2.2 billion FLOPs
2.2 GB of data transferred. 2.2 billion operations performed. The ratio: roughly one floating-point operation per byte loaded.
One FLOP per byte. The GPU does one multiply-add, idles while the memory bus delivers the next pair of bytes, does one more, idles again. All the way through 2.2 billion bytes. An A100 can perform 312 trillion FP16 operations per second. At ~1 FLOP per byte and 1,555 GB/s of memory bandwidth, the chip's actual compute throughput during decode is about 1.5 trillion FLOPs/s out of a theoretical 312 trillion. Roughly 0.5% utilization. The arithmetic units are almost entirely dark.
The math, all 2.2 billion operations of it, finishes in microseconds. Loading the data takes milliseconds. During decode, the GPU is not computing. It is moving bytes.
And this toll repeats every decode step. Token 1: read ~2.2 GB from VRAM, do microseconds of math. Token 2: same. Token 200: same. The GPU's on-chip memory (about 40 MB of L2 cache on an A100) is far too small to hold the full 2.2 GB model, so each step streams essentially the full weight set from VRAM all over again. Two hundred tokens means roughly 440 GB of memory traffic. To produce a few sentences of text.
GPU caches do retain some weight tiles between kernel launches, so the strict byte count per step is not literally "read the entire model." But these caches are small relative to the weight footprint, and the regime does not change. Treat "one full weight read per token" as a useful lower-bound mental model, not cycle-accurate accounting.
The Bandwidth Floor
If each decode step reads ~2.2 GB, and the GPU's memory bus can move at most B GB/s, then one token cannot come out faster than 2.2 / B seconds. That is a floor set by the physics of the memory interface. No kernel optimization, no clever scheduling, no amount of software engineering pushes below it on a given chip.
| GPU | Peak memory bandwidth | Minimum time to read 2.2 GB |
|---|---|---|
| RTX 4090 | 1,008 GB/s | 2.18 ms |
| A100 SXM 40GB | 1,555 GB/s | 1.41 ms |
| H100 SXM | 3,350 GB/s | 0.66 ms |
These are lower bounds derived from peak specs. Real inter-token latency is always higher: kernel launch overhead, imperfect memory access patterns, scheduling gaps between layers, and KV cache traffic all add up. Bandwidth also depends on form factor. PCIe variants of the same chip typically have lower bandwidth than SXM. An H100 PCIe delivers about 2,039 GB/s versus 3,350 GB/s for SXM.
Even before thinking about compute, decode has a millisecond-scale floor per token. That floor comes entirely from data movement. The only way to lower it for a given model is to shrink the weights (quantization, Detail 18) or use a faster memory bus (better GPU, or multiple GPUs in parallel, Detail 22).
Arithmetic Intensity
We have been circling the same observation from different angles: decode does almost no computation relative to how much data it moves. There is a standard way to express this.
The ratio of computation to data movement is called arithmetic intensity:
For decode at batch size 1, we just computed it: ~2.2 billion FLOPs over ~2.2 GB of data transfer. About 1 FLOP per byte.
Now put prefill next to it. A 1,000-token prompt runs through the same weights, but 1,000 token vectors pass through each matrix instead of 1:
- FLOPs: ~2.2 trillion (1,000 tokens × ~2.2 billion each)
- Bytes moved: still ~2.2 GB (same weights, loaded once, reused across all 1,000 tokens)
- Arithmetic intensity: ~500–1,000 FLOPs per byte
Same data transfer. A thousand times more useful computation.
A GPU has two ceilings. A compute ceiling: how many operations per second its cores can execute. A bandwidth ceiling: how many bytes per second its memory bus can deliver. Which one you hit first depends entirely on arithmetic intensity.
Modern GPUs need roughly 100–300 FLOPs per byte to fully saturate their compute cores. Below that threshold, the cores finish their work and sit idle waiting for the next batch of data. You are bandwidth-bound. Above it, the memory system keeps up fine and the cores are the bottleneck. You are compute-bound.
Decode at ~1 FLOP/byte sits two orders of magnitude below that threshold. The arithmetic units are 99%+ idle. Prefill at ~500–1,000 FLOPs/byte sits comfortably above it. Same model, same hardware, but the two phases live in completely different performance regimes. That is why the GPU runs hot during prefill and cool during decode, even though it is doing the "same" operation (matrix multiplies) in both.
Why Batching Matters
The bandwidth floor looks permanent. Each decode step reads ~2.2 GB. One token per read. But think about what you are paying for. Those 2.2 GB of weights are the same for every sequence. If 32 requests are active, you can read the weights once and multiply all 32 token vectors through them. Same data transfer, 32 tokens instead of 1.
With batch size 1:
- Read ~2.2 GB weights
- Produce 1 token
- Arithmetic intensity: ~1 FLOP/byte
With batch size 32:
- Read ~2.2 GB weights (same bytes)
- Produce 32 tokens (one per sequence)
- Arithmetic intensity: ~32 FLOPs/byte
The weight read does not change with batch size. What changes is how much useful work each byte of data transfer produces. If each read is expensive, make it count. That is why production inference systems batch aggressively, and why Part 2 is mostly about how to do that without wrecking per-user latency.
Part 1 Summary: The Complete Inference Pipeline
We started with a black box. Text goes in, text comes out. Twelve details later, there is no box. You can now name every component between the user's prompt and the streaming tokens, and estimate what it costs on the back of an envelope.
"Write a story" → "Once upon a time"
Trace the running example one last time:
- Tokenize: "Write a story" → [8144, 264, 3446]. Text is gone. From here on, the model sees only integers.
- Embed: Each ID becomes a 2,048-dimensional vector. Position enters inside each attention layer via RoPE, not at the input.
- 22 transformer blocks: Each block runs attention (tokens look at each other, updating their representations with context) then FFN (each token is processed independently through a nonlinear network), wrapped in residual connections and RMSNorm.
- Project and sample: The LM head projects the final vector to 32,000 logits. Softmax converts them to probabilities. Sampling picks one token: "Once."
The first token comes from prefill: process the full prompt in parallel, populate the KV cache, emit token 1.
Everything after it is decode. Each step processes only the new token, reads the cached K/V for attention, appends the new K/V, and samples the next word. One full trip through all 1.1 billion parameters per token. The sequence grows one entry at a time until the model emits an end-of-sequence token or hits the context limit.
The opening of this post pointed out the rhythm: a pause before the first token, then a steady stream. Now you know exactly why. The pause is prefill, compute-bound, scaling with prompt length. The stream is decode, bandwidth-bound, each step costing roughly the same because the weight read is the same size every time.
The Systems Picture
For TinyLlama serving one sequence, the memory budget breaks down like this:
Model weights (FP16): 2.2 GB
KV cache (2,048 context): ~45 MB (GQA keeps this small)
Activations (decode): ~50-100 MB
Total: ~2.4 GB
The two user-visible metrics map directly to the two phases:
- TTFT (Time to First Token) is prefill. Compute-bound. Scales with prompt length.
- ITL (Inter-Token Latency) is one decode step. Bandwidth-bound. Scales with model size and memory bandwidth, barely changes with output length.
Total latency = TTFT + ITL × (output_tokens − 1). For most responses, decode dominates.
At batch size 1, decode reads essentially all model weights from VRAM per token and performs about one floating-point operation per byte loaded. 0.5% compute utilization on an A100. The rest is waiting for data.
If you carry one idea from Part 1, make it this: decode is a bandwidth problem. Part 2 is about moving fewer bytes, wasting fewer bytes, or amortizing those bytes across more useful work.
The Three Bottlenecks
Now that the full pipeline is visible, the constraints come into sharp focus. Three bottlenecks run through everything we have covered:
-
Memory bandwidth limits decode speed. Every decode step reads a weight-equivalent volume of data from HBM. The arithmetic finishes in microseconds; the data transfer takes milliseconds. At batch size 1, the GPU sits at 0.5% compute utilization. Data movement, not arithmetic, sets the pace.
-
Memory capacity limits concurrency. Model weights take 2.2 GB. Each active sequence adds its own KV cache (45 MB at max context for TinyLlama, far more for larger models). The more requests you serve in parallel, the faster VRAM fills up. At some point there is simply no room for another sequence.
-
Sequential generation limits parallelism. Tokens come out one at a time. Each one requires a full forward pass through all 1.1 billion parameters. You cannot skip ahead, because each token depends on the one before it.
Every production optimization in Part 2 attacks one or more of these three constraints.
Part 2: Production Optimizations
Part 1 traced one request through one model. A single user types "Write a story," TinyLlama runs all 1.1 billion parameters, and tokens stream back. Clean. Complete. And, tbh, quite unrealistic :(
Nobody buys a GPU to serve one person.
Go back to the number that closed Detail 12: 0.5% compute utilization during decode. An A100, capable of 312 trillion operations per second, spending 99.5% of its time waiting for bytes to arrive from memory. Producing tokens for a single conversation. If you are running an inference service, that number is your starting condition. Everything in Part 2 starts from the same question: how do you make the economics work?
The three bottlenecks frame everything that follows. Bandwidth limits how fast each token can come out. Memory capacity limits how many requests can be active at once. Sequential generation means you cannot skip ahead. Every technique in Part 2 attacks one or more of these, usually at the cost of another.
Detail 13: Batching Requests
Start with one decode step for a single user. The GPU reads ~2.2 GB of weights from VRAM. A single token vector passes through each weight matrix. One multiply-add per weight. The arithmetic finishes in microseconds. The weight read takes about 1.4 milliseconds. Then the next step begins and the GPU reads those same 2.2 GB all over again.
Now a second user sends a request. Different prompt, different conversation, completely unrelated. But the model is the same. The weights are the same. The 2.2 GB traveling from VRAM to the compute cores are identical regardless of who is asking.
The question almost asks itself: if you are already paying 1.4 milliseconds to shuttle 2.2 GB through the memory bus, why produce only one token from all that data?
Think about what the compute cores actually do during a single-user decode step. A weight arrives from VRAM. The core multiplies it against the token vector, adds the result to a running sum, and is done. Then it waits for the next weight. That wait is where 99.5% of the time goes.
But the core did not need to sit idle. It could have multiplied that same weight against a second token vector while waiting. And a third. And a thirtieth. The weight is already there, sitting in a register. The arithmetic costs almost nothing relative to the time spent fetching. Every additional multiply-add uses compute capacity that was going to waste anyway.
This is batching: processing multiple sequences in a single forward pass. The weight read, which dominates decode time, is paid once. The extra arithmetic for additional sequences fills compute headroom that was going to waste.
The numbers follow directly. With batch size 1, TinyLlama reads ~2.2 GB and performs ~2.2 billion FLOPs. Arithmetic intensity: ~1 FLOP/byte. With batch size 32, the weight read is still ~2.2 GB (plus some KV cache traffic per sequence, which we will quantify in Detail 21), but the FLOPs rise to ~70 billion. Arithmetic intensity: ~32 FLOPs/byte. Still well below the ~200 FLOPs/byte needed to saturate an A100's compute cores, so the extra work barely registers in wall-clock time. One decode step, one weight read, thirty-two tokens out.
Throughput scales nearly linearly with batch size, at least until KV cache memory becomes the limiting factor.
The Simplest Implementation
Collect N requests into a group. Run prefill for all of them. Then decode in lockstep: each step reads the weights once and generates one token per sequence. When every sequence in the group has finished, return all results.
With batch size 32, the GPU reads roughly the same 2.2 GB per step but produces 32 tokens instead of 1. Same data movement, 32× the output. The per-user experience barely changes (each user still gets their tokens at roughly the same rate), but the GPU is now doing useful work with compute capacity that was previously dark.
There's a catch.
Where It Breaks
The scheme above works if all requests arrive at the same time and generate the same number of tokens. In production, neither is true.
Three requests arrive together:
Request A: "Write a poem" → generates 50 tokens
Request B: "Explain gravity" → generates 120 tokens
Request C: "Hello" → generates 20 tokens
You batch them and start decoding. For the first 20 steps, all three slots do useful work. Every weight read produces three tokens. Good.
Then Request C finishes at step 20. Its result goes back to the user, but its slot sits empty. The batch keeps running because B still has 100 tokens to go. That empty slot gets processed on every forward pass, producing nothing. Padding.
Request A finishes at step 50. Now two of three slots are padding for the remaining 70 steps. From step 50 to step 120, the GPU reads 2.2 GB per step and produces one useful token. Back to batch-size-1 efficiency, except we are technically running a "batch."
Meanwhile, a new request arrives at step 30. It could use that empty slot. But it cannot join. The batch composition was fixed at step 0. The new request sits in a queue until step 120, when the entire batch completes and a new one can form. Ninety steps of waiting, while an empty slot sits right there.
There is also a cost before decoding even begins. You need N requests to form a batch of N. Under low traffic, you wait for them to accumulate (or use a timeout, which produces smaller, less efficient batches). Either way, users experience added latency from the queueing alone.
This is static batching: the batch composition is fixed from start to finish. No one leaves early, no one joins late.
Static batching improves throughput over serving one request at a time. But it has three problems: finished sequences waste GPU cycles as padding, new requests cannot join until the entire batch completes, and short requests are held hostage by long ones. For an interactive service where users are watching tokens stream in, all three matter.
Detail 14: Continuous (In-Flight) Batching
Go back to the waste from Detail 13. After C finishes at step 20, its slot sits empty for 100 steps. A new request arrives at step 30, staring at an open slot it can't claim. The batch was sealed at step 0. Nobody in, nobody out.
A system simultaneously idle and backed up. Empty slots burning GPU cycles on one side, queued requests waiting for those cycles on the other. Not a rounding error. A structural problem.
Why Does It Have to Be This Way?
Look at the three slots more carefully. What does Request A's token generation actually depend on? Its own KV cache. Its own prefix. The model weights, which are shared. Does it depend on Request C being in slot 3? No. Does Request B care whether slot 1 holds Request A or Request E? No. Each slot is processing an independent sequence through an independent KV cache. The only thing they share is the weight read, and that happens regardless of who occupies which slot.
The batch is not a unit of computation. It is a scheduling convenience for amortizing the weight read. The sequences inside it have no dependency on each other. There is no computational reason they need to start together, run together, or finish together.
Once you see this, the rigidity of static batching looks arbitrary. It is a constraint we imposed, not one the hardware demands.
Let the Batch Breathe
Drop the constraint. When C finishes at step 20, remove it. The batch shrinks from 3 to 2. When a new request arrives and a slot opens, insert it. The batch grows back to 3. Sequences enter and leave as they please.
Walk through the same scenario:
Step 0: [A, B, C] all three decoding
Step 20: [A, B, _] C finishes, result returned, slot freed
Step 21: [A, B, D] Request D admitted, joins decode loop
Step 50: [_, B, D] A finishes, slot freed
Step 51: [E, B, D] Request E admitted
...
C finishes at step 20. By step 21, its slot is doing useful work again. Zero steps of padding. Request D didn't wait for the entire batch to drain; it got in as soon as there was room. The GPU stays near full occupancy as long as there are requests waiting.
Compare the two approaches on the same workload. With static batching, the GPU produces useful tokens on all three slots for 20 steps, then degrades to two active slots, then one. With dynamic admission, every slot is either producing useful output or being immediately filled by a new request. The difference compounds over time. Over a sustained workload, throughput can improve 2 to 5× depending on how much sequence lengths vary and how steadily requests arrive.
This is continuous batching, also called in-flight batching because the batch composition changes while generation is in flight. Most production inference servers implement some version of it. vLLM, TGI, and TensorRT-LLM all use continuous batching as their default scheduling strategy.
What Happens When a New Request Joins
A new request can't simply appear in the decode loop. It has no KV cache yet. The decode step expects to read cached Keys and Values for every past token position (Detail 10), and for a brand-new request there is nothing to read.
So when the scheduler decides to admit Request D, it first runs prefill for D's prompt. All of D's prompt tokens get processed through the 22 transformer layers in a single parallel pass, just like the prompt processing from Detail 11. This populates D's KV cache. Once prefill finishes, D joins the decode loop alongside everyone else, advancing one token per step.
From D's perspective, nothing special happened. Normal prefill, normal decode. From the server's perspective, D's prefill was interleaved into a system where other sequences are mid-generation. The interaction between a compute-heavy prefill and bandwidth-heavy decode steps has implications of its own, which Detail 21 will address.
The Scheduler
The engine that makes this work is a scheduling loop that runs once per decode step:
-
Evict finished sequences. Any sequence that sampled an end-of-sequence token or hit its maximum length gets removed. Its output is returned to the user. Its KV cache memory is freed back to the pool.
-
Admit waiting requests. If there are queued requests and enough KV cache memory to hold a new sequence, run prefill and insert the new sequence into an active slot. This is the step that keeps the batch full.
-
Run one decode step for all active slots. Read the model weights once. Produce one token per active sequence. Append each new token's K and V to its respective cache.
-
Repeat.
Step 2 has a condition that sounds innocuous: "if KV cache memory allows." That condition is doing more work than it looks.
Detail 15: The KV Cache Memory Problem
The scheduler from Detail 14 is elegant. Evict finished sequences, admit waiting requests, run one decode step, repeat. In steady state, slots stay full, the GPU stays busy, throughput climbs. Continuous batching works.
But go back to step 2: "If there are queued requests and enough KV cache memory to hold a new sequence, run prefill and insert." That condition, "if KV cache memory allows," is easy to skim past. It sounds like a simple check. Enough memory? Yes or no?
It's not simple. Every time the scheduler says "yes," it commits real VRAM. A new sequence means a new KV cache, growing by 22 KB per token per decode step (Detail 10), potentially reaching 45 MB at max context. Five active sequences? 225 MB committed. Ten? 450 MB. The model weights already occupy 2.2 GB. VRAM is finite.
Eventually the scheduler says "no." Not because the GPU is slow. Not because the memory bus is saturated. Because there is not enough space. Requests pile up in the queue, the GPU idles on half-empty batches, and throughput drops. Not a compute problem. Not a bandwidth problem. A space problem.
But the total bytes aren't even the hard part. We worked those numbers out in Detail 10. The hard part is how you allocate those bytes.
The Allocation Problem
Think about what the scheduler is actually dealing with. Requests arrive at unpredictable times. Each one starts with a different prompt length, generating a KV cache of a different initial size. Then each sequence grows, one token per step, at the same rate but from different starting points. And sequences finish at different times, in no particular order, because some users ask for a haiku and others ask for an essay.
The scheduler is carving up a shared pool of VRAM for allocations that start at different times, grow at different rates, and end unpredictably. If you have ever managed memory for a program that juggles variable-size objects with unpredictable lifetimes, you recognize this. It's the memory allocation problem. And the two obvious approaches both fail.
Approach 1: Reserve for the Worst Case
The simplest policy: when a new sequence arrives, allocate a contiguous block big enough for the maximum context length. The sequence might generate 50 tokens, or it might generate 2,048. You don't know. So you reserve for the ceiling. One big slab, 45 MB, right upfront.
This is predictable. No reallocation, no copying, no bookkeeping as the cache grows. The sequence has its block and nobody else can touch it.
But follow a concrete request through. A user sends a short prompt, 10 tokens. Prefill populates 10 KV positions. The model generates 40 more tokens before sampling end-of-sequence. Total cache used: 50 positions, about 1.1 MB. Total cache reserved: 2,048 positions, 45 MB.
Reserved: 45 MB (2,048 positions)
Used: 1.1 MB (50 positions)
Wasted: 43.9 MB (97.5% of the allocation)
97.5% of that memory is sitting empty, locked away from every other sequence, for the entire lifetime of this request.
Scale it up. Ten concurrent sequences, each reserving 45 MB, each actually using a fraction of it. 450 MB committed, maybe 20 MB doing real work. The other 430 MB is a ghost: allocated but empty, unavailable to any other request.
This is internal fragmentation. Waste that lives inside each allocation, in the gap between what was reserved and what was actually needed. The memory is not free. It's not used. It's just gone.
Approach 2: Grow As You Go
The fix seems obvious. Don't pre-allocate. Start each sequence with a small contiguous block, just enough for the prompt's KV cache. As the sequence generates tokens, extend the block. You only pay for what you use. Internal fragmentation drops to near zero.
But watch what happens to the memory layout over time.
Three sequences arrive. A needs 100 MB, B needs 150 MB, C needs 80 MB. They get placed contiguously:
[ A: 100 MB ][ B: 150 MB ][ C: 80 MB ][ free: 670 MB ]
Everything fits. Plenty of room. Now A finishes. Its 100 MB is freed:
[ free: 100 ][ B: 150 MB ][ C: 80 MB ][ free: 670 MB ]
Sequences D and E arrive. D needs 60 MB, fits in the gap left by A. E needs 200 MB, takes from the end:
[ D: 60 ][ free: 40 ][ B: 150 MB ][ C: 80 MB ][ E: 200 MB ][ free: 470 ]
B finishes:
[ D: 60 ][ free: 40 ][ free: 150 ][ C: 80 MB ][ E: 200 MB ][ free: 470 ]
Now sequence F arrives needing 200 MB of contiguous space. Check the free list: a 40 MB gap, a 150 MB gap, and 470 MB at the end. Total free memory: 660 MB. More than enough. But F needs its KV cache in one contiguous block, and neither the 40 nor the 150 MB gap qualifies. The 470 MB chunk works for now, but after a few more cycles of arrivals and departures, even that fragments.
[D: 60][free: 40][C: 80][free: 150][E: 200][free: 30][G: 90][free: 350]
Total free: 570 MB. A new request needs 200 MB contiguous. The largest gap is 350 MB, barely enough. A few more rounds and the free space scatters into ever-smaller holes. Sooner or later, a perfectly reasonable request gets turned away even though the total free memory dwarfs what it needs.
This is external fragmentation. The memory is plentiful, but it's in the wrong shape. Plenty of holes, none big enough. The bytes are available. The layout is not.
Two Failures, One Root Cause
Reserve up front and you waste the interior. Grow dynamically and you shatter the free space. Different symptoms, but look at what both approaches share: they insist that each sequence's KV cache be one contiguous block in VRAM. That requirement is the problem. Internal fragmentation is contiguity's price for safety. External fragmentation is contiguity's price for flexibility. In real workloads, one or both can waste 60 to 80% of the KV cache budget.
Operating systems hit exactly this problem decades ago with regular RAM. Their solution: stop requiring physical contiguity altogether. Give each program the illusion of a contiguous address space, backed by small, fixed-size pages scattered anywhere in physical memory.
Detail 16: PagedAttention (vLLM's Innovation)
The fix didn't come from ML. It came from operating systems, circa the 1960s.
How Virtual Memory Works
Early computers gave each program direct access to physical RAM. If program A occupied addresses 0 through 999, program B started at 1000. Programs grew, terminated, and left gaps. Sound familiar? The same swiss-cheese pattern from Detail 15, just in RAM instead of VRAM.
The OS community didn't try to get better at shuffling blocks around. They changed the rules. The idea: let every program believe it owns a contiguous stretch of memory starting at address 0. Under the hood, the OS breaks that illusion into fixed-size pages (typically 4 KB) and scatters them across physical RAM in whatever frames happen to be free. A per-process page table maps each virtual page to its physical frame.
One lookup, constant time. The program asks for address 5,000. The OS checks the page table: virtual page 1 lives in physical frame 37. Done. The program never knows its "contiguous" memory is physically scattered across the chip.
Both fragmentation types vanish. External fragmentation? Every frame is the same size and interchangeable, so gaps between allocations cannot form. Internal fragmentation? Capped at one partially filled page per allocation. One layer of indirection turned a fundamentally hard allocation problem into a trivial one.
From RAM Pages to KV Blocks
The mapping is direct:
- A sequence is the program
- Its logical KV positions (token 0, token 1, ...) are the virtual addresses
- Physical frames become fixed-size KV blocks in VRAM (say, 16 tokens each)
- A per-sequence block table plays the role of the page table
Same abstraction, different substrate. And the attention math doesn't change at all.
Walking Through a Request
Take the same 50-token request from Detail 15. Under static allocation, the system reserves one contiguous block of 2,048 KV slots. 50 fill with actual data. The other 1,998 sit empty but locked for the entire lifetime of the request. That's the baseline: 97.5% waste.
Now run it through paging with a block size of 16. Tokens 0-15 fill block 0. Tokens 16-31 fill block 1. Tokens 32-47 fill block 2. Tokens 48-49 go into block 3, leaving 14 slots empty. Four blocks total, grabbed one at a time from a free pool, wherever they happen to sit in VRAM:
Block table for Sequence A:
Logical 0 → Physical 7 (tokens 0-15)
Logical 1 → Physical 2 (tokens 16-31)
Logical 2 → Physical 14 (tokens 32-47)
Logical 3 → Physical 5 (tokens 48-49, 14 slots empty)
From the sequence's perspective, its KV cache is a contiguous stretch of 50 positions. Physically, it's four scattered chunks.
Both Problems, Gone
Block 3 has 14 empty slots. That's real waste, the same internal fragmentation from Detail 15. But paging caps it at 15 slots per sequence (one block minus one token). Compare that to 1,998 under static pre-allocation.
External fragmentation? Gone entirely. Every block is the same size and interchangeable. The swiss-cheese problem from Detail 15 cannot happen. Any free block anywhere in VRAM can satisfy the next allocation.
The numbers bear this out. Profiling in the vLLM paper found that only 20.4%-38.2% of reserved KV cache memory was actually storing token states in existing systems. With paging, the remaining waste is basically just the last partially filled block of each sequence, so the slack drops to only a few percent in practice.
Growth and Cleanup
Decode begins. New tokens fill the remaining slots in block 3 (physical address 5). After 14 more generated tokens, that block is full. The system grabs one more free block from the pool (say physical address 11) and adds a new entry to the block table: logical 4 → physical 11. No existing blocks move. No data is copied. Growth costs one block allocation per 16 generated tokens.
When the sequence finishes, blocks 7, 2, 14, 5, and 11 all return to the free pool immediately. Any future sequence can reuse them. No compaction step, no defragmentation. The system never required contiguity in the first place.
What Changes in the Kernel
The attention math is unchanged. To generate token t, we still need keys and values for positions 0 through t-1. The only difference is one level of indirection. If the kernel needs the Key for token 35:
block_size = 16
logical_block = floor(35 / 16) # = 2
offset = 35 % 16 # = 3 (4th slot in that block)
physical = block_table[2] # = 14 (from our table above)
Token 35's Key lives at the 4th slot of physical block 14. Same dot products, same softmax. The math doesn't know or care that the blocks are scattered.
But the kernel is more complex. Instead of reading K/V from a single contiguous array, it follows a per-sequence block table to gather blocks scattered across VRAM. Adjacent threads may read from non-adjacent physical blocks, which hurts memory coalescing. The cost is real.
The utilization improvement dwarfs it. Memory utilization jumps from 20-38% to over 96%. The scheduler can fit far more concurrent sequences in the same VRAM. Benchmarks show 2-4x higher throughput from the same hardware. Not from faster math, just from fitting more requests into memory at once.
Back-of-the-Envelope (TinyLlama)
Recall that TinyLlama's max-context KV cache is ~45 MB per sequence. Suppose we have 50 active sequences with an average context length of 500 tokens.
Without paging:
- Reserved: 50 × 45 MB = 2.25 GB
- Actually used: 50 × (500/2048 × 45 MB) ≈ 550 MB
With paging, each sequence allocates only as many blocks as it needs. The slack is at most 15 tokens in the last block per sequence. At ~22 KB per token (45 MB / 2,048 tokens), worst-case slack is about 330 KB per sequence, or ~17 MB across all 50. Total allocation lands near 567 MB instead of 2.25 GB.
The freed ~1.7 GB is enough for dozens more concurrent sequences. The continuous batching scheduler from Detail 14 can say "yes" to incoming requests far more often.
This is PagedAttention, introduced by the vLLM project. Paged memory management applied to the attention KV cache. The attention math is untouched. The innovation is entirely in the memory allocator, the same layer that operating systems redesigned decades ago for the same reasons.
That handles KV cache storage. But there is a separate inefficiency in how attention is computed, particularly during prefill.
Detail 17: FlashAttention
In Detail 12, we traced the bandwidth bottleneck during decode: the GPU spends milliseconds loading model weights and microseconds doing math. Prefill was different. Enough tokens share the same weight reads that the linear layers are solidly compute-bound. The GPU cores stay busy.
But the attention step during prefill has its own bandwidth problem, and it has nothing to do with model weights. It comes from intermediate matrices that the GPU creates, uses exactly once, and immediately throws away. For a 2,048-token prompt, these temporaries account for over 95% of the memory traffic during each attention head's computation. The actual inputs and outputs are a rounding error by comparison.
Where does all that traffic come from? To answer that, we need one more detail about how GPU memory is organized.
Two Tiers of Memory
We have been treating VRAM as a single pool. That was enough for the bandwidth story in Detail 12. But physically, the GPU has two distinct memories, and the distance between them is where the waste lives.
HBM (High Bandwidth Memory). What we have been calling "VRAM." On an A100 40GB: 40 GB of capacity, about 1.6 TB/s of bandwidth. It sits on separate memory chips around the edge of the GPU package. Millimeters from the compute cores.
SRAM (Static RAM). About 20 MB of on-chip memory across all streaming multiprocessors on an A100. Built into the same silicon as the compute cores. Micrometers away instead of millimeters. Roughly 10× faster than HBM.
20 MB versus 40 GB. SRAM is 2,000× smaller but sits right next to the arithmetic units. Every GPU computation follows the same pattern: load operands from HBM into SRAM, compute, write results back to HBM.
Here is the constraint that creates the problem we are about to trace. Each GPU operation runs as a separate program called a "kernel." When one kernel finishes and the next starts, the only way to pass results between them is through HBM. Kernels do not share SRAM. HBM is the shared mailbox.
(Not all inference hardware follows this two-tier pattern, but GPUs dominate inference today. The rest of this section addresses their specific bottleneck.)
Standard Attention: Where the Bytes Go
With that in mind, let's trace what happens when one attention head runs during prefill and count every byte that crosses the HBM boundary.
From Detail 5, TinyLlama splits attention into 32 heads, each working in 64 dimensions. Take a 2,048-token prompt. After the Q, K, V projections, each head has:
Q: [2,048 × 64] (one 64-dim query per token)
K: [2,048 × 64] (one 64-dim key per token)
V: [2,048 × 64] (one 64-dim value per token)
Each matrix: 2,048 × 64 = 131,072 numbers. At FP16 (2 bytes each): about 256 KB. All three together: ~768 KB. Hold on to that number.
Now run the attention formula from Detail 4. Step one: compute the score matrix .
Q is [2,048 × 64]. K transposed is [64 × 2,048]. The result: [2,048 × 2,048]. One attention score for every pair of tokens. 4.2 million entries.
S: [2,048 × 2,048] = 4,194,304 numbers
At FP16: 4,194,304 × 2 bytes ≈ 8 MB
The inputs totaled 768 KB. This single intermediate is 8 MB. A 10× expansion. This is the scaling of attention: double the sequence length and S quadruples in size.
Step two: softmax converts S into attention weights P. Same shape: [2,048 × 2,048]. Same size: 8 MB.
Step three: multiply P by V to get the output. [2,048 × 2,048] × [2,048 × 64] = [2,048 × 64]. Back to 256 KB. The computation balloons in the middle and collapses back down.
Now the problem. In a standard unfused implementation, these three steps, scores, softmax, output, run as separate GPU kernels. Separate kernels, separate SRAM. They can only pass results through HBM:
- Scores: Load Q (256 KB) and K (256 KB) from HBM into SRAM. Compute S. Write S (8 MB) back to HBM.
- Softmax: Read S (8 MB) from HBM into SRAM. Compute P. Write P (8 MB) back to HBM.
- Output: Read P (8 MB) and V (256 KB) from HBM into SRAM. Compute PV. Write output (256 KB) to HBM.
S gets written to HBM once and read back once: 16 MB of traffic. P gets the same treatment: 16 MB. The actual inputs (Q, K, V) and the final output add about 1 MB. Total per head: roughly 33 MB of HBM traffic.
32 of those 33 MB is S and P. Temporary matrices, created, read once, discarded. They never needed to leave the chip. They only passed through HBM because that is the only channel between separate kernels.
Scale up. TinyLlama has 32 heads per layer: 32 × 33 MB ≈ 1 GB per layer. Across all 22 layers: roughly 22 GB of HBM traffic per prefill pass, just from attention intermediates. On an A100 at ~1.6 TB/s bandwidth, that is about 14 ms spent purely on data transfer. The matrix multiplies themselves finish in microseconds.
(With GQA, K and V are shared across head groups, so the per-head K/V sizes above are slightly overstated. The core observation, that the intermediates dominate the traffic, holds regardless of how K/V are shared.)
Attention during prefill is bandwidth-bound, just like decode, but for a completely different reason. Decode is bottlenecked on reading model weights that the model uses permanently (Detail 12). Prefill attention is bottlenecked on throwaway intermediates that exist for a fraction of a millisecond but must take the slow path through HBM because the GPU has no other way to pass data between kernels.
The Roofline framework from Detail 12 quantifies this: each head computes roughly 1 billion FLOPs (the and matmuls) but moves about 33 MB through HBM. Arithmetic intensity: ~30 FLOPs/byte. Well below the 100-300 FLOPs/byte threshold where modern GPUs saturate their compute cores. The math is fast. The data movement is the wall.
The Fix: Never Materialize the Full Matrix
Look at the traffic breakdown: 768 KB of useful input, 256 KB of useful output, and 32 MB of overhead in between. All of that overhead exists because three kernels pass S and P through HBM. What if we fused all three operations, scores, softmax, and the output multiply, into a single kernel? One kernel owns its SRAM for the entire duration. No intermediate ever touches HBM.
That is the core idea behind FlashAttention. But "just fuse the kernels" runs into a practical constraint. The full score matrix for one head is 8 MB. A single kernel's share of SRAM is a fraction of the chip's 20 MB total. We cannot load the entire [2,048 × 2,048] matrix into SRAM.
But think about what we actually need. Row of the score matrix holds token 's attention scores against all 2,048 keys. Once we softmax that row and multiply by V, we have token 's output vector. Done. That row is never needed again. We do not need 2,048 rows in memory to compute any single one.
So instead of building the full matrix, work in tiles. Compute a block of rows, consume them, discard them, move on.
FlashAttention organizes this as follows:
- Divide Q into blocks of rows (say, 128 at a time). Divide K and V into blocks of the same size.
- Load one Q block into SRAM: [128 × 64] = 16 KB. Fits easily.
- Stream K/V blocks through one at a time. For each K block, compute a small score tile: [128 × 128] = 32 KB. Apply softmax (with a correction we will come back to), multiply by the corresponding V block, and accumulate into a running output. All of this stays in SRAM.
- After all K/V blocks have streamed through, the output for these 128 query rows is complete. Write it to HBM.
- Move to the next Q block. Repeat.
Notice the trade-off. For each Q block, every K/V block streams through from HBM. With Q blocks, K and V each get read multiple times, not just once. Standard attention reads K and V once but writes and reads the massive S and P intermediates. FlashAttention eliminates S and P entirely but re-reads K/V for each Q block. The net HBM traffic is still far lower, because the S/P intermediates it avoids ( elements) dwarf the extra K/V reads ( elements, where is SRAM size). But the K/V re-reads are why the IO-complexity bound has in the numerator.
The full [2,048 × 2,048] matrix never exists. Not in HBM, not in SRAM, not anywhere. At any moment, only one [128 × 128] tile of scores sits in SRAM: 32 KB. Roughly 250× smaller than the 8 MB matrix it replaces.
The Softmax Obstacle
There is a catch, and it almost kills this entire approach.
Softmax normalizes each row by the sum of exponentials across all keys:
When you have only one block of keys in SRAM, you do not have the full sum. You cannot normalize correctly.
Think about what happens concretely. After processing the first K block (tokens 0-127), you have partial scores and a partial sum of exponentials. But the next block might contain a key with a far higher score than anything seen so far. If it does, the relative weights for all earlier keys need to shrink. The denominator changes, and every weight you computed so far is wrong. You do not know the correct normalization until you have seen every key.
The naive fix: two passes. Scan all key blocks once to accumulate the full denominator, then scan again to compute the actual output. But two passes means reading K and V from HBM twice, which doubles the traffic and guts the savings that tiling was supposed to deliver.
FlashAttention avoids the second pass entirely. The idea: maintain three running quantities per query row as key blocks stream through:
- Running maximum : the largest score seen so far. (Subtracting the max before exponentiating prevents overflow. Standard practice in any stable softmax implementation, not specific to FlashAttention.)
- Running sum of exponentials : , corrected whenever the maximum changes.
- Partial output vector : the weighted sum of value vectors accumulated so far.
When a new block arrives with a new maximum , all previous exponentials were computed relative to the old max. They are too large by a factor of . Apply that correction to both and with one multiply each, then fold in the new block's scores normally. When the max does not change, the correction factor is 1 and the update is plain accumulation.
After all blocks have been processed, the result is numerically identical to computing softmax over the entire row at once. Not an approximation. Bit-for-bit the same answer, computed in a single pass. This technique is called online softmax, and it is what makes single-pass tiled attention possible.
What This Buys You
Standard attention materializes the full [N × N] score matrix per head. For TinyLlama at N = 2,048: 8 MB per head. Across 32 heads and 22 layers, the intermediates drive roughly 22 GB of HBM traffic per prefill pass. FlashAttention's working set is just the SRAM tiles: a [128 × 128] score tile (32 KB) plus the running output and three scalars per query row. The 8 MB intermediate per head shrinks to tens of kilobytes.
The total FLOPs are identical. FlashAttention does not skip any computation. It performs the same matmuls and the same softmax. It just never writes the big intermediates to HBM. In Roofline terms (Detail 12), this shifts the operating point. Standard attention sits at ~30 FLOPs/byte, well below the ridge and firmly bandwidth-bound. FlashAttention performs the same FLOPs with a fraction of the HBM traffic, pushing arithmetic intensity much closer to the compute-saturation threshold. The gains are largest at long sequence lengths, where the intermediates would otherwise dominate.
In practice, you rarely call FlashAttention directly. PyTorch's scaled_dot_product_attention selects a flash kernel automatically when hardware allows. Most production inference stacks use it under the hood.
Scope note: FlashAttention matters most during prefill, where the query length is and the intermediate is large. During decode, the query length is 1. There is no matrix to form. The decode bottleneck remains what Detail 12 identified: reading the model weights.
All of the above optimized attention computation and KV cache layout. None of it touched the weights themselves. What if they're simply too large to fit?
Detail 18: Quantization
TinyLlama was manageable: 1.1 billion parameters stored in FP16, about 2.2 GB of weights. Fits comfortably on any modern GPU with room to spare. But scale up to a real production model and the picture changes. Not "gets harder." Changes completely.
A 70B model at FP16 is about 140 GB of weights. An A100 has 80 GB of VRAM. The weights alone don't fit on a single GPU. You'd need at least two GPUs just to hold the parameters, before allocating a single byte for KV cache, activations, or batch state. And even when weights do fit on a chip, Detail 12 showed what happens during decode: the GPU reads the entire weight set, once per token. For a 70B model, that's 140 GB through the memory bus to produce one token. At the A100's 1,555 GB/s, that's a ~90ms floor per token from data movement alone.
Two problems, same root cause. The weights take too many bytes. Too many to fit in memory. Too many to move quickly through the bus.
Do We Need All 16 Bits?
Pick a weight at random from somewhere inside TinyLlama. Say it's 0.0217. In FP16, that number occupies 16 bits, enough for about four decimal digits of precision. During inference, this weight gets multiplied by one activation value, and the product gets added to a sum of thousands of similar products in a matrix multiply. One small contribution to one large accumulation.
Would the model's output change if that weight were 0.024 instead of 0.0217? Almost certainly not. The rounding error is one part in a thousand, and it gets absorbed into the accumulation of thousands of other terms. The final logit might shift by some negligible fraction. The token that wins the softmax stays the same.
This is true for most weights in the model. They don't need all 16 bits of precision. They just need to be close enough. The question is: how close is close enough, and how few bits can we get away with?
Fewer Bits, Coarser Grid
Think of precision as a grid on the number line. FP16 spreads roughly 65,000 distinct values across its range. Each weight snaps to the nearest grid point, and with that many points, the snap is tiny. Almost invisible.
Now space those grid points further apart. INT8 gives you 256 levels. INT4 gives you 16. Fewer grid points means each weight has to round further to reach the nearest one. But fewer grid points also means fewer bits to store: 8 bits instead of 16, or 4 instead of 16. Half the memory, or a quarter.
The mechanics are straightforward. Take a group of weights, find the maximum absolute value, and divide by the number of integer levels (127 for signed INT8). That gives you a scale factor. Divide each weight by the scale, round to the nearest integer, and store the integer. To reconstruct later: multiply the integer by the scale.
scale = max(abs(W)) / 127
q = round(W / scale) # INT8 values in [-127, 127]
W_hat = q * scale # approximate reconstruction
Walk through one weight. Suppose the group's max absolute value is 1.0, so scale = 1/127 ≈ 0.00787. Our weight 0.0217 maps to round(0.0217 / 0.00787) = round(2.76) = 3. Reconstructed: 3 × 0.00787 = 0.0236. The original was 0.0217; the reconstruction is 0.0236. Off by 0.0019. That's the price of compression: a small rounding error per weight, in exchange for cutting storage in half.
INT4 pushes further: 16 discrete levels ([-8, 7] in signed two's-complement), each weight stored in just 4 bits. A quarter of FP16. The grid is much coarser, the rounding errors larger. But 4× compression.
One Scale Factor Isn't Enough
There's a catch. The scheme above computes one scale factor for a group of weights. If one weight in the group is an outlier (say, 10.0 while everything else sits below 0.5), the scale has to stretch to accommodate it. Now the entire range [-10, 10] gets carved into 256 levels, and all the small weights near zero, which is most of them, lose nearly all their resolution. One outlier ruins precision for everything else.
In practice, you don't compute one scale for the whole weight matrix. There's a spectrum of granularity:
- Per-tensor: one scale for the entire matrix. Simplest, but one outlier stretches the range for everyone.
- Per-channel: one scale per output dimension. Better, because magnitudes often vary across channels.
- Per-group: one scale per small block of weights, typically 64 to 128 values. Each block gets its own scale, tuned to its local range.
Per-group gives the tightest fit. Most INT4 methods (GPTQ, AWQ) use per-group quantization. That's a big part of why they maintain reasonable quality at just 4 bits per weight.
This whole process, mapping floating-point weights to low-bit integers plus a scale factor, is quantization.
What Gets Quantized
When people say "INT4 model," they nearly always mean weight-only quantization. The model's fixed parameters (attention projections, FFN matrices, LM head) get compressed to low-bit integers. The activations, everything computed on the fly as data flows through the network, stay in FP16 or BF16.
Why not quantize everything? Weights are static. Same numbers every forward pass. You can analyze their distribution offline, pick good scales, and be done. Activations are different. They change with every input. Worse, they spike: a handful of hidden dimensions can hit values 100× larger than the rest. Force those into a low-bit integer range and the outliers eat all the dynamic range while everything else rounds to near-zero.
So the standard setup is INT4 or INT8 weights with FP16 activations. During matrix multiplies, the GPU reconstructs each weight on the fly (multiply the stored integer by its scale factor), multiplies against the full-precision activation, and accumulates the result. The dequantization happens inside the compute pipeline, not as a separate storage step.
There is a parallel track for datacenter inference: FP8 (8-bit floating point). Unlike INT8, FP8 retains a small exponent field, which preserves dynamic range and makes it viable for quantizing both weights and activations without the outlier problem. GPUs from the H100 onward have native FP8 tensor cores (with NVIDIA's Blackwell architecture extending hardware support to FP4), and FP8 is increasingly the default for datacenter deployments where the hardware supports it. Often the best trade-off between throughput and quality.
The Bandwidth Payoff
Back to the bandwidth argument from Detail 12. If decode is bandwidth-bound, the time per token can't be faster than weight_size / bandwidth. Shrink the weights, shrink the floor.
| Precision | Bytes per weight | TinyLlama weight size | Bandwidth floor per token (A100) |
|---|---|---|---|
| FP16 | 2 | 2.2 GB | ~1.41 ms |
| INT8 | 1 | 1.1 GB | ~0.71 ms |
| INT4 | 0.5 | 0.55 GB | ~0.35 ms |
These numbers are approximate. Real quantized formats store scale (and sometimes zero-point) metadata per group, so the file is slightly larger than the naive calculation. And the speedups are bandwidth ceilings, not guarantees. Kernel overhead, dequantization cost, KV-cache traffic, and batch size all affect the actual latency.
But the direction is reliable: if decode is bandwidth-bound, shrinking weights shrinks ITL.
Scale the math to a 70B model. FP16: ~140 GB of weights. One A100 (80 GB) can't hold it. You'd shard across at least two GPUs, paying inter-GPU communication overhead on every forward pass.
INT4: ~35 GB of weights. Add scale-factor metadata, call it 38 to 40 GB. Fits on a single 80 GB GPU with room for KV cache and activations. A multi-GPU problem becomes a single-GPU problem.
What Weight-Only Quantization Does Not Change
Two things to keep in mind.
First: the KV cache. If you only quantize weights, the KV cache stays in FP16 or BF16. For short contexts and small batches, weights dominate memory and quantization helps a lot. But as context grows or batch size increases, KV cache becomes a larger share of total memory. Weight-only quantization shrinks the fixed cost (weights). It doesn't touch the per-sequence cost (KV cache).
Modern inference engines address this with KV cache quantization: storing cached keys and values in FP8 or INT8 instead of FP16. This halves the per-token KV footprint, and INT4 KV cache (a 4× reduction) is entering production. The technique works because attention scores are relatively tolerant of small rounding errors in cached keys and values, especially with per-head or per-channel scale factors. In practice, FP8 KV cache has negligible quality impact and INT8 is close behind; INT4 requires more care but is viable for many workloads. Inference frameworks like vLLM and TensorRT-LLM now support KV cache quantization as a standard option.
Quantizing the KV cache reduces memory pressure (more sequences fit in VRAM) and bandwidth demand (each decode step reads less KV data per sequence). The latter matters directly in Detail 21 when we look at how KV traffic scales with batch size.
Second: the number of forward passes. You still read the (now smaller) weights once per token. The cost per read drops, but the count stays the same. Each decode step still produces exactly one token.
The Cost: Quality
Go back to our weight. 0.0217 became 0.0236 after INT8 rounding. An error of 0.0019. In a single matrix multiply, thousands of such slightly-off weights get multiplied against an input vector. Some round up, some round down. The errors partially cancel. The net effect on any single output value is tiny.
But not all weights are equally forgiving. A weight at 0.0118 sits right between two INT8 grid points. It could round to 0.00787 or 0.01575. Which direction it falls can nudge a downstream logit just enough to flip which token wins the softmax. One flipped token early in generation can alter the entire trajectory.
How often this matters depends on the grid. INT8 (256 levels) is usually very close to full-precision quality, though the gap depends on the quantization scheme (per-tensor vs per-channel vs per-group) and the workload. INT8 is the safe default. INT4 (16 levels) has a much coarser grid. For conversational use and general question-answering, INT4 usually works well. For tasks that demand precision (multi-step math, exact code generation, detailed factual recall), you start to see the model slip on cases the full-precision version handles cleanly. It doesn't become stupid. It just gets slightly less reliable on the hard cases.
Below INT4, things get fragile fast. 3-bit and 2-bit quantization are active research areas, but the grid is so coarse that even careful calibration can only partially compensate.
Calibration itself is not retraining. You run a small representative set of inputs through the model and measure how quantization errors propagate through the layers. The goal is to pick scales (and sometimes identify outlier channels) that minimize the effect on output quality. It takes minutes, not days.
Quantization gives you cheaper weight reads. Each decode step moves fewer bytes, so the bandwidth floor drops and ITL improves. You can fit larger models on fewer GPUs.
But it didn't change how many reads you do. The decode loop still produces one token per forward pass. Each read is cheaper, but you're doing just as many of them.
Detail 19: Speculative Decoding
Watch a long response stream in. Token by token, each one requiring a full trip through the model. Load 2.2 GB of weights from VRAM, run the math for a single token, write the result. Two hundred tokens means two hundred of these trips.
Quantization (Detail 18) made each trip cheaper by shrinking the weights. Batching (Details 13-14) made each trip serve more users by processing multiple requests in parallel. But for a single user watching their response appear, the rhythm hasn't changed. One token per forward pass.
The Dependency That Keeps Us Slow
The obvious wish: run the model once and produce 4 tokens instead of 1.
Autoregressive generation won't allow it. Say the prefix is "Once upon a" and we want the next 4 tokens. Token 1 might be "time." But token 2 depends on token 1. If token 1 is "time," token 2 is probably ",". If token 1 were "hill," token 2 might be "there" instead. You can't compute the right token 2 without first knowing token 1. Token 3 depends on token 2. Token 4 on token 3. Each token is conditioned on every token before it. That's the constraint from Detail 3.
So generation is inherently sequential.
But here's what makes this interesting. That constraint only applies to generation: the act of choosing the next token. It does not apply to verification: checking whether a proposed token was the right choice.
Think about the difference. When you generate, you're staring at a blank slot and asking "what goes here?" You need all prior context to answer that question. One choice at a time, full model pass each time. But when you verify, someone has already filled in the slot. The question becomes "is this what I would have written?" And you can check all the slots at once, in parallel. This is exactly what prefill does: process an entire sequence in one forward pass (Detail 11).
Generation: sequential. Verification: parallel. That asymmetry is the entire trick.
Why Checking Is (Almost) Free
We established in Detail 12 that decode is bandwidth-bound. The GPU spends most of its time waiting for weight data to arrive from memory, not doing actual computation. One TinyLlama decode step:
- Reads ~2.2 GB of weights from VRAM
- Performs ~2.2 billion FLOPs
- Arithmetic intensity: ~1 FLOP/byte
At ~1 FLOP/byte, the GPU uses maybe 1% of its compute capacity. The other 99% sits idle, waiting for the next byte to arrive.
Now feed 5 tokens into that same forward pass instead of 1. The weight read is still ~2.2 GB. Same model, same parameters. But now each weight gets multiplied against 5 token vectors instead of 1. The arithmetic intensity goes up by roughly 5×. Still well below the GPU's compute ceiling, so the extra math adds almost nothing to wall-clock time.
Verifying 5 tokens costs roughly the same time as generating 1. Same weights loaded, same bandwidth consumed. The extra computation was going to waste anyway.
So if we had good guesses for the next few tokens, checking them would be nearly free. The question is where the guesses come from.
Most Tokens Are Easy to Predict
Look at the sentence "Once upon a time, there was a." After "upon a," the word "time" is almost certain. The comma after "time" is trivial. "there" after ", " is a strong bet. Most tokens in natural text aren't hard to predict. They're syntax glue, common phrasing, predictable continuations. The real difficulty is concentrated in a handful of tokens where the model actually needs its full capacity to decide.
A much smaller model can get the easy tokens right. Take a 125M-parameter model from the same family as the target (it must share the same tokenizer, since we'll be comparing tokens one-to-one). Its weights are ~250 MB, roughly 9× smaller than TinyLlama. Each forward pass is proportionally cheaper.
Now combine the two observations. Checking is almost free. And a cheap model can produce decent guesses. The approach:
- The small model runs ahead and generates K candidate tokens autoregressively. It's tiny, so this is fast.
- The target model runs one forward pass over the prefix plus all K candidates. This costs roughly the same as generating a single token, because the verification just fills idle compute.
- Walk the K candidates left to right. At each position, the target has computed its own logits from the full context. If the candidate matches what the target would have produced, accept it. If not, take the target's token at that position and throw away everything after it.
Tracing Through an Example
Prefix: "Write a story. Once upon a"
Draft phase. The 125M model generates 4 tokens: ["time", ",", "there", "was"]. Four forward passes, each reading ~250 MB. Fast.
Verification phase. TinyLlama takes the prefix plus all 4 draft tokens and runs one forward pass. Same ~2.2 GB of weight data as a normal decode step, but now producing logits at every draft position:
Position 1: draft = "time" → target agrees ✓ accept
Position 2: draft = "," → target agrees ✓ accept
Position 3: draft = "there" → target disagrees ✗ → use target's token: "in"
Position 4: draft = "was" → discarded (conditioned on wrong token 3)
Result: 3 tokens from one target forward pass. Two accepted plus one from the target at the rejection point. Without this approach, those 3 tokens would have cost 3 separate reads of 2.2 GB each.
Think about the bounds. At worst, all 4 drafts are wrong. We still get 1 token (the target's own choice at position 1). At best, all 4 are right and we get 5: the 4 accepted candidates, plus a bonus token. Where does the bonus come from? The target's forward pass produces logits one position past the last draft token. When all drafts are accepted, those final logits are conditioned on a fully correct sequence, so we can sample from them. When rejection happens at position i, logits past that point were conditioned on a wrong token and get discarded.
Either way: (accepted count + 1) tokens per target forward pass. The +1 is a bonus sample past the end when everything lands, or a resample at the first mismatch when it doesn't.
This is speculative decoding. The name borrows from CPU design, where processors speculatively execute instructions before knowing if a branch condition is true, discarding the work if the guess was wrong.
The Output Doesn't Change
A smaller model is proposing tokens. We're sometimes accepting them. The natural worry: does this pull quality toward the draft model?
No. Think about what happens at each position. The target model computes its own logits from the full context. If the draft happened to guess the target's top token, we accept it. If not, we use the target's choice. Either way, the token that enters the sequence is the one the target would have picked. The draft model influenced which tokens got checked, not which ones got emitted.
For greedy decoding, the output is bit-for-bit what the target model would have produced alone. The draft model only affected speed.
For sampling (temperature, top-p), exact equivalence requires a rejection sampling scheme. Accept the drafted token y with probability min(1, p(y)/q(y)), where p is the target's probability and q is the draft's. Tokens where the draft was overconfident (q > p) get accepted less often, correcting for the bias. On rejection, sample from a corrected residual distribution. The math guarantees the distribution of emitted tokens matches sampling from the target directly.
Speculative decoding is not approximate decoding. The acceptance criterion guarantees exact equivalence. The speedup comes from exploiting idle compute, not from compromising quality. You could relax the criterion and accept tokens that are "close enough," but then you'd be biasing the output toward the draft model. The min(1, p/q) threshold exists precisely to prevent that.
What It Costs
Two models means memory for two weight sets and two KV caches:
Target weights (TinyLlama): 2.2 GB
Draft weights (125M): ~250 MB
Total: ~2.45 GB (plus KV caches for both)
When a draft token is rejected, the draft model's KV cache rolls back to the last accepted position. If token 3 was rejected and resampled, the draft's cached states for tokens 4+ were conditioned on the wrong token 3. Those entries are invalid. The draft clears back and re-drafts from there. The target cache only commits accepted entries, so it stays clean.
The draft passes themselves aren't free either. Each one is cheap (small model), but if acceptance is low, most of that draft work is wasted. Total wall-clock time includes the draft passes plus the target verification. If the draft model is bad enough, speculative decoding can actually be slower than plain decoding. The technique pays off when the target is bandwidth-bound (small batches), the draft model is much smaller than the target, and the acceptance rate is high. In that regime, the draft cost is small relative to the savings from accepting multiple tokens per target pass. Outside it, the overhead eats the gains.
When It Helps (and When It Doesn't)
The speedup hinges on one number: the acceptance rate, how often the draft model's guess matches what the target would have produced. With K = 4 draft tokens (assuming each draft token is accepted with probability conditional on all previous draft tokens being accepted):
| Acceptance rate | Expected tokens per target forward pass |
|---|---|
| 80% | ~3.4 |
| 50% | ~1.9 |
| 20% | ~1.2 (barely helps) |
Works well when:
- The draft model is from the same family as the target (high token agreement)
- Output is locally predictable: boilerplate, common phrasing, structured text, code
- Batch sizes are small, so decode is bandwidth-bound and the spare compute that makes verification cheap is actually available
Works poorly when:
- The draft model is weak relative to the target (low acceptance means most draft work is wasted)
- Large batches already saturate the GPU's compute (verification stops being "free")
- Memory is tight (two models plus two KV caches may not fit)
2-3× speedups on decode-heavy, low-batch workloads are common in practice.
Variants
The draft-model approach works, but loading a second model has costs. Several alternatives eliminate that overhead.
Self-speculative decoding. The target model drafts for itself by skipping its later transformer layers. Run only the first N layers (say 10 of TinyLlama's 22) for drafting. Early layers capture mostly local syntax, enough to predict function words and common continuations. Then verify with all 22 layers. No extra model, no extra memory for a second weight set. Trade-off: early layers have less context, so acceptance rates are lower.
Medusa. Attach lightweight prediction heads to the target model's final hidden layer. Head 1 predicts t+1, Head 2 predicts t+2, and so on. Each head proposes several candidates, forming a tree of possible continuations verified in one forward pass. The limitation: the heads are independent. Head 3 doesn't know what Heads 1 and 2 predicted. It's guessing 3 steps ahead without knowing the intermediate tokens, which limits accuracy for later positions.
EAGLE. Instead of predicting future tokens directly (hard, since many tokens are plausible at each step), predict future hidden states. Hidden states are continuous and smooth: similar inputs produce similar outputs. A small autoregressive head predicts the next hidden state given both the current state and the embedding of the token actually sampled. That conditioning on the chosen token resolves the ambiguity that Medusa's independent heads can't handle. EAGLE variants report 3-5× speedups.
Lookahead decoding. Uses an iterative refinement technique (Jacobi iteration) that produces candidate token sequences as a natural byproduct. No separate model, no extra heads. Verify those candidates against the target. Works best on repetitive or structured text where the same patterns recur.
Detail 20: Prefix Caching
The KV cache (Detail 10) eliminated redundant work within a single request. Once a token's Keys and Values are computed, the model never recomputes them. Generate token 50, and the KV entries for tokens 0 through 49 are already in memory. No wasted work.
But watch what happens across requests.
A chatbot has a system prompt: instructions, safety guidelines, tool descriptions, maybe a few examples of the desired output format. Call it 500 tokens. The user never sees these tokens. They type "What is the capital of France?" and the server processes 512 tokens: the 500-token system prompt, then 12 user tokens. Full prefill. Response streams back.
Next user types "Explain gravity." The server processes 508 tokens: the same 500-token system prompt, then 8 user tokens. Full prefill again.
Same 500 tokens. Same KV entries produced. Discarded and recomputed from scratch.
At 100 requests per minute, the server prefills 50,000 system-prompt tokens against roughly 2,000 unique user tokens. About 96% of the prefill computation produces KV entries the server already computed for the previous request. Seconds ago.
The fix is obvious in concept: compute the system prompt's KV entries once, store them, hand them to every request that starts with the same prefix. TTFT drops in proportion to how much of the input was shared. But "obvious in concept" and "actually safe to do" are separated by a question worth pausing on.
Can We Actually Reuse Those KV Entries?
When two requests share the same system prompt, the KV entries for those positions look like they should be the same. But are they identical, or only similar?
This matters more than it seems. If the entries are only approximately the same, reusing them changes the model's attention computations. Different attention patterns, different outputs. An approximation in the KV cache is an approximation in the model's reasoning. For reuse to be safe, we need exact matches. Not close. Exact.
The causal mask from Detail 4 gives us the guarantee. Position can only attend to positions through . Information flows strictly forward, never backward. The Key and Value vectors at position depend only on tokens at positions through . Nothing downstream can reach back and change what was computed upstream.
So if two requests share the same first 500 tokens, the KV entries at every shared position are byte-identical. What the user types at position 501 cannot influence position 200. Not "barely influences." Cannot. The causal mask makes this an architectural guarantee, not an empirical observation.
Compute once, store, attach to subsequent requests. This is prefix caching.
Walking Through Two Requests
Request 1 arrives: a 500-token system prompt followed by "What is the capital of France?" (12 tokens). 512 tokens total. The prefix cache is empty, so the server runs full prefill on all 512 tokens. At 22 KB per token position (Detail 10), the system prompt's KV footprint is 500 × 22 KB = 11 MB. The server hashes the 500-token prefix and stores a reference to those KV blocks. In systems using PagedAttention (Detail 16), the cached entries live as shared blocks that multiple requests can reference without copying, the same virtual-memory trick that already manages per-request KV memory.
Request 2 arrives seconds later: the same 500-token system prompt followed by "Explain gravity" (8 tokens). The server hashes the first 500 tokens, finds a match, and attaches the cached KV blocks. No recomputation. Prefill runs only on the 8 new tokens. Workload: 508 tokens down to 8. A 98% reduction. Since prefill dominates TTFT (Detail 11), time-to-first-token drops roughly in proportion.
How Much Does It Help?
The improvement scales with the ratio of shared prefix to unique suffix. A 500-token prefix with a 10-token suffix means prefill handles 2% of the original work. A 100,000-token prefix with a 200-token suffix: 0.2%.
For very large shared prefixes, the TTFT reduction can be dramatic because only the uncached suffix still needs prefill. Inter-token latency is unchanged. Decode is still bandwidth-bound on model weights (Detail 12), and prefix caching doesn't touch the decode phase. The speedup is entirely on the prefill side.
Exact Tokens Only
There is one constraint that bites in practice. Cache hits require exact token matches. Not semantic similarity. Not "close enough." Byte-identical tokens, in sequence.
A single extra space, a different quote style, an injected timestamp: any of these changes the token IDs and breaks the cache from that point forward. To get reliable hits, the shared prefix must be byte-identical before tokenization. Template the system prompt once, reuse it verbatim. Dynamic content (user names, timestamps, session IDs) goes after the cached prefix, not inside it.
Production Reality
The cache consumes the same scarce GPU memory discussed in Detail 15, so servers use LRU or TTL eviction to keep only the most frequently reused prefixes resident. SGLang's RadixAttention organizes this cache as a radix tree (trie) over token sequences, enabling efficient longest-prefix matching, insertion, and eviction in a single data structure. Think of it as the prefix-caching analog of PagedAttention for KV memory management.
Because the cache operates at PagedAttention block granularity (Detail 16), a 500-token prefix occupies blocks (though production systems typically only cache the 31 fully populated blocks, since the partial last block would require Copy-on-Write if two requests diverge within it). And when cached prefixes include user-provided text, many production systems scope caches per tenant or restrict caching to known-safe system prompts, treating the cache as user data.
Everything up to this point has been about one request at a time. Let's review what happens when that is not the case.
Detail 21: Latency vs Throughput Trade-offs
So far, everything in Part 2 has been about one request at a time. Faster prefill. Smaller KV cache. Speculative decoding. Prefix caching. All single-user optimizations.
But nobody runs an A100 for one user. Detail 12 told us why this matters: a single decode stream uses less than 1% of the GPU's compute. The arithmetic units sit idle while the memory bus ferries weights. Batching (Detail 14) fixed that waste by packing multiple sequences into each decode step, sharing the weight read, producing more tokens per cycle.
We never paused to ask: what does batching do to the person waiting for their stream?
The Two Clocks
You already know this rhythm. It's the first thing this post described. You type a prompt, there's a pause, then tokens stream in steadily.
Those are two separate clocks.
The pause is the time to process your prompt (prefill) plus any time your request spent waiting for a free slot. For TinyLlama on a single A100 serving one user with no contention, prefill on a 100-token prompt takes about 5 ms. The GPU tears through it because prefill is compute-bound and the compute is wide open.
The stream is one token per decode step. Each step reads the full 2.2 GB of weights plus the growing KV cache. At the A100's 1,555 GB/s, that comes out to about 1.4 ms per token.
A 200-token response: 5 + (199 × 1.4) ≈ 284 ms end-to-end.
Both numbers are excellent. The pause is imperceptible. The stream feels instant. And neither will hold once other users show up.
The industry calls the first clock TTFT (time to first token). The second is ITL (inter-token latency). They degrade differently and create different kinds of frustration. If TTFT drags, the interface feels frozen before it even starts generating. If ITL stutters, the streaming feels choppy, like a video buffering mid-sentence. Both are bad, but they come from different parts of the system. And as we're about to see, fixing one can make the other worse.
Zooming Out to the GPU
Switch perspective. You're no longer the user. You're the operator watching a GPU dashboard.
One user gets 710 tokens/sec. Fast for that user. But the GPU is running at 1% compute utilization. The memory bus is busy, sure, but the thousands of arithmetic cores are just... waiting. One car on a six-lane highway.
The operator's metric is throughput: total output tokens per second across all active requests. Latency is per-user. Throughput is per-GPU. At one user, they're the same number. At 100 users, they diverge.
Batching fills the idle compute by processing multiple sequences per weight read. Eight users means eight tokens from one 2.2 GB read, instead of eight separate reads. The weight cost is amortized. Throughput climbs.
But each sequence brings its KV cache along. And the KV cache costs bandwidth to read.
Where the Tax Shows Up
Why does per-stream ITL get worse as you add more sequences?
Each decode step reads the model weights once: 2.2 GB for TinyLlama, regardless of batch size. Weights are shared. But attention also reads each sequence's KV cache, its private record of all prior keys and values. More sequences, more KV data, more bytes through the memory bus per step.
TinyLlama with GQA uses 22 KB of KV cache per token position. A sequence with 300 tokens of context (100-token prompt plus 200 generated so far) carries about 6.6 MB of KV data. Tiny next to 2.2 GB of weights. But multiply it by the batch size:
| Batch size | Weight read | KV read | Total read | Step time | Total tok/s | Per-stream tok/s |
|---|---|---|---|---|---|---|
| 1 | 2.2 GB | 7 MB | 2.21 GB | 1.42 ms | 704 | 704 |
| 8 | 2.2 GB | 53 MB | 2.25 GB | 1.45 ms | 5,517 | 690 |
| 32 | 2.2 GB | 211 MB | 2.41 GB | 1.55 ms | 20,645 | 645 |
| 128 | 2.2 GB | 845 MB | 3.05 GB | 1.96 ms | 65,306 | 510 |
Look at batch 128. The GPU reads 3.05 GB instead of 2.21 GB. 38% more data. But it produces 128 tokens instead of 1. Throughput jumps 93×. Per-stream ITL goes from 1.42 ms to 1.96 ms. Each user barely notices. The operator gets 93× more useful work from the same hardware.
38% more latency per stream buys 93× more total output. That is an absurdly good deal.
But trace the KV column. Weight reads are constant: 2.2 GB, always. KV reads scale linearly with batch size. At batch 128, KV traffic is 845 MB, nearly 40% of the weight read. Push to 256 and it approaches the weight read itself. Push further and KV traffic overtakes weight traffic. From that point on, each additional sequence directly inflates step time in proportion.
This is the same roofline picture from Detail 12, now applied to the batch as a whole. Below a critical batch size, adding sequences costs almost nothing in latency. The spare bandwidth absorbs the extra KV reads. Throughput scales nearly for free. Both clocks stay happy.
Above it, the GPU transitions from bandwidth-bound (where idle capacity absorbs the extra work) to a regime where KV reads dominate. Each additional sequence buys a smaller throughput increment at a proportionally larger latency cost. The free lunch is over.
Where does the critical batch size fall? It depends on two things: how big each sequence's KV footprint is, and how much total bandwidth is available. TinyLlama with GQA and short contexts pushes the crossover far out, because GQA compresses the KV footprint to 22 KB per position instead of the 176 KB that full multi-head attention would cost (Detail 6). For a model with full multi-head attention, or longer contexts, or both, the crossover comes much sooner.
Two levers push it further out. GQA (or MQA) reduces KV heads architecturally. KV cache quantization (Detail 18) reduces storage precision: FP8 or INT8 halves the bytes per sequence, INT4 quarters them. Both slow the rate at which KV traffic grows with batch size, keeping the system in the "free throughput" regime for more sequences before the crossover hits.
When Prefill Interrupts Decode
The table above assumed every sequence was in the decode phase, all generating one token per step. Neat and uniform. But continuous batching (Detail 14) means new requests arrive constantly, and each one needs prefill before it can join the decode loop.
Here is where the two clocks collide.
Prefill is compute-heavy: it processes the entire prompt in one big matrix multiplication. Decode is bandwidth-heavy: it reads weights and KV cache to produce one token per sequence. These are different workloads fighting for the same GPU. When a long prefill lands in the middle of a decode batch, it monopolizes the compute pipeline. Every active decode stream stalls.
Make this concrete. 32 sequences are decoding smoothly at ~1.55 ms per step. A new request arrives with a 1,000-token prompt. Its prefill takes roughly 50 ms. During those 50 ms, all 32 streams produce zero tokens. Their ITL for that step spikes from 1.55 ms to 50+ ms, a 30× jump. The user mid-stream sees a visible stutter. The new user's TTFT is fine. Everyone else's experience just took a hit.
The naive fix: limit how many prompt tokens get prefilled per step. But throttling prefill directly inflates TTFT for the incoming request. You're protecting existing users by punishing the new one.
A better approach: chunked prefill. Instead of processing the full 1,000-token prompt in one shot, the scheduler breaks it into smaller chunks (say, 128 tokens each) and interleaves each chunk with a regular decode step.
Why does this work? Because the two workloads waste different resources. Decode is bandwidth-bound: compute units sit mostly idle while waiting for weight reads. Prefill is compute-bound: lots of arithmetic, less bandwidth pressure per FLOP. Interleave them and the prefill chunk fills the idle compute that decode leaves on the table, while decode uses the bandwidth that prefill doesn't fully need. They complement each other instead of competing.
The result: decode streams see small, predictable ITL bumps instead of one catastrophic spike. TTFT for the new request is slightly higher than a monolithic prefill (the work is spread across several steps), but that trade is almost always worth it.
The Queue
Everything above assumed the request was already running on the GPU. In a real service, requests arrive over time and compete for slots in the decode batch.
With continuous batching, a new request starts only after three things happen:
- A slot opens (an existing sequence finishes and frees its KV memory)
- Prefill runs for the new prompt
- The new sequence joins the decode loop
If slots are available, this happens immediately and TTFT equals bare prefill time. If all slots are full, the request waits. TTFT becomes prefill time plus queue time. And queue time can dwarf prefill.
How fast does queue time grow? Think about it from the arriving request's perspective. At low utilization, most slots are open. A new request almost always finds one immediately. As utilization climbs, open slots become rare. Near full capacity, almost every request waits, and each waiting request makes the line longer for the next one.
The relationship is nonlinear, and brutally so. Queue wait scales roughly as , where is utilization (fraction of capacity in use). At 50% utilization the ratio is 1. At 90% it is 9. At 95% it is 19. The curve is a hockey stick:
- At 70% utilization: queue is usually short, TTFT ≈ bare prefill time
- At 90%: p99 wait times climb sharply
- At 95%+: p99 TTFT can reach seconds, even while average throughput looks healthy on the dashboard
High utilization means high throughput and good hardware efficiency. But tail TTFT can blow up well before the GPU reaches 100%, because queueing math is brutal at the margins. The operator sees 92% utilization and thinks "efficient." The user in the p99 tail sees a 3-second wait and thinks "broken."
Goodput
Consider two server configurations:
- Config A: 50,000 tok/s throughput, p99 TTFT = 4 seconds
- Config B: 20,000 tok/s throughput, p99 TTFT = 200 ms
Config A moves more tokens. If you're graded on throughput alone, A wins. But if your application requires TTFT under 500 ms, Config A is useless. Its throughput is wasted because the latency target is violated.
The throughput that actually meets your latency targets is called goodput. More precisely: goodput = throughput × fraction of requests meeting the SLO.
Config A's p99 blows past the 500 ms target, so a large chunk of its throughput doesn't count. Config B stays within budget and delivers its full 20,000 tok/s as goodput. The "slower" server is the better server.
Raw throughput can mislead. As utilization climbs and queueing delay grows (the hockey-stick curve above), the fraction of requests meeting the SLO drops. Raw throughput continues to rise while goodput can actually fall. A server pinned at 95% utilization can look great on a dashboard while delivering poor goodput. Goodput forces you to account for both dimensions at once.
Picking an Operating Point
So which knob do you turn?
An interactive chatbot has a human watching the stream. Every millisecond of TTFT feels like lag, and any stutter in token delivery breaks the illusion of fluency. The operator caps the batch size well below the maximum, leaves headroom for new arrivals, and uses chunked prefill to smooth ITL spikes. Utilization stays moderate. That is the price of responsiveness.
A batch-processing pipeline summarizing a million documents overnight is a different world. No one is watching. TTFT and ITL are irrelevant. The only metric is tokens per dollar. Fill every slot. Push utilization as high as queueing allows. Maximize total throughput.
Most real systems serve a mix. The usual solution: separate the pools. Give interactive traffic a strict concurrency cap with headroom for fast admission. Route batch traffic to a throughput-maximizing pool where high utilization is acceptable.
Batch size is the primary lever. Below the critical batch size, you get both low latency and rising throughput, nearly for free. Above it, every additional sequence trades per-stream ITL for total tok/s. Queueing amplifies the problem at high utilization: even moderate overcommitment can push tail latency far beyond acceptable levels. The operating point you choose depends on what your users are doing, and on one more constraint we have not yet addressed: whether the model fits on a single GPU at all.
Detail 22: Multi-GPU Inference
Every optimization in Part 2 has assumed something we never stated outright: the model fits on one GPU.
Llama 2 70B at FP16: 70 billion parameters × 2 bytes = 140 GB of weights. An A100 has 80 GB of VRAM. The model does not load. Not "barely fits with careful memory management." Does not load.
Quantization buys room. INT4 (Detail 18) shrinks 140 GB to roughly 35 GB. Now the weights fit, with space left for KV cache. But there is a second wall.
Detail 12 established that decode reads all model weights every single token. The bandwidth floor for 35 GB of weights on a single A100:
If your application needs inter-token latency under 15 ms, one GPU cannot deliver it. No kernel optimization, no scheduling trick, no batching strategy breaks the bandwidth floor. It is set by how fast VRAM delivers bytes to the chip.
Two problems, then. The model might not fit. And even if it does, a single GPU might not be fast enough. Both require spreading the model across multiple GPUs. How you split determines which problem you actually solve.
The Assembly Line
A transformer is a stack of identical layers. The most obvious split: give each GPU a contiguous chunk and process tokens in sequence, like stations on a factory line.
4 GPUs, 32-layer model:
- GPU 0: layers 1 to 8
- GPU 1: layers 9 to 16
- GPU 2: layers 17 to 24
- GPU 3: layers 25 to 32
A token enters GPU 0, passes through 8 layers, and the resulting activation tensor ships to GPU 1. GPU 1 runs layers 9 to 16, ships to GPU 2. And so on. Communication is light: one small tensor at each stage boundary.
Intuitive. Clean. And if you trace the execution of a single decode step, immediately wasteful.
GPU 0 works on layers 1 to 8. GPUs 1, 2, 3: idle. GPU 1 works on layers 9 to 16. GPUs 0, 2, 3: idle. At any instant, three out of four GPUs do nothing. 75% of the hardware sits dark.
This idle time is the pipeline bubble.
The fix comes from the same intuition as any assembly line: don't wait for one item to finish before starting the next. If multiple sequences are in flight (and continuous batching from Detail 14 provides them naturally), GPU 0 can start processing sequence 2 while GPU 1 handles sequence 1. This is microbatching.
The bubble fraction follows a clean formula: , where is pipeline stages and is microbatches in flight.
| Microbatches (m) | Bubble fraction | GPU utilization |
|---|---|---|
| 1 | 75% idle | 25% |
| 4 | 43% idle | 57% |
| 16 | 16% idle | 84% |
| 64 | ~4% idle | ~96% |
With enough sequences flowing through the pipeline, utilization reaches 96%. Throughput scales. The assembly line hums.
But notice what did not change. A single request still flows through all 4 stages in sequence. Its latency is the sum of all stage times. Microbatching filled the idle GPUs with other people's requests. That is great for the operator paying the electricity bill. It does nothing for the user staring at a cursor, waiting for their stream to start.
This is pipeline parallelism (PP). It solves capacity: models that won't fit on one GPU can spread across many. It improves throughput: more total tokens per second. But it usually does not improve single-request latency, and can add a bit of stage-boundary communication on top.
The Latency Wall
If your only problem is "the model doesn't fit," PP is sufficient. But go back to the bandwidth floor. A 70B model at INT4 on a single A100: ~22.5 ms per token. Split it across 4 GPUs with pipeline parallelism, and each GPU holds fewer layers. But the total weight read across all stages is the same 35 GB. The stages run in sequence. Total decode time for one token is roughly the same as before.
The problem is structural. In pipeline parallelism, GPUs take turns. To make a single token genuinely faster, you need GPUs working on the same token at the same time. Not taking turns. Simultaneously.
Splitting Weight Matrices
Instead of giving each GPU different layers, give every GPU a slice of every layer. All of them work on the same token, at the same time.
Start from the basic operation: a matrix multiply . Suppose is large and you have 2 GPUs. Split down the columns: GPU 0 gets the left half, GPU 1 gets the right half. Each GPU multiplies its half of by the full input and produces its half of . Concatenate the halves. Done.
For this single operation, no inter-GPU communication was needed. Each GPU independently computed its portion.
But a transformer MLP is two consecutive multiplies with an activation function between them: . The Megatron-LM paper showed how to parallelize this without redundant work. Split column-wise: each GPU gets columns of , producing its portion of the hidden vector. Apply GeLU locally (it is element-wise, so each GPU applies it to its own slice without needing the full vector). Then split row-wise: each GPU gets rows of and multiplies against its local hidden slice.
Here is the catch. The row-parallel step on produces partial sums. Each GPU has computed part of the final dot product along the inner dimension. To assemble the correct output, you need to sum across GPUs. This synchronization is called an all-reduce.
The cost: 2 all-reduces per transformer layer (one after the attention projection, one after the MLP). For TinyLlama's 22 layers, that is 44 all-reduces per decode step.
The benefit: every GPU works on the same token at the same time. No pipeline bubble. No sequential waiting.
This is tensor parallelism (TP).
The Bandwidth Payoff
This is where PP and TP diverge in a way that matters for the person waiting at the other end.
Detail 12 established the decode bandwidth floor for TinyLlama on a single A100: 2.2 GB of weights read per token, giving ~1.41 ms. No software trick on one GPU breaks that.
With 2-way tensor parallelism, each GPU holds half the weights: 1.1 GB instead of 2.2 GB. Each GPU reads its half from its own VRAM, simultaneously. The floor per GPU drops to ~0.71 ms. And because both GPUs work in parallel, the effective floor for the whole model drops to ~0.71 ms.
ITL cut in half.
| TP degree | Weight read per GPU | Bandwidth floor (A100) |
|---|---|---|
| 1 (single GPU) | 2.2 GB | ~1.41 ms |
| 2-way | 1.1 GB | ~0.71 ms |
| 4-way | 0.55 GB | ~0.35 ms |
Why does this work? Recall from Detail 12: decode at batch size 1 has arithmetic intensity of ~1 FLOP/byte. The GPU finishes the actual math in microseconds and spends the rest of each step waiting for weight data to arrive from VRAM. Compute units are 99% idle, starving for bytes. Tensor parallelism splits that starvation across devices. Each GPU has less data to read, and they all read simultaneously.
This is a genuine per-token latency reduction, not a throughput trick. Pipeline parallelism keeps each GPU working at full speed on fewer layers, but the stages are sequential, so total per-token time stays roughly constant. Tensor parallelism puts all GPUs on the same token at the same time, dividing the bandwidth cost.
Back to the 70B model. Single A100 at INT4: ~22.5 ms/token. 2-way TP: ~11.3 ms. 4-way: ~5.6 ms. 8-way in a DGX node (8 A100s on NVLink): ~2.8 ms. The 15 ms target that was impossible on one GPU becomes comfortable on eight.
The Communication Tax
Those 44 all-reduces per step are not free. How much they cost depends entirely on what connects the GPUs.
Within a single server node, NVLink on H100 provides ~900 GB/s bidirectional bandwidth. The activation tensors in inference all-reduces are small (proportional to hidden dimension times batch size, not the full weight matrix). At these speeds, all-reduce overhead is a few percent of decode time. The bandwidth payoff from weight splitting dominates.
Across nodes, the picture inverts. A single InfiniBand NDR or 400 GbE link delivers about 50 GB/s. Standard 100 GbE: closer to 12 GB/s. Even high-end multi-rail setups top out around 400 GB/s aggregate. All-reduce overhead at those speeds becomes a significant fraction of the decode step, and can erase the bandwidth savings entirely.
Pipeline parallelism, by contrast, communicates only at stage boundaries: one point-to-point activation transfer between adjacent GPUs per step. Far less frequent than TP's per-layer all-reduces. PP tolerates slow interconnects.
This creates the deployment pattern you see everywhere: tensor parallelism within a node (where NVLink keeps all-reduces cheap), pipeline parallelism across nodes (where communication is rare enough to survive slower links). Match the parallelism strategy to the interconnect topology.
Large deployments combine both: 8-way TP within a DGX node (8 GPUs on NVLink), PP across nodes if the model exceeds one node's aggregate VRAM, and multiple replicas of the entire setup for throughput scaling. Capacity, latency, and throughput from the same deployment.
The Third Axis: Splitting the Sequence
Tensor parallelism splits weights. Pipeline parallelism splits layers. Both handle the model itself. But there is a third thing growing that neither fully addresses, and it scales with something no amount of weight-sharding controls: sequence length.
TP helps with the KV cache to a degree. It shards the cache across attention head partitions, so each GPU stores KV data only for its slice of heads. The per-GPU cache footprint drops roughly proportional to TP degree. But within each GPU's share, the cache still grows linearly with context length. At the context lengths people are pushing toward now, that linear growth becomes the dominant constraint.
Consider a 70B model with GQA (8 KV heads, = 128, 80 layers) serving a 1M-token context in FP16. Each layer stores keys and values: bytes. That is roughly 4.1 GB per layer. Across 80 layers: ~327 GB of KV cache for a single sequence. Quantized to INT8: still ~164 GB. No single GPU holds this. Even with 8-way TP splitting across heads, each GPU's share of the KV cache is ~41 GB, for one sequence alone.
Context parallelism (CP) splits the sequence dimension itself across GPUs. With 8-way CP on a 1M-token context, each GPU holds a 125K-token slice and stores only its portion of the KV cache.
The complication is attention. Every query token needs to attend to all key-value positions across the entire context, not just the ones stored locally. Ring Attention solves this by arranging GPUs in a logical ring. Each GPU computes attention against its local KV chunk, then passes its K/V blocks to the next GPU in the ring. After sends (one full rotation), every GPU has attended against all positions and holds the complete attention output for its token range.
The communication pattern differs from both PP and TP. PP sends activations point-to-point at stage boundaries: infrequent. TP all-reduces partial sums every layer: frequent but small. CP passes K/V blocks around the ring every layer: frequent and large, but the transfers overlap with compute. While one GPU computes attention against the chunk it just received, the next GPU is already sending its chunk along. During prefill, attention is in sequence length, providing enough arithmetic to hide the transfers behind useful work.
One subtlety: causal masking means GPUs handling early positions have less work (fewer past tokens to attend to), creating load imbalance. Striped partitioning (interleaving token positions across GPUs rather than assigning contiguous blocks) is the standard fix.
CP is most valuable during prefill, where the computation provides enough arithmetic intensity to mask the ring communication. During decode, each step generates only one new token, so the compute-to-communication ratio drops. CP's benefit during decode is primarily memory capacity (the KV cache stays distributed) rather than latency reduction. Per-token decode latency remains TP's territory.
Context parallelism is orthogonal to TP and PP. A large long-context deployment might use all three: TP within a node (split weights for low ITL), CP across nodes (split the KV cache so million-token contexts fit), and PP if the model exceeds even one node's aggregate weight capacity. Each axis addresses a different dimension: TP handles weights and heads, CP handles sequence length, PP handles layers.
Part 2 Summary: The Same Pattern, Every Time
Part 1 was about how inference works: one loop, where prefill builds a KV cache and decode extends it one token at a time.
Part 2 was about what gets in the way when that loop runs at scale. Thousands of concurrent users. GPUs with finite bandwidth and finite VRAM. Latency budgets in milliseconds.
The loop itself doesn't change. The constraints tighten around it.
And the response to each constraint followed the same pattern. You run the loop, you hit a physical wall, you identify which wall, and you engineer around it.
Decode rereads every weight for a single token? Batch more sequences so the read gets amortized across many tokens at once. KV cache filling all your VRAM? Share K/V heads across attention groups (GQA), page-manage the blocks so nothing is wasted on fragmentation (PagedAttention), reuse common prefixes across requests (prefix caching). Attention scaling quadratically at long contexts? Fuse the kernel so the bottleneck is compute, not memory traffic (FlashAttention). Weights too large to fit or too slow to read? Quantize: fewer bits per weight, same architecture. One GPU not enough? Shard the weights across devices (tensor parallelism) or shard the KV cache along the sequence dimension (context parallelism). Decode inherently sequential, one token at a time no matter what? Speculate with cheap drafts and verify in parallel.
Each optimization is a direct response to a bottleneck we already named in Part 1. Here's the reference table:
| Technique | What It Does | Typical effect |
|---|---|---|
| Continuous Batching | Dynamic request scheduling | 2-5× throughput |
| PagedAttention | Eliminate KV cache fragmentation | 2-4× memory efficiency |
| FlashAttention | Fused attention kernels | 2-4× attention speed |
| Quantization | Reduce precision | 2-4× smaller weights (often faster) |
| GQA/MQA | Smaller KV per head | 4-32× KV cache reduction |
| Speculative Decoding | Parallel verification | 2-3× decode speed |
| Prefix Caching | Reuse shared context | 5-10× TTFT for cached prefixes |
| Multi-GPU (TP) | Split weights across devices | Near-linear ITL reduction (until comm dominates) |
| Context Parallelism | Split KV cache across sequence | Enables 100K+ context serving |
They stack because they target different walls. Continuous batching and FlashAttention don't compete. Quantization and GQA both reduce bytes, but in different places: weights vs. KV cache. Speculative decoding attacks the sequential nature of the decode loop while the others optimize what happens inside each step. The ceiling never disappears (eventually bandwidth, VRAM, or queueing becomes the binding constraint), but with this mental model, you know which wall you're hitting and which lever to reach for.
Closing the Loop
Go back to where this started.
You type "Write a story" into ChatGPT. There's a pause. Then tokens stream in: "Once upon a time..."
Twenty-two Details ago, that was a black box. Text in, text out. The pause was "the model thinking." The stream was "the model writing." You could observe the rhythm but not explain it.
Now you can trace the whole thing.
The pause is prefill: your entire prompt processed in parallel across every transformer layer, building the KV cache that decode will read from. Longer prompts mean more tokens to process, more KV pairs to compute across more layers. That's why the pause scales with prompt length. Prefill is compute-bound: the GPU is saturated doing real math.
The stream is decode: one token per full trip through the model, each trip bottlenecked by how fast the GPU can read weights from memory. The speed barely changes between a poem and a legal brief. Same weights, same bandwidth ceiling. The streaming rate you observe is your hardware's memory bandwidth, made visible.
If you collapse everything (all 22 Details, both Parts) into one sentence:
Inference is a prefill pass that builds a KV cache, followed by a bandwidth-limited decode loop that repeatedly rereads weights and appends to that cache.
At the start of this post, that sentence was noise. Now every word in it points to something you built up from first principles. You know what prefill computes and why it's compute-bound. You know what the KV cache stores and how GQA shrinks it. You know why decode is bandwidth-limited, what "rereads weights" costs in wall-clock time, and what "appends to that cache" means at the hardware level.
Every production optimization from Part 2 targets a wall that this loop implies. The compute cost of prefill. The bandwidth cost of decode. The memory cost of the growing cache. The sequential constraint of one-token-at-a-time generation. When someone says "2× faster," the question is: which wall moved? TTFT, ITL, throughput, p99 tail latency? Those are different metrics, controlled by different bottlenecks.
And this transfers beyond anything we covered here. When a vLLM changelog says "chunked prefill enabled by default," you know what problem it solved: long prefills were monopolizing the GPU, starving decode tokens for everyone else in the batch. When someone says a model "runs on a single 4090," you can check it yourself: parameters × bytes per weight, see if it fits in 24 GB. Then weight bytes ÷ memory bandwidth gives you a floor on decode latency. When a benchmark claims "2× throughput," you know to ask: at what batch size, what context length, what precision? The arithmetic is straightforward once you know where the numbers come from.
The point was never to memorize 22 Details. It was to build a mental model where the next optimization, the next benchmark, the next architecture paper has somewhere to land. I find that if we keep pulling on threads. The answers always trace back to the same loop.
Optional: Going Deeper
The main post covered the pipeline and the optimizations that matter most. What follows goes a step further: other ways to shrink models, alternatives to quadratic attention, what vLLM actually does under the hood, and a couple of decoding tricks that skip the draft model entirely.
References & Further Reading
Primary sources for the concepts discussed above, plus a few good overviews.
Blog posts & videos
- Aleksa Gordić, Inside vLLM: Anatomy of a High-Throughput LLM Inference System. https://www.aleksagordic.com/blog/vllm
- Vizuara, How the VLLM Inference Engine Works. https://www.youtube.com/watch?v=QyHHbeXqgrQ
- Lilian Weng, Large Transformer Model Inference Optimization. https://lilianweng.github.io/posts/2023-01-10-inference-optimization/
- NVIDIA Developer Blog, Mastering LLM Techniques: Inference Optimization. https://developer.nvidia.com/blog/mastering-llm-techniques-inference-optimization/
- Jay Alammar, The Illustrated Transformer. https://jalammar.github.io/illustrated-transformer/
- Andrej Karpathy, Let's build GPT: from scratch, in code, spelled out. https://www.youtube.com/watch?v=kCc8FmEb1nY
Papers (mostly arXiv)
Core architecture
- Vaswani et al. (2017). Attention Is All You Need. https://arxiv.org/abs/1706.03762
- Bahdanau et al. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. https://arxiv.org/abs/1409.0473
- Su et al. (2021). RoFormer: Enhanced Transformer with Rotary Position Embedding. https://arxiv.org/abs/2104.09864
- He et al. (2016). Deep Residual Learning for Image Recognition. https://arxiv.org/abs/1512.03385
- Ba et al. (2016). Layer Normalization. https://arxiv.org/abs/1607.06450
- Shazeer (2020). GLU Variants Improve Transformer (SwiGLU). https://arxiv.org/abs/2002.05202
Tokenization
- Sennrich et al. (2016). Neural Machine Translation of Rare Words with Subword Units (BPE). https://arxiv.org/abs/1508.07909
- Kudo & Richardson (2018). SentencePiece: A Simple and Language Independent Subword Tokenizer and Detokenizer for Neural Text Processing. https://arxiv.org/abs/1808.06226
Models referenced
- Zhang et al. (2024). TinyLlama: An Open-Source Small Language Model. https://arxiv.org/abs/2401.02385
- Touvron et al. (2023). Llama 2: Open Foundation and Fine-Tuned Chat Models. https://arxiv.org/abs/2307.09288
- Llama Team (2024). The Llama 3 Herd of Models. https://arxiv.org/abs/2407.21783
- Jiang et al. (2023). Mistral 7B. https://arxiv.org/abs/2310.06825
- Gemma Team (2024). Gemma: Open Models Based on Gemini Research and Technology. https://arxiv.org/abs/2403.08295
- Gemma Team (2025). Gemma 3 Technical Report. https://arxiv.org/abs/2503.19786
Inference + serving
- Kwon et al. (2023). Efficient Memory Management for Large Language Model Serving with PagedAttention (vLLM). https://arxiv.org/abs/2309.06180
- Yu et al. (2022). Orca: A Distributed Serving System for Transformer-Based Generative Models (OSDI). https://www.usenix.org/conference/osdi22/presentation/yu
Attention kernels
- Dao et al. (2022). FlashAttention: Fast and Memory-Efficient Exact Attention with IO-Awareness. https://arxiv.org/abs/2205.14135
- Dao (2023). FlashAttention-2: Faster Attention with Better Parallelism and Work Partitioning. https://arxiv.org/abs/2307.08691
- Milakov & Gimelshein (2018). Online Normalizer Calculation for Softmax. https://arxiv.org/abs/1805.02867
KV-cache shape tricks
- Shazeer (2019). Fast Transformer Decoding: One Write-Head Is All You Need (MQA). https://arxiv.org/abs/1911.02150
- Ainslie et al. (2023). GQA: Training Generalized Multi-Query Transformer Models from Multi-Head Checkpoints. https://arxiv.org/abs/2305.13245
- DeepSeek-AI (2024). DeepSeek-V2: A Strong, Economical, and Efficient Mixture-of-Experts Language Model (MLA). https://arxiv.org/abs/2405.04434
Speculative decoding
- Leviathan et al. (2022). Fast Inference from Transformers via Speculative Decoding. https://arxiv.org/abs/2211.17192
- Chen et al. (2023). Accelerating Large Language Model Decoding with Speculative Sampling. https://arxiv.org/abs/2302.01318
- Li et al. (2024). EAGLE: Speculative Sampling Requires Rethinking Feature Uncertainty. https://arxiv.org/abs/2401.15077
- Cai et al. (2024). Medusa: Simple LLM Inference Acceleration Framework with Multiple Decoding Heads. https://arxiv.org/abs/2401.10774
- Fu et al. (2024). Break the Sequential Dependency of LLM Inference Using Lookahead Decoding. https://arxiv.org/abs/2402.02057
Quantization
- Frantar et al. (2022). GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers. https://arxiv.org/abs/2210.17323
- Xiao et al. (2022). SmoothQuant: Accurate and Efficient Post-Training Quantization for Large Language Models. https://arxiv.org/abs/2211.10438
- Lin et al. (2023). AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration. https://arxiv.org/abs/2306.00978
- Dettmers et al. (2022). LLM.int8(): 8-bit Matrix Multiplication for Transformers at Scale. https://arxiv.org/abs/2208.07339
- Dettmers et al. (2023). bitsandbytes: Accessible large language models via k-bit quantization for PyTorch. https://github.com/bitsandbytes-foundation/bitsandbytes
Parallelism
- Shoeybi et al. (2019). Megatron-LM: Training Multi-Billion Parameter Language Models Using Model Parallelism. https://arxiv.org/abs/1909.08053
- Huang et al. (2018). GPipe: Efficient Training of Giant Neural Networks using Pipeline Parallelism. https://arxiv.org/abs/1811.06965
Performance modeling
- Williams et al. (2009). Roofline: An Insightful Visual Performance Model for Multicore Architectures. Communications of the ACM, 52(4), 65-76. https://doi.org/10.1145/1498765.1498785
Alternate attention (context for long sequences)
- Kitaev et al. (2020). Reformer: The Efficient Transformer. https://arxiv.org/abs/2001.04451
- Wang et al. (2020). Linformer: Self-Attention with Linear Complexity. https://arxiv.org/abs/2006.04768
- Beltagy et al. (2020). Longformer: The Long-Document Transformer. https://arxiv.org/abs/2004.05150
Alternative hardware & memory economics
- Groq. Inside the LPU: Deconstructing Groq's Speed (blog). https://groq.com/blog/inside-the-lpu-deconstructing-groq-speed
- Groq. Groq LPU Inference Engine Crushes First Public LLM Benchmark (blog, February 2024). https://groq.com/blog/groq-lpu-inference-engine-crushes-first-public-llm-benchmark
- Groq. Groq Sets New LLM Performance Record of 300 Tokens per Second per User on Llama-2 70B (press release, November 2023). https://www.prnewswire.com/news-releases/groq-sets-new-large-language-model-performance-record-of-300-tokens-per-second-per-user-on-meta-ai-foundational-llm-llama-2-70b-301980280.html
- Patel & Nishball. Groq Inference Tokenomics: Speed, But At What Cost? SemiAnalysis (February 2024). https://newsletter.semianalysis.com/p/groq-inference-tokenomics-speed-but
- Patel et al. The Memory Wall: Past, Present, and Future of DRAM. SemiAnalysis (September 2024). https://semianalysis.com/2024/09/03/the-memory-wall/
- Cerebras. Cerebras Systems Unveils World's Fastest AI Chip (press release, March 2024). https://www.cerebras.ai/press-release/cerebras-announces-third-generation-wafer-scale-engine
- Cerebras. Cerebras Inference now 3x faster: Llama3.1-70B breaks 2,100 tokens/s (blog, October 2024). https://www.cerebras.ai/blog/cerebras-inference-3x-faster
- Prabhakar et al. (2024). SambaNova SN40L: Scaling the AI Memory Wall with Dataflow and Composition of Experts. https://arxiv.org/abs/2405.07518
- d-Matrix. Corsair Product Overview. https://www.d-matrix.ai/product/
Compression beyond quantization (optional)
- Sanh et al. (2019). DistilBERT, a distilled version of BERT: smaller, faster, cheaper and lighter. https://arxiv.org/abs/1910.01108
- Frantar & Alistarh (2023). SparseGPT: Massive Language Models Can Be Accurately Pruned in One-Shot. https://arxiv.org/abs/2301.00774
- Shazeer et al. (2017). Outrageously Large Neural Networks: The Sparsely-Gated Mixture-of-Experts Layer. https://arxiv.org/abs/1701.06538
- Fedus et al. (2021). Switch Transformers: Scaling to Trillion Parameter Models with Simple and Efficient Sparsity. https://arxiv.org/abs/2101.03961
Implementations & docs
- vLLM: https://github.com/vllm-project/vllm
- FlashAttention: https://github.com/Dao-AILab/flash-attention
- Hugging Face TGI: https://github.com/huggingface/text-generation-inference
- NVIDIA TensorRT-LLM: https://github.com/NVIDIA/TensorRT-LLM
- LMCache (prefix caching): https://github.com/LMCache/LMCache
- XGrammar (structured generation): https://arxiv.org/abs/2411.15100
- TinyLlama model card: https://huggingface.co/TinyLlama/TinyLlama-1.1B-Chat-v1.0
- Tiktokenizer (interactive tokenizer): https://tiktokenizer.vercel.app/