Backpropagation Part 2: Patterns, Architectures, and Training
Every gradient rule, from convolutions to attention, follows one pattern: the vector-Jacobian product. See past the memorized formulas to the unifying abstraction, understand how residuals and normalization tame deep networks, and learn why modern architectures are really just careful gradient engineering.
The Pattern Behind Every Gradient
You've traced gradients through a small network by hand. Multiply by the local derivative, pass backward, accumulate at merges. Your MLP learns. The rules fit on an index card.
Then you open a real model.
A convolution shares the same kernel weight at every spatial position. One weight, 48,000 uses in a single layer. How does its gradient accumulate from all of them? BatchNorm computes a mean and variance across the batch, then normalizes every element with those shared statistics. Change one input and you shift the statistics, which shifts every output. Attention runs three separate matrix multiplies (query, key, value), passes through softmax, and couples every sequence position to every other position.
Each operation needs a backward pass. Each backward pass looks different. Your notebook grows: "Linear layers: outer product, summed over the batch. Convolutions: correlate input with the upstream gradient. BatchNorm: three paths through mean, variance, and direct normalization. Attention: softmax coupling plus two matmul reverses." Fifteen operations, fifteen rules. They work. But there's a question underneath: if backprop is just the chain rule, why does it need this many special cases?
It doesn't.
Think back to Part 1. At every node, backprop did the same thing: take the incoming gradient, multiply by the local derivative, pass the result to the parent nodes. That pattern never changed between addition, multiplication, ReLU, or softmax. What changed was the local derivative itself.
So what is the local derivative when an operation maps 1000 inputs to 500 outputs? In principle, it's a 500×1000 matrix: the Jacobian. Half a million entries. And you'd need one at every layer.
Backprop never builds it. Instead, it computes what the matrix would do to the incoming gradient vector, directly, without materializing the Jacobian. That operation is called the vector-Jacobian product, or VJP. It's the single pattern underneath every gradient recipe in this post. Every transpose in every backward pass, every reduction over a batch dimension, every "correlate the input with the upstream gradient" is a VJP in different notation.
Part 1 built the foundation: computational graphs, adjoints, local rules, a working MLP. Part 2 puts it to work across real architectures. The VJP lens turns each layer's backward pass from a special case into an instance of one pattern. We apply it to convolutions, normalization, pooling, attention, residual connections, and recurrent networks. We connect gradients to optimizers and learning rate schedules. We trace why some architectures train and others collapse. The goal: a viewpoint where you can derive any layer's gradient on the spot by asking one question. Given a small change in the output, what change does that imply for the input?
The Hidden Abstraction: Vector-Jacobian Products
We referenced this in passing in part 1, but we've not really spent time analyzing these "VJPs", so I'll assume you know nothing about it and start from the fundamentals.
So coming back, throughout these posts, we've been casually saying "multiply by the local derivative" during backprop. But what exactly is "the local derivative" when an operation transforms vectors of different dimensions?
Start simple: with 2 inputs and 2 outputs , the Jacobian is:
Each row captures how one output changes with respect to all inputs.
Now scale that: with 1000 inputs and 500 outputs. This time, each output needs 1000 derivatives (one per input). That's 1000 entries per row. We have 500 outputs, so 500 rows. Total: a 500×1000 matrix with 500,000 entries.
Now for the kicker: backprop never actually builds that matrix. Ever.
Instead, backprop uses vector-Jacobian products (VJPs), the hidden abstraction that makes everything efficient. The Jacobian contains answers to every possible question: "how does output change when I "perturb" input ?" But during backprop, we only ever ask one specific question: "given this gradient flowing back from the loss, what gradient should flow to the inputs?" VJP answers that exact question directly, computing only what we need. No massive matrix, no wasted memory, just the result we're looking for.
Understanding VJPs transforms backprop from a collection of memorized rules into a unified framework. Those transposes that appear everywhere? They're not arbitrary. The way gradients flow through matrix multiplies? There's a deep geometric reason. Even the most complex operations (attention, convolutions) follow the same VJP pattern once you see the structure. So let's start answering all these questions, one by one!
VJPs: The Engine Behind Every Gradient
So the VJP replaces the Jacobian. What does that actually mean, mechanically?
You have an operation with inputs and outputs. The Jacobian is an matrix where row contains . Row captures how output responds to each input.
During backprop, a gradient arrives from downstream: entries, . How much the final loss cares about each output. What you need is for each input .
Chain rule:
That summation is . The upstream gradient as a row vector, times the Jacobian. You never need by itself. You only need what it does to . That product is the vector-Jacobian product (VJP), and every operation computes it directly, without ever forming .
In code:
For a layer with 1000 inputs and 500 outputs:
- Naive way: Build 500×1000 matrix (2MB in float32), then multiply
- VJP way: Direct computation, never more than vectors in memory
VJPs save more than memory. They're often cheaper to compute. Consider . The Jacobian with respect to would be a 4D tensor if you're being fully general with batches. The VJP? Just . One matrix multiply.
Trace through a concrete example to see the mechanics:
Every operation has a direct VJP formula, simpler than "build the Jacobian, then multiply." This is why backprop is feasible. No million-entry Jacobians sitting in memory at every layer. Just vector-sized operations, one per node.
Shape Discipline: When Gradients Must Match
Here's a failure mode I've seen more than once. You implement a backward pass, compute grad_W, and it comes out (784, 128). Your weight matrix W is (128, 784). Same numbers, wrong order. You transpose it. The shape now matches. The network trains.
Three epochs later, the loss is suspiciously flat.
The transposition was wrong. Not because the shapes didn't match after the fix, but because you were computing the wrong thing and forcing it to fit. Shape discipline is not a style guide. When shapes mismatch, the gradient computation itself is usually wrong. The shape is the tell.
So why must a gradient have the same shape as its parameter?
Start with what the gradient means. asks: "For each entry of , how much does the loss change if I nudge that entry?" One number per entry. If has shape , that's 100,352 entries, so must also have shape . The constraint is definitional, not conventional.
The VJP makes this concrete. Consider in index notation:
The gradient for entry follows from the chain rule:
Only depends on (through ), so the sum collapses:
In matrix form:
Check the shapes. has shape . has shape . Product: , exactly matching .
[DIAGRAM: An outer-product view of weight gradients. Left: a column vector (M×1) labeled "upstream gradient — how much each output affected the loss." Right: a row vector (1×N) labeled "input — what each input neuron provided." Their outer product: an M×N matrix labeled "grad_W contribution from one example." Below this, three such outer-product matrices stacking on top of each other with a + sign between them, summing into the final grad_W. Caption: "Each training example contributes a rank-1 outer product. The weight gradient is their sum."]
Read the formula. The gradient for weight is two things multiplied: how much output affected the loss, and how strongly input fired. Each weight's update is proportional to "how wrong was the thing this weight fed into" times "what did this weight see." The outer product is not a formula to memorize. It falls out of the index arithmetic.
One thing that trips people up: the batch dimension. Running examples through a layer doesn't change the shape of , so can't have a batch dimension either. The weight gradient accumulates outer products across the batch:
grad_output.T @ X— , matchinggrad_output.sum(axis=0)— , matching the bias
What broadcasts forward reduces backward. The bias was tiled across 32 examples in the forward pass, so its gradient sums across those 32 examples on the way back.
Shape assertions in code make this checkable:
These are not defensive programming. They are executable math. If grad_W.shape != W.shape, the computation is wrong, not just the code.
The loud shape error is not the problem. Both shapes get printed, you fix it, done. The dangerous one is the shape bug that silently computes something plausible but wrong. The network trains. Just poorly. The loss plateaus somewhere suspicious and nothing points to why.
Beyond Matrices: The Einsum Perspective
The transpose rule works when everything is 2D. Forward: Y = X @ W.T. Backward: grad_X = grad_Y @ W. You can hold the whole thing in your head.
Then you hit attention. Q, K, V are three-dimensional: batch, sequence, depth. You reach for the transpose rule. Transpose which axes? There are three. The 2D shortcut assumed a specific index structure: one shared index gets summed away, two free indices land in the output. Higher-dimensional tensors don't always fit that mold.
What broke isn't the gradient math. It's the shorthand. "Multiply by the transpose" was always a compressed description of something more general, and the compression only worked for 2D. Einsum is the uncompressed version.
Every einsum string tells you two things about its indices: which ones get contracted (summed away) and which ones stay free. The output string is the list of survivors:
Look at the forward pass. k appears in both inputs (ik and kj) but not in the output (ij). So k gets contracted: it's the index that gets summed over.
Now look at the backward pass for grad_A. The output indices are ik, matching A's shape. The index j, which was free in the forward output, is now the one being contracted. And k, which was contracted forward, is free again.
The roles flip. Contracted becomes free. Free becomes contracted. Output indices match the shape of whatever you're differentiating with respect to. That's the whole rule.
Where this starts to matter is the moment you leave 2D. Batch matrix multiplication:
Look at grad_W. Its output string is dk. No b. No n. Those dimensions got summed over. Not because someone decided to sum them, but because W's shape is (D, K) and the gradient must match that shape. The output string has to be dk, and any index not in it gets contracted.
This is the "what broadcasts forward reduces backward" principle, but now you can see it in the notation directly. W had no batch or sequence dimension, so it broadcast across both in the forward pass. In the backward pass, both dimensions collapse into a sum. The einsum string makes the reduction explicit instead of hiding it inside a matrix multiply.
One more example to show how far this scales. Attention has two matmuls with a softmax sandwiched between them. Each matmul is an einsum. Each einsum backward follows the same index-flipping rule:
Each backward einsum is the forward einsum with its index wiring rearranged so the output matches the gradient target. grad_Q needs shape (B,N,D), so the output string is bnd. grad_K needs shape (B,M,D), output string bmd. You read the shapes and write down the index strings. The softmax step is the one place this breaks: normalization couples all positions, so there's no clean index-reversal form for it. Everything else is mechanical.
The point isn't that einsum is nice notation. It's that the index string is the computation's structure, and structure determines the gradient. Once you can read 'bnd,dk->bnk' and immediately see that the backward for the first argument has output indices bnd with k contracted, you're not looking up gradient formulas. You're reading them off the forward pass.
The Takeaway: It's All Just VJPs
Step back for a second. The three preceding sections weren't three separate lessons. They were the same idea from three angles.
Why does appear in every backward pass? Because the VJP of a linear map is its adjoint, and the adjoint of is . That's not a convention someone chose. It's a consequence of how the chain rule works on linear maps between dual spaces.
Why must grad_W.shape == W.shape? The VJP lives in the space of perturbations to the parameter. That space has the same dimension as the parameter. Shape matching is geometry, not bookkeeping.
Why does einsum backward work by swapping contracted and free indices? Because the VJP of a contraction is another contraction with the index structure reversed. You're running the same operation backward.
One pattern. Three instances. Once you see it, you can't unsee it.
So when you encounter a new operation, you don't need to look up its gradient. Describe how the forward pass works in terms of its linear structure. The VJP follows from there. Every gradient recipe in the next section, convolutions, attention, normalization, pooling, is the same VJP question applied to a different operation. The question never changes. The answers do.
Backprop Recipes: A Practical Library
We've covered the principles. Now here is the reference. This section is a manual for how gradients flow through common operations. No derivations from first principles (you already know how to do that). Focus on the patterns, the gotchas, and the mental models that make implementing these operations routine.
Think of each operation as having its own gradient behavior. Linear layers are straightforward outer products. Convolutions are correlations in disguise. Pooling operations route gradients like a switchboard. Once you know each operation's behavior, you can predict how gradients will flow without working through the math every time.
Considering this is more of an advanced material and more so to be used as reference, I will put them in collapsible sections for better readability for folks not as interested in some of the specialized details at this stage.
Mental Models That Stick
appears everywhere. Linear layer backward pass, you write it without thinking. Then you see it again in convolution, wearing different clothes: a flip and a slide instead of a matrix transpose. Then in attention, three times. By the fourth occurrence you stop. Why is the transpose always there?
Because the VJP of any linear map is its adjoint. is the adjoint of . Every transpose in every backward pass is the same mathematical object. Not a convention. Not a design choice. It's what the chain rule requires when the forward computation is linear.
That's a pattern you can't unsee once you see it. Not a formula to memorize, but a lens that makes the next unfamiliar layer feel familiar before you've even worked through it. A few more follow.
The forward pass writes the gradient routing table.
Every forward computation is also a record-keeping exercise. Max pooling picks the winner. That's an output. It's also a routing decision: the winner gets the upstream gradient, the losers get zero. PyTorch saves those winning indices not for the forward pass (already done) but for the backward pass that hasn't happened yet. Dropout works the same way. Zeroed activations in the forward pass become zero gradients in the backward pass. No path tracked, no gradient routed.
The forward pass is more than half of the backward pass. The backward pass mostly follows instructions the forward pass already wrote.
What broadcasts forward, reduces backward.
A bias vector of shape (out_features,) gets added to every example in the batch. Forward: it broadcasts across all B examples. Backward: gradients from B examples all need to flow back to that one bias. They sum. grad_bias = grad_output.sum(0) isn't a convention. It's the only correct answer for any parameter that broadcast in the forward pass.
This holds more broadly than bias terms. LayerNorm subtracts a mean computed across the feature dimension; in backward, the correction term sums across those same features. Softmax couples gradients with a weighted sum across all positions. Broadcast forward, sum backward. Same dimension. Every time.
Information discarded forward is gradient you'll never recover.
If a forward computation doesn't keep a path open, the backward computation can't use it. Max pooling loses the runner-up positions. ReLU loses the negative activations. Dropout zeros out selected units. Dead ends, all of them. The backward pass is fully determined by what the forward pass tracked.
Residual connections are the counterexample worth remembering. Addition doesn't discard anything. Both input paths stay fully reachable. Both receive the full upstream gradient, unchanged. That's why residuals work as gradient highways: they add a new path without closing any existing ones.
Coupling is the cost of learned routing.
Hard-coded routing is cheap. Max pooling's routing table is fixed by forward pass winners. Gradient cost: O(n). Attention's routing table is learned (scores, softmax, weights), and the cost shows up in backward: the softmax coupling term sums across the entire sequence for every position. O(seq_len²) just for that correction.
The hierarchy is consistent. Elementwise operations: no coupling. LayerNorm: couples gradients through statistics within a single sample. BatchNorm: couples across examples in the batch. Attention: couples across positions in the sequence. The more globally a layer looks in the forward pass, the more tangled its gradient dependencies in the backward pass. Same complexity hierarchy, both directions.
One pattern underneath all of it: each operation computes without forming . The local VJP, composed across the whole graph, gives you every gradient you need. Face a new operation? Describe what it does forward. Ask how each output depends on each input. The backward rule follows.
Optimization Meets Backprop
We've covered how backprop efficiently computes millions of gradients. But gradients alone don't train your network. They're just directional information: which way is downhill from here. To actually learn, you need an optimizer that takes these gradients and decides how to update your parameters.
Think of backprop as a sophisticated sensor system that tells you the slope at your current position in parameter space. The optimizer is your navigation strategy: how fast to move, whether to build momentum, how to adapt to the terrain. Backprop gives you the map; the optimizer plans the journey.
This relationship is so fundamental that people often conflate them. They'll say "backprop learns to recognize images" when they really mean "gradient descent using gradients from backprop learns to recognize images." The distinction matters because you can swap optimizers without changing backprop, and different optimizers can dramatically change training dynamics even with identical gradients.
Here's what happens after backprop hands over the gradients, and why the choices you make here can be the difference between convergence in hours versus days (or never).
What the Optimizer Expects From Backprop
The handoff is simple. Backprop finishes. For each parameter, there’s a gradient tensor, same shape as the parameter. grad_W matches W. grad_b matches b. The optimizer reads them and decides how far to step. That’s the entire interface.
But the gradients aren’t neutral. Choices you made upstream change what the optimizer sees.
Reduction. You compute loss over a batch. Do you sum per-example losses or average them? With mean reduction, gradient magnitudes stay constant as batch size changes. With sum, they scale linearly. Switch batch size from 32 to 256. You’ve just handed the optimizer 8x larger gradients without touching the learning rate. I’ve watched people scale batch size for throughput, then spend hours debugging convergence. The reduction changed. Nothing else did. Mean reduction decouples batch size from learning rate. Sum binds them. You almost always want them decoupled.
Batch size is a noise knob. A gradient from one example is noisy. Average 256 examples and the noise smooths out. Bigger batch, cleaner signal. This sounds like bigger is always better. It isn’t. That noise acts as implicit regularization: it keeps the optimizer from committing early to sharp minima that generalize poorly. Large batches often need higher learning rates and explicit warmup to match the generalization of small ones. Batch size and learning rate are coupled. Tune them together.
Embedding layers need sparse updates. A forward pass looks up a handful of tokens. Only those rows of the embedding table produce nonzero gradients. A naive optimizer still touches every row. Sparse-aware optimizers update only the rows that fired. With a 32,000-token vocabulary, that’s the difference between updating 20 rows and 32,000.
Clipping. Global-norm gradient clipping is a seatbelt. One bad batch shouldn’t blow up weeks of training. A threshold around 1.0 is a reasonable default. But if clipping fires on most batches, that’s a symptom. Gradients are systematically too large, usually from a learning rate that’s too high or initialization that’s off. Fix the cause.
Backprop’s job ends at the gradient. What happens next depends on the optimizer.
Momentum, RMSProp, and Adam: Beyond Vanilla SGD
Picture a narrow valley in your loss surface. Steep walls, gently sloping floor. SGD computes the gradient and steps. But the gradient points mostly sideways, into the nearest wall, because that's the steepest direction. Next step: opposite wall. Then back again. The optimizer zigzags across the valley while barely inching forward. Most of the gradient budget goes to fighting itself.
[DIAGRAM: A narrow elliptical loss valley, viewed from above. Two paths to the same minimum at the center of the ellipse. Left path labeled "SGD": a zigzagging line that bounces back and forth between the steep valley walls while slowly inching forward. Right path labeled "Momentum": a smoother curve that cuts more directly through the valley. Elliptical contour lines surround the minimum to suggest elevation. Caption: "Both paths converge. Momentum arrives without the oscillations."]
Two separate ideas fix this. Both ended up in every modern optimizer.
Momentum is the simpler one. Instead of stepping wherever the current gradient points, you maintain a running average of recent gradients. Call it velocity. The zigzag components point opposite directions on alternating steps, so they cancel in the average. The component along the valley floor stays consistent, so it accumulates. A momentum coefficient of 0.9 means 90% of the velocity survives into the next step. Consistent directions build speed. Oscillations wash out.
Adaptive step sizes tackle a different failure. SGD gives every parameter the same learning rate. That's the problem. Embedding rows for rare tokens might see gradient once every thousand batches. Weights in a busy early layer get large, consistent updates on every step. One learning rate can't serve both. Too large for the sparse parameter, too small for the active one.
RMSProp tracks a running average of squared gradients, per parameter. Divide each step by the square root of that average. Big recent gradients: smaller step. Small recent gradients: larger step. Each weight adapts to its own history.
Adam puts both ideas together. A momentum term smooths direction. A squared-gradient average scales each parameter's step size. One wrinkle: both averages initialize to zero, so the first few estimates are biased low. Bias correction inflates them toward their true values until enough history accumulates. After a few hundred steps, the correction barely matters.
AdamW adds a fix that matters more than it looks. In plain Adam, weight decay gets folded into the gradient before the adaptive scaling touches it. So a parameter with large recent gradients has its regularization weakened alongside its task update. That's not what you want. AdamW applies weight decay directly to the weights, bypassing the adaptive machinery. Regularization stays proportional to weight magnitude, not gradient history.
Full derivations, bias correction math, and convergence properties are in the Gradient Descent post.
Schedules and Clipping: Dynamics Control
Adam doesn't know where you are in training. Step 1: random initialization, noisy everything. Step 100,000: nearly converged, loss barely moving. Same update formula both times. No built-in sense of "we're close, be careful now."
You have to tell it. That's the learning rate schedule: an external clock that controls how aggressive the optimizer is at each point in training.
Most runs follow three phases.
Warmup. At initialization, parameters are random and Adam's running averages are empty. The first gradients are noisy guesses. Jumping in at full learning rate means the optimizer commits hard to those guesses. The result: wild parameter swings in the first few hundred steps. Sometimes irrecoverable. A short linear warmup, typically 1-5% of total steps, starts the rate near zero and ramps up. The moving averages get time to accumulate real signal before the optimizer takes big steps.
Plateau. The learning rate holds at peak. Gradients are informative, the optimizer has history. This is where most of the learning happens.
Decay. Once the loss has largely converged, a big step size becomes a liability. The optimizer jumps past the minimum, bounces back, overshoots again. Reducing the rate lets it settle. Cosine decay is the standard: the rate traces a half-cosine from peak to near-zero. Smooth, no manual cutoff to pick. Step decay, cutting by a fixed factor at preset checkpoints, also works but demands you know the right timing. Wrong timing costs training.
[DIAGRAM: A learning rate vs. training step plot with three labeled phases separated by dotted vertical lines. Left phase "warmup": a short linear rise from near-zero up to a peak. Middle phase "plateau": a long flat section at the peak rate. Right phase "cosine decay": a smooth half-cosine fall from peak down to near-zero. A dashed horizontal line marks the peak learning rate. Axis labels: "Training step" on x, "Learning rate" on y. Caption: "Warmup stabilizes early training, the plateau trains at full strength, and cosine decay gives the optimizer room to settle near the minimum."]
Comparisons between schedules, warmup tuning strategies, and step vs. cosine tradeoffs are in the Gradient Descent post.
Clipping. Even with a good schedule, a single unusual batch can produce a gradient many times larger than normal. One bad update can undo thousands of steps of progress. Global norm clipping is the standard seatbelt: compute the gradient's global norm, and if it exceeds a threshold τ, scale every gradient down by the same factor.
Direction is preserved. The optimizer still moves the same way, just not as far. A threshold around 1.0 is common.
If clipping fires on most batches, it's a symptom, not a solution. Learning rate too high, initialization off, or a gradient path that amplifies through depth. Clipping quiets the explosion. Fix the cause.
Curvature Hints via Hessian Vector Products
The gradient tells you which way is downhill. It says nothing about the shape of the hill.
Picture a compass that always points toward lower ground. Useful. But it can't tell you whether the slope keeps falling for fifty meters or turns back up in two. One case says take a big step. The other says barely move. Same compass reading. Completely different right answer.
That's the gap between gradient and curvature. The gradient is the compass. Curvature tells you how far to trust it.
Adam approximates curvature indirectly, through its squared-gradient running average. A proxy. Good enough for most training. But sometimes you want the real thing: why did the network land in a sharp minimum? Why did training suddenly blow up at step 40,000? How curved is the loss surface in the direction you're actually stepping?
The object that answers all of these is the Hessian : second derivatives of the loss with respect to every parameter pair. . Curvature in every direction, packed into one matrix. For parameters, is . At , that's entries. You will never form this matrix.
You don't need to.
Most questions about curvature are about one direction. "If I step along , how curved is the loss?" The answer is . Not all of . Just the product of with one vector.
is -dimensional. Same size as the gradient. And you can compute it without ever building .
The trick: the gradient is itself a function of . Nudge in direction and watch how changes. That rate of change is :
Read that formula again. It's the directional derivative of the gradient function along . A JVP of the gradient computation. Forward-mode autodiff on the backward pass, not on the original network.
This is the Pearlmutter trick. Run a forward-backward pass to get gradients, keep the computation graph alive, then differentiate through the gradient computation itself in direction .
Cost: O(N) memory, roughly 2x a gradient computation. The full Hessian would need O(N²) memory and O(N) separate gradient computations. One HVP needs two. If you only care about a few directions, that's a trade worth taking.
[DIAGRAM: Two views of the same elongated elliptical loss bowl, both seen from above with contour lines. Left view: a single point marked θ with one gradient arrow pointing downhill toward the center of the bowl. Right view: same point, but two direction vectors v₁ and v₂ are drawn. v₁ points across the narrow axis of the bowl (where contour lines are dense). A small inset parabola beside it shows steep curvature with label "Hv₁: large. Step short." v₂ points along the long axis (where contour lines are sparse). A flat inset parabola shows shallow curvature with label "Hv₂: small. Step long." Caption: "Same gradient at θ, but curvature varies by direction. Hv tells you how curved the descent is along a chosen vector. High curvature means the quadratic approximation breaks down fast; you need a shorter step. Low curvature means you can move further before the approximation fails."]
So what do you actually do with an HVP?
Diagnose sharpness. Pick several random directions , compute after training. Large values mean sharp minima: the loss rises fast if you perturb parameters even slightly. Sharp minima tend to generalize worse. This is part of why SGD's gradient noise helps. The noise kicks the optimizer out of sharp basins that large-batch training settles into without resistance.
Get better step sizes. Along direction , the locally optimal step for a quadratic is where . A uniform learning rate assumes curvature is the same everywhere. It isn't. HVPs let you correct for that, at least in the directions you care about.
Build second-order optimizers. L-BFGS maintains a low-rank approximation of the inverse Hessian from gradient-difference pairs. K-FAC approximates the Fisher information matrix (the Hessian's cousin for log-likelihood objectives) using Kronecker factorization. Both exist because gradient descent steps in the wrong direction when curvature varies wildly across parameter dimensions.
In practice, full second-order methods are rare in deep learning. Adam's proxy is close enough for most runs, and HVPs show up more as diagnostic tools than live training components. But they cost almost nothing, and they reveal something first-order methods are blind to: the actual shape of the surface you're optimizing.
The point worth remembering: backprop gives you gradients. Differentiate the gradient function one more time and you get curvature. Same autodiff machinery, applied twice. No new principle. Just one extra pass.
For implementations, deeper derivations, and connections to natural gradient and K-FAC, see the Gradient Descent post.
Backprop Through Time: When Your Graph Has a Time Axis
Every layer in this post has used each weight matrix exactly once in the forward pass. An RNN breaks that. The same W_h and W_x run at every time step: in a 100-step sequence, that's 100 forward computations through the same parameters.
This raises a concrete question for backprop. If W_h participated in 100 different computations, how does it get a single gradient?
The answer is simpler than it looks. Unroll the loop, treat each time step as a layer, and you have a standard feedforward network with T layers where certain weights are shared. Backprop runs on the graph as usual. The shared weights accumulate gradient contributions from every layer where they appear.
That framing is accurate but glosses over something worth understanding. When the same matrix appears at 100 consecutive layers, its eigenvalues determine whether gradients grow or vanish across those layers. A slight imbalance, repeated 100 times, compounds dramatically. We make that precise in section 4.2. First, the mechanics.
The Unrolling Trick: Time Becomes Depth
Start with the rolled view. One function, called at every step:
That's the programmer's view: a loop calling the same function T times. Backprop doesn't see loops. It sees graphs. Unroll the loop, and the graph reveals itself:
Once unrolled, this is just a feedforward network. h at step 3 and h at step 4 are different nodes connected by an edge. The word "recurrent" describes the code, not the graph.
The difference from a normal feedforward net: W_h and W_x appear at every node. Same weight, used T times. So when we differentiate, every use produces a gradient contribution. The backward pass walks the graph in reverse and accumulates them all:
At each time step, the gradient goes three places:
- To the parameters, via
+=(shared weights accumulate from every step) - To the previous hidden state, via
W_h.T @ grad_z_t(the signal flowing backward through time) - To the input at that step (usually a dead end unless you need input gradients)
The += is everything. When the backward loop reaches step t, grad_W_h already holds contributions from steps t+1 through T. += stacks step t on top. Replace it with = and you overwrite the lot, leaving only the contribution from whichever step the loop visited first. The weight participated T times in the forward pass. Its gradient is the sum of T contributions.
That's all "backprop through time" means. A depth-T chain with tied weights. Same backward rules you've been using the whole post. The only new piece is the accumulation, and that's just the shared-parameter rule from Part 2, applied T times.
The Exploding/Vanishing Gradient Problem Gets Worse
The unrolled RNN is a 100-layer feedforward network. That should make you nervous about gradients.
In a standard feedforward net with 10 layers, backprop multiplies gradients by 10 different weight matrices. Each matrix has its own eigenvalue structure. One layer might shrink gradients slightly, the next expands them. The effects partially cancel. Not ideal, but not systematically disastrous.
An RNN doesn't get that cancellation. The backward pass multiplies by the same matrix, W_h, at every step. Whatever W_h does to a gradient, it does T times in a row.
From the backward loop in 4.1:
That line runs once per step. In a 100-step sequence, it runs 100 times. The gradient at h_0 has passed through W_h.T one hundred times:
Strip away the nonlinearity to see the matrix power effect in isolation. A linear RNN:
The gradient of the loss with respect to :
What determines the scaling is the spectral radius , the magnitude of the largest eigenvalue:
Plug in :
- If : gradient scaled by (significant decay)
- If : gradient scaled by (near total vanishing)
- If : gradient scaled by (manageable but growing)
- If : gradient scaled by (explosion)
The stable zone is razor-thin. The only truly safe spectral radius is exactly 1, and nothing in standard training keeps it there. At 0.95, barely below 1, you retain 0.6% of the gradient after 100 steps. That isn't "reduced." It's gone. At 1.05, barely above 1, 131x growth. Five percent off in either direction and training falls apart.
Now add the nonlinearity back. Tanh has a derivative of 1 at the origin, but it falls off fast as the hidden state grows:
Even if you somehow nail a spectral radius of exactly 1, the nonlinearity still multiplies gradients by roughly 0.5 per step for moderate hidden states. After 100 steps: . Not a small gradient. Numerical zero.
This is why vanilla RNNs can't learn long-range dependencies. The gradient passes through the same matrix at every step, squeezed by the activation derivative each time. After 50-100 steps, it's either infinity or zero. Anything more than a few dozen time steps apart might as well be in different training runs.
Truncated BPTT: A Memory-Compute Tradeoff
Everything in section 4.2 was about gradient magnitude: vanishing, exploding, the spectral radius controlling which one wins. But there's a more basic problem that hits before any of that matters. Can you even store the unrolled graph?
Full BPTT needs every hidden state cached until the backward pass sweeps through. Do the arithmetic for a realistic setup: hidden dimension 2048, batch size 64, sequence length 4096. That's million floats just for hidden states. In FP32, over 2GB. And we haven't counted input activations, pre-activation values, or gradient buffers. At some point the backward pass simply doesn't fit in memory.
[DIAGRAM: Two horizontal rows, each showing a timeline of T=100 steps. Top row labeled "Full BPTT": every step is a filled square (representing a cached hidden state), all uniformly shaded. Under the entire row, a brace labeled "Memory: T × H × B — grows with sequence length." Bottom row labeled "Truncated BPTT": same 100 steps, divided into five equal chunks of 20. Only one chunk (steps 41-60) is fully shaded and labeled "active window." The remaining chunks are hollow squares with thin arrows showing the hidden state threading through them. Under just the active window, a brace labeled "Memory: window × H × B — fixed." At each chunk boundary, a small scissors icon labeled "gradient stops here — value passes through." Caption: "Full BPTT caches every state. Truncated BPTT caches only the current window, keeping memory independent of sequence length."]
The fix is straightforward once you see the memory problem clearly. Instead of backpropagating through the entire sequence, chop it into fixed-size windows. Run forward through one window, backpropagate, accumulate weight gradients. Then carry the hidden state forward to the next window, but cut the gradient. This is truncated BPTT.
The critical line is h = H[-1].copy(). The hidden state's value passes forward. The gradient doesn't. In PyTorch terms, this is h = h.detach(): the autograd engine treats each window's starting state as a leaf node with no history.
Memory is now bounded. You cache window_size hidden states at a time, then throw them away. The cost is window × hidden_dim × batch_size, independent of sequence length. A 4096-step sequence costs the same as a 40,000-step sequence.
But what you lose is subtle. The model can still use information from before the current window. If step 60's hidden state encodes something useful from steps 1-59, the network at step 75 can act on it. The representation propagates forward just fine. But if that encoding is wrong, you can't fix it. The gradient from step 75 reaches back through the current window to step 61. It cannot reach step 20. Whatever shaped step 60's hidden state is permanently upstream of any gradient signal.
Recent errors get corrected every window. Distant errors accumulate unchallenged. The model develops a bias toward recent causes, not because recent inputs matter more, but because those are the only ones the optimizer can see.
LSTM and GRU: Gradient Highways Through Time
Look back at section 4.2. A spectral radius of 0.95, barely below 1, leaves 0.6% of your gradient after 100 steps. That's not gradual decay. That's a wall. And the vanilla RNN has no way around it: the same W_h.T multiplies every backward step. One hundred applications of the same matrix. You'd need nearly perfect eigenvalues to survive.
The diagnosis points to the fix. Repeated matrix multiplication is the problem, so build a path that skips it. LSTMs (Long Short-Term Memory) do exactly this by splitting the recurrent state in two. The hidden state h_t still passes through weight matrices, same as before. But alongside it runs a cell state c_t that updates by addition: c_t = f_t * c_prev + i_t * c_tilde. Old cell state scaled by a forget gate, plus new candidate information. No weight matrix in that equation. No eigenvalue problem.
Here's the full LSTM cell with all four gates:
Four gates, two state vectors, a lot of wiring. But the gradient story lives in one line: c_t = f_t * c_prev + i_t * c_tilde.
During backprop, the gradient reaching c_prev is:
Compare this to the vanilla RNN, where each backward step multiplied by the full W_h.T. Here, each step multiplies by f_t: a per-element vector between 0 and 1. No shared matrix. No fixed eigenvalue structure hammering the gradient identically at every step.
When the network wants to remember something, the forget gate stays near 1. The gradient flows through almost untouched. When information should be discarded, the gate closes and the gradient stops. That's not a bug. That's the network deciding what matters. The cell state adapts its permeability at each step instead of fighting one fixed eigenvalue structure for 100 steps straight.
GRUs (Gated Recurrent Units) compress the same idea into fewer parameters:
No separate cell state. The GRU folds everything into one hidden vector and updates it as a convex combination: h_t = (1 - z_t) * h_prev + z_t * h_tilde. When the update gate z_t is near 0, the hidden state passes through unchanged. So does the gradient. When z_t is near 1, computation routes through the new candidate instead. Same gradient highway principle as the LSTM, two fewer gates.
LSTMs don't eliminate vanishing gradients. They make them survivable.
An average forget gate of 0.95 over 100 steps still wipes out most of the gradient. The difference is that gate values aren't fixed. They're computed fresh from the current input and hidden state, not baked into the eigenvalues of W_h. Some paths through the sequence stay open the whole way. Others close early. The network learns which paths to keep open. A vanilla RNN can't do that. Its decay rate is a property of the weight matrix, applied uniformly everywhere, regardless of content.
That's the real advantage. Not that gradients never vanish, but that the vanishing becomes selective. Paths carrying important information survive. Paths carrying noise decay. And repeated matrix multiplication never gets a chance to compound into explosion.
Sequence Batching and Masking: Handling Variable Lengths
Here is a mistake worth making on toy data rather than on real data. You train a sequence model where every example has the same length. Works great. You move to real data: some sentences are seven words, some are forty-three. You pad the short ones to match the long ones, run training, watch the loss decrease. But you never added a mask. The model has been happily training on positions that do not exist.
Padding zeros are not neutral. They are fabricated inputs. Run the RNN through them without masking and the weights receive gradient from data you invented. The hidden state after the last real token gets overwritten by whatever the RNN computes on zeros. You read H[:, -1] at the end and hand your downstream task a corrupted representation.
The fix is simple: a mask. One for real positions, zero for padding.
Consider a batch of sequences with different lengths:
The forward pass fix: when a sequence has ended, stop updating its hidden state. Copy it forward unchanged at every padded position. The clock stops for that sequence.
The key line is H[:, t+1] = mask_t * h_new + (1 - mask_t) * H[:, t]. When mask_t is 0, this collapses to H[:, t]. The first sequence's state after position 3 is just its state at position 3, copied forward unchanged. Padded steps are inert.
The backward pass requires the same discipline. Same principle as the dropout mask: you differentiate through the function that actually ran. At padded positions, the forward pass copied state unchanged. The backward pass must honor that. Zero gradient at every padded position, as if those steps never existed.
The loss needs the mask for a different reason. Cross-entropy at a padded position is zero; there is no real target there. But the denominator is the problem. If you call loss.mean() over all batch_size * seq_len positions, you divide by positions that contributed nothing to the numerator. Average sequence length 20, max length 100? You are dividing by five times more positions than you should. The per-token loss shrinks proportionally. So does the learning signal.
Gradient clipping needs the mask too. Padding positions produce zero gradients. Those zeros pull the computed norm down. If your sequences vary widely in length, the norm you measure can be significantly smaller than the norm over real tokens alone. Your clipping threshold ends up looser than you intended.
Getting masking wrong is quiet. The loss still converges. Nothing crashes. You just get a model that trained slower and worse than it should have, and the difference is hard to trace back to the source. Hidden states corrupted by padding carry that corruption forward. A loss divided by the wrong denominator suppresses the learning signal throughout training. Gradients diluted by padding zeros miscalibrate clipping. Everything looks fine. It just works less well than it should.
The diagnostic is direct: verify that grad_h_next is exactly zero at every padded position after the backward pass. If it is not, there is a leak.
Masking is not a detail. It is the contract between your batch representation and the actual data. Every operation that touches sequence data, the forward pass, the backward pass, the loss, the gradient statistics, needs to honor it.
Vanishing and Exploding Gradients
We've covered how gradients flow backward through networks, how optimizers use them, and how to compute them efficiently. But there's something we haven't addressed yet: what happens when those gradients become useless? When they either shrink to nothing or explode to infinity as they propagate backward through your network?
This is the gradient pathology that almost killed deep learning in the 1990s. You'd stack more layers expecting more power, but instead you'd get a network that couldn't learn at all. The gradients would either vanish to numerical zero before reaching early layers, or explode to NaN and destroy your training. Your deep network would become an expensive random number generator.
The problem is fundamental: gradients are products. In a network with layers, the gradient reaching the first layer is a product of local derivatives. Products of many numbers tend toward extremes. Multiply 0.9 by itself 50 times and you get . Multiply 1.1 by itself 50 times and you get 117. Now imagine this happening with matrices, where eigenvalues determine the scaling, and you see why depth was considered impossible.
The solutions to this problem transformed deep learning from a curiosity into the dominant paradigm. Careful initialization, architectural innovations like skip connections, and normalization layers didn't just make deep networks trainable: they made depth an asset rather than a liability. Understanding these solutions is the difference between networks that learn and networks that don't.
Where Problems Arise
So the gradient is a product. We established that. But what, exactly, goes into that product? Two things: activation derivatives and weight matrices. Every layer contributes one of each, and they multiply together across the full depth of the network.
Trace the backward pass through a simple stack of layers:
Written as one expression, the gradient at the input is:
where is the activation derivative. A product of matrices. Whether gradients survive depends on the norms of those matrices, which come down to two choices you make at design time: your activation function and your weight initialization.
That's it. Two knobs, and the product amplifies whatever they do.
The Activation Function Trap
Start with activations. Sigmoid was the default for years. Smooth, differentiable, biologically motivated. But look at its derivative:
The sigmoid derivative peaks at 0.25. At . Move to or , which is typical for pre-activations without careful initialization, and you're down to about 0.1. That's a single layer. Stack 10 of them: . The gradient doesn't just shrink. It disappears.
Tanh is better (peaks at 1.0 instead of 0.25) but has the same shape. Both saturate. Both crush gradients in the flat regions.
ReLU fixed the saturation problem. Its derivative is exactly 1 for positive inputs:
No saturation, no decay. But ReLU trades one failure mode for another: dead neurons. Once a neuron's input goes negative, its gradient is zero. Not small. Zero. The neuron stops learning and never recovers. If enough neurons die, gradient flow gets choked through whatever survives.
The Weight Matrix Multiplier Effect
Fix your activations and you still have a problem. At every layer, the backward pass multiplies the gradient by . Whether that multiplication grows or shrinks the gradient depends on the eigenvalues of :
If the spectral radius (largest absolute eigenvalue) is above 1, each layer amplifies the gradient. Below 1, each layer shrinks it. And the compounding is brutal. A per-layer factor of 0.8 sounds harmless. After 50 layers: . A per-layer factor of 1.2 sounds equally harmless. After 50 layers: . The window of stability is razor-thin.
Depth Makes Everything Worse
This is the part that made deep learning seem impossible for a decade. Depth is what makes neural networks powerful, but depth is also what makes gradients unstable. More layers means more terms in the product, and products of many numbers drift toward extremes.
You can see this formally. Model each layer as scaling the gradient by some random factor . The total scaling is . Take the log:
If the expected log-scaling is even slightly negative, gradients vanish exponentially in depth. Slightly positive, they explode exponentially. Stable gradients require exactly. And even then, variance accumulates, so the distribution of gradient magnitudes spreads wider with every additional layer.
The early 2010s were spent figuring out how to beat these problems. What emerged weren't patches. They were design principles that made depth work. Initialization, normalization, residual connections, gradient clipping. Each one attacks a different term in that product.
Initialization Cures: Starting at the Right Scale
The fate of your training is often sealed before you take a single gradient step.
Initialize weights too large and activations explode. Gradients follow: they vanish through saturated sigmoids and tanhs, or blow up through linear regions. Initialize too small and signals decay to nothing on the way forward. No forward signal, no backward gradient. Either way, training stalls before it begins.
But there's a constraint you can exploit. If variance stays roughly constant from layer to layer, neither explosion nor vanishing happens. And that constraint is specific enough to give you exact formulas for how to set your initial weights, based on just two things: layer width and activation function.
Xavier/Glorot Initialization: Preserving Variance
Consider a linear layer where has components. If we initialize with variance and assume has unit variance with zero mean:
For the output to also have unit variance: .
But during backprop, gradients flow backward through . Preserving gradient variance requires . Two constraints, one parameter. Xavier Glorot and Yoshua Bengio (2010 paper) took the compromise: the harmonic mean of both.
The gain parameter accounts for the activation function. Linear and tanh get gain=1.0, which approximately preserves variance. Sigmoid isn't quite right at gain=1.0, but close enough.
He/Kaiming Initialization: Accounting for ReLU
Xavier assumes activations that roughly preserve variance. ReLU breaks that assumption. It zeros out every negative value, cutting variance in half:
Half the variance gone, every single layer. After 10 layers: . Your signal is a thousand times weaker than where it started.
Kaiming He's fix (2015) is straightforward: double the initial variance to compensate for ReLU's halving.
That in the ReLU branch is doing all the work. Without it, deep ReLU networks are untrainable. With it, variance stays stable across layers.
LSUV: Data-Driven Initialization
Xavier and He derive the right scale analytically, under assumptions about your data and activations. What if those assumptions don't hold? What if your architecture is too unusual for a closed-form solution?
Layer-Sequential Unit-Variance (LSUV) initialization skips the theory entirely. Run actual data through your network, measure the variance at each layer, and rescale weights until you hit the target:
No closed-form formulas. No activation-specific derivations. It works for any architecture because it measures what actually happens instead of predicting what should happen. The tradeoff: you need a representative data batch at initialization time.
The Variance Lens
All three methods share the same principle: control variance propagation. Think of your network as a chain of amplifiers. Each layer has a gain:
- Gain > 1: Variance grows. Eventual explosion.
- Gain < 1: Variance shrinks. Eventual vanishing.
- Gain = 1: Variance stable. Gradients flow.
Initialization sets these gains to 1 at the start of training. As training proceeds, the network adjusts them, but starting near 1 gives optimization a fighting chance.
The right initialization doesn't guarantee success. The wrong initialization guarantees failure. Get this right and half your gradient problems disappear before training even begins.
Normalization and Residuals: Controlling the Flow
Initialization gets you started. It doesn't keep you stable.
As training progresses, weight updates change the statistics of your activations. What started as unit variance drifts toward zero or infinity. Different training examples push in opposite directions. The distribution your layers were calibrated for stops being the distribution they actually see.
Two architectural ideas from the mid-2010s attack this problem from different angles: batch normalization forces activations back to a standard distribution, and residual connections provide gradient highways that bypass troublesome transformations entirely. Together, they made 100+ layer networks routine.
Batch Normalization: Forcing Statistical Discipline
Batch normalization (Ioffe and Szegedy, 2015) does something blunt: normalize activations to zero mean and unit variance, then let the network learn a different mean and variance if it wants to. That bluntness is the point.
Here's the problem it solves. Even with perfect initialization, as soon as training starts, weight updates in earlier layers change the distribution of inputs to later layers. A layer expecting inputs centered around zero suddenly receives inputs centered around five. It has to re-adapt. This is called internal covariate shift, and it compounds. A small change in layer 1 propagates through layer 2, gets amplified by layer 3, and by layer 10 the activations look nothing like they did a few gradient steps ago. Every layer is chasing a moving target.
BatchNorm breaks the cycle. After each transformation, it forces the distribution back to a known state. Each layer can learn without constantly adapting to upstream drift.
The mechanism is straightforward. During the forward pass, for each mini-batch:
First we compute the mean and variance across the batch. Then we normalize: subtract the mean, divide by the standard deviation. Now x_norm has mean zero and variance one, regardless of what the input distribution looked like.
But we don't stop there. We immediately apply a learnable affine transformation: gamma (scale) and beta (shift).
Wait. We just spent three operations forcing everything to mean zero and variance one. Now we scale and shift it again with learnable parameters? The network could learn gamma = sqrt(var) and beta = mean, completely undoing the normalization. Wouldn't we end up right back where we started?
This apparent paradox is the key. Without gamma and beta, we'd force every layer to output zero-mean, unit-variance activations. What if that's not optimal? What if a layer needs activations with mean 10 and variance 100? What if the best representation is mostly negative, but we're clamping everything toward zero?
Early experiments tried exactly this: fixed normalization, no learnable parameters. Networks trained faster initially (controlled distributions helped gradients), but final accuracy was consistently worse than unnormalized networks. Stable distributions, but the wrong distributions. Better optimization, worse representation.
gamma and beta fix this by giving the network full representational freedom while changing the optimization landscape. It's not about forcing a specific distribution. It's about decoupling the statistics of a layer's activations from the statistics of its inputs. Without BatchNorm, changing the weights changes both what the layer computes and the distribution of its outputs. With BatchNorm, scale and shift are learned independently through gamma and beta, while the normalization keeps the optimization landscape smooth.
Think of it this way: without BatchNorm, to increase the scale of activations, the network has to orchestrate changes across dozens of weights, each one affecting both the mean and the variance in coupled ways. With BatchNorm, increase gamma. Want to shift activations more positive? Adjust beta. The normalization step ensures these adjustments follow smooth, predictable gradients.
It's a reparameterization trick. The network can represent the same functions either way, but with BatchNorm the path to those functions is straighter. Same destinations, easier navigation. The network keeps full representational power over its distributions while the optimizer gets a smoother landscape to work with.
The effect on gradient flow is immediate. Normalized activations stay in a range where gradients are well-behaved. Sigmoid and tanh don't saturate as easily. ReLU gets more consistently positive inputs. The gradient signal stays strong across layers.
The backward pass is where things get interesting. Every sample in the batch contributes to the mean and variance, so gradients couple across the batch:
Notice what this backward pass reveals: BatchNorm couples gradients across the batch. The grad_mean and grad_var terms sum over all samples. When we compute grad_x, every sample's gradient picks up a contribution from these batch-wide statistics.
What does that mean? When computing the gradient for sample , we don't just consider how affects the loss. We also consider how affects the batch mean and variance, which affect the normalized values of every other sample. It's implicit regularization: outlier gradients get smoothed by the batch.
Consider an outlier sample with an extreme gradient. It tries to push the mean and variance in an extreme direction, but that change affects all other samples, spreading the signal. The network gets feedback that says "this gradient would shift the entire batch distribution, not just this one sample." Overfitting to individual examples gets harder.
The coupling also helps gradient scale. The grad_x_norm / np.sqrt(var + eps) term means gradients are scaled by the standard deviation of activations. High variance layers automatically get smaller gradients. Low variance layers get larger ones. The network self-regulates its gradient flow.
The cost: training behavior now depends on batch size. A batch of 32 has different statistics than a batch of 256. Coupling is stronger with larger batches, weaker with smaller ones. This is why BatchNorm models sometimes behave differently at different batch sizes, and why inference needs separate statistics (typically moving averages computed during training).
Layer Normalization: When Batches Don't Work
BatchNorm has an obvious problem: it requires batch statistics. Small batches mean noisy statistics. Batch size 1 (inference) means no statistics at all. Sequence models where different samples have different lengths? Batch statistics don't even make sense.
Layer normalization (Ba et al., 2016) sidesteps this by normalizing across features instead of across the batch:
No batch coupling. Each sample is normalized on its own, which makes LayerNorm perfect for small batch training, recurrent networks (where different positions in a sequence shouldn't affect each other's statistics), and attention mechanisms.
The trade-off: you lose BatchNorm's implicit regularization from batch coupling. You gain stability and consistency across batch sizes and sequence positions. For transformers, that trade-off isn't close. LayerNorm won.
Residual Connections: The Gradient Highway
Before 2015, there was a paradox. Deeper networks should be at least as good as shallow ones. A deep network could always learn to make its extra layers into identity mappings, reducing to an equivalent shallow network. Adding layers should never hurt.
In practice, it hurt. Networks with 20+ layers often performed worse than shallower ones, even on the training set. Not overfitting (that only affects test performance). Optimization failure: gradient descent couldn't find the trivial solution of making the extra layers do nothing.
Why? Learning the identity function is hard for stacked nonlinear layers. To make a layer pass its input through unchanged, all weights need to be zero and the nonlinearity needs to not interfere. But zero weights mean dead gradients. Random initialization starts the network far from identity and forces it to discover the identity through optimization. Gradient descent struggles with this.
Residual connections (He et al., 2015) flipped the problem. Instead of asking layers to learn a transformation , ask them to learn the residual . Learning to output zero is easy. Just set weights to zero. The network starts close to identity and only learns deviations from it.
That x + changes the default behavior to identity. The network adds refinements only when they improve the loss. Early in training, before learns anything useful, passes through unchanged. No degradation, no optimization failure.
The real payoff shows up during backpropagation. Gradients flow through two paths:
The gradient from the loss splits and takes two routes back:
- The skip path:
grad_outflows directly through. No matrix multiplication, no nonlinearity, no attenuation. As if the layer doesn't exist. - The residual path:
grad_outflows through the backward pass of , which might attenuate it, amplify it, or zero it out completely.
Then we add them. The skip path is your insurance policy. Even if is completely dead (all ReLUs negative, weights near zero, activations saturated), grad_out still flows backward. The gradient can never fully vanish.
Compare to a regular deep network. After 100 layers, your gradient has been transformed 100 times. If each layer attenuates by even 0.95x, you're left with of your original signal. With residuals? The full gradient still flows through the skip connections, with the residual paths adding refinements on top.
Put this mathematically. The gradient through a residual block is:
The identity matrix guarantees that at least the original gradient flows through. The Jacobian of can only add to this flow, never block it entirely. This is fundamentally different from a regular layer where the gradient is:
No identity term. If is small or has small singular values, your gradient vanishes. In the residual case, even if (completely dead layer), carries the gradient through.
This changes the optimization landscape. Gradients in early layers now have roughly the same magnitude as gradients in late layers. The network learns features at all depths simultaneously, instead of training late layers first and slowly propagating learning backward.
The Effective Depth Perspective
There's a deeper way to think about this. Veit et al. showed that ResNets can be viewed as ensembles of shallow networks. Each possible path through the skip connections is a different shallow network:
A 100-layer ResNet isn't really a 100-layer network. It's an ensemble of networks of varying depths, all sharing weights. The deepest path goes through all 100 blocks. The shallowest is just the identity. Most paths fall somewhere in the middle.
Make this concrete. The output of a 3-block ResNet expands algebraically:
Each term in this expansion represents a different path through the network. The first term is the path that skips everything. The term is the path through all three blocks. Terms like represent paths that use only some blocks.
Path lengths follow a binomial distribution. In an -block ResNet, the number of paths with exactly blocks is . Most paths have length around . Very short and very long paths are exponentially rare. The network is dominated by medium-length paths.
Here's what matters for gradients: shorter paths receive stronger gradients. A path through 10 layers has its gradient attenuated less than a path through 100 layers, even with skip connections helping. So short paths train first. Early in training, the network behaves like a shallow network. Easy to optimize.
Why does this happen? Consider the gradient magnitude for a path through residual blocks. Even with skip connections, each block's Jacobian contributes some multiplicative factor to the gradient. If the average spectral norm of these Jacobians is , then the gradient through a -block path is roughly attenuated by .
Short paths ( small): , gradient flows at nearly full strength. Long paths: even with (pretty good!), . Three orders of magnitude weaker.
This creates a natural curriculum. In the first few epochs, the network is effectively shallow: maybe 5-10 blocks deep in terms of which paths dominate the gradient signal. These shallow paths learn coarse features: edges, basic shapes, rough semantic categories.
As these paths train and the loss decreases, their gradients weaken (the loss surface flattens near the minimum). Medium-length paths (20-30 blocks) start contributing more. They learn refined features, building on the foundation the short paths established.
Eventually, even the longest paths through all 100 blocks start training. But by now, most of their constituent blocks are already doing something useful, learned by shorter paths. The long paths don't start from scratch; they're fine-tuning an already-functional network. A path through 50 layers has its first 25 layers already partially trained by shorter paths that used them. Effective depth increases gradually during training, not fixed from the start.
This ensemble perspective explains several mysteries: why ResNets train stably at 1000+ layers (short paths bootstrap the long ones), why you can delete random layers at test time with minimal performance loss (ensemble redundancy), and why the loss landscape is smoother (averaging over computational graphs instead of optimizing one). The "depth" of a ResNet is fluid. The network uses whatever effective depth it needs for the current stage, starting shallow and progressively deepening. Curriculum learning, baked into the architecture.
Normalization + Residuals: A Useful Pair
In practice, normalization and residuals almost always appear together. The standard recipe:
Each half covers a different failure mode:
- Normalization keeps activations in healthy ranges
- Residuals ensure gradients flow even if normalization or fails
There's a question of ordering:
Pre-norm won. The skip connection is completely unobstructed: gradients flow through addition and nothing else. Post-norm forces gradients through the normalization on every backward step, adding one more thing that can go wrong.
These techniques turned depth from a liability into an asset. Networks with hundreds of layers became routine. The gradient pathologies from earlier in this section were largely solved. But sometimes, even with normalization and residuals in place, gradients still explode. That's where our last line of defense comes in.
Gradient Clipping and Scale Management
You build the whole stack. Xavier initialization to start at the right scale. LayerNorm to keep activations stable. Residual connections to give gradients a highway through depth. Training runs smoothly for hours. Then on step 47,312, a single batch sends gradient norms from 1.2 to 140,000. One update at that scale and your weights are ruined.
This happens. Not often, but it happens. An outlier example with extreme loss. A batch that happens to align gradients constructively. Accumulated numerical errors compounding through 100 layers of matrix multiplications. The product of many matrices is volatile. All the stabilization techniques from the previous sections reduce the probability of an explosion. They don’t eliminate it.
Gradient clipping is the safety net. The idea: after computing all your gradients but before the optimizer applies them, check their magnitude. If it’s too large, scale everything down.
The most common form is global norm clipping. Treat all parameter gradients as one big concatenated vector, compute its norm, and if it exceeds a threshold, rescale:
Why the global norm? Why not clip each parameter’s gradient independently? Because direction matters. If you clip layer 3’s gradient but not layer 7’s, the relative magnitudes shift. The update no longer points in the direction the loss landscape suggested. Global norm clipping preserves the direction of the full gradient vector and only caps its magnitude. Same direction, shorter step.
The alternative, per-value clipping, clamps each gradient element to some range like . Simpler, but cruder. It doesn’t preserve direction, and different layers experience different effective clipping depending on their gradient scale. In practice, global norm clipping dominates for this reason.
Here’s the thing, though. Clipping is a safety valve, not a training strategy. If it fires on most steps, something is wrong upstream. The gradient norms are telling you that initialization is off, or a normalization layer is missing, or the learning rate is too high, or your data has outliers that need handling. Raising the clip threshold is treating the symptom. Clipping preserves direction but caps magnitude. It can’t repair unstable dynamics.
The right response to frequent clipping is to fix the root cause:
- Persistent spikes → check initialization and whether you’re missing normalization somewhere
- Occasional spikes → probably outlier batches; clipping is doing exactly its job
- Gradual norm growth over training → learning rate may be too high for this stage of optimization
Monitor the clip rate (fraction of steps where clipping activates) and gradient norm over time. These are your best diagnostic signals for training health. A healthy run clips rarely. A run that clips on 30% of steps is limping.
The practical stance: keep clipping enabled with a conservative threshold (1.0 is a common default), but design your network so it rarely fires. Good initialization, normalization, residual paths, sane learning rates. Clipping is the seatbelt, not the steering wheel. You want it there. You don’t want to need it.
Conclusion for Part 2: The Pattern That Scales
Go back to the beginning of this post. Fifteen operations, fifteen backward rules. Convolutions, attention, normalization, pooling, dropout, residuals, recurrent loops. Each one looked like its own special case.
Now look again. Every backward pass asked the same question: given a small change in the output, what change does that imply for each input? That's a VJP. The conv backward is a VJP. The attention backward is a VJP. The BatchNorm backward, with its three interleaved paths through mean, variance, and normalization, is a VJP. The notation changed. The shapes changed. The operation never did.
That compression is what makes the pattern portable. When you encounter a new layer, you don't need to look up its gradient formula. You need one question: how does a perturbation at the output propagate back through this computation? Write down that relationship. The gradient must match the parameter's shape, because it lives in the same space. Beyond that, the chain rule handles composition. The same mental model works whether the layer is a linear projection or a multi-head attention block.
The other thread running through this post is subtler. Every architecture decision we covered was designed for a different stated purpose. Xavier initialization controls the starting scale of activations. BatchNorm stabilizes training dynamics. Residual connections let you stack hundreds of layers. LSTM gates let RNNs model long dependencies. But trace the backward pass through each of them and the same story emerges: they all exist to keep gradients flowing. Initialization sets them to a workable scale. Normalization prevents them from drifting. Residuals give them a path that doesn't attenuate. Gates choose which ones survive across time steps. The forward-pass purpose is real. The backward-pass purpose is the reason they actually work.
That's the lens Part 2 adds. Part 1 said: backprop is careful accounting. Part 2 says: architecture is careful gradient engineering. The two claims are the same claim viewed from different ends of the network.
We still have a gap. We've derived VJPs, connected gradients to optimizers and clipping, analyzed what makes training stable or unstable. But all of it has been mathematics on a whiteboard. None of it has touched a framework, dealt with finite-precision arithmetic, or confronted the fact that 32-bit floats are a luxury most training runs can't afford. The pattern scales. Whether the numbers stay trustworthy as you scale it is Part 3.
Backpropagation Part 3: Systems, Stability, and Scale
References and Further Reading
The papers behind the techniques covered in this post. Grouped by topic so you can dig into whichever section left you wanting more.
Initialization
- Glorot & Bengio (2010). Understanding the difficulty of training deep feedforward neural networks. The variance-preservation argument that gave us Xavier init.
- He et al. (2015). Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification. Extends Xavier to ReLU networks. The Kaiming init paper.
- Mishkin & Matas (2016). All you need is a good init. Data-driven initialization (LSUV) that measures actual activations instead of assuming linear layers.
Normalization
- Ioffe & Szegedy (2015). Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift. The original BatchNorm paper. The "internal covariate shift" framing is debated, but the technique works.
- Ba, Kiros & Hinton (2016). Layer Normalization. Normalizes across features instead of across the batch. The default in transformers.
Residual connections
- He et al. (2016). Deep Residual Learning for Image Recognition. The ResNet paper. Identity shortcuts as gradient highways.
- Veit et al. (2016). Residual Networks Behave Like Ensembles of Relatively Shallow Networks. The "effective depth" perspective: most gradient signal flows through short paths, not the full depth of the network.