October 27, 2025112 min readintermediate

Backpropagation Part 2: Patterns, Architectures, and Training

Every gradient rule, from convolutions to attention, follows one pattern: the vector-Jacobian product. See past the memorized formulas to the unifying abstraction, understand how residuals and normalization tame deep networks, and learn why modern architectures are really just careful gradient engineering.

vjp batch-normalization residual-connections attention rnns lstms initialization optimization backpropagation

The Pattern Behind Every Gradient

You've just finished implementing backprop for your first neural network. You understand the chain rule, you can trace gradients through a few layers, and your toy MLP actually learns. You feel like you've got it.

Then you open a state of the art architecture: BatchNorm, LayerNorm, multihead attention, residual connections, convolutions with different strides, LSTM cells, dropout layers. Each one has its own gradient formula. You find yourself with a notebook full of rules: "for BatchNorm, do this; for convolutions, transpose and flip; for attention, track three separate paths; for residuals, just add." It works, but it feels like memorizing a reference without seeing the principle.

You do not need to memorize different rules for different layers. There is one pattern that appears everywhere, from the simplest matrix multiply to the most complex attention mechanism. This is the vector Jacobian product (VJP): backprop never builds the Jacobian matrices we learned about in calculus. Instead, it computes what that matrix would do to a vector directly, without ever writing down the matrix itself.

In Part 1, we walked gradients through a small network by hand to build the mental model. We saw that backprop pushes a gradient backward, multiplying by local derivatives and accumulating at merges. That's the foundation.

Part 2 shows us the pattern that makes it scale. We'll stop treating each layer type as a special case and start seeing the single operation underneath. Once we recognize the VJP pattern, we won't need to memorize rules anymore. We'll be able to derive the gradient for a new layer type on the spot, just by asking: "How does this operation pull a vector back through its Jacobian?"

This is not only a conceptual point. It is how we write custom layers, debug shape errors, understand why certain architectures train better, and recognize that modern designs such as residual connections and layer normalization are principled gradient pathways grounded in VJP geometry.

We can think of it like this: backprop is not "build a Jacobian, then multiply." It's "act as if you had the Jacobian, but never materialize it." With that lens, a long list of memorized recipes reduces to one pattern we can apply by inspection. Let's dive in.

The Hidden Abstraction: Vector-Jacobian Products

We referenced this in passing in part 1, but we've not really spent time analyzing these "VJPs", so I'll assume you know nothing about it and start from the fundamentals.

So coming back, throughout these posts, we've been casually saying "multiply by the local derivative" during backprop. But what exactly is "the local derivative" when an operation transforms vectors of different dimensions?

Start simple: with 2 inputs $(x_1, x_2)$ and 2 outputs $(y_1, y_2)$ , the Jacobian is:

J = \begin{bmatrix} \frac{\partial y_1}{\partial x_1} & \frac{\partial y_1}{\partial x_2} \\ \frac{\partial y_2}{\partial x_1} & \frac{\partial y_2}{\partial x_2} \end{bmatrix}

Each row captures how one output changes with respect to all inputs.

Now scale that: with 1000 inputs and 500 outputs. This time, each output needs 1000 derivatives (one per input). That's 1000 entries per row. We have 500 outputs, so 500 rows. Total: a 500×1000 matrix with 500,000 entries.

Now for the kicker: backprop never actually builds that matrix. Ever.

Instead, backprop uses vector-Jacobian products (VJPs), the hidden abstraction that makes everything efficient. The Jacobian contains answers to every possible question: "how does output $i$ change when I "perturb" input $j$ ?" But during backprop, we only ever ask one specific question: "given this gradient flowing back from the loss, what gradient should flow to the inputs?" VJP answers that exact question directly, computing only what we need. No massive matrix, no wasted memory, just the result we're looking for.

Understanding VJPs transforms backprop from a collection of memorized rules into a unified framework. Those transposes that appear everywhere? They're not arbitrary. The way gradients flow through matrix multiplies? There's a deep geometric reason. Even the most complex operations (attention, convolutions) follow the same VJP pattern once you see the structure. So let's start answering all these questions, one by one!

VJPs: The Engine Behind Every Gradient

You have some operation that takes $n$ inputs and produces $m$ outputs. The Jacobian $J$ is an $m \times n$ matrix holding all the partial derivatives: row $i$ contains $\frac{\partial y_i}{\partial x_1}, \frac{\partial y_i}{\partial x_2}, \ldots, \frac{\partial y_i}{\partial x_n}$ . In other words, how output $i$ responds to each input.

During backprop, you receive a gradient $\bar{v}$ from upstream, which is an $m$ -dimensional vector telling you $\frac{\partial L}{\partial y_1}, \frac{\partial L}{\partial y_2}, \ldots, \frac{\partial L}{\partial y_m}$ (how the loss depends on each output). Now what you need is $\frac{\partial L}{\partial x_j}$ for each input $j$ .

Recall that chain rule says: $\frac{\partial L}{\partial x_j} = \sum_{i=1}^{m} \frac{\partial L}{\partial y_i} \frac{\partial y_i}{\partial x_j}$ . I'll spare you the details, but when we compute this mathematically, it comes down to exactly $\bar{v}^T J$ : a row vector times a matrix, producing the gradient with respect to the inputs.

But here's the main thing: you never need $J_f$ itself. You only need the result of $\bar{v}^T J_f$ . This is a vector-Jacobian product (VJP), and every operation knows how to compute its VJP directly without forming $J$ .

Here's a simple example showing the difference:

For a layer with 1000 inputs and 500 outputs:

Naive way: Build 500×1000 matrix (2MB in float32), then multiply
VJP way: Direct computation, never more than vectors in memory

The VJP is not only a memory optimization. It is often computationally cheaper too. Consider matrix multiplication $Y = XW$ . The Jacobian with respect to $X$ would be a 4D tensor (if we're being fully general with batches). But the VJP with respect to $X$ is simply $\text{grad\_output} \cdot W^T$ , a matrix multiply.

Let's trace through a concrete VJP to demystify this:

To put it another way: for every operation, there is a direct formula for its VJP that is much simpler than "build Jacobian, then multiply." This is why backprop is feasible. You are not doing millions of matrix multiplies with huge Jacobians. You are doing vector operations with efficient shortcuts.

Shape Discipline: When Gradients Must Match

This rule has been emphasized throughout: gradients must have the same shape as their corresponding parameters. Here's why:

When you compute the gradient $\nabla_W L$ of the loss with respect to parameters $W$ , you're asking: "How should I change each element of $W$ to decrease the loss?" The answer must specify a change for every single parameter. If $W$ has shape (128, 784), then you need 128×784 numbers telling you how to adjust each weight.

But there's a deeper reason rooted in the VJP structure. Here's exactly how parameter gradients emerge from VJPs:

Now examine the operation $Y = WX$ in index notation: $Y_{ij} = \sum_k W_{ik} X_{kj}$

Taking the derivative with respect to $W_{ab}$ : $\frac{\partial Y_{ij}}{\partial W_{ab}} = \delta_{ia} \delta_{kb} X_{kj} = \delta_{ia} X_{bj}$

where $\delta$ is the Kronecker delta (1 if indices match, 0 otherwise).

Now when we compute the full gradient: $\frac{\partial L}{\partial W_{ab}} = \sum_{ij} \frac{\partial L}{\partial Y_{ij}} \frac{\partial Y_{ij}}{\partial W_{ab}} = \sum_j \frac{\partial L}{\partial Y_{aj}} X_{bj}$

In matrix form: $\nabla_W L = (\nabla_Y L) \cdot X^T$

The shape works out perfectly:

$\nabla_Y L$ has shape $(B, M)$
$X^T$ has shape $(N, B)$
Product has shape $(M, N)$ : exactly matching $W$ !

This shape preservation isn't luck. It's because the gradient tells you the rate of change for each parameter. Just like velocity has the same "shape" as position (3D position → 3D velocity), gradients have the same shape as their parameters.

Here's a practical example showing how shape discipline helps catch bugs:

The assertions aren't just debugging aids. They're mathematical requirements. If these shapes don't match, your gradient is mathematically incorrect, not just programmatically wrong.

Beyond Matrices: The Einsum Perspective

So far the focus has been on matrix operations, but modern deep learning involves higher-dimensional tensors: 4D convolution kernels, 3D attention tensors, and beyond. The VJP pattern still applies, but index notation becomes unwieldy. This is where Einstein summation (einsum) notation becomes invaluable.

Einsum makes the index structure explicit, which makes deriving VJPs mechanical. Here's how this works:

The pattern is systematic: to get the VJP for a tensor, you contract the gradient with all other tensors, arranging indices to match the original tensor's shape.

But where einsum really shines is batch matrix multiplication:

See how the batch and spatial dimensions (B, N) get reduced when computing grad_W? That's the accumulation pattern emphasized throughout. Einsum makes it explicit: any index that doesn't appear in the output gets summed over.

Let's trace through an attention mechanism to show how complex operations decompose into simple VJPs:

Einsum makes the VJP derivation algorithmic:

Identify which indices are summed (contracted) in the forward pass
In the backward pass, those become free indices
Free indices in the forward pass might become contracted in the backward pass
The gradient calculation is just "reversing the wiring"

Modern frameworks like JAX and PyTorch actually use einsum internally for many operations because it makes both forward and backward passes explicit and optimizable. When you understand einsum VJPs, you understand how these frameworks compute gradients for any tensor operation.

The Takeaway: It's All Just VJPs

Every gradient calculation in deep learning, no matter how complex the architecture, is just composing VJPs. You never build Jacobians. You never multiply huge matrices. You just flow vectors backward through VJP functions.

That transpose pattern you see everywhere? It's the natural structure of VJPs for linear operations. The shape matching requirement? It's the mathematical necessity that gradients live in the cotangent space. The way convolutions, attention, and batch operations handle gradients? They're all following the same VJP composition rules.

When you see gradients as VJPs rather than matrix multiplies, several things become clear:

Why reverse mode autodiff is memory efficient (only vectors, never matrices)
Why transposes appear in backward passes (adjoints of linear maps)
Why gradients must match parameter shapes (cotangent space dimension)
Why complex operations have simple gradient rules (VJP composition)

The next time you implement a custom operation or debug gradient flow, don't think about Jacobian matrices. Think about VJPs: how does a small change in the output create changes in the input? Answer that with a direct formula, and you've got your backward pass.

This VJP perspective scales from simple linear layers to the most complex architectural innovations. It's the unifying abstraction that makes automatic differentiation possible at the scale of modern deep learning.

Backprop Recipes: A Practical Library

We've covered the principles. Now here is the reference. This section is a manual for how gradients flow through common operations. No derivations from first principles (you already know how to do that). Focus on the patterns, the gotchas, and the mental models that make implementing these operations routine.

Think of each operation as having its own gradient behavior. Linear layers are straightforward outer products. Convolutions are correlations in disguise. Pooling operations route gradients like a switchboard. Once you know each operation's behavior, you can predict how gradients will flow without working through the math every time.

Considering this is more of an advanced material and more so to be used as reference, I will put them in collapsible sections for better readability for folks not as interested in some of the specialized details at this stage.

Mental Models That Stick

After implementing backprop through dozens of architectures, here are some useful mental shortcuts:

Linear operations (matrix multiply, conv): Gradient for weights = outer product of output gradient and input. Gradient for input = output gradient transformed by transposed/flipped weights.

Elementwise operations (ReLU, sigmoid): Gradient gets multiplied by local derivative elementwise. Dead neurons (ReLU) or saturated neurons (sigmoid) create zero gradients.

Reduction operations (sum, mean, max): Sum/mean distributes gradient equally. Max routes gradient only to winner. What was discarded in forward pass is lost forever in backward pass.

Normalization (BatchNorm, LayerNorm): Creates gradient coupling through statistics. Every input affects every output through mean/var, making gradients complex but helping with training stability.

Structural operations (reshape, transpose, concat, split): Pure routing with no computation. Reshape/transpose reorder gradients. Concat/split slice gradients. Addition copies gradients.

The pattern that unifies everything: each operation knows how to compute $v^T J$ (its VJP) without forming $J$ . This local computation, composed across the graph, gives you all gradients efficiently. Once you internalize this pattern, you can derive the backward pass for any operation by asking: "How does a small change in the output create changes in the inputs?"

This is your working toolkit. When you see a new operation, you don't need to memorize its gradient formula. Just understand its forward behavior, and the backward pass follows from asking how outputs depend on inputs. The chain rule and automatic differentiation handle the rest.

Optimization Meets Backprop

We've covered how backprop efficiently computes millions of gradients. But gradients alone don't train your network. They're just directional information: which way is downhill from here. To actually learn, you need an optimizer that takes these gradients and decides how to update your parameters.

Think of backprop as a sophisticated sensor system that tells you the slope at your current position in parameter space. The optimizer is your navigation strategy: how fast to move, whether to build momentum, how to adapt to the terrain. Backprop gives you the map; the optimizer plans the journey.

This relationship is so fundamental that people often conflate them. They'll say "backprop learns to recognize images" when they really mean "gradient descent using gradients from backprop learns to recognize images." The distinction matters because you can swap optimizers without changing backprop, and different optimizers can dramatically change training dynamics even with identical gradients.

Here's what happens after backprop hands over the gradients, and why the choices you make here can be the difference between convergence in hours versus days (or never).

What the Optimizer Expects From Backprop

Backprop’s output is simple: for each parameter, a gradient tensor of the same shape. Using those gradients well is the optimizer’s job. A few practical notes (details in the Gradient Descent post):

Shape and meaning: gradients match parameter shapes and point toward loss increase; updates move opposite to them.
Reduction and scale: sum vs mean loss changes gradient magnitude. Mean reduction keeps gradients batch-size independent; adjust learning rate accordingly if you use sum.
Noise and batch size: larger batches reduce variance; smaller batches add helpful noise. Treat batch size and learning rate as a pair.
Clipping: apply global-norm clipping as a safety valve for explosions; if it triggers often, fix the root cause (LR/init/model), don’t rely on clipping forever.
Sparsity: some layers (e.g., embeddings) yield sparse gradients: optimizers should update only the touched rows efficiently.

The gradient is the message; the optimizer is the interpreter.

Momentum, RMSProp, and Adam: Beyond Vanilla SGD

Modern optimizers build on SGD with two ideas: momentum (use a smoothed direction) and adaptation (scale by recent gradient magnitudes). Quick mental model, full details in the Gradient Descent post:

SGD + Momentum: smooths updates; accelerates along consistent directions, damps zigzagging across narrow valleys.
RMSProp: per parameter step sizes shrink when recent gradients are large, grow when they are small.
Adam: momentum + RMSProp with bias correction; AdamW decouples weight decay so regularization strength stays consistent.

Schedules and Clipping: Dynamics Control

Two knobs have outsized impact on stability and speed:

Learning rate schedules: step, cosine, linear. Cosine + short warmup is a strong default; step works well with SGD when hand tuned. See the Gradient Descent post for examples and trade offs.
Gradient clipping: use global norm clipping as a seatbelt. If it engages often, revisit LR, initialization, or architecture.

Curvature Hints via Hessian Vector Products

Sometimes first order information is not enough. Hessian vector products (HVPs) let you peek at curvature along a direction without forming the full Hessian. They are useful for step size heuristics, diagnosing sharpness, and inspiring second order methods (L-BFGS, natural gradient, K-FAC). If you want the formulas and practical snippets, see the Gradient Descent post.

The punchline here: backprop gives you gradients efficiently; HVPs add selective curvature when you can afford it. Optimization is how you turn either (or both) into stable parameter updates.

Backprop Through Time: When Your Graph Has a Time Axis

So far we've covered backprop on feedforward networks where data flows in one direction: input to output. But what about models that process sequences? Where the same network gets applied over and over to a stream of inputs, maintaining hidden state between steps? This is where backprop through time (BPTT) comes in, and it's simultaneously simpler and more treacherous than you might think.

The key insight: there's nothing fundamentally new here. A recurrent network processing a sequence is just a really deep feedforward network with weight sharing. Unroll the loop, and you get a standard computational graph. Apply backprop as usual. Done.

But that simplicity hides serious challenges. When you unroll an RNN for 100 time steps, you've created a 100 layer deep network. All the gradient pathologies discussed earlier get amplified. Worse, the same weight matrices get multiplied repeatedly, turning small eigenvalue problems into exponential growth. And the memory cost of storing 100 steps worth of activations can exhaust GPU memory.

Here's how BPTT works, why vanilla RNNs fail at long sequences, and how modern architectures like LSTMs create strong gradient paths through time. Once you see recurrence as just another graph pattern, the steps of sequence modeling become mechanical.

The Unrolling Trick: Time Becomes Depth

A recurrent network looks deceptively simple in its rolled up form:

But when you process a sequence, you're really building a deep computational graph:

The crucial observation: this is just a feedforward network where certain weights happen to be tied together. Nothing in the forward pass is actually "recurrent" from backprop's perspective. It's a DAG like any other.

The backward pass follows the usual rules, but now you need to accumulate gradients across time for the shared parameters:

The pattern is clear: at each time step, the gradient splits three ways:

To the parameters (accumulates because they're shared)
To the previous hidden state (continues flowing backward)
To the input at that time step (usually not used)

This is exactly the fork-and-accumulate pattern from earlier, just repeated many times. The fact that it's "through time" is just conceptual scaffolding. The math doesn't care about time; it only sees a graph.

The Exploding/Vanishing Gradient Problem Gets Worse

Remember the gradient pathology discussed earlier? In an RNN, it's not just bad: it's exponentially bad. Here's why.

Consider the gradient flowing from time step $T$ back to time step 0. It gets multiplied by $W_h^T$ repeatedly:

If the largest eigenvalue of $W_h$ is greater than 1, gradients explode exponentially. If it's less than 1, they vanish exponentially. The stable region where the eigenvalue equals exactly 1 is measure zero: you'll never hit it in practice.

Here's the math. For a linear RNN (no nonlinearity, for simplicity):

$h_t = W_h h_{t-1} + W_x x_t$

The gradient of the loss with respect to $h_0$ involves:

$\frac{\partial L}{\partial h_0} = \frac{\partial L}{\partial h_T} \prod_{t=1}^{T} \frac{\partial h_t}{\partial h_{t-1}} = \frac{\partial L}{\partial h_T} W_h^T$

The norm of this gradient scales as:

$\|\frac{\partial L}{\partial h_0}\| \approx \|\frac{\partial L}{\partial h_T}\| \cdot \rho(W_h)^T$

where $\rho(W_h)$ is the spectral radius (largest eigenvalue magnitude). For $T = 100$ :

If $\rho = 0.99$ : gradient scaled by $0.99^{100} \approx 0.37$ (significant decay)
If $\rho = 0.95$ : gradient scaled by $0.95^{100} \approx 0.006$ (near total vanishing)
If $\rho = 1.01$ : gradient scaled by $1.01^{100} \approx 2.7$ (manageable but growing)
If $\rho = 1.05$ : gradient scaled by $1.05^{100} \approx 131$ (explosion)

The tanh nonlinearity makes things worse. Its derivative is at most 1, but typically much smaller:

So even if your weight matrix has eigenvalue 1, the nonlinearity shrinks gradients. After 100 time steps, multiplying by 0.5 (a typical tanh derivative) at each step gives you $0.5^{100} \approx 10^{-30}$ . Your gradient has vanished into numerical zero.

This is why vanilla RNNs can't learn long-range dependencies. By the time the gradient travels back 50-100 steps, it's either exploded to infinity or vanished to zero. The network can't learn connections between events separated by more than a few dozen time steps.

Truncated BPTT: A Memory-Compute Tradeoff

Full BPTT requires storing activations for the entire sequence. For a sequence of length 1000 with hidden dimension 512, that's half a million floats per sequence. Process a batch of 32 sequences and you need 64MB just for hidden states. Add in the input cache, gradient buffers, and optimizer state, and memory becomes the bottleneck.

Truncated BPTT (TBPTT) is the practical compromise: only backpropagate for a fixed window of time steps, but carry the hidden state forward:

The tradeoff is clear:

Memory: Only store activations for window_size steps (constant memory)
Gradient approximation: Can't learn dependencies longer than window_size
Bias: Recent information gets more gradient than distant information

You can get fancier with overlapping windows or randomized truncation lengths, but the principle remains: limit backprop depth to control memory while hoping the hidden state carries enough information forward.

LSTM and GRU: Gradient Highways Through Time

The limitations of vanilla RNNs aren't fundamental to sequence modeling. They're artifacts of a specific architecture choice. LSTMs (Long Short-Term Memory) and GRUs (Gated Recurrent Units) solve the gradient flow problem by creating paths where gradients can flow unchanged through time, much like residual connections do for depth.

The key innovation: replace multiplication with addition for the main information path. Instead of repeatedly multiplying hidden states by weight matrices, LSTMs maintain a "cell state" that gets modified through addition. Addition doesn't compound like multiplication does.

Here's the LSTM forward pass, broken down to show the gradient highways:

Look at that cell state update: c_t = f_t * c_prev + i_t * c_tilde. This is addition, not matrix multiplication! The gradient can flow through the f_t * c_prev term with only element-wise scaling, not repeated matrix multiplies.

During backprop, the gradient of the loss with respect to c_prev is:

The forget gate f_t typically stays close to 1 for important information, creating an almost unimpeded gradient path. Unlike the repeated multiplication by the same matrix $W_h$ in vanilla RNNs, each time step has its own gating decision. Some paths close (forget gate near 0), but others stay open (forget gate near 1).

GRUs simplify this idea with fewer gates but the same principle:

The GRU update h_t = (1 - z_t) * h_prev + z_t * h_tilde is a convex combination. When z_t is near 0, the hidden state passes through unchanged, creating a perfect gradient highway. This is even simpler than LSTM: one interpolation instead of separate forget/input gates.

But there's a subtlety: LSTMs don't completely solve gradient problems. They just make them manageable:

The advantage isn't that gradients never vanish in LSTMs. It's that:

Some paths maintain gradients while others vanish (heterogeneous decay)
The network learns which paths to keep open (adaptive gating)
No eigenvalue explosions from repeated matrix multiplication

Sequence Batching and Masking: Handling Variable Lengths

Real sequences have different lengths. You can't just stack them into a tensor without padding. But padding creates fake data that can corrupt gradients. This is where masking becomes essential.

Consider a batch of sequences with different lengths:

During the forward pass, you need to prevent padded positions from affecting the hidden state:

During backprop, masking prevents gradients from flowing through padded positions:

The loss computation also needs careful masking:

There's a subtle but important detail about how masks affect gradient statistics:

Masking might seem like a minor implementation detail, but getting it wrong leads to subtle bugs:

Gradients flowing through padding (polluting updates)
Loss averaged incorrectly (including padded positions)
Hidden states corrupted by padded inputs
Gradient norms computed incorrectly (affecting clipping)

Always validate your masking by checking that gradients are exactly zero at padded positions. If they're not, you have a bug that will silently degrade performance.

The key takeaway: sequence models are just graphs with particular patterns: weight sharing across time, special handling for variable lengths, and architectural innovations (LSTM/GRU) to maintain gradient flow. Once you understand these patterns, implementing and debugging sequence models becomes much more straightforward.

Vanishing and Exploding Gradients

We've covered how gradients flow backward through networks, how optimizers use them, and how to compute them efficiently. But there's something we haven't addressed yet: what happens when those gradients become useless? When they either shrink to nothing or explode to infinity as they propagate backward through your network?

This is the gradient pathology that almost killed deep learning in the 1990s. You'd stack more layers expecting more power, but instead you'd get a network that couldn't learn at all. The gradients would either vanish to numerical zero before reaching early layers, or explode to NaN and destroy your training. Your deep network would become an expensive random number generator.

The problem is fundamental: gradients are products. In a network with $L$ layers, the gradient reaching the first layer is a product of $L$ local derivatives. Products of many numbers tend toward extremes. Multiply 0.9 by itself 50 times and you get $5 \times 10^{-3}$ . Multiply 1.1 by itself 50 times and you get 117. Now imagine this happening with matrices, where eigenvalues determine the scaling, and you see why depth was considered impossible.

The solutions to this problem transformed deep learning from a curiosity into the dominant paradigm. Careful initialization, architectural innovations like skip connections, and normalization layers didn't just make deep networks trainable: they made depth an asset rather than a liability. Understanding these solutions is the difference between networks that learn and networks that don't.

Where Problems Arise

To fix gradient pathologies, you first need to understand where they come from. The culprits hide in plain sight: your activation functions and your weight matrices. Each contributes its own scaling factor to the gradient product, and when you stack many layers, these factors compound into disaster.

Here's the mathematics of accumulation. Consider the gradient flowing through a simple chain of layers:

The gradient reaching the input is: $\frac{\partial L}{\partial x_0} = \prod_{i=1}^{L} W_i^T \cdot \text{diag}(\phi'(z_i))$

where $\phi'$ is the activation derivative. This is a product of $L$ matrices. The scaling behavior depends on the norms of these matrices, which in turn depend on weight initialization and activation function choice.

The Activation Function Trap

Sigmoid and tanh were the default activations for decades. They have smooth gradients and biological plausibility. But they have a fatal flaw: their derivatives peak at 0.25 (sigmoid) or 1.0 (tanh) and rapidly decay toward zero as you move away from the origin.

Think about what this means. If your pre-activations have magnitude around 2-3 (which is typical without careful initialization), the sigmoid derivative is about 0.1. Stack 10 such layers and your gradient gets multiplied by $0.1^{10} = 10^{-10}$ . The gradient vanishes before it can teach early layers anything.

ReLU partially solved this by having a derivative of exactly 1 for positive inputs:

But ReLU introduces its own problem: dead neurons. If a ReLU neuron's input becomes negative, its gradient is zero. Forever. No learning. And if many neurons die, gradient flow gets bottlenecked through the few survivors.

The Weight Matrix Multiplier Effect

Even with perfect activations, weight matrices can destroy gradients. During backprop, gradients get multiplied by $W^T$ at each layer. The scaling depends on the eigenvalues of $W$ :

The compound effect is what kills you. If each layer scales gradients by 0.8, after 50 layers you have $0.8^{50} \approx 10^{-5}$ . If each layer scales by 1.2, after 50 layers you have $1.2^{50} \approx 9000$ . The stable region is razor-thin.

Depth Makes Everything Worse

Here's the cruel irony: the deeper your network, the more powerful it could be, but also the more likely it is to suffer from gradient pathologies. The probability that gradients remain stable decreases exponentially with depth.

Consider this simple model: assume each layer independently scales gradients by a random factor $s_i$ drawn from some distribution. The total scaling is $\prod_i s_i$ . By the law of large numbers (in log space):

$\log\left(\prod_{i=1}^L s_i\right) = \sum_{i=1}^L \log(s_i) \approx L \cdot \mathbb{E}[\log(s)]$

If $\mathbb{E}[\log(s)] < 0$ (average scaling less than 1), gradients vanish exponentially in $L$ . If $\mathbb{E}[\log(s)] > 0$ , they explode exponentially. Only if $\mathbb{E}[\log(s)] = 0$ exactly do you have a chance, and even then, variance accumulates.

The early 2010s were spent discovering how to navigate these problems. The solutions aren't band-aids; they're fundamental design principles that enable depth. Here's how they work.

Initialization Cures: Starting at the Right Scale

Here's an insight that took a long time for researchers to appreciate: the fate of your training is often sealed before you take your first gradient step. Initialize your weights too large and activations explode, causing gradients to vanish (for sigmoid/tanh) or explode (for linear regions). Initialize too small and signals decay to nothing as they propagate forward, leaving no gradient to propagate back.

The key idea is that you can analytically derive the correct initialization scale by requiring that variance stays constant as signals propagate through the network. This constraint gives you a precise formula for how to initialize weights based on layer width and activation function.

Xavier/Glorot Initialization: Preserving Variance

Xavier Glorot and Yoshua Bengio derived the first principled initialization in 2010. Their insight: for signals to neither explode nor vanish, the variance of activations should remain constant across layers. Here is the specific paper.

Consider a linear layer $y = Wx$ where $x$ has $n_{in}$ components. If we initialize $W$ with variance $\text{Var}(w)$ and assume $x$ has unit variance with zero mean:

$\text{Var}(y_i) = \text{Var}\left(\sum_{j=1}^{n_{in}} w_{ij} x_j\right) = n_{in} \cdot \text{Var}(w) \cdot \text{Var}(x)$

For unit variance preservation: $\text{Var}(y) = 1$ requires $\text{Var}(w) = 1/n_{in}$ .

But wait: during backprop, gradients flow backward through $W^T$ . For gradient variance preservation, you'd want $\text{Var}(w) = 1/n_{out}$ . Xavier initialization takes the harmonic mean:

The gain parameter accounts for the activation function. For linear activations, gain=1. For tanh, gain=1.0 (approximately preserves variance). For sigmoid, gain=1.0 isn't quite right but close enough.

He/Kaiming Initialization: Accounting for ReLU

Xavier initialization assumes linear activations or activations that preserve variance. ReLU breaks this assumption by zeroing out negative values, effectively halving the variance:

$\text{Var}(\text{ReLU}(x)) \approx \frac{1}{2} \text{Var}(x) \quad \text{(for centered } x \text{)}$

Kaiming He realized this in 2015 and derived the correct initialization for ReLU networks:

The key insight: ReLU's rectification requires a $\sqrt{2}$ correction factor to maintain variance. Without this, signals decay by 50% per layer, and deep ReLU networks become untrainable.

LSUV: Data-Driven Initialization

Layer-Sequential Unit-Variance (LSUV) initialization takes an empirical approach: run actual data through your network and adjust weights to achieve unit variance empirically:

LSUV has two advantages: it works for any activation function without analytical derivation, and it accounts for your actual data distribution rather than theoretical assumptions.

The Variance Lens

All these methods share the same principle: control variance propagation. You can think of your network as a variance amplifier. Each layer has a "gain":

Gain > 1: Variance grows, eventual explosion
Gain < 1: Variance shrinks, eventual vanishing
Gain = 1: Variance stable, gradients flow

Initialization sets these gains to 1 at the start of training. As training proceeds, the network will adjust them, but starting near 1 gives optimization a fighting chance.

The right initialization doesn't guarantee success, but the wrong initialization guarantees failure. It's the difference between starting your optimization at base camp versus starting in a crevasse. Get this right, and half your gradient problems disappear before training even begins.

Normalization and Residuals: Controlling the Flow

Initialization gets you started, but it doesn't keep you stable. As training progresses, weight updates change the statistics of your activations. What started as unit variance might drift toward zero or infinity. Even worse, different training examples might push the statistics in opposite directions, creating internal covariate shift that makes optimization harder.

The two most transformative architectural innovations of the 2010s both address gradient flow: batch normalization forces activations back to a standard distribution, while residual connections provide gradient highways that bypass troublesome transformations entirely. Together, they made 100+ layer networks not just possible but routine.

Batch Normalization: Forcing Statistical Discipline

Batch normalization, introduced by Sergey Ioffe and Christian Szegedy in 2015, is simple in form: normalize activations to zero mean and unit variance, then allow the network to learn a different mean and variance if needed. This simple change has large effects on gradient flow.

The core problem BatchNorm addresses is internal covariate shift. Even with perfect initialization, as soon as you start training, weight updates in earlier layers change the distribution of inputs to later layers. A layer that was expecting inputs centered around zero might suddenly receive inputs centered around five. The layer has to constantly re-adapt to these shifting input distributions, slowing down training and making it harder for gradients to find consistent directions.

Worse, this shift compounds across layers. A small change in layer 1 propagates through layer 2, gets amplified by layer 3, and by layer 10 the activations might be completely different from what they were a few gradient steps ago. Each layer is trying to hit a moving target.

BatchNorm breaks this cycle by forcing every layer's inputs back to a standard distribution after each transformation. It's like repeatedly resetting the problem to a known state, so each layer can learn without having to constantly adapt to distribution drift.

The mechanism is straightforward. During the forward pass, for each mini-batch:

Look at what's happening here. First, we compute the mean and variance across the batch: these are the current statistics of our activations. Then we normalize: subtract the mean and divide by the standard deviation. This forces the normalized activations (x_norm) to have mean zero and variance one, regardless of what the input distribution looked like.

But here's the crucial insight: we don't stop there. We immediately apply a learnable affine transformation with parameters gamma (scale) and beta (shift).

But wait, doesn't this seem contradictory? We just spent three operations forcing everything to mean zero and variance one. And now we're going to scale and shift it again with learnable parameters? If the network can learn gamma and beta to completely undo the normalization (just set gamma = sqrt(var) and beta = mean), haven't we just wasted our time? We'd end up right back where we started.

This apparent paradox is the key to understanding why BatchNorm works. Without gamma and beta, we'd be forcing every layer to output zero-mean, unit-variance activations. But what if that's not optimal? What if a particular layer needs activations with mean 10 and variance 100 to properly represent its features? Or what if the optimal representation needs mostly negative values, but we're forcing them toward zero?

Early experiments tried exactly this, where we tried fixed normalization with no learnable parameters. The networks trained faster initially (controlled distributions helped gradients), but final accuracy was consistently worse than unnormalized networks. The distributions were stable, but they were the wrong distributions. Optimization improved, but representation suffered.

The key with gamma and beta is that they give the network full representational freedom while changing the optimization landscape. In other words, it's not about forcing a specific distribution. It's about decoupling the statistics of a layer's activations from the statistics of its inputs. Without BatchNorm, changing the weights changes both what the layer computes AND the distribution of its outputs. With BatchNorm, the network can learn the scale and shift independently through gamma and beta, while the normalization ensures the optimization landscape stays smooth.

Think of it this way: without BatchNorm, to increase the scale of activations, the network has to carefully orchestrate changes across dozens of weights where each weight adjustment affects both the mean AND the variance in complex, coupled ways. With BatchNorm, the network can just increase gamma. Want to shift activations more positive? Adjust beta. The normalization step ensures that these adjustments follow smooth, predictable gradients.

It's a reparameterization trick, like changing coordinate systems in optimization. The network can represent the same functions either way, but with BatchNorm, the path to those functions is straighter. Imagine trying to navigate a city on a map with curvy, winding streets versus a map with a clean grid. You can reach the same destinations, but the grid makes navigation easier. That's what BatchNorm does to the loss landscape.

It combines both: the representational power of arbitrary distributions, with the optimization benefits of normalized inputs.

The effect on gradient flow is immediate. Normalized activations stay in a range where gradients are neither too large nor too small. Non-linearities like sigmoid and tanh don't saturate as easily. ReLU gets more consistently positive inputs. The gradient signal stays strong and stable across layers.

The backward pass is where it gets even better. Because every sample in the batch contributes to the mean and variance, gradients couple across the batch:

This backward pass reveals an insight: BatchNorm couples the gradients across the batch. Look at lines computing grad_mean and grad_var which sum over all samples. Then when we compute grad_x, every sample's gradient gets a contribution from these batch statistics.

What does this mean in practice? When computing the gradient for sample $i$ , we don't just consider how $x_i$ affects the loss. We also consider how $x_i$ affects the batch mean and variance, which in turn affect the normalized values of all other samples. It's a form of implicit regularization: outlier gradients get smoothed by the batch statistics.

Think about what happens to an outlier sample. If one sample in the batch has an extreme gradient, it will try to push the mean and variance in an extreme direction. But that change affects all the other samples in the batch, spreading the gradient signal. The network effectively gets feedback that says "this gradient would affect the entire batch distribution, not just this one sample." This smoothing effect helps prevent overfitting to individual examples.

The coupling also helps with gradient scale. Remember our vanishing/exploding gradient problems? BatchNorm's backward pass has built-in gradient normalization. The grad_x_norm / np.sqrt(var + eps) term means gradients get scaled by the standard deviation of the activations. Layers with high variance activations automatically get smaller gradients; layers with low variance get larger gradients. The network self-regulates its gradient flow.

But there's a cost: training behavior now depends on batch size. A batch of 32 samples has different statistics than a batch of 256. The coupling is stronger with larger batches, weaker with smaller ones. This is why models trained with BatchNorm sometimes behave differently at different batch sizes, and why we need separate inference-time statistics (typically moving averages computed during training).

Layer Normalization: When Batches Don't Work

BatchNorm, however, has a fatal weakness: it requires batch statistics. With small batches, these statistics are noisy. With batch size 1 (inference), they don't exist. For sequence models where different samples have different lengths, batch statistics don't even make sense.

Layer normalization, introduced by Jimmy Ba and others in 2016, normalizes across features instead of across batch:

LayerNorm doesn't couple samples, making it perfect for:

Small batch training
Recurrent networks (different positions in sequence)
Attention mechanisms (each position normalized independently)

The trade-off: you lose BatchNorm's implicit regularization through batch coupling. But you gain stability and consistency across different batch sizes and sequence positions.

Residual Connections: The Gradient Highway

Before 2015, there was a paradox in deep learning: deeper networks should be at least as good as shallow ones. A deep network could always learn to make its extra layers into identity mappings, reducing it to an equivalent shallow network. In theory, adding layers should never hurt performance.

In practice, it hurt dramatically. Networks with 20+ layers often performed worse than shallower networks, even on the training set. This wasn't overfitting (which would only affect test performance). This was a fundamental optimization failure: gradient descent couldn't even find the trivial solution of making the extra layers do nothing.

The problem is that learning the identity function is actually hard for stacked nonlinear layers. If you want a layer to be an identity mapping, all its weights need to be zero (for the linear part) and the nonlinearity needs to not interfere. But initializing to zero weights leads to dead gradients. Random initialization? The network starts far from identity and has to discover it through optimization. Gradient descent struggles with this.

Residual connections, introduced by Kaiming He and colleagues in 2015, flipped the problem on its head. Instead of asking layers to learn a transformation $H(x)$ , ask them to learn the residual $F(x) = H(x) - x$ . The key insight: learning to output zero is easy. Just set weights to zero and you're done. The network now starts close to identity and only needs to learn deviations from it.

This simple addition (literally x +) changed the default behavior to identity. The network learns to add refinements only when they improve the loss. Early in training, before $F$ learns anything useful, the network defaults to passing $x$ through unchanged. This avoids degradation and optimization failure.

But the real transformation is what happens during backpropagation. Gradients flow through two paths:

Look at what this means. The gradient from the loss splits and takes two routes back:

The skip path: grad_out flows directly through with zero transformation. No matrix multiplication, no nonlinearity, no attenuation. It's as if the layer doesn't exist.
The residual path: grad_out flows through the backward pass of $F$ , which might attenuate it, amplify it, or zero it out completely.

Then we add them: grad_through_skip + grad_through_F_computed. The skip path is your insurance policy. Even if $F$ is completely dead (all ReLUs negative, weights near zero, activations saturated), you still get grad_out flowing backward. The gradient can never vanish completely.

Compare this to a regular deep network. After 100 layers, your gradient has been transformed 100 times. If each layer attenuates by even 0.95×, you're left with $0.95^{100} \approx 0.006$ of your original gradient. With residuals? You still have the full gradient flowing through the skip connections, with the residual paths adding refinements on top.

But there's a deeper insight here. The gradient through a residual block is: $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \left(I + \frac{\partial F}{\partial x}\right)$

The identity matrix $I$ guarantees that at least the original gradient flows through. The Jacobian of $F$ can only add to this flow, never block it entirely. This is fundamentally different from a regular layer where the gradient is: $\frac{\partial L}{\partial x} = \frac{\partial L}{\partial y} \cdot \frac{\partial F}{\partial x}$

Notice what's missing: the identity term. In the regular case, if $\frac{\partial F}{\partial x}$ is small or has small singular values, your gradient vanishes. In the residual case, even if $\frac{\partial F}{\partial x} = 0$ (completely dead layer), you still have $I$ carrying the gradient through.

This changes the optimization landscape fundamentally. Gradients in early layers now have roughly the same magnitude as gradients in late layers. The network can learn features at all depths simultaneously, rather than having to train the late layers first and slowly propagate learning backward.

The Effective Depth Perspective

ResNets don't just make networks trainable at depth; they change what depth means. Andreas Veit and others showed that ResNets can be viewed as ensembles of shallow networks. Each possible path through the skip connections represents a different shallow network:

This is a radical reframing. A 100-layer ResNet isn't really a 100-layer network. It's an ensemble of $2^{100}$ networks of varying depths, all sharing weights. The deepest path goes through all 100 blocks. The shallowest path is just the identity. Most paths are somewhere in the middle.

Let's make this concrete. When you compute the output of a 3-block ResNet, you can expand it algebraically:

\begin{align} y &= x + F_3(x + F_2(x + F_1(x))) \\ &= x + F_3(x) + F_3(F_2(x)) + F_3(F_2(F_1(x))) + \ldots \end{align}

Each term in this expansion represents a different path through the network. The first term $x$ is the path that skips everything. The term $F_3(F_2(F_1(x)))$ is the path through all three blocks. Terms like $F_2(x)$ represent paths that use only some blocks.

The distribution of path lengths follows a binomial distribution. In an $n$ -block ResNet, the number of paths with exactly $k$ blocks is $\binom{n}{k}$ . Most paths have length around $n/2$ . Very short paths (length 0 or 1) and very long paths (length $n-1$ or $n$ ) are exponentially rare. The network is dominated by medium-length paths.

Here's the crucial insight about gradients: during backpropagation, shorter paths receive stronger gradients. A path through 10 layers has its gradient attenuated less than a path through 100 layers (even with skip connections helping). So short paths train first. Early in training, the network behaves like a shallow network, which is easy to optimize.

Why does this happen? Consider the gradient magnitude for a path through $k$ residual blocks. Even with skip connections, each block's Jacobian $\frac{\partial F_i}{\partial x}$ contributes some multiplicative factor to the gradient. If the average spectral norm of these Jacobians is $\rho$ , then the gradient through a $k$ -block path is roughly attenuated by $\rho^k$ .

For short paths, $k$ is small, so $\rho^k \approx 1$ . The gradient flows with nearly full strength. For long paths, even if $\rho = 0.9$ (pretty good!), you have $0.9^{50} \approx 0.005$ . The longest path's gradient is three orders of magnitude weaker than the shortest path's gradient.

This creates a natural curriculum during training. In the first few epochs, when gradients are strongest, the network is effectively shallow: maybe 5-10 blocks deep in terms of which paths dominate the gradient signal. These shallow paths learn the coarse features: edges, basic shapes, rough semantic categories.

As these paths train and the loss decreases, their gradients weaken (the loss surface flattens near the minimum). Now medium-length paths (maybe 20-30 blocks) start to contribute more to the gradient. These paths learn more refined features, building on the foundation the short paths established.

Eventually, even the longest paths, through all 100 blocks, start training. But by now, most of their constituent blocks are already doing something useful (learned by shorter paths). The long paths don't have to learn from scratch; they're fine-tuning an already-functional network.

As these short paths start to work, they provide a stable foundation. The longer paths can then train on top of this foundation. A path through 50 layers now has the first 25 layers already partially trained by shorter paths that used them. The effective depth increases gradually during training, rather than being fixed from the start.

This ensemble perspective explains several mysteries: why ResNets train stably even at 1000+ layers (short paths bootstrap the long ones), why you can delete random layers at test time with minimal performance loss (ensemble redundancy), and why the loss landscape is so much smoother (averaging over $2^n$ computational graphs instead of optimizing just one). The "depth" of a ResNet is fluid: the network uses whatever effective depth it needs for the current training stage, starting shallow and progressively deepening. It's implicit curriculum learning baked into the architecture.

Normalization + Residuals: A Useful Pair

A key improvement combined normalization with residuals. The standard recipe:

This combination addresses both problems:

Normalization keeps activations in healthy ranges
Residuals ensure gradients flow even if normalization or F fails

The pre-norm vs post-norm debate raged for years:

Pre-norm won because it provides cleaner gradient paths. The skip connection is completely unobstructed, while post-norm forces gradients through the normalization.

These techniques transformed depth from a liability into an asset. Networks with hundreds of layers became routine. The gradient pathologies that plagued early deep learning were largely solved. But sometimes, even with all these techniques, gradients still explode. That's where our last line of defense comes in.

Gradient Clipping and Scale Management

Even with perfect initialization, normalization, and residual connections, gradients can still explode. A single unfortunate batch, an outlier example, or accumulated numerical errors can send gradients to infinity. When this happens, a single update can destroy weeks of training. Gradient clipping is your safety net: a simple but essential technique that caps gradient magnitude before it causes damage.

Of course, clipping is a last line safety valve to prevent rare explosions from wrecking training. Most of the detailed mechanics (global norm vs value clipping, layer wise strategies like LARS/LAMB, monitoring, and diagnostics) are already covered in the Gradient Descent post. Here is the distilled interface level guidance for backprop:

Use global norm clipping sparingly as insurance. Frequent clipping signals deeper issues (too high LR, bad init, missing residuals/norms, outlier batches).
Prefer fixing root causes over raising clip thresholds. Clipping preserves direction but caps magnitude; it can’t repair unstable dynamics.
Monitor gradient norms and clip rate; treat persistent spikes as a debugging breadcrumb trail, not a feature.

In practice: keep clipping enabled, but aim to make it unnecessary by design (good initialization, normalization, residual paths, and sane learning rates).

Conclusion for Part 2: The Pattern That Scales

At first, the list of gradient formulas can feel long: one for BatchNorm, another for convolutions, another for attention. It can read like a reference of rules rather than a single idea. The consolidation is simple: there is one pattern. Every gradient computation, from the simplest linear layer to the most complex attention mechanism, is a vector Jacobian product. Backprop never builds those massive Jacobian matrices we feared from calculus class. It computes what those matrices would do to vectors directly and efficiently through the same multiply and accumulate pattern everywhere.

Here's the unified view we've built:

VJPs are the engine: Every operation knows how to pull gradients backward through its Jacobian without forming it
Transposes aren't arbitrary: They're how linear maps reverse in dual space, the mathematical structure of backward flow
Shape discipline is non-negotiable: Gradients match parameter shapes because they live in the same dimensional space
Recipes are patterns, not rules: Linear ops use outer products, reductions distribute evenly, routing operations (pool, attention) direct gradients selectively
Time is just another dimension: BPTT is backprop on an unrolled graph with weight sharing, nothing fundamentally new

The architectural innovations that improved deep learning all address the same problem: getting gradients to flow. Careful initialization (Xavier, He) starts gradients at the right scale. Normalization (BatchNorm, LayerNorm) keeps them there during training. Residual connections provide paths that bypass problematic transformations entirely. These are not independent heuristics; they are complementary solutions to the multiplicative nature of gradient flow.

We also saw how backprop interfaces with the broader training pipeline. The optimizer consumes gradients but doesn't produce them. Gradient clipping protects against rare explosions but shouldn't be constantly active. Memory-compute tradeoffs (truncated BPTT, checkpointing) let us handle sequences and depth within finite resources. These are engineering realities that turn mathematical backprop into practical learning systems.

A shift in perspective: backprop isn't mysterious or fragile. It's systematic bookkeeping with a single reusable pattern. When you encounter a new layer type, you don't need to memorize its gradient formula. Ask yourself: "How does output change affect input values?" Write down that VJP. The chain rule and automatic differentiation handle the composition. The same mental model that works for a two-layer network scales to transformers with billions of parameters.

Where we go next: We have the pattern and the recipes, but real implementation requires engaging with frameworks, numerical precision, and system-level concerns. Modern autodiff systems handle the graph construction and backward execution we've been doing conceptually. Mixed precision training pushes the boundaries of numerical stability. Gradient-based interpretability uses the same gradients for understanding, not just learning.

Continue to Part 3 where we make it production-ready: framework APIs, custom gradients, numerical stability, systematic testing, memory optimization, and using gradients as interpretability tools.

Backpropagation Part 3: Systems, Stability, and Scale

References and Further Reading

Xavier Glorot & Yoshua Bengio, Understanding the difficulty of training deep feedforward neural networks
Kaiming He et al., Delving Deep into Rectifiers: Surpassing Human-Level Performance on ImageNet Classification
Dmytro Mishkin & Jiri Matas, All you need is a good init
Sergey Ioffe & Christian Szegedy, Batch Normalization: Accelerating Deep Network Training by Reducing Internal Covariate Shift
Jimmy Lei Ba, Jamie Ryan Kiros & Geoffrey E. Hinton, Layer Normalization
Kaiming He et al., Deep Residual Learning for Image Recognition
Andreas Veit et al., Residual Networks Behave Like Ensembles of Relatively Shallow Networks