172 min readfoundation

Backpropagation Part 1: From Graphs to a Working MLP

Backprop computes a million gradients for the price of two forward passes. From computational graphs to adjoints, from chain rule to a working neural network, this is the algorithm that made deep learning possible; and is demystified here step by step.

You type loss.backward(). A few milliseconds later, every parameter in your network has a .grad attribute.

You've run this loop enough times to stop thinking about it. Forward pass, compute loss, backward, step. The loss drops. The model improves. Somewhere inside that backward call, every weight in the network learned exactly how much it contributed to the current mistake.

Now notice the timing. The backward pass takes roughly the same time as the forward pass. Not roughly the same time per parameter. Roughly the same time, period. Ten thousand parameters or ten billion, the backward pass costs about two to three times a single forward pass.

That ratio should bother you. A million weights, each needing its own gradient. The obvious way to get them: nudge each weight, measure the effect. That is a million forward passes. The backward pass does it in one.

Most people answer "chain rule" and leave it there. True, and it explains nothing about why one backward pass is enough.

The answer is bookkeeping. The forward pass, as it runs, quietly records the numbers the backward pass will need: what values arrived at each neuron, which neurons fired, what the outputs looked like before each activation function. The backward pass reads that ledger in reverse, applies one simple rule at each node, and accumulates every gradient in a single sweep.

Forward pass as ledger. Backward pass as settlement. That is the whole trick.

Once it clicks, other things start making sense. Why training costs so much more memory than inference. Why certain activation functions matter. Why initialization is not arbitrary. Why some architectures are harder to train than others. These are all gradient-flow questions, and backprop is where gradient flow lives.

The Gradient Bottleneck

A neural network with a million parameters is a million knobs. Training it means answering, at every single training step, the same question a million times: should I nudge this knob up or down, and by how much?

You can picture the obvious way to answer. Pick weight #1. Bump it up by a hair. Run the whole network forward, and check whether the loss got better or worse. Reset. Move on to weight #2. Then weight #3. Then weight #1,000,000.

That instinct is roughly correct, and it is also roughly impossible to execute. Suppose one forward pass on your million-parameter network takes 100 milliseconds, which is fast by modern standards. One pass per weight, a million weights, is 100,000 seconds of compute. That's 27 hours just to decide how to take one training step. And training needs millions of such steps.

In the 1960s and 70s, researchers knew exactly what they wanted to compute: the derivative of the loss with respect to every parameter.

The blueprint was clear. The execution was out of reach, because the naive method scaled linearly with the parameter count. Like knowing exactly how to build a skyscraper while owning only a teaspoon to move dirt.

Then in 1986, Rumelhart, Hinton, and Williams showed the field something that had been hiding in plain sight. Not a new learning algorithm, but a computational trick borrowed from other fields: you could compute all those millions of gradients in roughly the same time it takes to compute just one.

Here is the shape of the trick. Run one forward pass through the network and compute the loss the usual way, except pause to save a few intermediate values at each neuron along the way. Then run one backward pass that walks the network in reverse. At every neuron, the backward pass knows two things: the final loss, and those cached intermediates from the forward pass. That's enough to work out exactly how much each weight contributed to the loss. Every weight. All at once. One sweep.

Cost of that backward pass? About the same as the forward pass. So instead of a million forward passes to figure out one training step, you run one forward pass and one backward pass. That is the whole deal.

And this was not even a new algorithm. Reverse-mode automatic differentiation, to give it its formal name, had already been around for roughly a decade in control theory and numerical analysis. The AI community had been looking in the wrong places. They were chasing brain-inspired solutions when the answer was a clever accounting trick from applied mathematics.

Training times collapsed by orders of magnitude. Days instead of centuries.

The Gradient: Your Local Compass

Stand on a hillside in thick fog. You cannot see the valley. You cannot see the peak. All you have is the slope under your feet: which way tilts down, how steeply. Take a step downhill. Feel again. Repeat.

That is gradient descent, in two dimensions, with a foot.

A neural network plays the same game, except the terrain has as many dimensions as the network has parameters. Each weight is its own axis on that surface. Your foot gave one slope reading; here you need a reading along every axis at once. That full set of slopes is the gradient. Computing it efficiently is the problem backprop solves.

(If the gradient descent post already made this obvious, feel free to skim.)

What a gradient actually is

One question, asked once per parameter: "If I nudge this weight by a tiny amount, how much does the loss change?"

The answer has two parts:

  • Sign: does the loss go up or down when you increase this weight?
  • Magnitude: by how much?

Say you have a toy network with ten weights. Its gradient is ten numbers, one per weight:

  • Weight 1: 0.3-0.3. Negative, so increasing this weight decreases the loss.
  • Weight 2: +0.8+0.8. Positive, so decreasing this weight helps. And because +0.8>0.3|+0.8| > |-0.3|, weight 2 matters more right now.
  • Weight 3: 0.001-0.001. Same sign as weight 1, but the slope is nearly flat. Along for the ride.
  • ... and so on, one entry per weight.

Scale that up. A million-parameter network needs a million gradient values. GPT-3 with 175 billion parameters? 175 billion gradient values, recomputed from scratch at every training step.

In symbols: if your parameters are θ=[w1,w2,,w1,000,000]\theta = [w_1, w_2, \dots, w_{1{,}000{,}000}], the gradient of the loss LL is the vector

L=[Lw1, Lw2, , Lw1,000,000].\nabla L = \left[\frac{\partial L}{\partial w_1},\ \frac{\partial L}{\partial w_2},\ \dots,\ \frac{\partial L}{\partial w_{1{,}000{,}000}}\right].

Same question, different weight, a million times. Stack the answers and you have the full compass reading.

Why "local" is doing so much work

Notice what the gradient doesn't tell you. It doesn't know where the minimum is. It doesn't know whether one exists nearby. It doesn't know the shape of the loss surface beyond the patch of ground right under your feet. It just says: right here, right now, this direction is downhill. That is the entire message.

That sounds like a weakness, and it is one. You can get stuck in local minima. You can wander a flat plateau forever.

It is also the whole reason backprop is possible at all. You don't need a map of the whole landscape. You only need each weight to answer a local question, and each question is cheap. That cheapness is what lets you train a billion-parameter model in the first place.

A large gradient means you're on a cliff face: a small step changes the loss a lot. A small gradient means you're on a plateau: even big steps barely move the loss. The sign tells you which direction to step. The magnitude tells you how seriously to take it. Most of what an optimizer actually does, once backprop hands it the gradient, is decide how much to trust that magnitude. Adam effectively says, "fine on direction, skeptical on size."

Why Brute Force Fails

So: a million gradient values, one per weight. How would you actually compute them all?

The First Instinct: Just Try Each One

The definition of a derivative is right there. Use it.

  1. Pick weight #1
  2. Increase it by a tiny amount (say, 0.00001)
  3. Run the entire network forward with this new weight
  4. See how much the loss changed
  5. That change divided by 0.00001 is your gradient for weight #1
  6. Reset weight #1 to its original value
  7. Repeat for weight #2, weight #3, all the way to weight #1,000,000

This is finite differences:

LwiL(wi+ϵ)L(wi)ϵ\frac{\partial L}{\partial w_i} \approx \frac{L(w_i + \epsilon) - L(w_i)}{\epsilon}

where ϵ\epsilon is your tiny nudge (like 10510^{-5}).

The intro showed the bill: 27 hours for one gradient on a million-parameter network. Look at what you are actually doing. Each nudge changes one weight out of a million. The other 999,999 are identical. The network reruns the entire forward pass, every multiplication through every layer, and 99.9999% of the computation is work it already did. One weight wiggled. The whole thing recomputed from scratch.

A million near-identical forward passes to get a million numbers.

GPT-3 has 175 billion parameters. At 100ms per forward pass, one gradient takes roughly 554 years. Training needs millions of them.

The Second Idea: Just Derive The Formula

Finite differences is an approximation that costs a fortune. The natural next thought: skip the approximation entirely. Write the network as one mathematical expression, apply calculus rules, derive an exact gradient formula for every weight. No epsilon, no numerical error. This is symbolic differentiation.

For small expressions, it works perfectly. The catch is scale.

A two-layer network written out: L=loss(softmax(W2ReLU(W1x+b1)+b2), y)L = \text{loss}(\text{softmax}(W_2 \cdot \text{ReLU}(W_1 x + b_1) + b_2),\ y). Differentiate with respect to W1W_1. The chain rule unfolds every nested function, and the expanded result grows combinatorially. Five layers in, the gradient formula for a single weight runs to millions of terms. You would need gigabytes to store it, let alone evaluate it.

Worse: the formula for W1W_1's gradient and the formula for W2W_2's gradient share enormous subexpressions. The loss, the softmax, everything downstream appears in both. Symbolic differentiation doesn't see the overlap. It derives each gradient as an independent expression, expands it fully, evaluates it from scratch.

The Breakthrough: Your Forward Pass Already Did The Hard Work

Both naive methods fail for the same reason: they treat computing a million gradients as a million separate problems. It is not. It is one problem with huge shared structure.

When you run the network forward to compute the loss, you are already computing the ingredients for every gradient. Neither approach thought to look there. Think of an assembly line. An inspector at the end spots a defect: "this cost us." Trace it backward. Every station already recorded what it did. Each one gets a single question: how much of this defect is yours?

Imagine your network processing a picture of a cat.

Layer 1: Takes raw pixels, applies a million weights, produces feature detections. Neuron #47 fires strongly (outputs 32.7) because it saw an edge. Neuron #892 stays quiet (outputs 0) because its pattern wasn't there.

Layer 2: Takes those feature detections, combines them with another million weights, builds bigger patterns. Some neurons fire (5.2, 9.1), others don't.

Layer 3: Takes those patterns, makes a final decision: "73% cat, 12% dog, 15% other."

Loss: Compares to the true label ("cat"), outputs a single disappointment score: 2.31.

Those forward-pass numbers (32.7 for the edge detector, 5.2 for the pattern builder) are doing double duty. On the way forward, they are the activations the next layer consumes. On the way back, they are exactly the pieces needed to update every weight that produced them. The forward pass is paying for both.

Now scale up. Millions of parameters instead of three toy layers. Pick a random weight, say #47,293, sitting somewhere in layer 2. To adjust it, you need to answer one question: "How much did this weight contribute to our loss of 2.31?"

Two pieces of information are enough:

  1. What input did this weight see? That specific number from layer 1, already computed: 32.7.
  2. How much did this weight's output affect the final loss? That comes from layer 3, during the backward pass.

The weight's contribution is just those two numbers multiplied. That's it. No rerunning the network. No giant formulas. Just a multiplication of numbers we already have, or are about to compute anyway.

The same logic plays out inside every neuron. During the forward pass, each one records what it needs: the inputs it received ([32.7, 0, 18.4, ...]), the weights it applied ([0.5, -0.2, 0.8, ...]), the output it produced (5.2), and whether its activation fired or was blocked. That is the full receipt.

During the backward pass, each neuron receives one number: how much its output affected the loss. Call it 0.3.

That single number is enough:

  • Weight 0.5 saw input 32.7. Gradient: 0.3 × 32.7 = 9.81. Positive and large. This weight pushed the loss up. Reduce it.
  • The connection back to the layer that sent 32.7: gradient is 0.3 × 0.5 = 0.15.

That second number passes down. Every layer only does local arithmetic on numbers it already has.

This is not specific to neural networks. Any step-by-step computation works the same way: run forward and save the intermediate values, then run backward to compute each step's contribution. Physics simulations, rendering engines, financial models. Neural networks are especially well-suited because every operation (multiply, add, apply ReLU) has a clean local derivative, and they chain thousands of them in sequence.

The only price you pay is memory. Every intermediate value has to sit around until the backward pass needs it. For a modern network:

  • 10 layers × 1000 neurons per layer: 10,000 numbers per image.
  • Processing 32 images at once: 320,000 numbers.
  • Billion-parameter networks: gigabytes, just for intermediates.

But compare the alternative: centuries of compute. Memory is cheap. Time is not. Backprop made the only sensible trade.

Backpropagation's True Identity

Look at what we just did. Saving intermediates, walking backward, multiplying local derivatives. At no point did this require a neural network.

So why does "backpropagation" sound like it belongs to neuroscience? It doesn't. The name is a historical accident, and it hides what the technique actually is.

Strip the network away entirely:

Two operations, three values, one graph. You can ask: how does changing x change z? The chain rule answers it, and a computer can apply the chain rule automatically, step by step, as the computation runs. No symbolic formula explosion. No numerical approximation. Exact derivatives, accumulated alongside the values themselves. That is automatic differentiation.

Backprop is reverse-mode automatic differentiation applied to a computation graph with a scalar loss at the end. One output, millions of inputs. Reverse mode was designed for exactly that ratio. The neurons are incidental. The biology is a naming artifact. We are doing efficient calculus on a computer program, in one backward sweep.

Functions as Compositions and Graphs

Now that we have seen the computational efficiency of backprop, let's see what makes it work. The key isn't in the algorithm itself. It's in recognizing this pattern: every neural network, no matter how complex, is just simple operations chained together.

Think about what happens when you evaluate y=(x+2)3y = (x + 2) * 3. You do not compute this in one step. Your brain breaks it down:

  1. First compute a=x+2a = x + 2
  2. Then compute y=a3y = a * 3

This decomposition isn't just how we think about it, it's also what the computer does. And this is the key point: if we can compute something step by step going forward, we can compute gradients step by step going backward.

To make this concrete with a slightly bigger example. Consider this function:

This looks like a simple program, but we can draw it as a graph:

  • x and y are input nodes
  • a and b are intermediate nodes
  • c is the output node
  • The operations (*, +, *) are the edges connecting them

To look at this another way: x affects the output c through two different paths. It goes through a (by getting multiplied with y) and through b (by getting added to y). When computing gradients, we need to account for both paths. The gradient contributions from each path will add up.

This is exactly what happens in neural networks, just at a massive scale. Each weight affects the loss through potentially thousands of paths. Backprop's job is to efficiently track all these paths and sum up their contributions.

From Slope to Jacobian

You know what a derivative is. One input, one output, one number. Nudge xx, watch how f(x)f(x) moves, divide. The answer fits in a scalar.

Now look at what a neural network layer actually does.

A layer takes 512 inputs and produces 256 outputs. Nudge input #47 by a tiny amount and you don't get one number. You get 256 numbers, one change per output. And there are 512 inputs to nudge. That is 512×256=131,072512 \times 256 = 131{,}072 individual slopes. The single-number derivative still works for each pair. But you need somewhere to put all 131,072 of them.

The Gradient: One Step Up

We covered this in §1.1, but it is worth restating compactly because the next step builds directly on it.

A function f:RnRf: \mathbb{R}^n \to \mathbb{R} takes nn inputs and produces one number. Each input has its own partial derivative. Stack all nn of them into a vector:

f(x)=[fx1, fx2, , fxn]Rn\nabla f(x) = \left[\frac{\partial f}{\partial x_1},\ \frac{\partial f}{\partial x_2},\ \ldots,\ \frac{\partial f}{\partial x_n}\right] \in \mathbb{R}^n

That is the gradient. One number per input. When we say "the gradient of the loss with respect to every parameter," this is it: θL\nabla_\theta L, one partial derivative per parameter, LL a scalar.

The Jacobian: One Step Further

What happens when the function produces multiple outputs? f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m. A hidden layer taking nn features and producing mm activations.

Each output has its own gradient: a vector of nn partial derivatives telling you how that one output responds to all inputs. You have mm such gradients. Stack them as rows of a matrix:

Jf(x)=[y1/x1y1/x2y1/xny2/x1y2/x2y2/xnym/x1ym/x2ym/xn]Rm×nJ_f(x) = \begin{bmatrix} \partial y_1/\partial x_1 & \partial y_1/\partial x_2 & \cdots & \partial y_1/\partial x_n \\ \partial y_2/\partial x_1 & \partial y_2/\partial x_2 & \cdots & \partial y_2/\partial x_n \\ \vdots & & \ddots & \vdots \\ \partial y_m/\partial x_1 & \partial y_m/\partial x_2 & \cdots & \partial y_m/\partial x_n \end{bmatrix} \in \mathbb{R}^{m \times n}

This is the Jacobian. Row ii is the gradient of output yiy_i. Column jj is how every output responds to a nudge in input xjx_j. All m×nm \times n slopes in one object.

Make it concrete. Two inputs, two outputs:

y1=x12+x2y2=3x1x22y_1 = x_1^2 + x_2 \qquad y_2 = 3x_1 - x_2^2

The Jacobian:

J=[2x1132x2]J = \begin{bmatrix} 2x_1 & 1 \\ 3 & -2x_2 \end{bmatrix}

At (x1,x2)=(1,2)(x_1, x_2) = (1, 2):

J(1,2)=[2134]J\bigg|_{(1,2)} = \begin{bmatrix} 2 & 1 \\ 3 & -4 \end{bmatrix}

Read it: nudge x1x_1 by ϵ\epsilon, y1y_1 moves by 2ϵ2\epsilon, y2y_2 moves by 3ϵ3\epsilon. Nudge x2x_2, y1y_1 moves by ϵ\epsilon, y2y_2 moves by 4ϵ-4\epsilon. Four slopes, one table.

[VISUAL: A labeled grid showing a Jacobian matrix. Columns are labeled x1,x2,,xnx_1, x_2, \ldots, x_n along the top. Rows are labeled y1,y2,,ymy_1, y_2, \ldots, y_m along the left side. Each interior cell shows yi/xj\partial y_i / \partial x_j. Highlight the entire first row in blue with a label reading "gradient of y1y_1". Highlight the second column in orange with a label reading "how all outputs respond to x2x_2". Below the grid: a label "Shape: m×nm \times n" with annotations "m = outputs (rows)" and "n = inputs (columns)".]

Shapes Are Contracts

Before you compute a single partial derivative, you already know the Jacobian's shape. Read it straight from the function signature.

f:RnRmf: \mathbb{R}^n \to \mathbb{R}^m has an m×nm \times n Jacobian. Always. A layer mapping 784 pixels to 128 features: 128×784128 \times 784. A layer mapping 1000 inputs to 500 outputs: 500×1000500 \times 1000. The shape is forced by the function, not by the calculus.

Think of it as a contract. When you write a layer, you commit that the backward pass will deal with a matrix of exactly that shape. Wrong shapes mean a broken contract. This is why shape errors in backprop are so diagnostic: they tell you where the contract was violated.

It is also why the next section is called "The Jacobian You Never Build." That 500×1000500 \times 1000 Jacobian has half a million entries. For one layer. Backprop has a way around it.

Backpropagation compute: explained intuitively

The Jacobian You Never Build

Half a million entries for one layer. A billion for a large one. And backprop never builds any of it.

Not because it can't. Because most of those entries are wasted work. By the time the backward pass reaches any layer, it already knows which weighted combination of Jacobian rows matters for the final loss. Everything else is computation you'd throw away.

The Tale of Two Accountants

A restaurant lost $100 on last night's service (the loss), and two accountants are hired to figure out which ingredients caused the problem. The loss came from just 2 dishes:

  • Dish #1 was responsible for $60 of the loss
  • Dish #2 was responsible for $40 of the loss

These dishes used various amounts of 1000 different ingredients. Both accountants need to figure out how much each ingredient contributed to the $100 loss.

The Naive Accountant

The first accountant builds a complete spreadsheet. For every ingredient, look up what percentage it contributed to each dish:

Ingredient% Used in Dish #1% Used in Dish #2
Ingredient #115%5%
Ingredient #20%20%
Ingredient #38%12%
.........
Ingredient #10003%0%

That's 2000 numbers computed and stored. The full Jacobian.

Then, to find each ingredient's contribution to the loss:

Ingredient #1's fault = 15% × $60 +  5% × $40 = $9 + $2 = $11
Ingredient #2's fault =  0% × $60 + 20% × $40 = $0 + $8 = $8
Ingredient #3's fault =  8% × $60 + 12% × $40 = $4.80 + $4.80 = $9.60
...

Work done: compute 2000 percentages, store them all, then do 1000 weighted sums.

The Clever Accountant

The second accountant skips the spreadsheet. She already knows the dishes cost 60and60 and 40. Why store every percentage as an isolated fact when you're just going to multiply each one by the same dollar amount anyway?

She goes straight to the final answer:

For Ingredient #1:
  - Look up: it's 15% of Dish #1 → 15% × $60 = $9
  - Look up: it's 5% of Dish #2 → 5% × $40 = $2
  - Total fault: $9 + $2 = $11

For Ingredient #2:
  - Look up: it's 0% of Dish #1 → 0% × $60 = $0
  - Look up: it's 20% of Dish #2 → 20% × $40 = $8
  - Total fault: $0 + $8 = $8

Same answers. Never stored the full table.

The Insight

Both accountants got the same result. But the naive one computed and stored all 2000 relationships even though he only needed them in one specific weighted combination. The clever one saw it: "I never need '15% of Dish #1' as an isolated fact. I only need '15% of Dish #1 × $60'."

That's backpropagation in miniature. The 60and60 and 40 are gradient values flowing backward from the loss. The percentages are local derivatives. You never need them in isolation. You only need them weighted by how much each output affected the loss.

Following the Gradient Backward

By the time the backward pass reaches your layer, the gradient vector [g₁, g₂, ..., gₘ] is already waiting: one number per output, naming the exact weighted combination of Jacobian rows you need. Any other combination is wasted work.

Let's trace this with a concrete setup:

1000 inputs → [Your Layer] → 2 outputs → [More layers] → Loss (single number)

The layers upstream have already done you a favor. They've worked out how your 2 outputs affect the final loss and handed the result back as 2 numbers: [g₁, g₂].

1000 inputs → [Your Layer] → 2 outputs → [More layers] → Loss (single number)
                                   ↑
                              [g₁, g₂] = [∂L/∂out₁, ∂L/∂out₂]

In effect: "Don't worry about what's upstream. Here is how your output 1 changes the loss (g₁), and how output 2 changes it (g₂)."

Your layer's Jacobian:

J = [∂y₁/∂x₁  ∂y₁/∂x₂  ∂y₁/∂x₃]
    [∂y₂/∂x₁  ∂y₂/∂x₂  ∂y₂/∂x₃]

(Shown with 3 inputs for compactness; in our running example it's 1000.) Each row is an output, each column an input. The gradient values g₁ and g₂ act as weights on these rows:

  • g₁ = ∂L/∂output₁: weight for row 1
  • g₂ = ∂L/∂output₂: weight for row 2

Now your job: pay it forward. Tell each of your 1000 inputs how it, individually, impacted the final loss.

Sounds hard. You have 1000 inputs, not 2. You've never seen the actual value of L.

But the loss can only "see" your inputs through your layer's outputs. You know how your outputs impact the loss (g₁ and g₂ told you). You know how your inputs impact your outputs (the local derivatives, which you can compute from cached forward-pass values). Chain those two facts:

L/x1=L/out1out1/x1∂L/∂x₁ = ∂L/∂out₁ \cdot ∂out₁/∂x₁

You already know ∂L/∂out₁ from upstream. The only thing left to compute is ∂out₁/∂x₁, the local derivative.

But x₁ affects both outputs, not just one. The full picture for input x₁: multiply g₁ by how much x₁ affects output 1, multiply g₂ by how much x₁ affects output 2, and sum:

L/x1=g1out1/x1+g2out2/x1∂L/∂x₁ = g₁ \cdot ∂out₁/∂x₁ + g₂ \cdot ∂out₂/∂x₁

For each input, sum how it affects each output, weighted by that output's gradient.

One subtlety. In a bigger network, an input might reach the loss through a path that doesn't go through your layer at all. The gradient you pass backward only accounts for influence through you. If that input also flows through some other layer, that layer sends its own gradient, and the two contributions get summed. What you're really computing is: "how this input affects the loss when it flows through me."

The Vector-Jacobian Product

Writing this for all inputs at once:

∂L/∂input_i = g₁ × J[0,i] + g₂ × J[1,i]

This is entry ii of the vector-Jacobian product (VJP): [g₁, g₂] × J.

In standard notation: L/x=JTL/y\partial L / \partial \mathbf{x} = \mathbf{J}^T \cdot \partial L / \partial \mathbf{y}. The gradient with respect to the inputs equals the transposed Jacobian times the gradient with respect to the outputs. This identity is the mathematical heart of backpropagation.

We never need J[0,i] by itself. Only multiplied by g₁. We never need J[1,i] by itself. Only multiplied by g₂. The incoming gradient vector tells us the only linear combination of Jacobian rows that matters.

In code:

The J.T @ grad_output pattern routes gradients backward through the same connections they flowed forward through. The transpose handles the routing.

Concrete Numbers

Layer with 3 inputs, 2 outputs. Jacobian (if we built it):

J = [2  3  1]  # How output 1 depends on each input
    [4  0  5]  # How output 2 depends on each input

Gradient from loss: [g₁, g₂] = [0.5, 0.2]

The naive approach multiplies the full matrix:

grad_inputs = [0.5, 0.2] × [2  3  1]
                           [4  0  5]
            = 0.5 × [2, 3, 1] + 0.2 × [4, 0, 5]
            = [1, 1.5, 0.5] + [0.8, 0, 1]
            = [1.8, 1.5, 1.5]

Backprop computes the VJP directly, one input at a time:

  • Input 1: 2 × 0.5 + 4 × 0.2 = 1.8
  • Input 2: 3 × 0.5 + 0 × 0.2 = 1.5
  • Input 3: 1 × 0.5 + 5 × 0.2 = 1.5

Same result. Never stored the 6-entry Jacobian.

For a linear layer Y = WX + b, the same VJP pattern gives you all gradients:

Each line is a VJP. The transpose in W.T @ grad_output routes gradients back through the same connections they came from, just in reverse.

This takes roughly the same time as the forward pass and needs only cached values from forward propagation: the input X and the activation masks. For deep networks with realistic layer sizes, the difference between materializing the full Jacobian and computing the VJP directly is the difference between trainable and impossible.

The Jacobian is a useful conceptual tool. Backprop never builds it. It goes straight to the only thing that matters: how each input contributed to the loss, weighted by the gradient flowing backward.

Chain Rule by Pictures

We know how to push a gradient through one layer. Networks stack dozens of them. What rule moves the gradient from layer to layer?

Two functions, chained. xx flows through gg to produce yy; yy flows through ff to produce zz.

x → [g] → y → [f] → z

Wiggle xx by a small amount ϵ\epsilon. How much does zz move?

gg amplifies first. If gg's local derivative is 2, yy moves by 2ϵ2\epsilon. That 2ϵ2\epsilon flows into ff. If ff's local derivative is 3, zz moves by 3×2ϵ=6ϵ3 \times 2\epsilon = 6\epsilon.

Multiply the local derivatives along the path.

zx=zyyx=3×2=6\frac{\partial z}{\partial x} = \frac{\partial z}{\partial y} \cdot \frac{\partial y}{\partial x} = 3 \times 2 = 6

[VISUAL: A horizontal chain of three nodes: x on the left, y in the middle, z on the right. Arrows connecting them labeled g (x to y) and f (y to z). Below the g arrow: local derivative "×2". Below the f arrow: local derivative "×3". A perturbation "+ε" annotates x, which becomes "+2ε" at y and "+6ε" at z via rightward nudge arrows. Caption at the bottom: "multiply along the path."]

Each function is an amplifier. They compose by multiplication: each one operates on the already-amplified output of the one before it.

But real networks are not chains. A single value can fan out into two paths, both feeding into the same output.

[VISUAL: A diamond-shaped graph. x at the left. Two arrows leave x: one going up to an intermediate node a, one going down to node b. From both a and b, arrows converge at z on the right. Label the four edges: x→a with "×2", a→z with "×3", x→b with "×4", b→z with "×5". Below the upper path: product label "2 × 3 = 6". Below the lower path: "4 × 5 = 20". At z, the contributions arrive and sum: "6 + 20 = 26". A plus symbol at z marks the merge.]

x can reach z through two routes:

  • Top route: derivative is 2 × 3 = 6
  • Bottom route: derivative is 4 × 5 = 20
  • Total derivative: 6 + 20 = 26

Multiply along paths, add across paths. That is the multivariate chain rule. Every fork in the forward pass becomes a sum in the backward pass. Not a convention we chose: when xx shifts by ϵ\epsilon, the top path pushes zz by 6ϵ6\epsilon and the bottom pushes by 20ϵ20\epsilon, independently. Separate influences on the same output add. It falls out of how derivatives work.

The complete form:

Lx=paths from x to Ledges in path(local derivative along edge)\frac{\partial L}{\partial x} = \sum_{\text{paths from } x \text{ to } L} \prod_{\text{edges in path}} (\text{local derivative along edge})

Addition at merges is a topology event, not an accounting choice. When a value has two outgoing edges in the forward pass, its gradient has two incoming contributions in the backward pass. The graph forces it.

Computational Graphs

All the chains, diamonds, and fork-and-merge patterns we drew in the last two sections. Those are computational graphs. Not metaphors. The actual data structure your framework builds while your forward pass runs.

When you write:

PyTorch watches each operation execute. The matmul becomes a node. The addition becomes a node. The ReLU becomes a node. Each node records its inputs and its output. By the time your forward pass finishes, the full graph sits in memory, waiting.

Each node tracks two things:

  1. During the forward pass: its value
  2. During the backward pass: its gradient (how much the final loss changes if this value changes)

Each operation tracks two things:

  1. How to compute the forward output
  2. Its local derivative rule

That second item does all the work. ReLU does not know what came before it. Matrix multiplication does not know what comes after. Each operation answers one narrow question: "my output shifted by this much, so how much did each input contribute?" The local derivative. That is all backprop needs from any node.

[VISUAL: Two "node cards" side by side, like index cards. Left card: labeled "Multiply (a × b)". Top half: "Forward: output = a × b", two arrows entering from above labeled a and b, one arrow leaving below labeled output. Bottom half: "Backward: gradient for a = upstream × b, gradient for b = upstream × a". Right card: labeled "ReLU". Top half: "Forward: output = max(0, x)", one arrow in, one arrow out. Bottom half: "Backward: pass gradient through if x > 0, block it (zero) otherwise". Caption under both: "Each operation is self-contained. It knows its forward rule and its local derivative. The graph coordinates everything else."]

This locality is what makes automatic differentiation work. If every operation can push a gradient backward through itself, and the graph connects them all, you can differentiate any composition. You never ask "how does weight 47 affect the loss through 20 layers of nonlinear transformations?" You register a local rule for multiply, one for add, one for ReLU. Each in isolation. The graph routes everything else.

So backward() walks the graph PyTorch recorded during the forward pass. Reverse order. At each node, call the local derivative rule, accumulate the result. No node talks to another. The graph coordinates everything.

And recall from the diamond paths: when a value is used multiple times in the forward pass, its gradient contributions sum in the backward pass. Two outgoing edges forward, two incoming gradient contributions backward. The graph enforces this automatically.

Neural networks are computational graphs. Each operation knows its local derivative rule and nothing else. The graph handles routing. What you reuse forward, you accumulate backward.

But we've been vague about what "the forward pass records" actually means. What gets stored at each node? Why does training eat so much more memory than inference? That's the forward pass story.

The Forward Pass: Setting Up for the Backward Journey

Forward propagation was covered in detail in the MLP post, so we won't repeat the full story here. If you need a refresher on how matrix multiplies and ReLUs compose to transform data, check that out first. Here, we'll focus on what the forward pass needs to do specifically to enable backpropagation.

The one thing to keep in mind w.r.t. the forward pass is that it isn't just computing outputs, but it's laying breadcrumbs for the backward pass to follow.

What Backprop Actually Needs From Forward

The backward pass arrives at a weight carrying one number: how much the loss changes if that weight's output changes. A gradient from the layers above.

But that gradient alone is not the weight's update. Computing the actual weight gradient requires a second piece: the input this weight was multiplied against during the forward pass. By the time the backward pass gets there, the forward pass is over. If nothing was saved, that input is gone.

So the forward pass has two jobs. Computing outputs is the visible one. The hidden one: leaving behind the specific values the backward pass needs to finish its gradient arithmetic. For a typical layer, that means the input it received and enough about the activation to reconstruct its local derivative. ReLU, for instance, only needs a boolean mask of which neurons fired. Not the full pre-activation values. Just the sign.

The downstream gradient, how this layer's output affected the final loss, is not part of that record. That's what the backward pass computes as it sweeps back. The cache handles inputs and activation state. The backward sweep handles everything else.

Skip the cache and the backward pass has to recompute those forward-pass values from scratch, roughly doubling the work. Store them, and backward only pays for the new arithmetic it adds.

The Cache: Your Memory-Computation Tradeoff

The forward pass leaves values behind for the backward pass. But not all of them. Most intermediate tensors never appear in any gradient formula, so caching them wastes memory for nothing.

The filter: look at the gradient expression for each operation. Whatever variables show up on the right-hand side, cache those. Everything else, let go.

For multiplication, z=xyz = xy: the gradient z/x=y\partial z/\partial x = y, so you cache yy to reconstruct xx's gradient, and vice versa. For ReLU, a=max(0,z)a = \max(0, z): the gradient is 1 where zz was positive and 0 where it wasn't. Not the magnitude. Just the sign.

That's the whole rule. Here's how it plays out for the operations you'll hit most often.

Linear layer: z=Wx+bz = Wx + b

Why both?

  • To get L/W\partial L/\partial W: you need xx (the input)
  • To get L/x\partial L/\partial x: you need WW (the weights)
  • To get L/b\partial L/\partial b: you just need the upstream gradient

ReLU: a=max(0,z)a = \max(0, z)

ReLU's gradient is binary: pass through or block. So a boolean mask is all you need. That's 1 bit per element versus 32 bits for the full pre-activation values. Across millions of neurons and dozens of layers, that 32x ratio compounds.

Softmax + Cross-Entropy Loss

Softmax converts raw scores into probabilities. Cross-entropy measures how wrong those probabilities are. They're almost always fused, and the cache is the reason.

Softmax converts raw scores (logits) into a probability distribution:

softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}

Where:

  • ziz_i = the logit (raw score) for class ii
  • The output is a probability distribution where all values are positive and sum to 1

Cross-Entropy Loss measures how different the predicted distribution is from the true distribution:

L=iyilog(pi)L = -\sum_{i} y_i \log(p_i)

Where:

  • pip_i = predicted probability for class ii (from softmax)
  • yiy_i = true probability for class ii (target distribution)
  • LL = the loss value (lower is better)

For classification, yy is typically one-hot (all zeros except a 1 at the true class), simplifying to:

L=log(ptrue class)L = -\log(p_{\text{true class}})

Why fuse them? Treating them as separate operations means two separate backward passes: cross-entropy backward needs the probabilities, softmax backward needs the logits. Keep them separate and you cache both. Fuse them and the combined gradient with respect to the logits is just Lzi=piyi\frac{\partial L}{\partial z_i} = p_i - y_i: predicted probability minus true probability. Nothing else required.

In code, the fused cache:

The backward pass is a one-liner: probs - one_hot(true_class).

The Topological Sort: Order Matters

For a chain network, backward order is obvious. Forward goes A → B → C, backward goes C → B → A. Reverse the list, done.

Now add a skip connection. ResNet sends one value to two different layers. Attention splits paths that merge back later. "Reverse the forward order" stops working. There is no single forward order. You might visit B before C or C before B during the forward pass, and both traversals are valid.

So what actually pins down the backward ordering?

Think about a node that fans out to two children. Both children will eventually compute a gradient and send it back. The node needs both contributions before it can compute its own total gradient and pass it upstream. Visit the node too early, before both children have reported, and you sum an incomplete set of contributions. The gradient you propagate is wrong.

That gives you the rule: process all children before their parent. Every downstream consumer of a node must finish before the node itself gets processed.

For a chain, this collapses to "reverse the list." For a DAG with branches and merges, you need something more principled: a topological sort. Given a directed acyclic graph, a topological sort finds any ordering where parents come before children. For the backward pass, flip it. Any valid reverse topological order guarantees that when you reach a node, every downstream consumer has already sent its gradient contribution back.

Getting this wrong is quiet. The network still trains. The loss still drops at first. But a node with two consumers that gets processed after only one has reported back accumulates a partial gradient. That incomplete number propagates upstream, and every gradient it touches from that point on is slightly off. The symptoms look like a learning rate problem or a bad initialization. Not a correctness bug. You might not notice for hundreds of epochs.

Topological sort is not an optimization. It is the correctness guarantee.

Shapes: The Unforgiving Truth

The forward pass is forgiving. You transpose, broadcast, reshape, and things fit. Get a dimension wrong and PyTorch throws a clear error with both shapes in the message.

The backward pass is not.

Get a shape wrong in your backward implementation and you won't see an error. You'll see a loss that drops a bit at first from early momentum, then plateaus. Nothing in the output points at the problem.

[VISUAL: Two panels side by side. Left panel labeled "Forward pass error": a red terminal showing a PyTorch RuntimeError, e.g. "mat1 and mat2 shapes cannot be multiplied (32x784 and 128x128)". Right panel labeled "Backward pass shape bug": a loss curve that drops steeply for the first few epochs, then flatlines at a high value for 100+ epochs. Caption below both panels: "Forward pass mistakes are loud. Backward pass shape bugs are quiet."]

The rule: the gradient of a tensor must have the same shape as the tensor itself.

The reason is mechanical. The gradient goes straight into the update:

If W.shape is (128, 784), then grad_W.shape must be (128, 784). Not (784, 128). Not (100352,). Exactly (128, 784). The subtraction is element-wise: each weight gets the gradient computed for it specifically. Wrong shape means the wrong weight gets nudged, or the operation crashes. W has 100,352 parameters. You need 100,352 gradient values. One per weight.

These shapes are not choices you make. The forward pass locks them in. Once W maps 784 input features to 128 outputs, grad_W must be (128, 784). The chain rule forces it. There is no other valid form.

Trace through a single layer:

QuantityShapeWhy
W(128, 784)128 output neurons, 784 inputs each
x(784,)784 input features
z = Wx + b(128,)one activation per output neuron
grad_z (incoming)(128,)matches the output z
grad_W(128, 784)must match W
grad_x(784,)must match x

Each row follows from the row above it. The forward pass is a sequence of commitments. The backward pass honors them.

[VISUAL: A two-column diagram. Left column labeled "Forward pass" shows three rectangles stacked vertically: W (wide, labeled 128x784), x (tall, labeled 784), z (short, labeled 128), connected by arrows pointing downward. Right column labeled "Backward pass" shows the same three shapes mirrored: grad_W (identical bounding box to W, 128x784), grad_x (identical to x, 784), grad_z (identical to z, 128), connected by arrows pointing upward. Matching pairs are connected by horizontal dashed lines. Caption: "Every gradient shape is a mirror of its forward-pass counterpart. The forward pass decides it."]

Shapes are also your best debugger. grad_W came out (128, 784) but W is (784, 128)? You forgot a transpose. grad_W is (128,) instead of (128, 784)? Missing an outer product. The wrong shape tells you the category of bug before you look at a single line of the formula.

When you read the local gradient rules in the next section, notice that every transpose and every sum in those formulas is forced by shape constraints. Nothing there is notation for its own sake. It is shapes working out their only valid form.

Local Gradients: Your Building Blocks

The cache, the evaluation order, the shape rules. We have the infrastructure. What's left is the actual arithmetic: when the backward pass reaches a node, what does it compute?

Each operation answers one question: "My output shifted by a small amount. How much of that shift came from each input?"

That's the local gradient. The operation doesn't know what feeds into it, doesn't know what uses its output. It reports one thing: the exchange rate between its inputs and its output. Because every operation answers this independently, you can chain them into networks of any depth. The chain rule does the rest.

Most of deep learning breaks down into four operations.

Addition: z=x+yz = x + y

Nudge xx by ϵ\epsilon, and zz moves by ϵ\epsilon. Same with yy. The exchange rate is 1:1 in both directions.

zx=1zy=1\frac{\partial z}{\partial x} = 1 \qquad \frac{\partial z}{\partial y} = 1

Addition is a signal splitter. The incoming gradient passes through to both inputs, unchanged. No scaling, no filtering. This is why a bias term (which is just an addition) has the simplest gradient rule in the whole network: its gradient equals the upstream gradient, full stop.

Multiplication: z=xyz = x \cdot y

zx=yzy=x\frac{\partial z}{\partial x} = y \qquad \frac{\partial z}{\partial y} = x

Nudge xx by ϵ\epsilon and zz moves by yϵy \cdot \epsilon. The exchange rate depends on the current value of yy. Large yy means xx matters a lot right now. Small yy means xx barely moves the needle. The local gradient for each input is the other input's value.

There is an uncomfortable consequence: if either input is zero, the gradient for the other input is also zero. A multiply gate with a zero on one side is a closed door. Watch for this in gating mechanisms and attention -- if your gate value collapses to zero, the gradient dies with it.

ReLU: a=max(0,z)a = \max(0, z)

az={1if z>00otherwise\frac{\partial a}{\partial z} = \begin{cases} 1 & \text{if } z > 0 \\ 0 & \text{otherwise} \end{cases}

ReLU is a gate. Either the gradient flows through unchanged, or it stops completely. No partial signal, no fractional pass-through. This binary behavior is what makes ReLU so cheap to differentiate: the backward pass only needs to know which neurons fired. Not their values. Just the sign. This is why the forward pass caches a boolean mask, not the full tensor.

This also describes dead neurons. If a neuron's pre-activation is always negative, the gate stays closed, the gradient is always zero, and the weights never update. The neuron is stuck. A few dead neurons are not catastrophic. Whole layers of them are.

Matrix Multiply: Z=WXZ = WX

LW=LZXTLX=WTLZ\frac{\partial L}{\partial W} = \frac{\partial L}{\partial Z} \cdot X^T \qquad \frac{\partial L}{\partial X} = W^T \cdot \frac{\partial L}{\partial Z}

The transposes look like notation, but they are forced by the shapes. As we saw in §3.4: the gradient of WW must have the same shape as WW itself. The upstream gradient L/Z\partial L/\partial Z and XTX^T are the only things you can multiply together to get a matrix with that shape. There is no other valid form.

The intuition underneath: WW carries information from XX-space to ZZ-space in the forward pass. In the backward pass, WTW^T carries gradients back from ZZ-gradient-space to XX-gradient-space. The same connections, reversed. Forward and backward are mirrors through the same weights.

[VISUAL: Four "recipe cards" arranged in a 2×2 grid. Each card shows the operation name, the forward formula, and the backward gradient formulas. Card 1 (Addition): z = x + y; ∂z/∂x = 1, ∂z/∂y = 1; a fork icon showing the incoming gradient splitting equally to both inputs. Card 2 (Multiplication): z = xy; ∂z/∂x = y, ∂z/∂y = x; annotation "other input is the scale." Card 3 (ReLU): a = max(0, z); ∂a/∂z = 1 if z > 0, else 0; drawn as an open/closed gate — gradient either flows or stops. Card 4 (Matrix Multiply): Z = WX; the two gradient formulas with shapes annotated alongside, showing why the transposes are forced by dimension-matching. Caption below all four cards: "Four rules. Most of deep learning."]

Four rules. Every fully connected layer, every ReLU, every softmax, every cross-entropy loss decomposes into these operations. The complexity of a billion-parameter model is not in the individual rules. It is in the depth of the graph and how they chain.

A Complete Forward Pass Example

The four rules, the cache, the shapes, the topological ordering. All in isolation so far. Time to run them together on actual numbers.

Two inputs, two hidden neurons, one output. Small enough to trace by hand.

[VISUAL: A two-layer network drawn in whiteboard style. Left: two input nodes labeled x₁ = 3.0 (top) and x₂ = -2.0 (bottom). Center column: two hidden nodes h₁ (top, open circle) and h₂ (bottom, open circle). Right: one output node. Four weighted arrows connecting inputs to hidden nodes: x₁→h₁ labeled 1.0, x₂→h₁ labeled -0.5, x₁→h₂ labeled 0.5, x₂→h₂ labeled 2.0. Small bias annotations: b = 0.0 near h₁, b = 1.0 near h₂. Two arrows from hidden nodes to output: h₁→output labeled 1.0, h₂→output labeled 0.5, with bias -1.0 near the output node. After the ReLU step, shade h₂ gray with the annotation "-1.5 → 0 (dead)" to show it was silenced. Caption: "One neuron fires. One doesn't."]

Layer 1 pre-activations, z1 = W1 @ x + b1:

h₁:  (1.0)(3.0) + (-0.5)(-2.0) + 0.0  =  3.0 + 1.0       =  4.0   →  active
h₂:  (0.5)(3.0) +  (2.0)(-2.0) + 1.0  =  1.5 - 4.0 + 1.0  = -1.5  →  dead

Shape of z1: (2,). One number per hidden neuron. Forced by the output dimension of W1.

ReLU: a1 = max(0, z1) → [4.0, 0.0]

h₂ is silenced. What the cache stores: not -1.5. Just the mask [True, False]. The backward pass only needs to know which neurons were positive, not by how much. One bit per neuron instead of 32.

Layer 2, z2 = W2 @ a1 + b2:

(1.0)(4.0) + (0.5)(0.0) + (-1.0)  =  4.0 + 0.0 - 1.0  =  3.0

h₂'s weight in W2 is 0.5. It sees zero input. The output doesn't care what that weight is.

The same computation, with the cache made explicit:

Now trace what the dead neuron means for the backward pass. a1 = [4.0, 0.0] sits in cache2. When backprop computes grad_W2, it takes the outer product of the upstream gradient and a1. The second entry of a1 is zero. W2's second weight gets a zero gradient. It won't move.

One layer back: the backward pass hits the ReLU mask for h₂, reads False, and stops. The weights in W1's second row learn nothing from this example. Not because anything went wrong. Because h₂ contributed nothing to this particular forward pass.

This is the mechanism, not an edge case. Any given input activates some subset of neurons and silences the rest. Only the active neurons participate in the prediction and the weight update. Different inputs activate different subsets. Each training example only updates the weights in the region it lands in.

Setting Up for the Journey Back

Look at what we left behind.

x = [3.0, -2.0] sitting in cache1. The mask [True, False] beside it. And a1 = [4.0, 0.0] in cache2.

Three values. The entire record the backward pass needs from a two-layer network. Not the full pre-activation tensor. Not intermediate computations. Three values.

[VISUAL: Three small cache card boxes arranged left to right. Card 1 labeled "cache1: x" shows [3.0, -2.0]. Card 2 labeled "cache1: mask" shows [True, False] with h₂'s entry grayed out and annotated "dead." Card 3 labeled "cache2: a1" shows [4.0, 0.0]. A horizontal banner above all three reads "What the forward pass left behind." Below each card, a small downward arrow labeled "consumed by backward." Caption: "Three values. Everything backprop needs."]

The cache stores more than inputs for gradient formulas. It stores verdicts. h₂'s mask entry reads False. When backprop reaches that neuron, it will multiply by zero and move on. The forward pass decided h₂ contributed nothing to this prediction. The backward pass does not revisit the question.

The forward pass records which neurons fired and which paths carried signal. The backward pass reads those records in reverse.

So here is where we stand. Backprop starts from the loss, a single number. It walks the graph backward, one node at a time, applying the local derivative rules from §3.5 to the cached values we just traced. When it reaches the bottom, every weight has a gradient.

One sweep forward. One sweep back. Every gradient computed exactly once.

The next section walks that backward sweep, node by node.

The Backward Pass: Retracing Your Steps With Purpose

The cache has what the backward pass needs. The forward pass is over. The process starts from the loss: one number. It ends with a gradient for every weight.

One sweep, in reverse.

Meet the Adjoint: Your Gradient's True Name

Here is the thing that confused me when I first saw backprop code. You are not computing "the gradient" as a single thing at the end. You are computing one for every intermediate value in the graph. The input to a neuron gets one. The output of a multiplication gets one. The pre-activation before a ReLU gets one.

Each of them answers the same question: if I wiggle this value, how much does the loss move?

That number is the adjoint. For any value vv:

vˉ=Lv\bar{v} = \frac{\partial L}{\partial v}

The sensitivity of the loss to vv. How fast the loss moves when vv moves.

Why does it need its own name? Partly because "gradient" is already overloaded. But mostly because naming it makes the algorithm legible. When you read backward pass code, every line does the same thing: here is a value, here is its adjoint. One pattern, applied everywhere.

Trace it on a small example:

The seed is always 1. L/L\partial L / \partial L is exactly 1: the loss is perfectly sensitive to itself. Everything else follows. Each step takes the adjoint that arrived from downstream, scales it by the local derivative, and passes the result upstream.

[VISUAL: Four labeled boxes in a horizontal row: x, y, z, loss. Each box has two halves — the top half shows the forward value, the bottom half shows the adjoint. Top row: x=2, y=6, z=7, loss=49. Backward arrows run right to left, each labeled with its local derivative: loss to z labeled ×(2z=14), z to y labeled ×1, y to x labeled ×3. Bottom row fills in right to left: loss gets adjoint 1, z gets 14, y gets 14, x gets 42. Caption: "Forward fills the top. Backward fills the bottom."]

The Multiply-Accumulate Pattern: Backprop's Only Move

Go back to the backward pass in §4.1. Every line did the same thing: take the adjoint that arrived, scale it by something local, add to what was already there. Same pattern, every node. Zoom in on what happens at a single edge. Not the whole graph. Just one edge, one step.

One line:

That's the move. Every gradient in every network comes from this line, applied to every edge in the graph, in reverse order. The rest of the algorithm is just coordination.

The relay station

What does a single node see? From downstream, a ratio arrives: how much the final loss moves per unit change in this node's output. That's the downstream adjoint.

The node's job is narrow. It doesn't need to know anything about the layers above or below it. It just needs to answer: given that my output mattered by factor yˉ\bar{y}, how much did each of my inputs contribute?

The local derivative is the exchange rate. If the local derivative for input xix_i is 3, it means a unit change in xix_i causes a change of 3 in the output. So if the output is worth yˉ\bar{y} to the loss, then xix_i caused yˉ×3\bar{y} \times 3 of that. Multiply. Pass upstream.

Why += is not optional

But what if a value feeds into two downstream nodes?

Values get reused. We saw this in section 2.2: a single value can fan out to two different downstream nodes, both eventually affecting the loss. Two paths from the same value. Two separate gradient contributions, one through each path.

If you write =, the second contribution overwrites the first. You lose half the gradient. The answer is wrong.

The += accumulates both:

Without accumulation, just overwriting would give 3 instead of 6. Wrong by a factor of 2. In the graph, x has two outgoing edges, one to each factor of the multiplication. Two edges, two contributions. The += collects both.

This falls straight out of the multivariate chain rule: when a value has multiple outgoing paths, its total gradient is the sum of contributions through each path. The += is that sum, edge by edge.

[VISUAL: Three small panels side by side, each showing a mini computational graph with backward arrows. Panel 1 labeled "Simple chain": two nodes x and y connected by one edge. A backward arrow carries the label "output_adjoint × local_deriv." A bucket below x labeled "x_adjoint" receives one stream. Panel 2 labeled "Fan-out": x connects forward to both y₁ and y₂. Two separate backward arrows return from y₁ and y₂, each labeled with its own contribution. Both pour into the same bucket below x. A "+=" symbol appears at the bucket with a note: "two contributions, one accumulation." Panel 3 labeled "x * x": x has two outgoing edges to a multiply node, labeled "factor 1" and "factor 2." Both edges fire backwards. The bucket shows "3 + 3 = 6." Caption beneath all three panels: "One move. All three situations. The += handles the topology."]

The breadth of one rule

A weight matrix in a batched linear layer is used 32 times, once per example. Each use generates a contribution. They accumulate. The matrix multiply that computes the weight gradient sums all 32 contributions implicitly.

A skip connection sends a value to two downstream layers. Both send gradients back. They accumulate.

An attention key is compared against multiple queries. Each comparison generates a contribution. They accumulate.

No special logic for any of these. The graph structure determines which edges exist. The += collects contributions through all of them. That's the whole move.

Walking the DAG: The Actual Algorithm

You could run the multiply-accumulate move along a straight path right now. Loss to z, z to y, y to x. One path, one direction, no ambiguity.

But graphs branch. A value x feeds into two downstream nodes, and both send gradient contributions back. Process x before both contributions arrive and you propagate an incomplete adjoint. Every node below it gets a wrong gradient. Silently.

So the question is: when you reach a node, how do you know its adjoint is complete?

Reverse topological order. Process children before parents. By the time the loop reaches any node, every downstream consumer has already sent its contribution back. The full algorithm:

Three steps.

Initialize. Every adjoint starts at zero. The loss gets 1: L/L=1\partial L / \partial L = 1. The one adjoint you can write down without computing anything. Every other adjoint in the graph follows from it.

Walk in reverse topological order. If x fans out to a and b, both a and b process before x. By the time the loop reaches x, both contributions have arrived and the += has collected them. No incomplete gradients leak downstream.

Accumulate. Same move as §4.2, at every edge. Multiply the incoming adjoint by the local derivative, add to the input's running total. The += handles fan-out, weight sharing, and every other case where a value has multiple consumers.

None of this is specific to neural networks. It works on any directed acyclic graph where each node has a local derivative rule.

Trace it through a concrete graph. Forward: y = wx + b, then loss = y².

Start at loss. Adjoint 1. The local derivative of squaring is 2y=142y = 14, so y receives 14.

Next, y. Adjoint 14. Addition has local derivative 1 for both inputs, so temp and b each get 14. Notice b is a leaf: no inputs to propagate to. It holds its adjoint and waits.

Then temp. Adjoint 14. Multiplication's local derivative with respect to each input is the other input's value. So w gets 14×2=2814 \times 2 = 28, x gets 14×3=4214 \times 3 = 42.

Leaf nodes hold their final adjoints. The sweep ends.

Five nodes. Every adjoint filled. One pass.

The Cost of Gradients: Time, Memory, and Checkpointing

Five nodes, one backward sweep. The same procedure scales to any size graph. What changes is the bill.

Two costs. They behave very differently.

Time: One More Forward Pass

The backward pass visits every node once. At each node: read the incoming adjoint, multiply by the local derivative, accumulate. Same structure as the forward pass.

So: forward pass is O(number of operations). Backward pass is O(number of operations). Total cost is about 2× running the network once.

Not 2× per parameter. Two passes, regardless of how many parameters you have.

This is what the chain rule buys you. Without backprop, you'd estimate each gradient by perturbing one parameter at a time: run forward, nudge by epsilon, run forward again, divide. One forward pass per parameter. For 100 million parameters, that's 100 million forward passes. Backprop collapses that to two.

[VISUAL: Three horizontal bars stacked vertically, labeled on the left. Top bar: "Inference (forward only)" -- length 1x. Middle bar: "Training: forward pass" -- also length 1x. Bottom bar: "Training: backward pass" -- length ~1.5x, with a small annotation pointing to the extra length: "transposes + accumulation." A curly brace on the right spans both training bars, labeled "~2-3x total." Caption: "Two passes total. No matter how many parameters."]

Why roughly 2× and not exactly? The backward pass does work the forward pass doesn't: transposing weight matrices, accumulating contributions from fan-out. In practice, backward lands between 1.5× and 3× the forward cost depending on architecture. The complexity class is the same. No surprises.

Memory: The Bill That Arrives Later

Time is manageable. Memory is where training gets hard.

The backward pass needs activations from the forward pass. Every value a local derivative depends on must stay alive in GPU memory until the backward pass reaches that node. The cache from §3.2 is not freed until the sweep finishes. Every layer's activations, every attention weight matrix, the whole batch. All sitting in memory, waiting.

For 24 layers: roughly 5.6 GB just for activations. Add model weights, optimizer state, gradient tensors, batch data. A 7B parameter model in full training can require 80+ GB.

This is why you run out of GPU memory during training when inference fits fine. Inference processes one layer and forgets what it doesn't need before moving to the next. Training keeps all of it alive at once.

[VISUAL: Two side-by-side column diagrams, each showing a vertical stack of 6 numbered layers. Left column labeled "Inference": only one layer is highlighted (active) at any moment; layers above it are greyed out with a label "freed." Right column labeled "Training": all 6 layers are highlighted simultaneously, each carrying a small yellow block labeled "cache." A bracket on the right spans the full height of the training column with the annotation "all alive simultaneously." Caption: "Inference forgets as it goes. Training cannot."]

Gradient Checkpointing: Trading Compute for Memory

So the cache is large. Can you shrink it?

Yes. Throw some of it away and recompute it when you need it.

Instead of storing every layer's activations, store a checkpoint every N layers during the forward pass. Discard everything in between. When the backward pass needs a discarded activation, recompute it from the nearest checkpoint.

[VISUAL: A vertical stack of 8 numbered layers (1 through 8), shown twice side by side. Left column "No checkpointing": every layer has a yellow cache block. Right column "Checkpoint every 4 layers": only layers 1 and 5 have yellow cache blocks; layers 2-4 and 6-8 show a gray "discarded" marker. A dashed recompute arrow originates from the checkpoint at layer 5, passes forward through layers 6 and 7, and arrives at layer 8 when its activation is needed during backward. Caption: "Store less. Pay by recomputing what you discarded."]

In code:

With L layers and checkpoints placed every √L layers, you store √L checkpoint slots and recompute at most √L layers per backward step. Memory scales with the square root of depth instead of linearly. A 100-layer network goes from 100 activation buffers to ~10.

Memory:      O(√L) instead of O(L)
Computation: ~1.5× forward passes instead of 1×

In PyTorch this is one function call: torch.utils.checkpoint.checkpoint(fn, x). It wraps the segment you want to checkpoint.

The Shape Discipline That Keeps You Sane

You have been tracing adjoints through graphs for several sections now. Here is the move I found most useful early on: before computing a single gradient, write down its shape. You already know it.

The rule from §3.4: the shape of a gradient must match the shape of its corresponding value. The adjoint vˉ=L/v\bar{v} = \partial L / \partial v has one entry per element of vv. 128 elements, 128 partial derivatives. Not convention. Definition.

You have seen why: the gradient goes straight into W = W - lr * grad_W. Element-wise. Wrong shape and the update either crashes or silently nudges the wrong weights. No error message for the second failure mode.

The forward pass as a shape oracle

The forward pass commits to shapes. The backward pass inherits them.

Before you write a single line of backward code, fill in every gradient's shape by reading the forward pass. W is (128, 784)? Then grad_W is (128, 784). Done. No derivation needed.

[VISUAL: A two-column table. Left column labeled "Forward pass" lists each tensor with its shape: W (128x784), b (128), x (784), z = Wx+b (128). Right column labeled "Gradient shape" mirrors each row: grad_W (128x784), grad_b (128), grad_x (784), grad_z (128). A dashed horizontal line connects each pair across columns. Below the table: "Every gradient shape is a copy of its forward counterpart. The forward pass already decided it."]

This becomes your first debugging move. Check shapes before touching formulas. Transposed a matrix that shouldn't be transposed? Shape mismatch. Forgot an outer product? Shape mismatch. The wrong shape names the bug category before you read a single line of the gradient formula.

When operations broadcast: the reduce rule

Most forward passes silently expand tensors. Adding a bias of shape (128,) to a batch of activations shaped (32, 128) works because broadcasting replicates the bias 32 times. NumPy handles it quietly. You don't even think about it.

Backprop does not get that free pass. Whatever expansion happened forward, you undo backward. Replication undoes with a sum along the replicated axis:

The pairing is mechanical. Broadcast forward, sum backward, same axes. Every time.

We go deeper in §6.4 when full layer implementations are on the table. But those two lines carry the whole rule.

Two shape checks cover any backward implementation. Does each gradient match its tensor's shape? Does each broadcast sum back down? If both hold, the scaffolding is right. Everything else is arithmetic.

Pulling It All Together

Adjoints (§4.1). Multiply-accumulate (§4.2). Reverse topological order (§4.3). Shape discipline (§4.5). Four pieces, built separately across this chapter. Here they are in a working two-layer network, forward and backward:

Walk it top to bottom.

forward() does two things: compute predictions and leave a record. self.X, self.Z1, self.A1, self.probs -- these cached values are everything the backward sweep will read. The forward pass is precise about what it keeps, because the backward pass is precise about what it needs.

backward() opens with the seed adjoint:

Softmax followed by cross-entropy has a clean shortcut: chain-ruling through both collapses to probs - one_hot. Three lines. This is the first adjoint -- the loss's sensitivity at the final pre-activation.

From there, multiply-accumulate (§4.2):

A1.T @ grad_Z2 is the weight gradient in outer-product form. Shape check: A1 is [batch, 128], grad_Z2 is [batch, 10], so the product is [128, 10] -- the shape of W2. The adjoint definition forces this (§4.5). grad_Z2.sum(axis=0) is the broadcast reverse: bias was replicated across 32 examples during the forward pass, so the backward gradient sums along that same axis.

Then ReLU:

The mask is already in the cache. The forward pass decided which neurons fired. The backward pass does not revisit that question -- it applies the stored answer. A dead neuron gets a zero adjoint. Its weights learn nothing from this example.

Layer 1 repeats the same pattern. Same three lines, different shapes, same logic. The chain rule at every edge, reverse topological order, cached values from the forward pass.

Practice With Baby Examples

The mechanics are on the table. What's still missing is the feel of running backprop by hand. The clearest way to get that: shrink everything until the computation fits in your head.

No matrices. No batches. Just scalar values through a small graph, compact enough to check against direct calculus. The rules are the same ones we've been building. At this scale, there's nowhere to hide if something goes wrong.

The Straight Line: Your First Gradient Story

Four operations in a line. One input, one loss, one path connecting them.

We want L/x\partial L/\partial x. How does the input affect the loss?

You could compose the entire chain into L=(2x+5)2L = (2x + 5)^2, expand, differentiate. That works when the graph has four nodes. Not when it has four billion.

So walk the graph backward. At each node: multiply the arriving adjoint by the local derivative, pass the result upstream (§4.2).

Trace what happened. The square at z=11z = 11 has local derivative 2z=222z = 22. Addition passes the adjoint through unchanged (local derivative 1). Multiplication by 2 scales it by 2. The adjoint starts at 1 and accumulates: 1×22×1×2=441 \times 22 \times 1 \times 2 = 44.

Check it the hard way:

  • L=(2x+5)2=4x2+20x+25L = (2x + 5)^2 = 4x^2 + 20x + 25
  • dL/dx=8x+20dL/dx = 8x + 20
  • At x=3x = 3: dL/dx=24+20=44dL/dx = 24 + 20 = 44

Same answer. But the backward pass never needed the closed-form relationship between xx and LL. It used two things at each node: the local derivative and the adjoint arriving from one step ahead. No expansion, no simplification, no global view of the graph.

When Paths Split and Merge

One path from x to the loss. One chain of multiplications. What happens when x reaches the loss through two separate routes?

x reaches the loss two ways: through a (the multiply branch) and through b (the add branch). The graph forms a diamond.

[VISUAL: A diamond-shaped computation graph. Two inputs on the left: x = 2 and y = 3. Two middle nodes: a = x × y = 6 (upper branch) and b = x + y = 5 (lower branch). Both feed into z = a × b = 30, then to L on the right. The two paths from x to L are highlighted in different colors — upper path through a in blue, lower path through b in orange. Each edge is labeled with its operation.]

Backward pass. Start from the loss:

Two contributions to d_x, summed. That sum is not a design choice. When x shifts by a tiny ϵ\epsilon, both paths feel it at once: 15ϵ15\epsilon through the top, 6ϵ6\epsilon through the bottom. Both perturbations happen simultaneously. Total effect: 21ϵ21\epsilon.

This is §2.2's rule with numbers behind it:

Lx=all paths from x to L(edges in pathlocal derivative)\frac{\partial L}{\partial x} = \sum_{\text{all paths from } x \text{ to } L} \left( \prod_{\text{edges in path}} \text{local derivative} \right)

Multiply along paths, add across paths. And direct calculus agrees:

z=(xy)(x+y)=x2y+xy2z = (x \cdot y)(x + y) = x^2 y + x y^2 zx=2xy+y2=2(2)(3)+9=21 \frac{\partial z}{\partial x} = 2xy + y^2 = 2(2)(3) + 9 = 21\ ✓

Same answer. Backprop followed the graph, applied local rules, accumulated. No closed-form expansion required.

Now you can see why += from §4.2 is not optional. A plain = overwrites the first path's contribution when the second arrives. The gradient becomes 15 or 6 instead of 21. No crash, no error message. Just wrong numbers flowing upstream.

Two paths is a toy example. Real networks have thousands. A convolution kernel at 676 spatial positions: 676 gradient contributions, summed into one weight update. A ResNet skip connection splitting a value into two branches: both send gradients back, both accumulate. An attention value queried from multiple positions: one gradient contribution per query, all summed.

Different architectures. Same rule. The graph determines how many paths exist. The += collects them.

Trusting But Verifying: Gradient Checking

Backprop is implemented. The loss drops. Everything looks fine.

That last sentence is weaker evidence than it seems. A network with subtly wrong gradients still learns. For a while. An imprecise gradient points roughly downhill, and roughly downhill is enough for the loss to keep dropping. A transposed matrix that should be WTW^T but isn't. A missing factor of 2. A ReLU boundary handled wrong. None of these crash your code. They surface as "why won't this converge" after a long training run, when there are a hundred other things to blame.

So: how do you verify a backprop implementation?

The answer is the tool we rejected in §1.2. Finite differences. Not for training, just to spot-check.

The derivative is defined as:

Lx=limϵ0L(x+ϵ)L(x)ϵ\frac{\partial L}{\partial x} = \lim_{\epsilon \to 0} \frac{L(x + \epsilon) - L(x)}{\epsilon}

Computing this for every parameter takes one forward pass per parameter. Ruinous at scale. But you don't need to check every parameter. Pick 50 or 100 at random. Nudge each one slightly, measure how the loss responds, compare to what backprop returned. The whole check takes seconds.

Use the centered difference rather than the one-sided version:

LxL(x+ϵ)L(xϵ)2ϵ\frac{\partial L}{\partial x} \approx \frac{L(x + \epsilon) - L(x - \epsilon)}{2\epsilon}

The one-sided formula looks forward from where you stand. Centered looks both ways. Error scales with ϵ2\epsilon^2 instead of ϵ\epsilon: same cost, much better accuracy.

Back to the straight-line example from §5.1. We know L/x=44\partial L/\partial x = 44. Run it through the centered difference:

They match to 3×1083 \times 10^{-8} relative error. Both methods measure the same quantity: how the loss responds to a change in xx. Backprop computes it exactly via the chain rule. Finite differences approximate it by actually nudging xx and observing. Agreement to eight decimal places is strong evidence the implementation is correct.

Gradient checking is a debugging tool, not a training tool. Two forward passes per parameter is far too expensive for actual training. And even as a spot-check, the numerical approximation has limits. Choosing ϵ\epsilon takes care.

Too large and you measure average slope over an interval, not a local derivative. The function curves over that range. Too small and floating-point subtraction of nearly equal numbers amplifies rounding error. 10510^{-5} is the standard starting point. Relative error below 10510^{-5}: passed. Above 10210^{-2}: something is wrong.

The check in practice

The full check loops over a random subset of parameters, computes a numerical gradient for each, and flags any that disagree with backprop beyond a threshold:

When gradient checking fails, the relative error tells you what kind of bug you have:

  • Error around 1.0 or 2.0: A missing factor. The 2 in d(x2)/dx=2xd(x^2)/dx = 2x is the classic one.
  • Error around 0.5: You're computing half the gradient. Common with symmetric operations.
  • Huge error: Wrong operation entirely, or gradients aren't reaching that parameter at all.
  • Error only on some parameters: Conditional logic. ReLU boundaries are the usual suspect.

A passing check is strong evidence, not proof. A hundred random samples catches most bugs. A wrong gradient could match the numerical estimate by coincidence. Rare, but worth knowing.

These three scalar examples were the smallest version of backprop. Chains multiply. Forks add. Gradient checks verify. The same three operations run inside a billion-parameter model, across more nodes and more layers. The math does not change at scale. Only the shapes do.

From One Neuron to a Layer

The scalar examples were the smallest possible version of backprop. Chosen to make every step traceable by hand. A real network doesn't look like that. A real layer takes a vector of inputs, multiplies by a weight matrix, adds a bias, and passes a vector of activations to the next layer.

The rules don't change. Multiply along paths, add at merges, accumulate gradients. The scalars just grow into matrices, and the same bookkeeping scales with them.

The Affine Transform: Your Workhorse Operation

What does a real layer compute? Take an input vector, multiply by a weight matrix, add a bias. y=Wx+by = Wx + b. That's it.

This is the affine transform. Fully connected layers, attention projections, transformer MLPs: the architectures look different, but this operation sits inside all of them. Work out its gradients once and you have the template for most of what you'll encounter.

The minimal case: one neuron, three inputs.

Backward pass. Three gradients to compute:

  1. How much should each weight change? (L/w\partial L/\partial w)
  2. How much should the bias change? (L/b\partial L/\partial b)
  3. What gradient flows back to the input? (L/x\partial L/\partial x)

Each weight wiw_i controls the output through one path: z=wixi+z = w_i x_i + \ldots. So the gradient is just the input it was paired with, scaled by what came from downstream:

The forward pass multiplies inputs and weights together. The backward pass swaps partners:

  • Weight gradients: downstream gradient × inputs
  • Input gradients: downstream gradient × weights

Not arbitrary. The chain rule demands exactly this. Each weight controls the output through its paired input, so the gradient carries that input back. Same logic, both directions.

Scaling to a Full Layer

One neuron is one dot product. A layer is many neurons, all looking at the same input vector. Stack four neurons' weight vectors into a matrix and you get the whole layer at once:

Backward pass. Same logic, matrix form:

When multiple neurons share the same input, that input's gradient is the sum of contributions from all of them. The accumulation pattern from Part 5, now happening through matrix multiplication.

Batching: The Final Dimension

So far, one example per pass. Real training processes a batch: 32, 64, 256 examples at once. GPUs have thousands of cores; one example at a time leaves most of them idle.

The gradient rules don't change. What changes is a distinction you need to track: weights and biases are shared across every example in the batch. Inputs aren't. That difference determines how gradients accumulate.

Start small. Two examples instead of one:

The dimensions:

  • Input X: [2, 3] — 2 examples, 3 features each
  • Weights W.T: [3, 4] — 3 inputs to 4 outputs
  • Output Z: [2, 4] — 2 examples, 4 neuron outputs each

Each row of Z is what you'd get processing that example alone. The batch dimension stacks them.

This is where shared vs. per-example matters. Every example ran through the same weights, but each one has its own opinion about how those weights should change. The backward pass collects all of them:

Not a special batching rule. Same accumulation pattern from the fork nodes in Part 5: when one value feeds multiple paths, its gradient sums over all of them. The weights feed every example in the batch, so their gradient sums across it.

The distinction:

  • Shared parameters (W, b): Used by all examples → gradients accumulate across the batch
  • Per-example data (X): Different for each example → gradients stay separate

Scaled to a realistic batch size:

Same rules at 1 example or 1,000. The chain rule does not care about the batch dimension. Matrix multiplication handles the summation.

One subtlety worth tracing: in grad_W = grad_Z.T @ X, the matrix multiply does the batch summation automatically.

  • Each row of grad_Z.T is one neuron's gradients across all batch examples
  • Each column of X is one input feature across all batch examples
  • The dot product sums over the batch dimension, giving the accumulated gradient

Shared parameters accumulate. Per-example data stays separate. That is the entire batching story for backprop.

The Activation Function Zoo

The affine transform is linear. Stack two linear transforms and you still have one linear transform. Stack a hundred. Same result. Without a nonlinearity after each layer, a thousand-layer network collapses to a single matrix multiply, and there is nothing interesting left to learn.

But nonlinearities have a second job, one you do not see until you run backprop. The derivative of the activation is the exchange rate at each node in the backward pass. When a neuron's pre-activation lands in a region where the derivative is near zero, the gradient through it collapses. Nothing gets through. It does not matter what is happening in the layers above.

This is why the derivative curve matters more than the function itself.

[VISUAL: A two-column grid showing four activation functions side by side: sigmoid, tanh, ReLU, and Leaky ReLU. Left column of each pair: the function plotted from x = -6 to 6. Right column: the derivative of that function on the same x range. For sigmoid and tanh, the derivative forms a bell shape that peaks in the center and drops steeply toward zero at both extremes. For ReLU, the derivative is a hard step: 0 on the left, 1 on the right. For Leaky ReLU, the same step but with a small nonzero value (0.01) on the left instead of 0. Caption below: "The derivative curve tells you where gradients can flow. Flat regions mean stalled learning."]

Sigmoid: The Classic S-Curve

Sigmoid was the original choice. It squashes any real number into (0, 1):

σ(x)=11+ex\sigma(x) = \frac{1}{1 + e^{-x}}

The derivative, σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x)), has a convenient property: you only need the output value to compute it, no need to cache the input. It peaks at 0.25 when x=0x = 0.

0.25. That is the best exchange rate sigmoid ever offers.

Now chain ten of them. Each layer, best case, passes a quarter of the gradient. Ten layers: 0.25101060.25^{10} \approx 10^{-6}. A millionth of the original signal. And that is the ceiling, assuming every neuron sits right at x=0x = 0. Once pre-activations drift into the flat regions, the exchange rate drops well below 0.25. The gradient vanishes before it reaches the early layers.

This is why deep sigmoid networks could not train. The gradient died on the way down.

Tanh: Centered but Still Saturating

Tanh looks like a stretched sigmoid, ranging from -1 to 1 instead of 0 to 1:

The centering is a real improvement. Sigmoid outputs are always positive, which pushes gradient updates to have the same sign across a layer. Tanh outputs are balanced around zero, which avoids that.

But look at the derivative, tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x). It peaks at 1.0 when x=0x = 0, then drops toward zero at large inputs, exactly like sigmoid. The saturation zones are still there. The gradient still collapses at the extremes. Tanh is a meaningful step forward, but does not fix the underlying problem.

ReLU: The Gradient Preserver

ReLU is almost aggressively simple:

a=max(0,x)a = \max(0, x)

The derivative is binary: 1 where the input was positive, 0 where it was not. No saturation on the positive side. When a neuron fires, the gradient flows through at exactly 1.0. No compounding decay. A hundred active neurons in a chain, and the gradient arrives just as strong as it left.

This is what made deep networks trainable. When every active neuron passes the gradient at full strength, you can stack a hundred layers and the signal arrives intact.

The price is dead neurons. If a neuron's pre-activation is negative, it outputs zero, its gradient is zero, its weights receive no update. A few dead neurons are manageable. A large fraction of your hidden layer going dark is a real problem. We cover this in §6.3.

Modern Variants: Fixing the Dead Zone

The fixes all share one idea: keep ReLU's gradient-preserving positive side, but give negative inputs some way to contribute.

Leaky ReLU: Instead of a hard zero on the negative side, pass a small fraction through.

The gradient is 1 when the neuron is active, 0.01 when not. Still not zero. The neuron can still receive a small signal even when it was inactive, which gives it a chance to recover.

GELU (Gaussian Error Linear Unit): Where ReLU has a sharp kink at zero, GELU rounds it off. The transition from suppression to pass-through is smooth, and so is the derivative:

No hard zero means no dead neurons. The derivative is messier than ReLU's clean 0-or-1, but it is smooth everywhere and never quite reaches zero. GELU is the default activation in BERT, GPT, and most modern transformers.

SiLU / Swish: The input gates itself by multiplying through its own sigmoid:

Large positive inputs pass through almost unchanged, since sigmoid is near 1. Large negative inputs get suppressed, since sigmoid is near 0. But the suppression is gradual, not a hard cutoff, so the gradient never hits zero. LLaMA and many recent architectures use SiLU.

From hard thresholds to smooth gates. The history of activation functions is largely a history of fixing gradient flow.

When Neurons Die: Understanding Saturation

Your loss is dropping. You've been training for an hour and the curve is moving, but barely. Doubling the learning rate doesn't help. The network is computing. Outputs look reasonable. But something is absorbing the gradient and returning almost nothing.

Usually, it's a saturated neuron.

Back in §6.2, the activation derivative was the exchange rate: how much of the incoming gradient a neuron passes upstream. Sigmoid peaks at 0.25. An active ReLU passes 1.0. Exchange rates multiply through chains.

Ten sigmoid neurons, each at their absolute best, pass 0.25101060.25^{10} \approx 10^{-6} of the original gradient. In practice the rates are much worse, because neurons rarely sit at their peak. Once a pre-activation drifts into the flat tail of the sigmoid curve, the exchange rate drops from 0.25 toward zero.

That drift is saturation.

What Saturation Actually Is

Say a sigmoid neuron's pre-activation settles at x=7x = 7.

The forward pass outputs a clean 0.999. Nothing looks wrong. But the backward pass multiplies the incoming gradient by 0.001 before passing anything upstream. Whatever signal was headed for earlier weights has to squeeze through here. It exits as 0.1% of what it entered.

This is what makes saturation hard to spot. The forward pass looks fine. Loss might even be declining. But the gradient is bleeding out at each saturated neuron, and early layers are getting almost no learning signal.

[DIAGRAM: A horizontal chain of five sigmoid neurons, labeled with large pre-activation values (e.g., 6, -5, 7, -4, 6) that place them deep in the flat saturation zone. An arrow enters from the right labeled "gradient = 1.0". Between each neuron, show a vertical bar representing the gradient passing through, shrinking dramatically at each step: 1.0, 0.001, 0.000001, barely visible, then a dot. Color the bars from bright green (strong signal) to washed-out gray (near-zero). Caption: "Each saturated neuron multiplies the gradient by a number close to zero. Five neurons in the flat zone, and nothing reaches the left side."]

The Distribution Problem

A single saturated neuron is annoying. An entire layer saturating at once is catastrophic.

After initialization, pre-activations sit near zero. That's exactly where sigmoid and tanh have their best exchange rates. But as training progresses, weight updates push pre-activations toward the tails. Hundreds of neurons start multiplying by near-zero at once. The gradient arriving at earlier layers collapses.

You can watch it happen. Track the distribution of pre-activations across a layer during training. A healthy layer keeps values near zero. A saturating one shows the histogram drifting toward the extremes.

The loss alone won't tell you this. Weights might still be updating, just so slowly it looks like a learning rate problem rather than a gradient problem.

Dead ReLU: The Irreversible Case

Saturation in sigmoid and tanh is bad, but recoverable. If pre-activations drift back toward zero, gradients start flowing again.

ReLU's version is not recoverable.

When a ReLU neuron's pre-activation is negative, the output is exactly zero and the local gradient is exactly zero. Not small. Zero. The weight gradient is zero. The weights don't move. The neuron outputs zero next batch too. The gradient is zero again. Stuck.

The neuron outputs zero, so its gradient is zero, so its weights don't update, so it keeps outputting zero. No learning path out of this. The only way to fix a dead ReLU is to stop training and reinitialize.

A few dead neurons aren't a crisis. Reasonable initialization and learning rates keep the dead fraction small, and the rest of the network compensates. But push the learning rate too high, initialize poorly, or let a bias drift negative, and the dead fraction climbs. Once a significant portion of a layer dies, that capacity is gone. Nothing brings it back mid-training.

This is why Leaky ReLU, GELU, and SiLU exist. A small nonzero gradient on the negative side means a stuck neuron still receives some signal. It might recover. Weak but nonzero: the difference between "slow to learn" and "will never learn."

Shape Discipline: Getting the Dimensions Right

The forward pass runs. The loss prints. You feel good. Then you call backward and the stack trace mentions two shapes that have nothing to do with each other. Or worse: no error at all. Just a loss curve that flatlines for a hundred epochs.

We covered shape rules in §3.4 and §4.5. This section is where they become muscle memory. Almost every backprop bug you write by hand is a shape bug. Sometimes loud: a dimension mismatch with both shapes in the error. Sometimes silent: wrong gradients, no crash, and a network that quietly refuses to learn.

The Golden Rule of Gradient Shapes

One rule does most of the work:

The gradient of a tensor has exactly the same shape as the tensor itself.

No wiggle room. The reason is mechanical. You subtract the gradient from the parameter:

In a two-layer network, the shapes lock in like this:

Broadcasting: When Shapes Don't Match (On Purpose)

NumPy's broadcasting feels like cheating. You write Z = X @ W.T + b with b of shape (128,) and it just works against a (32, 128) batch. No loop, no np.tile. The bias is silently replicated 32 times.

Backprop does not get the same free pass. Whatever expansion you got for free forward, you undo backward. The way to undo replication: sum along the replicated axis:

That's the rule: broadcast forward, sum backward, along the same axes.

Three Bugs You Will Write More Than Once

Bug 1: Wrong Matrix Multiply Order

Bug 2: Forgetting to Sum Over Batch

Bug 3: Reshaping Instead of Transposing

Building Shape Intuition

The trick to shape intuition is naming dimensions, not numbering them. (32, 784) is opaque. (batch_size, input_features) tells you what each axis means. Once you know that, you know whether to sum it or keep it.

Shape discipline feels pedantic at first. It stops feeling that way the third time it catches a bug for free. Shapes match, gradients match. Shapes don't, and the mismatch names the bug before you read a single line of code.

Putting It Together: A Complete Layer Implementation

The last four sections each took one piece in isolation. The affine transform. Activations. Saturation. Shape rules. But they constrain each other in ways that only show up when they run together. The activation you pick dictates the initializer. The initializer determines whether pre-activations stay safe or land in the saturation zones from §6.3. The cache the forward pass writes is exactly the cache the backward pass reads. Shape assertions in the backward pass check commitments the forward pass made.

One class, all four pieces:

Worth sitting with a few details.

The initialization scaling, np.sqrt(2.0 / input_dim), is He initialization, calibrated for ReLU. Swap in sigmoid without changing this scaling and pre-activations land straight in the saturation zones from §6.3. The initializer and the activation are coupled. Pick one, the other is constrained.

The cache stores self.X and self.pre_activation. Not the post-activation self.A. ReLU's backward only needs to know which neurons were positive, so it checks self.Z > 0 on the stored pre-activation. The exact output values are never used.

The saturation warning lives in forward, not backward. This is deliberate. Pre-activations get their values during the forward pass. By the time the backward pass runs, the damage is already done. Check where the problem originates, not where it manifests.

Shape assertions: not paranoia. The fastest way to catch a wrong transpose or a missing outer product before it shows up as a silent flatline fifty epochs later.

Gradient clipping is a ceiling, not a cure. If it fires consistently, something upstream is producing large gradients. The clip handles the symptom. The source is elsewhere.

Now stack these layers. Each one answers the same two questions: what are the gradients for my parameters, and what gradient do I pass back? That is it. No layer needs to know what happens upstream or downstream. It multiplies, accumulates, and returns a gradient. The chain rule wires the answers together. Two layers, twenty layers, two hundred: the pattern does not change.

One layer is interesting. Two layers wired into a classifier is where it starts to feel like a real model. That is where we go next.

Two-Layer Classifier: The Complete Picture

We have all the pieces: affine transforms, ReLU, the chain rule, shape discipline. Wire them together and you have a classifier.

The setup: MNIST digit classification. Input is 784 pixel values (a 28×28 image, flattened). Hidden layer has 128 units with ReLU. Output is 10 probabilities, one per digit. Two weight matrices, two bias vectors. The forward pass is four lines of math.

Forward Expressions: Building Your Classifier

Four operations, one scalar loss. Keep the shapes in view. When something breaks, those are what you check first.

Layer by layer:

Layer 1: Feature extraction

Layer 2: Class scoring

Output: Class probabilities

Loss: How wrong were we?

Run it on an actual digit. Say, a "3":

Layer 1 compressed 784 pixels into 128 features. ReLU zeroed the negatives. Layer 2 turned those features into 10 raw scores. Class 3 won (logit 8.1), softmax pushed it to 96%, and the loss landed at 0.041. That is what a correct prediction looks like from the inside.

Those pre-softmax scores have a name.

Stable Softmax: Don't Let Your Exponentials Explode

Softmax looks harmless on paper:

pi=ezijezjp_i = \frac{e^{z_i}}{\sum_j e^{z_j}}

Code it up. Run it on a trained network. Works fine for a while. Then one batch arrives where a logit hits 1000. You compute e1000e^{1000}. In float32, that overflows to inf. The denominator is inf. Every probability becomes nan. The loss is nan. The gradients are nan. Training is over.

The mirror failure is quieter. All logits very negative, say [-1000, -1001, -1002]. Every exponential underflows to zero. Denominator is zero. 0/0 = nan. Same end result, different path.

The fix: shift all logits by a constant before exponentiating. The probabilities do not change:

pi=ezicjezjcp_i = \frac{e^{z_i - c}}{\sum_j e^{z_j - c}}

Because ece^{-c} appears in both numerator and denominator and cancels: ezicjezjc=eceziecjezj=ezijezj\frac{e^{z_i - c}}{\sum_j e^{z_j - c}} = \frac{e^{-c} \cdot e^{z_i}}{e^{-c} \cdot \sum_j e^{z_j}} = \frac{e^{z_i}}{\sum_j e^{z_j}}

Set c=maxjzjc = \max_j z_j. The largest exponent you ever compute is e0=1e^0 = 1. No overflow. Every other exponent is at or below 1.

Softmax is stable now. The loss is not.

Cross-entropy computes log(p[true_class]). If the true class has probability 104010^{-40}, that log is a large negative number. Fine. But if p underflows to exactly zero, log(0) is -inf. The loss blows up.

The fix: never materialize the intermediate probability. Fuse softmax and cross-entropy into one expression:

By staying in log-space, no intermediate probability ever gets small enough to underflow. This is why PyTorch bundles them: F.cross_entropy and nn.CrossEntropyLoss both implement this fused version. Pass raw logits and PyTorch handles the numerics. Pass pre-computed probabilities and you are on your own.

Backward Pass Line by Line: Where Every Gradient Comes From

Section 7.2 gave us our starting gradient: grad_z2 = p - one_hot(y). Now walk it backward, applying the chain rule at each operation in the forward pass. The cache holds exactly what each step needs.

Two layers, one ReLU. Four backward steps.

Gradient Through Layer 2

Take W2[i,j]. In the forward pass it appeared as z2[i] += W2[i,j] * h[j], so its local derivative is h[j]. Multiply by the upstream gradient: grad_z2[i] * h[j]. Do that for every (i,j) pair and you get the outer product.

The bias is simpler. b2[i] adds directly to z2[i], local derivative 1. Gradient copies straight through.

Now h[j]. It contributed to every output z2[i] through weight W2[i,j], so gradients from all 10 outputs flow back into it. Sum them: that is the W2.T multiply.

Gradient Through ReLU

ReLU is a gate. When z1[k] was positive in the forward pass, it passed the value through unchanged. Local derivative is 1. Gradient flows. When z1[k] was zero or negative, it blocked. Local derivative is 0. Gradient stops. That's why we cached z1 in the forward pass: not to replay the value, but to know which neurons were open.

Gradient Through Layer 1

Compare this to layer 2. Same outer product, same copy, same W.T multiply. Only the variable names changed. One rule, both layers.

Complete Backward Pass

Put it together:

That WTW^T in grad_h and grad_x is not a coincidence. It shows up every time a gradient flows backward through a linear map. Section 7.4 is about exactly that.

Why All Those Transposes?

Look at the backward pass from section 7.3. Two lines share a pattern:

Every time a gradient flows backward through a linear layer, the weight matrix gets transposed. Not a rule to memorize. It falls straight out of the chain rule.

Take a small W: two outputs, three inputs.

W = [[0.5, 0.3, 0.8],
     [0.1, 0.9, 0.2]]

Forward reads rows. Each row computes one output:

y1 = 0.5·x1 + 0.3·x2 + 0.8·x3
y2 = 0.1·x1 + 0.9·x2 + 0.2·x3

The backward pass arrives with grad_y = [g1, g2]. What's the gradient for x1?

x1 shows up in both equations: multiplied by 0.5 in the first, by 0.1 in the second. Both paths carry blame:

∂L/∂x1 = g1 × 0.5 + g2 × 0.1

Same for x2 (g1 × 0.3 + g2 × 0.9) and x3 (g1 × 0.8 + g2 × 0.2).

Now look at what you actually grabbed. For x1: weights [0.5, 0.1]. For x2: [0.3, 0.9]. For x3: [0.8, 0.2]. Those are the columns of W.

Forward reads rows. Backward reads columns.

W.T flips columns into rows so a single matrix multiply computes every input gradient at once:

W.T = [[0.5, 0.1],    # x1's connections (was column 1)
       [0.3, 0.9],    # x2's connections (was column 2)
       [0.8, 0.2]]    # x3's connections (was column 3)

grad_x = W.T @ [g1, g2]
       = [0.5*g1 + 0.1*g2,   # ∂L/∂x1
          0.3*g1 + 0.9*g2,   # ∂L/∂x2
          0.8*g1 + 0.2*g2]   # ∂L/∂x3

Same weights. Different direction.

[VISUAL: Two diagrams side by side. Left diagram labeled "Forward: rows." Three input nodes (x1, x2, x3) on the left connect via arrows to two output nodes (y1, y2) on the right. Row 1 of W (weights 0.5, 0.3, 0.8) highlighted in blue -- three arrows all pointing right to y1. Row 2 (0.1, 0.9, 0.2) highlighted in orange -- three arrows pointing right to y2. Right diagram labeled "Backward: columns." Same weights, same nodes. Now gradient signals g1 and g2 enter from the right. For x1: two arrows from y1 (weight 0.5, scaled by g1) and from y2 (weight 0.1, scaled by g2) converge leftward at x1, labeled "0.5g1 + 0.1g2". Same pattern for x2 and x3. Caption spanning both panels: "Forward reads rows. Backward reads columns. W.T converts columns into rows so the matrix multiply handles it."]

The algebra says the same thing. For any vector vv and matrix WW:

v(Wx)=(Wv)xv^\top (Wx) = (W^\top v)^\top x

Move WW across a dot product and it transposes. Write out L/xi\partial L / \partial x_i with the chain rule and this identity is exactly what falls out. The transpose is not a rule we imported. It is what the chain rule naturally produces at a linear layer. Same weights that carry activation forward carry credit backward, just read column-first instead of row-first.

Batching: When Dimensions Multiply

Batching the forward pass is mechanical. Stack 32 images as rows. One matrix multiply. Everything picks up a leading 32 and the math applies row by row.

The backward pass is where I had to stop and think. The weight W2[3,17]W_2[3, 17] contributed to all 32 predictions. All 32 loss terms have an opinion about which direction that weight should move. You can only update it once per step. What do you do?

Forward: stacking is all it takes

One example: xx is (784,)(784,). The first layer computes z1=W1x+b1z_1 = W_1 x + b_1, giving (128,)(128,).

Stack 32 examples as rows. XX is now (32,784)(32, 784).

Z1=XW1T+b1(32,784)(784,128)+(128,)=(32,128)Z_1 = X W_1^T + b_1 \quad (32, 784) \cdot (784, 128) + (128,) = (32, 128)

Row ii of Z1Z_1 is identical to what you'd get running example ii alone. The matrix multiply does all 32 at once. The weights W1W_1 are still (128,784)(128, 784). The bias b1b_1 is still (128,)(128,), broadcasting across all 32 rows.

Everything downstream follows the same pattern. HH is (32,128)(32, 128), softmax runs per row, PP is (32,10)(32, 10). The activations all carry a leading 32. The weights never do.

The weight gradient question

The total loss averages over the batch:

L=132i=132LiL = \frac{1}{32}\sum_{i=1}^{32} L_i

The chain rule on a sum is a sum of chain rules. The gradient of LL with respect to W2W_2 is the average of the per-example weight gradients:

LW2=132i=132LiW2\frac{\partial L}{\partial W_2} = \frac{1}{32}\sum_{i=1}^{32} \frac{\partial L_i}{\partial W_2}

For a single example, the weight gradient is an outer product: Li/W2=outer(gradz2(i), h(i))\partial L_i / \partial W_2 = \text{outer}(\text{grad}_{z_2}^{(i)},\ h^{(i)}). That is a (10,)(128,)=(10,128)(10,) \otimes (128,) = (10, 128) matrix. You get 32 such matrices. Sum them, divide by 32.

grad_Z2.T @ H computes exactly this, in one shot:

Matrix multiply computes all 32 outer products and sums them. The batch dimension disappears. grad_W2 comes out (10,128)(10, 128), same shape as W2W_2 itself. 32 examples, one update direction.

For biases, same logic. The gradient flows back from (32,10)(32, 10) and sums to (10,)(10,):

Input gradients are different. Each example has its own input. grad_X stays (32,784)(32, 784), one gradient row per example.

Broadcast up, sum down

One rule governs batching in the backward pass.

Whatever got broadcast going forward gets summed going backward.

b1b_1 was (128,)(128,) and broadcast to all 32 rows. Backward, gradients from all 32 rows sum back to (128,)(128,). W1W_1 was shared across all 32 inputs, so 32 outer products collapse into one (128,784)(128, 784) gradient. Inputs had no sharing. Their gradients keep the batch dimension.

[VISUAL: Two panels side by side. Left panel labeled "Forward: broadcast." A small rectangle labeled "b1 (128,)" sits at the top. Three arrows fan out downward to three separate rows of a wide matrix labeled "Z1 (32, 128)," illustrating how one bias vector broadcasts to all rows. On the left edge, a rectangle labeled "W1 (128, 784)" shows a single weight matrix used against all rows. Right panel labeled "Backward: sum." The same three rows of Z1, but now three arrows converge upward into a single rectangle labeled "grad_b1 (128,) = .sum(axis=0)." A stack of small matrices with a plus sign between them collapses into one rectangle labeled "grad_W1 (128,784) = grad_Z1.T @ X." Caption below both panels: "Broadcast forward, sum backward. The batch dimension appears in activations going forward, then collapses for parameter gradients going backward. Input gradients keep the batch dimension -- they are not shared."]


Sum or average?

Mathematically, the gradient of the total loss is a sum. Dividing by batch size makes it an average. Both are valid. The choice is convention.

The practical reason most frameworks average: if you sum, gradient magnitude scales with batch size. Batch of 128 gives gradients 4x larger than batch of 32. You would need to rescale your learning rate every time batch size changes. Averaging keeps gradient magnitude stable across batch sizes, so one learning rate works regardless.

If you average the loss, average the gradients. Be consistent.

Shape table

Every gradient has the same shape as its forward value. Parameter shapes have no batch dimension. Activation shapes do.

If grad_W2 ever comes out (32,10,128)(32, 10, 128) instead of (10,128)(10, 128): you forgot to reduce the batch dimension. Shape mismatch is always a bug. This table is your debugger.

The Complete Implementation

The backward method has nothing new. Every line traces to something we already derived. grad_Z2 = P - one_hot(y) is section 7.3. grad_Z2.T @ H is the batched outer product from 7.5. grad_H * (Z1 > 0) is the ReLU mask from 7.3. The implementation is the algorithm, written as a class.

Three methods, clean boundaries. forward runs the computation and writes the cache. backward reads the cache and computes every gradient. update applies them. Want Adam instead of SGD? Only update changes. Forward and backward don't care.

Run this on MNIST with a batch size of 256 for a hundred epochs and the loss falls steadily, landing around 95% test accuracy.

Two things worth sitting with.

The cache is minimal. forward stores Z1 (for the ReLU mask), H (for grad_W2), P (starting gradient), X (for grad_W1), and y (to identify the correct class). Every one of those fields gets read in backward. No dead weight. And nothing backward needs was left out.

The division by B happens exactly once, at the starting gradient. It propagates naturally through every outer product and sum that follows. No second averaging step downstream.

This structure scales. More layers means more cache entries and more backward steps, but each step follows the same local rules. Swap the activation and one mask changes. Swap the loss and only the starting gradient changes. The skeleton is the same.

Conclusion for Part 1: What We Want To Carry Forward

The backward method for our two-layer classifier fits on a screen. One pass, exact gradients, all 101,770 parameters.

At the top of this post, you called loss.backward() and a few milliseconds later every parameter had a .grad. Now you know what those milliseconds contain. The forward pass writes down the facts. The backward pass reads them in reverse, applies one local rule at each node, and every gradient falls out. Not magic. Careful accounting.

What carries forward is not the specific lines of code. It is the structure behind them.

Values flow forward, adjoints flow backward. The adjoint xˉ\bar{x} has the same shape as xx, always. Shape mismatch means a wrong rule, not an exception. The cache holds exactly what the backward pass will need, nothing else. The division by batch size happens once, at the start, and propagates through naturally. These are not details to memorize. They are invariants. When something goes wrong in a backward pass, these are the first things to check.

A few ideas are worth sitting with longer than the code suggests.

grad_Z1 = grad_H * (self.Z1 > 0). One line. But it shows you what an activation function actually does during training: it decides which neurons get to learn. Neurons that were negative in the forward pass contribute nothing to the backward pass. They do not update. If enough neurons stay consistently negative, gradient flow stops in that region of the network. Not because something broke. Because the activation function did exactly what it was asked to do.

Stable softmax is a different lesson. Computing in log space before exponentiating changes the numerical range without changing the mathematical answer. This pattern recurs anywhere you have exponentials followed by normalization. Learn it here on a toy network, recognize it everywhere else.

Then there is the cache cost. Inference discards intermediates; training keeps them. The memory spike when you call loss.backward() is not the backward pass consuming memory. That memory was committed the moment you ran the forward pass. The backward pass is just paying the bill.

We derived every rule by hand, which is the right way to see it once. But it does not scale. Part 2 zooms out to the view that makes large networks writable. Every backward step, for any layer, compresses into one operation: multiply by the transposed Jacobian. Convolutions, normalization, attention. Once you see that the backward pass is always "transpose and multiply," writing new layers stops being scary.

That is the unifying view. Part 2 builds it.

Backpropagation Part 2: Patterns, Architectures, and Training

References and Further Reading

What you read next depends on where the gaps are.

Build it

If there is one thing worth doing after this post, it is Karpathy's micrograd lecture. He covers most of the same ground but builds a working autograd engine live. Two and a half hours. Watch it with a code editor open.

Verify your mental model with concrete numbers

Visual intuition

The math, more carefully

For topics in Parts 2 and 3

The big picture

  • Geoff Hinton, Yoshua Bengio & Yann LeCun, Deep Learning: NIPS'2015 Tutorial — the three people most responsible for the deep learning resurgence, explaining it together. Worth reading as history as much as tutorial.

The original paper

The 1986 Rumelhart, Hinton, and Williams paper is about 12 pages. Section 3 is the part that matters. Not an easy read as an introduction, but after this post you have the prerequisites to follow what they were arguing and why it mattered.