October 24, 2025138 min readfoundation

Backpropagation Part 1: From Graphs to a Working MLP

Backprop computes a million gradients for the price of two forward passes. From computational graphs to adjoints, from chain rule to a working neural network, this is the algorithm that made deep learning possible; and is demystified here step by step.

The Gradient Bottleneck

Imagine you're training a neural network with a million parameters. You need to know how to adjust each weight to reduce your error. Simple enough in principle: for each weight, figure out if increasing it makes things better or worse, then move in the right direction.

To make this happen, architecturally, you need a mechanism that first allows you to change a weight; and second, shows you what happens when you change it. Let's imagine the most intuitive (and possibly the most suboptimial) way to do this: Weight #1: bump it up, run the ENTIRE neural network, check output (the "loss"). Weight #2: same process. Weight #3: same. All the way to weight #1,000,000.

You must be wondering what a computational nightmare this really is. Running the entire network for a single weight change? It's too much! If your 1m large network takes 100 milliseconds to process an image (pretty fast by modern standards), you're looking at 100,000 seconds just to compute how to take one tiny step in training. That's 27 hours. For one gradient! Modern networks need millions of these steps.

This made training nontrivial neural networks impractical. In the 1960s and 70s, researchers knew exactly what to compute: the derivative of the loss with respect to every parameter.

The compute burden was the blocker: networks of interest had thousands or millions of parameters, and the naive method scaled linearly with that count.

The field was stuck as the computation was intractable. The blueprint was clear, but the execution was out of reach (almost like knowing exactly how to build a skyscraper but only having a teaspoon to move dirt).

Then in 1986, Rumelhart, Hinton, and Williams showed the field something that had been hiding in plain sight. Not a new learning algorithm, but a computational trick borrowed from other fields: you could compute all those millions of gradients in roughly the same time it takes to compute just one.

The flow at a high level was this:

  • Run a forward pass on the network to compute the loss (just as you did before), but with one change: just save some intermediate values per neuron.
  • Now run a single "backward" pass.
    -- This backward pass knows 2 things: the final loss value, and the intermediate values of each neuron in the network.
    -- Using these 2 data points, backward pass can calculate how much that neuron's weight impacts the final loss.
  • The algorithm can then choose to update the weights accordingly. Meaning that all weights can be updated in a single backward pass.

Cost of this backward pass? About the same as the forward pass! To put it another way, instead of a million forward passes for a million parameters, you just need one forward and one backward passes.

This wasn't even a new algorithm. Reverse-mode automatic differentiation (the official mathematical term) had been around since the 1970s in control theory and numerical analysis. The AI community may just have been looking in the wrong places, trying to find brain-inspired solutions when what they needed was a clever accounting trick from applied mathematics.

This reduced training times by orders of magnitude: days instead of centuries.

The Gradient: Your Local Compass

Before diving into why the obvious approaches fail and how backpropagation solves the problem, let's make sure we're on the same page about what a gradient actually is. If you've read the gradient descent post, you can skip this section. For everyone else, here's the intuition.

What Is a Gradient, Really?

A gradient answers one simple question: "If I change my parameters slightly, how much will my loss change?"

But here's the crucial bit: in a neural network, you don't just have one parameter. You have millions. And you need to know how the loss changes with respect to every single one of them.

Each gradient component tells you two things:

  • Direction: Should I increase or decrease this weight? (That's the sign: + or -)
  • Urgency: How much does this weight matter? (That's the magnitude)

Picture a tiny network with just 10 weights. The gradient isn't a single number. It's 10 numbers, one for each weight:

  • Weight #1's gradient: -0.3 (decreasing this weight reduces loss)
  • Weight #2's gradient: +0.8 (this weight is making things worse, decrease it!)
  • Weight #3's gradient: -0.001 (this weight barely matters right now)
  • ... and so on for all 10 weights

Scale this up. A network with a million parameters needs a million gradient values. GPT-3 with 175 billion parameters? That's 175 billion individual gradient values we need to compute. Every. Single. Training. Step.

Consider this: you're standing on a hillside in thick fog. You can't see the valley below, but you can feel the slope under your feet. In neural networks, this "hillside" has millions of dimensions, and you need the slope in every single direction.

The gradient gives you exactly that: a complete description of the local slope in every direction. In math notation, if your parameters are θ=[w1,w2,...,w1000000]θ = [w_1, w_2, ..., w_{1000000}], then what you need is how each of these weights impacts the loss L. In other words, to find out how weight 1 (w_1) impacts loss L, you need to compute ∂L/∂w_1. To find out how weight 2 imapcts L, we need ∂L/∂w_2. So similarly, to find out how all of the weights impact the loss L, then the overall gradient L∇L is the vector [L/w1,L/w2,...,L/w1000000][∂L/∂w_1, ∂L/∂w_2, ..., ∂L/∂w_{1000000}]. Each component answers "how does this specific weight affect the loss?"

The key insight here is locality. The gradient doesn't tell you where the global minimum is. It doesn't even tell you if there is a minimum nearby. All it tells you is: right here, right now, which direction should I step to decrease the loss? It's a purely local measurement. And that's all gradient descent needs to work.

A large gradient means you're on a cliff face: small steps in that direction cause big changes in loss. A tiny gradient means you're on a plateau: even large steps barely change anything. This extreme locality is both gradient descent's greatest weakness and its greatest strength. Weakness because you can get trapped in local minima or wander plateaus forever. Strength because you only need to compute local information, making the algorithm scale to billions of parameters.

Why Brute Force Fails

Now that you understand what a gradient is (a collection of millions of numbers telling you how each parameter affects your loss), let's do a small exercise: imagine you're an engineer facing this problem.

You have a neural network with, say, one million parameters. You need to compute the gradient: how the loss changes with respect to each of those million weights. You know the math (it's just derivatives).

So the question becomes, how would you actually compute it?

The First Instinct: Just Try Each One

We cleared this up regarding brute force earlier (27 hours vs 200ms for one gradient), but let's say you don't know that and all you know is that you need to compute derivatives for each weight.

With that premise, here's what most engineers would try first:

  1. Pick weight #1
  2. Increase it by a tiny amount (say, 0.00001)
  3. Run the entire network forward with this new weight
  4. See how much the loss changed
  5. That change divided by 0.00001 is your gradient for weight #1
  6. Reset weight #1 to its original value
  7. Repeat for weight #2, weight #3, ... weight #1,000,000

This approach has a name: finite differences. And mathematically, it's perfectly valid:

LwiL(wi+ϵ)L(wi)ϵ\frac{\partial L}{\partial w_i} \approx \frac{L(w_i + \epsilon) - L(w_i)}{\epsilon}

where ϵ\epsilon is your tiny nudge (like 10510^{-5}).

But as we saw before, here's the killer: each gradient requires a full forward pass through your network. So if one forward pass through a million parameters takes 100ms (pretty fast!), you're looking at 100,000 seconds = ~27 hours just to compute one gradient. Remember, you need to compute thousands or millions of these during training.

For GPT-3's 175 billion parameters: at 100ms per forward pass, that is 175 billion × 0.1 seconds which equals about 554 years per gradient (model will complete training for sure, but only after the heat death of the universe).

The Second Idea: Just Derive The Formula

"Okay," you might think, "instead of numerically approximating the gradient, why not derive an exact formula? I know calculus!"

The idea would be to use symbolic differentiation where you treat your entire neural network as one big mathematical expression and use calculus rules to derive exact gradient formulas for every weight. No approximation errors like finite differences. Perfect derivatives computed symbolically, just like in your calculus homework.

Except this approach faces a different computational nightmare: the formulas themselves become impossibly large. A deep network's gradient formula would expand into millions of terms. You'd need gigabytes just to store the symbolic expression for a single weight's gradient, let alone compute with it. The cure would be worse than the disease.

The Breakthrough: Your Forward Pass Already Did The Hard Work

Both naive methods fail for the same fundamental reason: they don't recognize that computing a million gradients isn't a million separate problems. It's one problem with shared structure.

Here would be the key observation: when you run your network forward to compute the loss, you are computing the ingredients needed for every gradient. An intuitive way to understand this could be the following assembly line, where inspection stage says what the issue is, and when you trace the process backwards, you see that various processes in the flow impacted it partially, and that each process there needs a slight change to make the final inspection outcome better:

Now let's be a bit more specific and go deeper with this scenario: Imagine your network processing a picture of a cat:

Layer 1: Takes raw pixels, applies a million weights, produces feature detections. Maybe neuron #47 fires strongly (outputs 32.7) because it saw an edge. Neuron #892 stays quiet (outputs 0) because its pattern wasn't there.

Layer 2: Takes those feature detections, combines them with another million weights, starts building bigger patterns. Some neurons fire (5.2, 9.1), others don't.

Layer 3: Takes those patterns, makes a final decision: "73% cat, 12% dog, 15% other."

Loss: Compares to the true label ("cat"), outputs a single disappointment score: 2.31

A critical point: those specific numbers computed in the forward pass (for example 32.7 for the edge detector and 5.2 for the pattern builder) are not just activations. They are exactly the pieces needed to update every weight in the network.

Instead of the simple neural network we see, assume it is a super large one instead with millions (or even billions) of parameters. Think about a random weight #47,293 somewhere in layer 2. To know how to adjust it, we need to answer: "How much did this weight contribute to our loss of 2.31?"

We can answer that using only two pieces of information:

  1. What input did this weight see? (That specific number from layer 1 already computed: 32.7)
  2. How much did this weight's output affect the final loss? (This comes from layer 3 in the backward pass)

The weight's contribution is just these two numbers multiplied together. That's it. No rerunning the network. No complex formulas. Just multiplication of numbers we already have or are about to compute anyway.

To make this even more concrete. During the forward pass, imagine each neuron is taking notes:

  • "I received these inputs: [32.7, 0, 18.4, ...]"
  • "I used my weights: [0.5, -0.2, 0.8, ...] to process them"
  • "I produced this output: 5.2"
  • "I applied ReLU, so I was active (not blocked)"

During the backward pass, each neuron gets a message from the layer above it: "Your output affected the loss by 0.3."

Now the neuron can immediately figure out:

  • "For my weight 0.5 that saw input 32.7: its gradient is 0.3 × 32.7 = 9.81, telling me how much impact I have on the loss (magnitude: 9.81) and in what direction (positive means increasing this weight increases loss)"
  • "For the neuron below that fed me 32.7: send back gradient 0.3 × 0.5 = 0.15 (using my own incoming weight 0.5), telling it how much it contributed to the loss through this connection"

It passes that second message down, and the process continues. Each layer only does local calculations with numbers it has direct access to.

An interesting part is that this isn't some hack specific to neural networks. Any computation where you go step by step forward can use this trick:

  1. Forward: Compute your result, remember intermediate values
  2. Backward: Use those remembered values to figure out how each step contributed

In fact, this pattern shows up everywhere: physics simulations, rendering engines, even Excel spreadsheets. But neural networks are perfectly suited for it because they're just sequences of simple operations (multiply, add, apply ReLU) repeated millions of times.

The only price you pay? Memory. Every intermediate value needs to be stored until the backward pass uses it. For a modern network processing images:

  • 10 layers × 1000 neurons per layer = 10,000 numbers to store per image
  • Processing 32 images at once? That's 320,000 numbers
  • Big networks with billions of parameters? Gigabytes of storage just for these intermediate values

But consider the alternative: centuries of computation versus gigabytes of memory. In a world where memory is cheap and time is expensive, backprop made the only sensible trade-off. It transformed neural networks from an theoretical curiosity into the foundation of modern AI, all by recognizing that the forward pass was already doing most of the work.

Backpropagation's True Identity

Something to keep in mind: backpropagation isn't really about neural networks. It's not even about machine learning. It's a specific instance of a more general technique called automatic differentiation (usually called autodiff).

When you write y = x * 2; z = y + 3, you're defining a computation. Automatic differentiation tracks how values flow through this computation and accumulates derivatives along the way. No formula manipulation, no approximation. Just exact derivatives computed alongside your regular calculation.

Backpropagation is basically automatic differentiation (see collapsible section above), done in reverse-mode, applied to neural networks. One output (the loss), millions of inputs (the parameters). Reverse mode is perfect for this geometry: it computes how one output depends on all inputs in a single backward pass.

Functions as Compositions and Graphs

Now that we have seen the computational efficiency of backprop, let's see what makes it work. The key isn't in the algorithm itself. It's in recognizing this pattern: every neural network, no matter how complex, is just simple operations chained together.

Think about what happens when you evaluate y=(x+2)3y = (x + 2) * 3. You do not compute this in one step. Your brain breaks it down:

  1. First compute a=x+2a = x + 2
  2. Then compute y=a3y = a * 3

This decomposition isn't just how we think about it, it's also what the computer does. And this is the key point: if we can compute something step by step going forward, we can compute gradients step by step going backward.

To make this concrete with a slightly bigger example. Consider this function:

This looks like a simple program, but we can draw it as a graph:

  • x and y are input nodes
  • a and b are intermediate nodes
  • c is the output node
  • The operations (*, +, *) are the edges connecting them

To look at this another way: x affects the output c through two different paths. It goes through a (by getting multiplied with y) and through b (by getting added to y). When computing gradients, we need to account for both paths. The gradient contributions from each path will add up.

This is exactly what happens in neural networks, just at a massive scale. Each weight affects the loss through potentially thousands of paths. Backprop's job is to efficiently track all these paths and sum up their contributions.

From Slope to Jacobian

So far we've been talking about gradients loosely, but let's be precise about what we're computing. When you learned calculus, you started with derivatives of scalar functions: one input, one output. The derivative tells you the slope at a point.

But neural networks aren't scalar functions. A single layer might take 1000 inputs and produce 500 outputs. How do you even define a "derivative" for that?

The answer is the Jacobian matrix. Don't let the fancy name scare you. It's just a way of organizing all possible derivatives between inputs and outputs.

Let's say we have a function that takes 3 inputs and produces 2 outputs:

inputs: [x₁, x₂, x₃]
outputs: [y₁, y₂]

The Jacobian is a 2×3 matrix:

J = [∂y₁/∂x₁  ∂y₁/∂x₂  ∂y₁/∂x₃]
    [∂y₂/∂x₁  ∂y₂/∂x₂  ∂y₂/∂x₃]

Each row tells you how one output depends on all inputs. Each column tells you how all outputs depend on one input.

Backpropagation compute: explained intuitively

The Jacobian You Never Build

For a layer transforming n inputs to m outputs, the Jacobian is an m×n matrix. For a realistic layer (say 1000 inputs, 500 outputs), that's 500,000 entries. For a million inputs and thousand outputs, that's a billion numbers just for one layer's derivatives!

Cue the clever trick! Backprop never builds this matrix. Instead, it computes vector-Jacobian products (VJPs) directly. Let me illustrate why with this analogy of 2 accountants.

The Tale of Two Accountants: Why VJP Wins

Imagine a restaurant lost $100 on their last service (the loss), and two accountants are hired to figure out which ingredients caused the problem. The loss came from just 2 dishes:

  • Dish #1 was responsible for $60 of the loss
  • Dish #2 was responsible for $40 of the loss

These dishes used various amounts of 1000 different ingredients. Both accountants need to figure out how much each ingredient contributed to the $100 loss.

The Naive Accountant (Builds the Full Jacobian)

This accountant says: "I'll build a complete spreadsheet first!"

He creates a massive table:

Ingredient% Used in Dish #1% Used in Dish #2
Ingredient #115%5%
Ingredient #20%20%
Ingredient #38%12%
.........
Ingredient #10003%0%

That's 2000 numbers computed and stored! (The full Jacobian)

Then, to find each ingredient's contribution to the loss:

Ingredient #1's fault = 15% × $60 +  5% × $40 = $9 + $2 = $11
Ingredient #2's fault =  0% × $60 + 20% × $40 = $0 + $8 = $8
Ingredient #3's fault =  8% × $60 + 12% × $40 = $4.80 + $4.80 = $9.60
...

Work done: Compute 2000 percentages, store them all, then do 1000 weighted sums.

The Clever Accountant (Uses VJP Directly)

This accountant says: "Wait, I already know the dishes cost 60and60 and 40 in losses. I don't need a spreadsheet!"

They go straight to computing:

For Ingredient #1:
  - Look up: it's 15% of Dish #1 → 15% × $60 = $9
  - Look up: it's 5% of Dish #2 → 5% × $40 = $2
  - Total fault: $9 + $2 = $11

For Ingredient #2:
  - Look up: it's 0% of Dish #1 → 0% × $60 = $0
  - Look up: it's 20% of Dish #2 → 20% × $40 = $8
  - Total fault: $0 + $8 = $8

Work done: Compute each weighted contribution on the fly, never storing the full table.

The Key Insight

The naive accountant computed and stored all 2000 relationships (the full Jacobian) even though he only needed them in one specific weighted combination. The clever accountant realized: "I never need to know '15% of Dish #1' as an isolated fact. I only need to know '15% of Dish #1 × $60'."

This is exactly what happens in backpropagation:

  • The 60and60 and 40 are the gradient values [g₁, g₂] flowing backward from the loss
  • The percentages are the local derivatives (how inputs affect outputs)
  • We never need the percentages alone. We only need them weighted by the gradients

Why We Never Need Individual Rows

As you can see, what makes backprop really efficient is that during the backward pass, we already know exactly which weighted combination of Jacobian rows we need. Think of it this way:

  • The full Jacobian has m rows (one per output)
  • Each row has n entries (one per input)
  • Total: m × n numbers to compute and store

But during backprop:

  • We receive m numbers from downstream: the gradient vector [g₁, g₂, ..., gₘ]
  • These tell us exactly which weighted combination of rows we need
  • We compute this combination directly without ever forming individual rows

We never need to know "how does output #1 affect all inputs?" in isolation. We only need "how does output #1 affect all inputs, weighted by how much output #1 affects the loss."

Technical View: Following the Gradient Flow

Let's trace this backward with a concrete example:

1000 inputs → [Your Layer] → 2 outputs → [More layers] → Loss (single number)

During backpropagation, by the time we reach your layer, the "More layers" have actually helped you out. What they already did, is that they've computed how your 2 outputs affect the final loss L and compressed all that information into just 2 numbers: [g₁, g₂].

In other words, they are saying this: "Don't worry about what's going on upstream. We just calculated how your neuron 1's output changes loss L (∂L/∂out₁). Here is the value: g1. Also, we calculated how neuron 2's output is changing the final loss L (∂L/∂out₂), here is the number: g2."

1000 inputs → [Your Layer] → 2 outputs → [More layers] → Loss (single number)
                                   ↑
                              [g₁, g₂] = [∂L/∂out₁, ∂L/∂out₂]

Now consider this: Earlier while defining the Jacobian, we saw that the matrix looks like this:

J = [∂y₁/∂x₁  ∂y₁/∂x₂  ∂y₁/∂x₃]
    [∂y₂/∂x₁  ∂y₂/∂x₂  ∂y₂/∂x₃]

Loosely speaking here, but each row represents an output i and each column represents input j.

So now coming back, when the output layers give your layer the values of g1 and g2 representing ∂L/∂out₁ and ∂L/∂out₂, you'll probably realize that hold on, why do I need ALL the output rows? I just need the rows corresponding to out₁ and out₂! These are just 2 rows, and so no need to compute or store in memory all the i outputs!

To say the same thing more formally, g1 and g2 are the weighting factors that tell us exactly which linear combination of Jacobian rows we need:

  • g₁ = ∂Loss/∂output₁: Weight for row 1 of the Jacobian
  • g₂ = ∂Loss/∂output₂: Weight for row 2 of the Jacobian

Now let's move forward.

Here is where we are: You have realized, through the values of the gradients, how your weights impact the loss. Now as the previous layers helped you out by telling you how you were impacting the final loss L, you probably should pass down the favor to your inputs as well. In other words, now your task is to tell all your inputs how they, individually, have impacted the overall loss.

Now this makes it sound super complicated, right? You have not 2, but a 1000 inputs! And guess what, you've never even seen the actual value of the loss L. How then are you expected to tell your inputs how they're impacting L?!

This is a hard one. But there may be a partial solution. Here is how it goes: The loss is a single number. And the loss can only "see" your layer's neurons through those two gradients, g1 and g2. There is not other path of influence for your layer's neurons on the loss, this is it! So what you can do is, because you know how you impact the loss and because you know how your inputs impact you, you can probably just connect those 2 ideas and calculate how your inputs impact your output. And if that is the ONLY way in which the inputs reach the output neuron, then indeed you can use your neurons impact on L to determine how your inputs impact L (∂L/∂xi)! That is all!

How do you find that numerical value ∂L/∂x₁? Well, consider this:
L/x1=L/out1out1/x1∂L/∂x₁ = ∂L/∂out₁ * ∂out₁/∂x₁

You already know ∂L/∂out₁ because it was supplied to you by the other layers, so all you need to figure out, is ∂out₁/∂x₁

In other words, you need to figure out how much each input affects each output (the local derivative). For input x1, you take your gradient g1 and multiply it by how much x1 affects output 1 (which is ∂out₁/∂x₁). For a linear layer where output = Wx, this local derivative is just the weight W[0,1] that connects input 1 to output 1. So the contribution is g1 × W[0,1]. You do the same for how x1 affects output 2: g2 × W[1,1]. Add these up and that's how x1 impacts the total loss L through your layer!

But wait, why did I just say that 'there may be a partial solution'? This method should unequivocally say how inputs impact the loss, right? Yes but consider what happens if the inputs are also connected to the output neuron through some other path. In that case, the gradient you passed down to, say, input x1, only shares how x1 impacts L when it passes through your neuron (as I said earlier), but not how x1 impacts L when passing through this other path. For that, it needs to have a gradient supplied by that second path's neuron and then add it up to the value it received from your layer's neuron.

So what you're really telling your inputs is NOT: "Here is how you impact loss L", but "Here is how you impact loss L when you flow through me".

Now let's describe this more formally: to compute gradients for your 1000 inputs, you don't need all 2000 Jacobian entries separately. You only need them pre-multiplied by [g₁, g₂].

Why This Is The Only Combination We Need

What we're actually trying to compute:

∂Loss/∂input_i = ∂Loss/∂output₁ × ∂output₁/∂input_i + ∂Loss/∂output₂ × ∂output₂/∂input_i
                = g₁ × J[0,i] + g₂ × J[1,i]

This is just entry i of the vector-Jacobian product [g₁, g₂] × J.

Or in more standard notation: ∂L/∂x = J^T · ∂L/∂y where the 'gradient with respect to inputs' equals the 'transposed Jacobian' times 'the gradient with respect to outputs'. This identity is the mathematical heart of backpropagation!

We never need J[0,i] by itself. We only need it multiplied by g₁. We never need J[1,i] by itself. We only need it multiplied by g₂.

The incoming gradient vector [g₁, g₂] tells us the only linear combination of Jacobian rows that matters for computing our input gradients. Computing any other combination would be wasted work as we'd be computing derivatives that don't contribute to our final answer.

What Actually Happens in Code

See the key insight? We're computing grad_input = J^T @ grad_output, which for each input i gives us:

grad_input[i] = J[0,i] × g₁ + J[1,i] × g₂

We sum up how that input affected each output, weighted by that output's gradient. The transpose naturally handles the routing!

A Numerical Example to Drive It Home

Let's make this completely concrete with actual numbers. Suppose:

  • Layer has 3 inputs, 2 outputs (so Jacobian is 2×3)
  • The full Jacobian would be:
    J = [2  3  1]  # How output 1 depends on each input
        [4  0  5]  # How output 2 depends on each input
    
  • Gradient from loss: [g₁, g₂] = [0.5, 0.2]

Computing gradients the naive way (building full Jacobian):

grad_inputs = [0.5, 0.2] × [2  3  1]
                           [4  0  5]
            = 0.5 × [2, 3, 1] + 0.2 × [4, 0, 5]
            = [1, 1.5, 0.5] + [0.8, 0, 1]
            = [1.8, 1.5, 1.5]

What backprop actually does (never building J):

  • For input 1: effect on output 1 (2) × g₁ (0.5) + effect on output 2 (4) × g₂ (0.2) = 1.8
  • For input 2: effect on output 1 (3) × g₁ (0.5) + effect on output 2 (0) × g₂ (0.2) = 1.5
  • For input 3: effect on output 1 (1) × g₁ (0.5) + effect on output 2 (5) × g₂ (0.2) = 1.5

Same result, but we computed it directly without ever storing the 6-entry Jacobian!

For a linear layer Y = WX + b, the complete backward pass computes all gradients:

See how clean this is? Each gradient follows naturally from the chain rule. The transpose in W.T @ grad_output routes the gradients back through the same connections they came from, just in reverse!

As you can see in the code, we avoid materializing those billion Jacobian entries! The VJP computation takes roughly the same time as the forward pass itself and only needs the cached values from forward propagation (like the input X and activation masks). We never build the O(m×n) Jacobian matrix: we go straight to the answer we need. For deep networks, this is the difference between "actually trainable" and "computationally impossible."

The Jacobian exists conceptually to help us understand the math, but backprop is clever enough to never actually build it. It goes straight to computing the only thing we need: how each input contributed to the loss through all available paths (the VJP).

Don't worry, I know I've not yet explained this as rigorously using math and code and discussed ramifications and implications of some of the concepts discussed here. I will do that in the following sections. Just consider this one to be an intuitive breakdown of e2e backpropagation flow.

Chain Rule by Pictures

The chain rule is the heart of backprop. You probably memorized it in calculus: (fg)(x)=f(g(x))g(x)(f ∘ g)'(x) = f'(g(x)) \cdot g'(x). But here's a quick refresher from the perspective of backprop.

Imagine you're looking at this simple chain:

x → [g] → y → [f] → z

The value x flows through function g to produce y, then through function f to produce z. Now we want to know: if we wiggle x a tiny bit, how much does z wiggle?

The chain rule says: multiply the local derivatives along the path. If g amplifies changes by a factor of 2, and f amplifies by a factor of 3, then the total amplification from x to z is 6.

But real networks aren't simple chains. They have forks and merges. When a value splits into multiple paths, we need to follow all of them. When paths merge, their effects add up.

Look at that diamond pattern. The value x can reach z through two different routes:

  • Top route: derivative is 2 × 3 = 6
  • Bottom route: derivative is 4 × 5 = 20
  • Total derivative: 6 + 20 = 26

This is the multivariate chain rule in action: multiply along paths, add across paths. Every time you see a fork in the forward pass, you'll see an addition in the backward pass. This isn't a special rule we invented. It falls out naturally from how derivatives work.

Computational Graphs

Let's now formalize what we've been drawing informally. A computational graph is a directed acyclic graph where:

  • Nodes represent values (inputs, intermediates, outputs)
  • Edges represent operations that transform values

Every node knows two things:

  1. During forward pass: its value
  2. During backward pass: its gradient (how much the loss changes if this value changes)

Every edge (operation) knows two things:

  1. How to compute the forward operation
  2. How to compute the local derivative

Each operation only needs to know its own derivative rule. ReLU doesn't need to know about the layers before or after it. Matrix multiplication doesn't care if it's followed by softmax or ReLU. Every operation is self-contained.

This means when you write:

You're not just computing values. You're building a graph. Each operation adds nodes and edges. When you call backward(), the framework walks this graph in reverse, applying local derivative rules and accumulating gradients.

The key principle to understand: when a value is used multiple times in the forward pass (like a weight multiplying different inputs in a batch), its gradient contributions sum in the backward pass. We'll see concrete scalar examples of this gradient accumulation pattern in the "Practice With Baby Examples" section later.

This completes our conceptual foundation:

  • Neural networks are computational graphs where each operation only knows its local derivative rule
  • Vectorization and broadcasting are just patterns of reuse, and reuse in forward becomes accumulation in backward

The next sections will walk through exactly how backpropagation works: how the forward pass prepares for the backward journey, how gradients flow backward step by step, and how these abstract principles play out in concrete scalar examples that you can verify by hand.

The Forward Pass: Setting Up for the Backward Journey

Forward propagation was covered in detail in the MLP post, so we won't repeat the full story here. If you need a refresher on how matrix multiplies and ReLUs compose to transform data, check that out first. Here, we'll focus on what the forward pass needs to do specifically to enable backpropagation.

The one thing to keep in mind w.r.t. the forward pass is that it isn't just computing outputs, but it's laying breadcrumbs for the backward pass to follow.

What Backprop Actually Needs From Forward

When you compute the forward pass, you're solving two problems simultaneously:

  1. Computing the actual output (the prediction)
  2. Storing exactly the right values to compute gradients later

Let's do a small recap. So what do we need exactly?

Think about computing the gradient of the loss with respect to some weight in layer 2. To compute that, we need:

  • The input to layer 2 (which was the output of layer 1)
  • The pre-activation values of layer 2 (before ReLU)
  • How that layer's output affected everything downstream

Without storing these during forward pass, we'd have to recompute them during backprop, which would double the computation time.

The Cache: Your Memory-Computation Tradeoff

Equation: z=Wx+bz = Wx + b

Why? Because during backprop:

  • To get L/W\partial L/\partial W: you need xx (the input)
  • To get L/x\partial L/\partial x: you need WW (the weights)
  • To get L/b\partial L/\partial b: you just need the upstream gradient

ReLU: a=max(0,z)a = \max(0, z)

For ReLU: you only need a boolean mask (1 bit per element) instead of the full values (32 bits per element). That's a 32x memory saving right there.

Softmax + Cross-Entropy Loss

Understanding Softmax and Cross-entropy

Softmax converts raw scores (logits) into a probability distribution:

softmax(zi)=ezijezj\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j} e^{z_j}}

Where:

  • ziz_i = the logit (raw score) for class ii
  • The output is a probability distribution where all values are positive and sum to 1

Cross-Entropy Loss measures how different the predicted distribution is from the true distribution:

L=iyilog(pi)L = -\sum_{i} y_i \log(p_i)

Where:

  • pip_i = predicted probability for class ii (from softmax)
  • yiy_i = true probability for class ii (target distribution)
  • LL = the loss value (lower is better)

For classification, yy is typically one-hot (all zeros except a 1 at the true class), simplifying to:

L=log(ptrue class)L = -\log(p_{\text{true class}})

Why fuse them? When you compute the gradient of cross-entropy loss with respect to the logits (not the probabilities), a clean simplification appears: Lzi=piyi\frac{\partial L}{\partial z_i} = p_i - y_i. For classification, this becomes simply probs - one_hot(true_class). This clean gradient is why they're almost always combined in practice.

Practical Usage

This one's interesting. Computing softmax and cross-entropy separately requires caching more, but if you fuse them:

The gradient is probs - one_hot(true_class): a standard identity that yields a clean expression.

The Topological Sort: Order Matters

Let's think about this, what is the order of computing these gradients with all these dependencies? We surely can't compute gradients in random order, but what is the right order?

Short answer is that you need to respect the dependencies. The forward pass naturally gives you a topological ordering: you compute layer 1, then layer 2, then layer 3. The backward pass just reverses this: gradients for layer 3, then layer 2, then layer 1.

But modern computational graphs aren't always sequential. They have branches (like ResNet's skip connections) and merges (like concatenation). The topological sort ensures you've computed all the gradients that flow into a node before processing that node.

Shapes: The Unforgiving Truth

Get a shape wrong in the forward pass, and PyTorch will error out immediately. Get a shape wrong in your backward pass implementation, and you might not notice until your network mysteriously fails to learn.

The rule to follow: the gradient of a tensor must have the same shape as the tensor itself.

Why? Because you're going to subtract the gradient from the parameter:

If shapes don't match, this breaks. But more philosophically, it makes sense: if W has 1000 parameters, you need 1000 gradient values to tell you how to update each one.

This shape matching is your debugger. If your gradient for W has shape (128, 784) but W has shape (784, 128), you know you forgot a transpose somewhere.

Local Gradients: Your Building Blocks

Each operation needs to know its own derivative. Just its own, nothing else. This locality is what makes backprop modular.

Here's the essential toolkit:

Addition: z=x+yz = x + y

  • z/x=1\partial z/\partial x = 1
  • z/y=1\partial z/\partial y = 1
  • Gradient distributes unchanged to both inputs

Multiplication: z=xyz = x \cdot y

  • z/x=y\partial z/\partial x = y
  • z/y=x\partial z/\partial y = x
  • Gradient gets scaled by the "other" input

ReLU: z=max(0,x)z = \max(0, x)

  • z/x=1\partial z/\partial x = 1 if x>0x > 0, else 00
  • Gradient passes through or stops

Matrix Multiply: Z=WXZ = WX

  • L/W=(L/Z)XT\partial L/\partial W = (\partial L/\partial Z) X^T
  • L/X=WT(L/Z)\partial L/\partial X = W^T (\partial L/\partial Z)
  • The transposes route gradients correctly

These four operations and their combinations give you 80% of deep learning. Everything else is details.

A Complete Forward Pass Example

Let's trace through a tiny network to make this concrete. Two inputs, two hidden neurons, one output:

See what happened? The second hidden neuron got ReLU'd to zero. During backprop, gradients won't flow through it. This selective routing is how ReLU networks carve up the input space into regions.

Setting Up for the Journey Back

The forward pass has done its job. You have:

  1. The prediction: What the network thinks the answer is
  2. The caches: All the breadcrumbs needed for backprop
  3. The computation graph: The path gradients will follow backward

Now the backward pass can begin. It will walk this graph in reverse, using the cached values to compute gradients efficiently. One forward pass, one backward pass, all gradients computed.

The next section shows exactly how this backward walk happens, step by step, until every parameter knows exactly how it affected the loss.

The Backward Pass: Retracing Your Steps With Purpose

We've seen how the forward pass computes outputs and caches values. Now comes the payoff: the backward pass that computes all gradients in one sweep.

The backward pass answers one question for every parameter in the network: "How much did you contribute to the loss?" To answer this efficiently, here's a concept that will make it click.

Meet the Adjoint: Your Gradient's True Name

Here's a term that sounds fancy but isn't: the adjoint. For any intermediate value vv in your computation, its adjoint vˉ\bar{v} is just:

vˉ=Lv\bar{v} = \frac{\partial L}{\partial v}

That's it. The adjoint is the gradient of the loss with respect to that value. Nothing more, nothing less.

Why bother with a new name? Because "gradient" gets overloaded. When we say "the gradient flows backward," what we really mean is "adjoints flow backward." The adjoint vˉ\bar{v} tells you: if this value vv increases by a tiny amount, how much does the loss increase?

Let's trace through a simple example to make this concrete:

See the pattern? Each adjoint is computed from:

  1. The adjoint flowing in from downstream (closer to the loss)
  2. The local derivative of the operation

This is the chain rule in action, but thinking in terms of adjoints makes it systematic. You always know what you're computing: the sensitivity of the loss to that specific value.

The Multiply-Accumulate Pattern: Backprop's Only Move

For any node in your graph, here's what happens during the backward pass:

That's it. Multiply the downstream adjoint by the local derivative, accumulate into the upstream adjoint. This pattern repeats for every edge in your graph.

But here's the critical bit: notice the += instead of =. Why accumulation instead of assignment?

Because values can be used multiple times! Remember that x can affect the output through multiple paths? Each path contributes to the gradient, and these contributions must sum.

Here's an example to illustrate this:

Without accumulation, just overwriting would give the wrong gradient (3 instead of 6). This accumulation is what handles the "forking" in computational graphs automatically.

This multiply-accumulate pattern scales to millions of parameters. Each weight that's used in multiple places (which is basically all of them in a batched computation) gets gradient contributions from all its uses, automatically summed up.

Walking the DAG: The Actual Algorithm

Now let's walk through how to actually implement backpropagation on a computational graph. The algorithm is (I daresay) quite simple once we have the right mental model.

Here's the complete algorithm:

That's it. That's backpropagation. Here's what's happening:

  1. Initialize: Every adjoint starts at zero except the loss (which is 1)
  2. Reverse order: Process nodes from output to input
  3. Accumulate: Each node multiplies its adjoint by local derivatives and adds to its inputs' adjoints

The reverse topological order is crucial. It ensures that when you process a node, you've already processed everything downstream from it. The node's adjoint is complete and ready to propagate.

Here's a complete worked example:

The Cost of Gradients: Time, Memory, and Checkpointing

Let's talk about what backprop actually costs you in terms of computation and memory. This is where theory meets harsh reality.

Time Cost

A useful fact about backprop: computing all gradients costs about the same as one forward pass. More precisely:

  • Forward pass: O(number of operations)
  • Backward pass: O(number of operations)
  • Total: ~2× the cost of just computing the output

The upshot: whether you have 10 parameters or 10 billion, the gradient computation costs roughly 2× the forward pass. Not 10× or 10 billion×. About 2×.

Why "roughly" 2× and not exactly? The backward pass often involves slightly different operations (transposes, additional multiplies), so in practice it's usually between 1.5× and 3× the forward cost.

Memory Cost

Here's where things get tricky. To compute gradients, you need those cached intermediate values from the forward pass. For a big network, this adds up:

This is why you run out of GPU memory during training even though inference works fine. Training needs to store all these intermediate values.

Gradient Checkpointing: Trading Compute for Memory

When memory becomes the bottleneck, you can make a trade: instead of storing all activations, store checkpoints and recompute the rest during backprop.

The strategy:

  1. During forward pass: Only save activations every N layers
  2. During backward pass: Recompute missing activations from nearest checkpoint
  3. Trade-off: Use √L memory for L layers, but do ~1.5× more computation

Here's what checkpointing looks like in practice:

The trade-off is clear:

  • Memory: O(√L) instead of O(L)
  • Computation: ~1.5× forward passes instead of 1×
  • Enables training models that wouldn't fit otherwise

The Shape Discipline That Keeps You Sane

Here's a principle that will save you hours of debugging: the shape of a gradient must match the shape of its corresponding value.

This seems obvious but it's easy to mess up with broadcasting and batching. Here's the rule in action:

When operations broadcast, you must reduce the gradient dimensions to match:

The rule is simple: if you broadcast in forward, you reduce in backward. If you replicate in forward, you sum in backward.

This shape discipline is your guardrail. If shapes don't match, you have a bug. No exceptions.

Pulling It All Together

Here's backprop on a complete mini neural network to cement everything:

Look at what's happening:

  1. Each gradient is computed using only local information (cached values and adjoints)
  2. The adjoint transforms as it flows backward (matrix multiplies, masking by ReLU)
  3. Parameter gradients accumulate contributions from the batch
  4. Shapes are preserved (gradients match parameter shapes)

This is backprop: systematic application of the chain rule with smart caching and accumulation.

Practice With Baby Examples

Now that we've seen the machinery of backprop, let's build intuition with the smallest possible examples. No matrices, no batches, just scalar values flowing through simple computational graphs. This is where we can build the muscle memory that scales to real networks.

The goal here isn't to show you the complexity. It's the opposite: to show you that even the most intricate neural network gradient is just these simple patterns repeated millions of times. Master these toy examples, and you've mastered the essence of backprop.

The Straight Line: Your First Gradient Story

Let's start with the simplest possible example: a chain of scalar operations. This is backprop with training wheels, but don't skip it. The pattern you see here is exactly what happens in deep networks, just without the visual clutter of matrices and tensors.

Consider this computation:

We want to compute L/x\partial L/\partial x: how does the input x affect the final loss L?

Here's the forward pass with actual values flowing through:

  • Start with x = 3
  • Multiply by 2: y = 6
  • Add 5: z = 11
  • Square it: L = 121

Now for the backward pass. Computing the adjoint (gradient of loss with respect to each value) for every node, working backwards:

Verifying this is correct using calculus:

  • L = (2x + 5)²
  • Expanding: L = 4x² + 20x + 25
  • Taking the derivative: dL/dx = 8x + 20
  • At x = 3: dL/dx = 24 + 20 = 44 ✓

The key insight: we never needed to expand that formula. The backward pass computed the exact same answer by:

  1. Breaking the computation into simple steps
  2. Computing local derivatives (2z for square, 1 for addition, 2 for multiply)
  3. Multiplying them along the path: 2z × 1 × 2 = 2(11) × 1 × 2 = 44

This is the chain rule in its purest form: multiply derivatives along the path from output to input.

When Paths Split and Merge

Now let's handle something slightly trickier: a value that affects the output through multiple paths.

Look at the structure: x can reach the loss through two different routes:

  • Route 1: x → a → z → L
  • Route 2: x → b → z → L

During the backward pass, we need to account for both paths. Here's what happens:

The crucial pattern: when a value is used multiple times, gradients from each use add up. This isn't a special rule we made up. It's what the multivariate chain rule demands:

Lx=all paths(edges in pathlocal derivatives)\frac{\partial L}{\partial x} = \sum_{\text{all paths}} \left( \prod_{\text{edges in path}} \text{local derivatives} \right)

Verifying this with direct calculus:

  • z = (x×y) × (x+y) = x²y + xy²
  • ∂z/∂x = 2xy + y² = 2(2)(3) + 9 = 12 + 9 = 21 ✓

Again, we get the same answer, but backprop computed it without ever expanding the expression. It just followed the graph structure and accumulated contributions.

This fork-and-merge pattern appears everywhere in neural networks:

  • Weight sharing: Same weight used across different spatial locations (CNNs) or time steps (RNNs)
  • Skip connections: Values flowing through multiple paths (ResNets)
  • Attention: Values being used as queries, keys, and values simultaneously

In every case, the rule is the same: gradients from multiple uses accumulate. Not because someone decided it should be that way, but because that's what derivatives do when you use a value multiple times.

Trusting But Verifying: Gradient Checking

We've been computing gradients analytically, but how do we know the implementation is correct? This is where gradient checking comes in: comparing analytical gradients (from backprop) against numerical approximations (from finite differences).

The idea is simple. The derivative is defined as:

Lx=limϵ0L(x+ϵ)L(x)ϵ\frac{\partial L}{\partial x} = \lim_{\epsilon \to 0} \frac{L(x + \epsilon) - L(x)}{\epsilon}

For small enough ϵ\epsilon (like 10510^{-5}), we can approximate this numerically. Even better, we can use the centered difference for higher accuracy:

LxL(x+ϵ)L(xϵ)2ϵ\frac{\partial L}{\partial x} \approx \frac{L(x + \epsilon) - L(x - \epsilon)}{2\epsilon}

Checking the gradient from our earlier example:

The gradients match to many decimal places! This gives confidence that the backprop implementation is correct.

But here's the thing about gradient checking: it's a debugging tool, not a training tool. Only use it to verify correctness because:

  1. It's expensive: Need two forward passes per parameter (remember the disaster from the introduction?)
  2. It's approximate: Numerical errors creep in, especially with very small or very large gradients
  3. It's tricky to tune: Too large ϵ\epsilon? Poor approximation. Too small? Floating point errors.

The Gradient Check Recipe

Here's a systematic approach to gradient checking:

Here's a pro tip: when gradient checking fails, the relative error magnitude hints at the bug type:

  • Error around 1.0 or 2.0: You probably forgot a factor (like the 2 in d(x2)/dx=2xd(x^2)/dx = 2x)
  • Error around 0.5: You might be computing only half the gradient (common with symmetric operations)
  • Huge error: Wrong operation entirely, or you're not actually computing gradients for that parameter
  • Error only on some parameters: Look for conditional logic (like ReLU) that might be handled incorrectly

The best part of these toy examples is that everything is transparent. You can verify every calculation by hand. But the patterns you're learning (chains multiply, forks add, always verify) are exactly what you'll use in production networks with billions of parameters.

Once you can narrate the gradient flow through these simple graphs, you understand backpropagation. The rest is just engineering: handling tensors instead of scalars, batching operations for efficiency, and managing memory. But the core algorithm? It's just these patterns repeated at scale.

From One Neuron to a Layer

We have been seeing backprop on tiny scalar examples. Now scale up to the actual building blocks of neural networks: neurons and layers. The same rules apply (multiply along paths, add at merges, accumulate gradients). The difference is that we move from scalars to vectors and matrices.

A single neuron is just a dot product followed by a nonlinearity. A layer is just many neurons computed in parallel. Once you see how gradients flow through one neuron, you understand how they flow through entire networks. It's the same pattern, just vectorized for efficiency.

The Affine Transform: Your Workhorse Operation

At the heart of every neural network layer is the affine transform: y=Wx+by = Wx + b. This simple operation (matrix multiply plus bias) does most of the heavy lifting in deep learning. If you understand how gradients flow through this operation, you understand most of what you need to know.

Starting with a single neuron to build intuition:

Now during backprop, we need to compute three things:

  1. How should we adjust the weights? (L/w\partial L/\partial w)
  2. How should we adjust the bias? (L/b\partial L/\partial b)
  3. What gradient should flow back to the input? (L/x\partial L/\partial x)

Here's the key insight: the gradient with respect to each weight is just the input it was multiplied with, scaled by the gradient flowing from downstream:

See the pattern? During forward pass, inputs and weights multiply to create the output. During backward pass, the gradient "swaps partners":

  • To get weight gradients: multiply the downstream gradient by the inputs
  • To get input gradients: multiply the downstream gradient by the weights

This swapping pattern is fundamental. It's not arbitrary; it's what the chain rule demands. And it scales perfectly to layers with many neurons.

Scaling to a Full Layer

Now let's handle multiple neurons at once. Instead of one neuron with 3 inputs, let's have 4 neurons, each looking at the same 3 inputs:

The backward pass follows the exact same logic, just with matrix operations:

The crucial observation: when multiple neurons share the same input, that input's gradient is the sum of contributions from all neurons. This is the accumulation pattern from earlier, now happening naturally through matrix multiplication.

Batching: The Final Dimension

So far, we've been processing one example at a time. In practice, neural networks process many examples simultaneously in batches. This isn't just for speed; it's fundamental to how modern networks train. Processing in batches gives more stable gradient estimates and lets us use GPUs efficiently.

Batching adds an extra dimension to our tensors, but the key thing is: the gradient computation rules stay exactly the same. We're just doing the same operation on multiple examples at once.

Let's start small to build intuition. Instead of one example with 3 inputs going through a layer with 4 neurons, let's process 2 examples:

Look at what happened with the dimensions:

  • Input X: [2, 3] → 2 examples, each with 3 features
  • Weights W.T: [3, 4] → transforms 3 inputs to 4 outputs
  • Output Z: [2, 4] → 2 examples, each producing 4 neuron outputs

Each row of Z is what we'd get if we processed that example individually. The batch dimension just stacks these operations.

Now here's where batching gets interesting for backpropagation. During the forward pass, all examples share the same weights and biases. This means during the backward pass, each example contributes to the gradient of those shared parameters:

This accumulation is crucial. The weights are shared across all examples in the batch, so their gradient must account for how they affected all examples. This is the same accumulation pattern we saw earlier when one value was used multiple times in the forward pass.

The key distinction:

  • Shared parameters (W, b): Used by all examples → gradients accumulate across the batch
  • Per-example data (X): Different for each example → gradients stay separate

Let's see this scaled up to realistic batch sizes:

The elegance of this: the exact same mathematical rules apply whether you're processing 1 example or 1000. The chain rule doesn't care about the batch dimension. It's just matrix operations that naturally handle the batching for us.

One more subtlety: when we compute grad_W = grad_Z.T @ X, the matrix multiplication is doing the summation for us. Let's verify this makes sense:

  • Each row of grad_Z.T is one neuron's gradients across all batch examples
  • Each column of X is one input feature across all batch examples
  • The dot product sums over the batch dimension, giving us the accumulated gradient

The pattern is: if a parameter is shared across the batch (like weights and biases), its gradient is the sum of contributions from all examples. If not (like the inputs), gradients remain separate.

The Activation Function Zoo

After the affine transform comes the nonlinearity. Without it, stacking layers would be pointless (multiple linear transforms collapse to one). Each activation function has its own personality and gradient behavior. Let me show you the ones you'll encounter most often.

Sigmoid: The Classic S-Curve

The sigmoid function σ(x)=1/(1+ex)\sigma(x) = 1/(1 + e^{-x}) was the original choice, squashing any input into (0, 1):

The derivative has a useful property: σ(x)=σ(x)(1σ(x))\sigma'(x) = \sigma(x)(1 - \sigma(x)). You can compute the gradient using just the output value, no need to store the input. But there is a problem: look at those flat regions. When the input is very negative or very positive, the gradient vanishes to nearly zero. The neuron learns very slowly there.

Tanh: Centered but Still Saturating

The hyperbolic tangent tanh(x)\tanh(x) is like sigmoid but centered at zero and ranging from -1 to 1:

The derivative is tanh(x)=1tanh2(x)\tanh'(x) = 1 - \tanh^2(x). Like sigmoid, you only need the output to compute the gradient. Being centered at zero is good (keeps activations balanced), but it still saturates at both extremes.

ReLU: The Gradient Preserver

ReLU (Rectified Linear Unit): simple and effective:

ReLU's gradient is binary: either 1 (gradient flows perfectly) or 0 (gradient is blocked). No saturation on the positive side! This simple change made training deep networks actually possible. The price? "Dead neurons" that get stuck outputting zero if they land on the wrong side.

Modern Variants: Best of Both Worlds

Researchers have proposed smoother alternatives that fix ReLU's sharp edges while keeping its benefits:

Leaky ReLU: Allow a small gradient when x < 0

GELU (Gaussian Error Linear Unit): Smooth gating that's become standard in transformers

SiLU/Swish: Another smooth alternative

Different activation functions give neurons different "personalities": decisive, probabilistic, or selective

When Neurons Die: Understanding Saturation

Saturation is when your activation function's gradient becomes tiny, effectively stopping learning for those neurons. It's a silent killer in deep networks. Let's see exactly what happens and how to spot it.

The Saturation Zones

For sigmoid and tanh, saturation happens when inputs get large in magnitude:

When a neuron saturates, the gradient flowing through it gets multiplied by nearly zero. Even if there's a huge error to correct, the weight updates become microscopic:

The Distribution Problem

Saturation isn't just about individual neurons. It's about populations. If your pre-activation values drift too far from zero, entire layers can saturate:

Dead ReLU Syndrome

ReLU has its own death mode: once a neuron outputs zero, if its weights update in the wrong direction, it might never recover:

The neuron outputs zero, so its gradient is zero, so its weights don't update, so it keeps outputting zero. It's dead. In practice, with proper initialization and learning rates, only a small percentage die, but it's something to monitor.

Shape Discipline: Getting the Dimensions Right

Now let's talk about the thing that causes more debugging headaches than anything else in deep learning: tensor shapes. When you move from scalars to tensors, keeping track of dimensions becomes critical. One misaligned matrix multiply and your entire backward pass breaks. We mentioned it very briefly earlier, but let's take a look at it closely.

The Golden Rule of Gradient Shapes

Remember the principle?

The gradient of a tensor must have exactly the same shape as the tensor itself.

This is non-negotiable because you're going to subtract the gradient from the parameter:

Here's how this plays out in practice:

Broadcasting: When Shapes Don't Match (On Purpose)

NumPy's broadcasting lets you add bias vectors to batch matrices without explicit copying. But during backprop, you must undo the broadcast by reducing:

The rule: if you broadcast in forward, you must reduce in backward along the broadcast dimensions:

Common Shape Bugs and How to Spot Them

Here are the shape mistakes we see most often:

Bug 1: Wrong Matrix Multiply Order

Bug 2: Forgetting to Sum Over Batch

Bug 3: Reshaping Instead of Transposing

Building Shape Intuition

The best way to build shape intuition is to think about what each dimension represents:

The shape discipline might feel pedantic at first, but it becomes second nature. And when your shapes are right, your gradients usually are too. It's the cheapest form of debugging: if shapes match, you're already mostly correct.

Putting It Together: A Complete Layer Implementation

Here's a complete implementation that combines everything: the affine transform, activation, proper shapes, and careful gradient handling:

This implementation includes all the important details:

  • Proper initialization to prevent initial saturation
  • Caching exactly what's needed for backprop
  • Monitoring for saturation and dead neurons
  • Shape assertions to catch bugs early
  • Gradient clipping for stability

You can stack these layers to build deep networks, and the gradient will flow correctly through all of them. Each layer just needs to implement its local forward and backward logic. The chain rule handles the rest.

Now that you understand how gradients flow through individual layers, you're ready to see how they flow through entire networks. Let's walk through a complete two-layer classifier, putting all these pieces together into a working system.

Two-Layer Classifier: The Complete Picture

Time to put everything together. We have seen individual pieces: neurons, layers, activations. Now trace gradients through an entire classification network, from input pixels to probability outputs. This isn't a toy example. This is the same basic architecture that later scaled to state of the art results.

Working with a concrete setup: classifying handwritten digits. The network takes 784 pixel values (28×28 image, flattened) and outputs 10 probabilities (one per digit). Two layers, one hidden layer with ReLU, then softmax for classification. Simple enough to trace by hand, complex enough to reveal all the important patterns.

Forward Expressions: Building Your Classifier

Here's the complete forward pass mathematically. Using actual dimensions so you can see exactly how shapes flow through the network:

Here's the forward computation, step by step:

Layer 1: Feature extraction

Layer 2: Class scoring

Output: Class probabilities

Loss: How wrong were we?

Tracing through with concrete numbers to make this real:

The network transforms 784 pixel values into 10 class scores through learned feature extraction (first layer) and learned scoring rules (second layer). The softmax ensures we get a valid probability distribution, and cross-entropy loss quantifies how far off our prediction is from the truth.

Stable Softmax: Don't Let Your Exponentials Explode

Here's where things get tricky. The naive softmax formula looks innocent:

pi=ezijezjp_i = \frac{e^{z_i}}{\sum_j e^{z_j}}

But what happens if one of your logits is, say, 1000? You're computing e1000e^{1000}, which overflows to infinity in float32. Even moderately large values like 100 cause problems. And if your logits are all very negative? The exponentials underflow to zero, giving you 0/0 = NaN.

The fix is simple: shift all logits by a constant. The probabilities do not change:

pi=ezicjezjcp_i = \frac{e^{z_i - c}}{\sum_j e^{z_j - c}}

This works because: ezicjezjc=eceziecjezj=ezijezj\frac{e^{z_i - c}}{\sum_j e^{z_j - c}} = \frac{e^{-c} \cdot e^{z_i}}{e^{-c} \cdot \sum_j e^{z_j}} = \frac{e^{z_i}}{\sum_j e^{z_j}}

The ece^{-c} cancels out. So what should cc be? Use the maximum logit:

Even better, when combined with cross-entropy loss, we can avoid computing the log of a probability (which can be numerically unstable for small probabilities):

This pattern (shift by max, compute log-sum-exp) appears everywhere in deep learning. It's one of those tricks that separates working code from crashing code.

Backward Pass Line by Line: Where Every Gradient Comes From

Now for the main event: computing gradients for every parameter. Starting from the loss and working backward, showing exactly where each gradient comes from.

As we saw earlier, the gradient of softmax with cross-entropy is remarkably clean: grad_z2 = p - one_hot(y). Let's use this to trace gradients through the entire network:

Gradient Through Layer 2

See the pattern? When computing weight gradients, we take the outer product of the downstream gradient and the input to that layer. When propagating gradients backward, we multiply by the transposed weight matrix.

Gradient Through ReLU

ReLU acts like a gate: gradients flow through neurons that were active (z1 > 0) and get blocked at neurons that were inactive. Simple but effective.

Gradient Through Layer 1

Same pattern as layer 2. The computation is mechanical once you know the rules.

Complete Backward Pass

Putting it all together, here's the full backward pass:

Notice the symmetry: each layer follows the same pattern (outer product for weights, copy for bias, transpose multiply for input gradient). This isn't four different rules; it's one rule applied four times.

Why All Those Transposes?

You've probably noticed that WTW^T appears whenever we compute input gradients. As explained briefly earlier, this isn't an arbitrary notation. There's a deep reason rooted in linear algebra.

Think about what's happening geometrically. In the forward pass, WW transforms vectors from input space to output space:

  • W:RninRnoutW: \mathbb{R}^{n_{in}} \to \mathbb{R}^{n_{out}}

In the backward pass, gradients flow from output space back to input space. We need the "reverse" transformation:

  • WT:RnoutRninW^T: \mathbb{R}^{n_{out}} \to \mathbb{R}^{n_{in}}

The transpose is literally the adjoint (dual) of the linear map. It answers: "If the output changes by this much, how much does each input need to have contributed?"

Here's another way to see it. The chain rule for the gradient through y=Wxy = Wx gives:

Lxi=jLyjyjxi\frac{\partial L}{\partial x_i} = \sum_j \frac{\partial L}{\partial y_j} \frac{\partial y_j}{\partial x_i}

Since yj=kWjkxky_j = \sum_k W_{jk} x_k, we have yj/xi=Wji\partial y_j/\partial x_i = W_{ji}. So:

Lxi=jLyjWji=jWjiTLyj=(WTyL)i\frac{\partial L}{\partial x_i} = \sum_j \frac{\partial L}{\partial y_j} W_{ji} = \sum_j W_{ji}^T \frac{\partial L}{\partial y_j} = (W^T \nabla_y L)_i

The transpose emerges naturally from the math. It's not a trick or convention; it's what the chain rule demands. Think of it like a many-to-many mapping: going forward, input ii affects all outputs through row ii of WTW^T (equivalently, column ii of WW). Going backward, output jj's gradient affects all input gradients through column jj of WTW^T (equivalently, row jj of WW). The transpose naturally handles this reversal.

Batching: When Dimensions Multiply

So far we've seen single examples, but real training uses batches. Let's see how the math changes when processing 32 images simultaneously:

The activations gain a batch dimension, but weights stay the same shape. During backprop, gradients from all examples in the batch accumulate:


The key pattern: parameters that are shared across the batch (weights and biases) have their gradients summed/averaged over the batch dimension. This is the "reduction" I mentioned earlier: what gets broadcast in the forward pass gets reduced in the backward pass.

Why average instead of sum? It's a scaling choice:

  • Sum: Gradient magnitude scales with batch size (32× bigger gradient for 32× more examples)
  • Average: Gradient magnitude independent of batch size (easier to tune learning rate)

Most frameworks average by default, which is why you can often use the same learning rate for different batch sizes (though the noise characteristics change).

Shape Discipline With Batches

With batching, shape tracking becomes even more critical:

The rule remains ironclad: gradients have the same shape as their corresponding values. This is your debugging lifeline. Shape mismatch = bug, always.

The Complete Implementation

Putting everything together into a working implementation that you can actually run:

This implementation covers all what we discussed. It handles:

  • Numerical stability (shifted softmax)
  • Proper batching (reductions where needed)
  • Correct shapes throughout
  • The standard softmax with cross-entropy gradient

You could train this on actual MNIST data and get ~95% accuracy. Not bad for 50 lines of NumPy!

This two-layer network embodies all the core ideas of backpropagation:

  • Forward pass: Transform inputs through learned features to predictions
  • Loss: Measure prediction error with a differentiable function
  • Backward pass: Trace error signals back through the network
  • Parameter updates: Adjust weights in the direction that reduces error

Scale this up (more layers, more neurons, better optimizers) and you get modern deep learning. The principles stay exactly the same. Once you truly understand this two-layer example, you understand backprop.

Conclusion for Part 1: What We Want To Carry Forward

We started with a simple but brutal constraint: We need the gradient of the loss with respect to every value, yet can only afford a few passes through the network. The way out was not new math, it was new bookkeeping. The forward pass writes down just enough facts about each local step. The backward pass reads those facts and applies the chain rule in reverse. Time becomes linear in the size of the network because we reuse what was already computed.

The mental model that can now guides us:

  • Values flow forward, adjoints flow backward: We cache what each local rule will need later
  • Local rules are everything: add, multiply, activation, matrix multiply
  • Shapes are non negotiable: every gradient has the same shape as its value
  • Batching is just many copies in parallel: parameter gradients add across the batch, input gradients stay separate
  • Stability is earned by design choices: ReLU keeps gradients alive, stable softmax prevents numeric overflow

A small but important perspective: backprop is not a mysterious force. It is careful accounting. The forward pass is the ledger, the backward pass is settlement. When training feels memory heavy, that is the cost of not paying with time. When training feels unstable, that is the system telling us where local rules or scales are working against gradient flow.

Where we go next: Let's tighten the abstraction. Everything we did here can be expressed as vector Jacobian products. That lens explains why transposes appear, why gradient shapes must match parameter shapes, and how complex layers still reduce to the same local computations. It also makes the jump from tiny examples to real architectures feel natural.

Continue to Part 2 to formalize this picture and extend it: vector Jacobian products, transpose as an adjoint, deeper shape discipline, and richer layers like convolution and attention.

Backpropagation Part 2: Patterns, Architectures, and Training

References and Further Reading