Interactive Visualization

Gridworld: Dynamic Programming

A small, deterministic Markov Decision Process you can poke at. Click any open cell to make it the destination, then play or step through value iteration one sweep at a time. When the policy settles, the agent disc walks the greedy path from the start cell to the destination.

Loading gridworld

Reinforcement learning textbooks march in roughly the same order: bandits, then Markov decision processes, then dynamic programming, then the methods you actually use in practice (Monte Carlo, TD, policy gradients, and so on). Dynamic programming is where most people's intuition breaks first. You can recite the Bellman equation. You can probably even derive value iteration from policy iteration on a whiteboard. But picturing what one Bellman backup does to a value field, and what the greedy policy update does with that field, is harder than it should be. This page is here so you can watch it.

The world starts with Karpathy's layout. Ten by ten, deterministic, with the cell marked R +1.0 as the goal, nine traps worth -1, and two wall segments with one narrow gap to slip through. In the DP model, reward belongs to the state being left: after the agent takes an action from the +1 state, the next state is reset to the top-left start. The animation stops the disc at the selected destination instead of drawing that reset, because here the point is the path to the reward. The hatched cells are walls and have no visible V or π. Everything else is fair game.

The first half of each displayed sweep runs one synchronous Bellman expectation backup over every cell. For each state, it averages over the actions your current policy would take, and for each action it pulls back the immediate reward plus γ times the value sitting at the next state:

V_π(s) ← Σ_a π(a|s) · [ r(s) + γ · V_π(s′) ]

The amber and slate tints on each cell are V itself, normalized for color. Press Forward a few times and you can see what the equation actually does: positive value diffuses outward from the goal, negative value bleeds out from the traps, and both flows have to bend through the gap because the walls block propagation. After each backup, the arrows are redrawn greedily from the new V. So the value field and the policy sharpen together.

The greedy policy update is the other half. It looks at the current V, peeks one step ahead from each state, and assigns full probability to whichever action or actions point at the highest neighbor value. That is all “greedy with respect to V” means in this setting. When two or more actions tie, the policy splits evenly across the winners, which is why some cells emit a fork of equal-length arrows instead of one. The arrows on the grid are π drawn directly: longer arm means higher probability of taking that action.

Play repeats that backup/update pair about every 100 milliseconds. Forward runs it once. Back scrubs through the snapshots you already generated. In full policy iteration you would evaluate a policy for many sweeps before improving it. Karpathy's value iteration demo does the impatient version: one backup, then one greedy update, over and over. Run that loop long enough and it squeezes out a fixed point where neither V nor π wants to move. Once the sweep counter stops going up, a small disc takes over and walks the path the policy would follow from the start cell to the selected destination. That walk is just argmax_a π(a|s), applied step by step. Nothing fancy.

The modeling inputs are deliberately small. Click any open cell and it becomes the sole +1 destination; the previous destination is cleared back to 0, and the fixed -1 cells otherwise stay as traps. Drop γ near 0.1 and the policy collapses to whatever is one or two steps away. The goal becomes invisible from across the room because its value decays before it reaches you. Push γ near 0.99 and the goal's pull soaks through the entire reachable space; the arrows form long, coherent currents pointing home through the gap. Somewhere in between is where γ stops being a hyperparameter and starts feeling like the agent's planning horizon.

The reason none of this scales is right there in the math. To run one sweep you have to touch every state, and to back up a value you have to know r(s) and nextState(s, a) exactly. That works for a 10×10 grid. It doesn't work for a robot arm with continuous joints, or a 19×19 Go board with 10¹⁷⁰ positions, or anything where you have to learn the dynamics from interaction. Treat this page as a place to feel what the Bellman backups are actually doing. The methods that scale usually keep this backup idea where they can, then replace the unaffordable exact sums with samples or learned approximations.

Source: github.com/karpathy/reinforcejs (MIT).