Gridworld: Dynamic Programming
A small, deterministic Markov Decision Process you can poke at. Click any open cell to make it the destination, or switch to obstacle mode to add and remove walls; then play or step through value iteration one sweep at a time. When the policy settles, the agent disc walks the greedy path from the start cell to the destination.
Loading gridworld
Reinforcement learning textbooks march in roughly the same order: bandits, then Markov decision processes, then dynamic programming, then the methods you actually use in practice (Monte Carlo, TD, policy gradients, and so on). Dynamic programming is where most people's intuition breaks first. You can recite the Bellman equation. You can probably even derive value iteration from policy iteration on a whiteboard. But picturing what one Bellman backup does to a value field, and what the greedy policy update does with that field, is harder than it should be. This page is here so you can watch it.
The world starts with Karpathy's layout. Ten by ten, deterministic, with the cell marked R +1.0 as the goal, nine traps worth -1, and an L-shaped wall with one narrow gap punched through it. In the DP model, reward belongs to the state being left: after the agent takes an action from the +1 state, the next state is reset to the top-left start. The animation stops the disc at the selected destination instead of drawing that reset, because here the point is the path to the reward. The hatched cells are walls and have no visible V or π. Everything else is fair game.
The cards above the grid are small route puzzles, each one a case where eyeballing the answer leads you astray. Twin Gates offers two passages through a wall: a narrow upper gate close to the start but with traps salting the corridor behind it, and a wider three-cell lower gate that's farther but clean. The geometry is engineered so the trap-avoiding detour through the upper gate and the long approach through the lower one give the start cell the same value. The policy splits fifty-fifty at the early forks, and both routes glow with the same warmth in the value field; the agent's walk resolves the tie deterministically through action ordering. Switchback is an S-curve maze where the right move is to walk away from the destination for a while before turning back toward it. Pocket puts the goal six steps from start as the crow flies, then forces a twenty-step detour because the only door is on the far side. Cliff is the textbook lesson on shortest-path fallacy — the direct route along the bottom row is the same eighteen steps as the safe route one row up, but every cliff cell costs -1, and the agent picks the higher-V detour every time. Press a card and the world resets to sweep 0; same Bellman backups, different geometry.
The amber and slate tints on each cell are V itself, normalized for color. Press Forward a few times and you can see the useful picture first: positive value diffuses outward from the goal, negative value bleeds out from the traps, and both flows have to bend through the gap because the walls block propagation. After each pass, the arrows are redrawn from the new V. So the value field and the policy sharpen together.
That pass is a synchronous Bellman expectation backup. For each state, it averages over the actions your current policy would take, and for each action it pulls back the immediate reward plus γ times the value sitting at the next state:
The greedy policy update is the other half. It looks at the current V, peeks one step ahead from each state, and assigns full probability to whichever action or actions point at the highest neighbor value. That is all “greedy with respect to V” means in this setting. When two or more actions tie, the policy splits evenly across the winners, which is why some cells emit a fork of equal-length arrows instead of one. The arrows on the grid are π drawn directly: longer arm means higher probability of taking that action.
Play repeats that backup/update pair about every 100 milliseconds. Forward runs it once. Back scrubs through the snapshots you already generated. In full policy iteration you would evaluate a policy for many sweeps before improving it. Karpathy's value iteration demo does the impatient version: one backup, then one greedy update, over and over. Run that loop long enough and it squeezes out a fixed point where neither V nor π wants to move. Once the sweep counter stops going up, a small disc takes over and walks the path the policy would follow from the start cell to the selected destination. That walk is just argmaxa π(a|s), applied step by step. Nothing fancy.
The modeling inputs are deliberately small. Click any open cell and it becomes the sole +1 destination; the previous destination is cleared back to 0, and the fixed -1 cells otherwise stay as traps. Drop γ near 0.1 and the policy collapses to whatever is one or two steps away. The goal becomes invisible from across the room because its value decays before it reaches you. Push γ near 0.99 and the goal's pull soaks through the entire reachable space; the arrows form long, coherent currents pointing home through the gap. Somewhere in between is where γ stops being a hyperparameter and starts feeling like the agent's planning horizon.
The other knob is geometry. Switch to Obstacle mode and the click does the inverse: paint a wall on an open cell, or scrub one away from a wall. Each toggle is a fresh MDP, so value iteration restarts from sweep 0 with the same destination still selected and the agent re-plans around whatever opening is left. Wall the goal off completely and the iteration still converges — V is well-defined for every reachable state — but the agent never arrives, and the status line reads “blocked” instead of “arrived.” Useful for testing the obvious things (does a one-cell-wide gap still propagate value? does γ change which detour the policy prefers?) and the less obvious (wall off a trap and watch the corridor next to it warm up because the penalty stops bleeding).
The reason none of this scales is right there in the math. To run one sweep you have to touch every state, and to back up a value you have to know r(s) and nextState(s, a) exactly. That works for a 10×10 grid. It doesn't work for a robot arm with continuous joints, or a 19×19 Go board with 10170 positions, or anything where you have to learn the dynamics from interaction. Treat this page as a place to feel what the Bellman backups are actually doing. The methods that scale usually keep this backup idea where they can, then replace the unaffordable exact sums with samples or learned approximations.