Interactive Visualization

Mars Rover Explorer

A rover sits on a slope of red dust with half a battery and a few moves: drive, drill, photograph, uplink. There are eight places it can end up. Six it passes through, Base, Plain, Rocks, Ridge, Crater, and the Signal where it sends the day's data home. Two it can't leave: Mission end, safely home, and the Sand trap, where it's stuck for good.

Which moves are worth making turns on one number per spot, its value: the reward you can expect from there to the end of the day. There are two ways to get it, and you can try both here. One samples real runs and averages what each spot earns. The other reads the whole map and computes each value exactly, reaching the same answer from the other side.

Value from sampled runs

Pick what the rover follows, set the battery with γ, and hit play. Each run collects rewards, and for every spot the rover stood on, the rewards from there onward, the return, drop back onto it. One run is noisy: a clean pass to the uplink looks great, a run that ends in the sand looks awful. Run a few hundred and the per-spot averages, v̂(s), settle. Nobody gave the rover the map. The values came out of experience.

Loading rover...

The same world, taken apart

The same eight spots, in three lenses. Start bare: click a spot to see where the rover can drift and how often. Add rewards and a discount and every spot shows its value, computed exactly instead of sampled. Add the rover's own choices and you can hand it a policy and compare two ways to behave.

Loading rover...

Two knobs change everything. γ is the battery: near 0 the rover is myopic and grabs the nearest reward, near 1 it's patient enough to detour for a bigger one. The policy is the rule it follows. At γ = 0.9 the coin-flipping Lost Wanderer is worth less than nothing at Base, about -2, because random thrashing keeps finding the sand; the optimal rover is worth about +4 from the same spot. Same dust, same dice, different rule.

Click the Rocks under the optimal policy and watch it skip the +3 drill, which slips into the -20 sand one run in five, for a plain move that reaches the same Crater nine times in ten. The action values make that call. And here's the point of keeping both tools together: the value the explorer computes is exactly the number the sampler crawls toward for the same policy. One knows the map, the other only gets to sample it. That gap is the line between dynamic programming and reinforcement learning.