Tools
Interactive Visualization

Mars Rover Explorer

A rover sits on a slope of red dust with half a battery and a short list of moves: drive, drill, photograph, point its antenna at Earth. Around it are eight places it can end up. Six it can pass through, Base, Plain, Rocks, Ridge, Crater, and the Signal where it uplinks the day's data. Two it can't leave: Mission end, safely home, and the Sand trap, where it's stuck for good. Some spots pay off, the Crater and the Signal carry the real science, and one of them, the sand, can end the day outright.

Which moves are worth making turns on a single number per spot: its value, the total reward you can expect from there to the end of the day. There are two ways to get those numbers, and you can try both here. One samples, sending the rover out on real runs and averaging what each spot actually earns. The other solves, reading the whole map and computing each value exactly. They reach the same answer from opposite directions, and watching them meet is the quickest way I know to feel what value means in a Markov decision process.

Value from sampled runs

Pick what the rover follows, set the battery with γ, and hit play. It takes a full run, Base until it's home or stuck, collecting rewards along the way. Then for every spot it stood on, the rewards from that spot onward get added up, later ones discounted, and that total, the return, drops back onto the spot. One run says almost nothing: a lucky pass to the uplink makes a spot look great, a run that ends in the sand makes the same spot look terrible. Run a few hundred and the per-spot averages, the v̂(s) numbers, stop lurching and settle. Nobody handed the rover the map. The values came out of experience.

Loading rover...

The same world, taken apart

Now the exact answer. The explorer holds the same eight spots and opens in three lenses. Start bare: just the states and the odds of moving between them. Click any spot to see where the rover can drift and how often, and sample a few runs to watch how easily an innocent-looking start ends in the sand. Add rewards and a discount and every spot earns its value, computed outright instead of sampled. Add the rover's own choices and you can hand it a policy and compare two ways to behave.

Loading rover...

A few things are worth pushing on. The discount γ is the rover's battery. Near 0 it's nearly dead and counts only the next step, grabbing whatever science is in reach; near 1 it has a full charge and the patience to take a long detour for a bigger payoff later. On autopilot, turning γ up actually makes every spot look worse, because a rover with no steering drifts into the sand sooner or later, and the more it cares about the future, the more that ending weighs on the present. Give it a steering wheel and the same knob does something subtler: somewhere near a full charge the best plan can flip, the rover giving up the short safe road for the long route through the Crater once it can finally afford to wait for the richer science.

The bigger lever is the policy. The Lost Wanderer flips a coin at every fork; the optimal policy chooses deliberately. At γ = 0.9 the Wanderer is worth less than nothing where it starts, about -2.0 at Base, since aimless thrashing finds the sand often enough to eat whatever science it stumbles into. The optimal rover is worth around +4.3 at the same spot. Same dust, same dice. The only thing that changed is the rule it follows, and that rule is the whole distance from -2 to +4.

Switch to the decision lens and click the Rocks, then the Crater. Both offer the same kind of bet: a risky grab worth a lot, or a safe step worth less. In the Rocks the rover skips the grab. Drilling pays +3 but slips into the -20 sand one run in five, while simply moving reaches the same Crater nine times in ten for the price of a step. In the Crater it does the opposite and takes the grab, because there the payoff is +4 and the slip is only one in ten. Same shape of decision, opposite answers, and the action values are what tell them apart.

Here's the connection between the two tools. The value the explorer computes under a policy is exactly the number the sampled averages are crawling toward when the rover follows that same policy. One reads the whole map and folds the rewards inward; the other never sees the map and recovers the same answer from real runs. Watch the sampler under the optimal policy and the difference gets sharp: it only fills in the spots the rover actually drives through, and the abandoned routes stay blank, because averaging can't estimate a return it never sees. That gap, knowing the dynamics versus only getting to sample them, is the line between dynamic programming and reinforcement learning. The rest of the series lives on the sampling side of it.