Gridworld: Dynamic Prog
A 10×10 deterministic gridworld for watching Bellman backups and greedy policy improvement solve a known MDP. Pick a route template or click any cell to choose the destination, then scrub value iteration one sweep at a time.
Interactive visualizations for building intuition about deep learning concepts.
A 10×10 deterministic gridworld for watching Bellman backups and greedy policy improvement solve a known MDP. Pick a route template or click any cell to choose the destination, then scrub value iteration one sweep at a time.
Choose a model from GPT-2 to LLaMA 3.1 405B and DeepSeek V3, pick your GPUs, precision, and parallelism, and get the full pretraining bill: memory per GPU, training days, and dollar cost. A post-training tab does the same for SFT, LoRA/QLoRA, DPO, PPO, and GRPO.
Build, train, and visualize multi-layer perceptrons in real time. Experiment with architectures, datasets, and hyperparameters to develop intuition for how neural networks learn.
One eight-state rover world, layered three ways: a bare Markov process you can sample trajectories from, a reward process with discounted state values, and a full decision process where you pick a policy. Drag the discount γ, switch between the Lost Wanderer and the optimal policy, and click any state to unfold its Bellman backup.
Paste a paragraph, dial in a vocabulary size, and watch Karpathy's minBPE mint one merge token at a time. Each learned token gets compared byte-for-byte against GPT-4's cl100k_base so you can see where your forty-byte trainer agrees with the production tokenizer — and where it goes its own way.
A small visible-window KV-cache recall trace for seeing how FIFO leaves foreign work ahead of a request's tail, and how a request-aware dispatch heuristic changes that.
Pick from twenty-nine production tokenizers — tiktoken, Llama, Gemma, Mistral, Qwen, DeepSeek, Cohere, Phi, Yi — paste any text, and watch every model carve it into pieces. Click a token to walk its BPE merge tree.
Pin two to four tokenizers against the same input and watch where they carve it differently. Highlighted pills mark the spans only some encoders share; the bar at the top prints counts plus a spread so the multilingual penalty becomes a number.
Pick a preset or dial in V, d, L, T, B, and dtype to see the real tax of vocabulary on a transformer: embedding parameters, untied LM head, logits FLOPs per token, and KV-cache bytes.