Interactive Visualization

Train your own BPE

Paste a paragraph, pick how many merges to mint, and watch byte-pair encoding compress it one bigram at a time. The bottom half lines your learned vocabulary up against GPT-4's cl100k_base: the same algorithm, run on trillions of tokens of internet text instead of one paragraph.

Loading trainer

Karpathy's minBPE is two hundred lines of Python that show why every modern language model rolls its own tokenizer. It starts with the UTF-8 byte stream (256 base tokens, no exceptions, no vocabulary file) and asks one question: which two adjacent tokens sit next to each other the most? Mint a new token for that pair. Replace every instance with the new token. Repeat. The sequence shrinks; the vocabulary grows. When you hit the budget, you stop.

The trainer above runs that loop server-side and ships back a snapshot per merge, so you can scrub through the trace. The defaults are deliberately small (fifty kilobytes of corpus, ninety-six merges max) because the point isn't to ship a production tokenizer. The point is to feel the loop. Press Train with the Karpathy paragraph already loaded and the first merge that pops out is almost always · t: a space followed by a t. Not a word. Not a morpheme. Just a pair of bytes the algorithm noticed because that paragraph happens to have a lot of thes in it. The second is e·, the end of a word. The third is · the·, folding the first two into a word boundary. You can watch language emerge from frequency.

The center pane shows three things at once. The sequence preview at the top is the first ninety-six tokens of the corpus, redrawn after every merge: the same bytes, but with freshly minted pairs collapsed into single pills. The current merge band names the pair the algorithm just made and how many times it appeared. The applied-merges list underneath is the full vocabulary so far, in the order each token was learned. Click any row and the sequence rewinds to that step. The pills carry a stable hue per token id (the same id is always the same color), so you can track a single merge as it propagates through the corpus.

The right column is the comparison that gives this tool its name. Every token your run produces gets looked up, byte for byte, against cl100k_base, the BPE vocabulary GPT-4 was trained on. Most of your tokens will hit. The base bytes do: every legal UTF-8 byte is token 0–255 in both vocabularies. The early merges often do too. · t, · a, in are byte sequences common enough in any English text that both algorithms find them. Where your trainer goes its own way is on the rarer merges, the ones sensitive to the particular corpus you fed it. Train on the FizzBuzz example and you'll mint a token for print(; cl100k_base has print and ( as separate tokens because it never saw enough Python in a row to fuse them. Train on the Unicode example and entire codepoints fall back to byte sequences with no match at all, because GPT-4 had seen café and 東京 as whole strings long before they reached the merge frequency you can fake with one paragraph.

The math is in the loop. Each pass over the sequence is O(n): counting the best pair is a hash-map tally, and applying the merge is a linear scan that replaces every match in place. The total cost is roughly O(n · k), where n is the corpus length and k is the merge count. Fifty kilobytes at ninety-six merges runs in under a hundred milliseconds on a laptop. Production tokenizers run the same loop on a thousand cores against a corpus a billion times bigger. You won't reproduce that merge schedule in a browser tab. What you can reproduce is the shape of the answer.

The dial labeled vocab size is the only knob worth playing with. The minimum is 257 (the 256 base bytes plus exactly one merge); below that there's nothing to look at. The maximum is 352, the ninety-six merge cap. The interesting range is the bottom third. Push it to 260 and watch which three bigrams the algorithm thinks are worth minting first. Push it to 280 and a recognizable English vocabulary starts to surface: the, and, ·a, ·t, ing. Push it to 350 and you'll start seeing whole words that only appeared once in the corpus. That's the failure mode every production tokenizer designs around, with regex pre-splitting and word boundaries this minimal trainer deliberately skips. The gap between what your one-paragraph trainer learns and what cl100k_base shipped is the cost of training data, measured in hits on the right-hand table.

Source: github.com/karpathy/minbpe (MIT). GPT-4 comparison runs against OpenAI's tiktoken cl100k_base vocabulary, loaded server-side.