Interactive Visualization

Tokenizer Playground

Twenty-nine production tokenizers, the same text, twenty-nine different answers. Pick an encoding, type anything, and watch each model carve the string into the units it actually reasons about. Click any token to reverse the BPE merges that built it, byte by byte, back to the raw codepoints underneath.

Loading tokenizer

The first thing a tokenizer does is the thing the model can never undo. Whatever the segmenter decides, whether don't is one token or three, whether the leading space belongs to world or stands alone, whether 안녕 is one piece or six bytes, that decision is now the unit of thought. The model only ever reasons about its own tokens. Every other detail of the original string is, from the model's point of view, gone.

The catalog spans the families you actually meet in production. OpenAI's tiktoken encodings (cl100k for GPT-3.5/4, o200k for GPT-4o and the o-series, plus the older p50k/r50k/gpt2 cuts) load through the official WASM build. Then Meta's Llama line, Google's Gemma, Mistral's SentencePiece and Tekken, the full Qwen and DeepSeek series, Cohere's Command R, plus Phi-3, Yi, and GPT-NeoX as the long-tail picks. Each tokenizer ships its real vocabulary (the same tokenizer.json the model trains and inferences against), fetched lazily on first pick so the page's opening bundle stays small.

The colored pills are token boundaries. Each pill's hue is a deterministic hash of its integer id, so the same token wears the same color across reloads and across tokenizers that share an id space. Glyphs inside a pill stand in for invisible bytes: · is a space, ↵ a newline, → a tab, \x1b any other control byte. SentencePiece tokenizers (Llama 2, Gemma, Mistral v1, Phi-3, Yi) preserve the literal ▁ meta-space so word boundaries stay visible. Byte-level tokenizers (everything else) reverse the GPT-2 printable-byte remap so the display reads as actual UTF-8 rather than the encoded Ġ that surfaces in the underlying vocabulary.

Click any pill and the merge trace opens on the right. The panel replays BPE encoding in reverse: it starts with the final token, then expands one merge per step until the byte primitives at the bottom of the tree are showing. The two children of each merged-out node are the exact pair that combined during training to make it. Most merges aren't whole words. They're fragments that just happened to co-occur often enough in the training corpus to earn a slot. Walking the tree of a common English word is usually anticlimactic; walking the tree of an emoji, a Korean syllable, or a piece of obscure punctuation is where the algorithm's priorities show.

The point of having all of them in one place is that tokenization is one of the few model-internal decisions that transfers cleanly across model generations. The same cl100k that shipped with GPT-3.5 still drives GPT-4. The o200k introduced with GPT-4o is what the o-series reasons with. Llama 3, 3.1, and 3.2 share a vocabulary byte-for-byte. Llama 4 retrains it. Qwen 2, 2.5, and 3 share merges and diverge only on the special-token block. Seeing fifty languages of the same sentence get carved into different counts by the same tokenizer is the cleanest way to understand why English speakers get cheap context windows and the rest of the world doesn't.

Everything runs in the browser. The WASM module for OpenAI's encodings (~1 MB) loads once on first tiktoken pick; transformers.js handles the rest with its own dynamic import. The tokenizer vocabularies sit in /public and stream in on demand, so picking gemma-3 for the first time fetches its 33 MB asset bundle but never touches the others. No text leaves the page.

Built on openai/tiktoken and transformers.js.