Interactive Visualization

Tokenizers, Side by Side

Two tokenizers reading the same string can hand you back two different answers. Pin two to four side by side and the disagreements jump out: every pill with a faint inset ring marks a span the other columns didn't share.

Loading comparison

Open a tokenizer benchmark and you'll find that cl100k_base and o200k_base differ by maybe 10% on English prose. That number is true and almost useless. Ten percent of what? Which words? Which characters? The aggregate hides the part you actually want to see: not how often two encoders disagree, but which specific spans they cut differently. That's what this page makes visible.

Each pill is one token. Its color is a deterministic hash of the integer id, so the same token wears the same color across reloads, and (more usefully) across encoders that share an id space. Glyphs inside a pill stand in for invisible bytes: · for a space, ↵ for a newline, → for a tab. SentencePiece tokenizers (Llama 2, Gemma, Mistral v1, Phi-3, Yi) keep their literal ▁ meta-space so word boundaries stay visible. Byte-level tokenizers reverse the GPT-2 printable-byte remap, so the display reads as actual UTF-8 instead of the Ġ-flavored encoding stored in the underlying vocabulary.

The highlighting rule is one line. A token in column k is marked when its half-open [start, end) span isn't present in every other column's offset list. Lengths don't matter, ids don't matter, only span equality matters. Drop a snippet of code in and most byte-level BPE tokenizers will agree with tiktoken on most identifiers; the disagreement clusters where you'd expect: numbers, punctuation, and the occasional Unicode glyph one of them never bothered to merge.

The strip above the columns prints each tokenizer's count plus a spread: the percent gap between the smallest and largest count for the same input. On English prose the spread barely moves. On Korean it gets ugly fast. A sentence that fits inside fifteen tokens of o200k_base can balloon to forty in r50k_base, because the older encoder never trained on enough Hangul to merge it into anything larger than raw UTF-8 bytes. The encoder that wins your benchmark loses someone else's language.

Tokenizer choices, input text, column order: it all lives in the URL. Copy the share link and you copy the whole view, so someone else can land on exactly what you set up. Long inputs get lz-string-compressed so a paragraph fits cleanly into a query parameter. Nothing leaves the page. Every tokenizer's vocabulary loads lazily on first pick and runs entirely in the browser.

Built on openai/tiktoken and transformers.js.