Tokenizers, Side by Side
Two tokenizers reading the same string can hand you back two different answers. Pin two to four side by side and the disagreements jump out: every pill with a faint inset ring marks a span the other columns didn't share.
Loading comparison
Open a tokenizer benchmark and you'll find that cl100k_base and o200k_base differ by maybe 10% on English prose. That number is true and almost useless. Ten percent of what? Which words? Which characters? The aggregate hides the part you actually want to see: not how often two encoders disagree, but which specific spans they cut differently. That's what this page makes visible.
Each pill is one token. Its color is a deterministic hash of the integer id, so the same token wears the same color across reloads, and (more usefully) across encoders that share an id space. Glyphs inside a pill stand in for invisible bytes: · for a space, ↵ for a newline, → for a tab. SentencePiece tokenizers (Llama 2, Gemma, Mistral v1, Phi-3, Yi) keep their literal ▁ meta-space so word boundaries stay visible. Byte-level tokenizers reverse the GPT-2 printable-byte remap, so the display reads as actual UTF-8 instead of the Ġ-flavored encoding stored in the underlying vocabulary.
The highlighting rule is one line. A token in column k is marked when its half-open [start, end) span isn't present in every other column's offset list. Lengths don't matter, ids don't matter, only span equality matters. Drop a snippet of code in and most byte-level BPE tokenizers will agree with tiktoken on most identifiers; the disagreement clusters where you'd expect: numbers, punctuation, and the occasional Unicode glyph one of them never bothered to merge.
The strip above the columns prints each tokenizer's count plus a spread: the percent gap between the smallest and largest count for the same input. On English prose the spread barely moves. On Korean it gets ugly fast. A sentence that fits inside fifteen tokens of o200k_base can balloon to forty in r50k_base, because the older encoder never trained on enough Hangul to merge it into anything larger than raw UTF-8 bytes. The encoder that wins your benchmark loses someone else's language.
Tokenizer choices, input text, column order: it all lives in the URL. Copy the share link and you copy the whole view, so someone else can land on exactly what you set up. Long inputs get lz-string-compressed so a paragraph fits cleanly into a query parameter. Nothing leaves the page. Every tokenizer's vocabulary loads lazily on first pick and runs entirely in the browser.