May 9, 2026143 min readfoundation

Tokenizers from First Principles

Tokenization looks like preprocessing and behaves like architecture. From bytes to BPE to the cracks at the frontier, this is an argument that almost everything weird about LLMs starts at the atom you chose.

#tokenization #bpe #byte-pair-encoding #wordpiece #unigram #sentencepiece #tiktoken #tekken #unicode #utf-8 #vocab-size #fundamentals

The Strawberry Problem

In late 2024 and early 2025, there was an interesting glitch that many LLM users stumbled upon. They figured out that if you ask an LLM how many rs are in strawberry, it says two.

You'd stare at the screen. You know there are three. You open a new chat and ask it to explain itself. It tells you two again, this time with a confident explanation attached. The explanation is also wrong.

This is the same model that can refactor a thousand lines of code or explain entanglement three different ways. But ask it to count letters in a word, and it stumbles.

Now while this particular problem was mostly resolved (labs post-trained on examples like it, and then reasoning models arrived), we all could see that LLMs have a few other quirks like this. For example, ask one to reverse .DefaultCellStyle and the answer comes back scrambled. Add 1234 + 5678 and it lands one off. Translate a paragraph into Hindi and the API bill can triple. Drop <|endoftext|> into a prompt and the response can end mid-sentence.

These seem like five different bugs. The easy thing is to file them all under "LLMs are weird" or talk about "jagged intelligence" and move on. But they actually share a single cause, and it sits at the lowest layer of an LLM's stack.

So here's the reason. When you type strawberry, the model doesn't see strawberry. GPT-4's tokenizer splits that string into three pieces (str, aw, berry) and turns them into three integers: 496, 675, and 15717. Those integers are row numbers in a giant lookup table. The model wakes up holding three rows from that table. The letters s, t, r, a, w, b, e, r, r, y are never inputs. They were last visible to the system at the moment you hit enter, and after that they were gone.

So the strawberry question, again. The model is being asked to count rs in a string that, from its point of view, has no letters in it at all. What it actually has is a few numbers ("embedding vectors" and the trained associations sitting on top of them). It guesses. The guess is confident because confident guessing is most of what the rest of the stack does, and you get back "two."

The same picture explains the other four. Reversing .DefaultCellStyle means reconstructing letters that got compressed away. Adding 1234 + 5678 means doing arithmetic on whichever chunks the merge table happened to prefer, which is rarely individual digits. A Hindi paragraph costs three times what the English one cost because the vocabulary was trained mostly on English and each Hindi word gets spelled out from many small pieces. <|endoftext|> is a special row the model was trained to treat as "stop now"; drop it in the middle of a prompt and you've sent that signal early. Different symptoms, but all with the same root: the tokenizer chose what counted as one thing, and the model lives inside that choice from its first batch.

One more thing to see before moving on. If you look at how this glorified LUT works, when we enter something like the string hello world, it becomes (based on a tokenizer called cl100k_base):

The leading space is part of the second token. The model doesn't see those eleven characters; it sees two row numbers, and world (with the preceding space) and world are different rows in the table. That's the kind of small, permanent decision a tokenizer makes before training has even started.

So: a tokenizer is a compression scheme, frozen before training, that the model has to live inside for the rest of its life. It decides which substrings are cheap and which get split to pieces, where positions get spent, which logits get scored, and which embedding rows ever see a gradient. It's strange because the piece of the stack that looks most like preprocessing has the longest reach of anything in the system.

BPE, WordPiece, Unigram, SentencePiece, tiktoken, Tekken: they're all answers to this one question. What counts as one thing? This post will take that question a bit (too?) seriously. Text cracks open down to bytes, a vocabulary gets built on top of the byte floor, and the failures from the opening will make sense one by one.

Part 1: The Atom Question

1.1 The Model Never Sees Your Text

Stay with hello world for a second. Eleven characters went in, and what reached the model was 15339 and 1917. From there the letters are gone: no h, no leading space, no second l. Just two indices into the lookup table, and whatever vectors sit at those rows.

You might look at two numbers that close and assume it means something. It doesn't, not really. 15339 and 15340 are neighbors on the number line and total strangers everywhere else, because the rows they point to were filled in independently during training. 15339 happens to be hello; 15340 is whatever unrelated thing landed in the next row over. The integer is just an address. It tells the model which row to open, and the row is where any meaning actually lives.

Once that lookup happens, the model is in vector space, working with the embeddings, and the spelling of your word didn't come along. It's tempting to assume the characters are still in there somewhere, folded into the vector for a later layer to pull back out. Not as letters, anyway: whatever spelling the model can recover later is association soaked into the row during training, and Part 3 gets into how unreliable that recovery is. The tokenizer was the last stage in the whole pipeline that saw h, e, l, l, o as letters; after it, there are only numbers and the vectors they index.

Try it yourself in the playground below. Change a space, uppercase a letter, paste the same word in Korean. The text on screen barely changes while the row of integers under it gets reshuffled.

This is why the strawberry count fails: by the time the model has anything in its hands, strawberry is two or three fat tokens, and the individual letters aren't there to count. A date like 2024-02-01 shows up already chopped into whatever splits the table has (202, 4, -, 02, -, 01) rather than four clean digits and two dashes. And the same sentence in Korean can cost three or four times the tokens it costs in English, because the vocabulary was trained mostly on English, so each Korean word gets spelled out from a pile of small pieces.

1.2 Not Words, Not Characters

The most natural idea is that a token is a word. Keep a list of words, give each one an integer, look up its embedding, off you go. It works fine for cat, and honestly for most of a paperback novel. Then a real prompt shows up and it falls apart. user.profile.avatarUrl isn't a word. Neither is a URL that wraps over four lines, or a hashtag coined yesterday, or a typo, or a German compound like Donaudampfschifffahrtsgesellschaft. You can't get all of those into a list ahead of time, so you need some plan for whatever isn't in it.

For years that plan was a single token, <UNK>, meaning roughly "something was here and I won't say what." Everything unfamiliar mapped onto it. Which is pretty rough when you sit with it: a typo, a rare surname, a URL, and that German compound all turn into the same token, so the model can't tell them apart, and the letters it would need to do anything useful were thrown out before it ever ran.

Okay, so words are too coarse. Swing to the other extreme and give every character its own token. Now the vocabulary is tiny, a few hundred entries, you never meet something you can't represent, and counting the rs in strawberry is easy because the letters are sitting right there in the input. The catch is that everything gets much longer. A paragraph that was around 150 tokens turns into roughly 700 (English runs four to five characters per token, so the arithmetic is about what you'd expect), and attention cost grows with the square of the sequence length (every position attends to every other), so that's a real tax. Most of those single-character positions barely carry any information anyway: the model spends nine separate positions on t, h, e, space, m, o, d, e, l before it reaches a token worth predicting.

Subword tokenization is the compromise nearly everyone uses now. Make the vocabulary big enough that common strings collapse into a single token, but keep the ability to drop down to smaller pieces, all the way to single characters, when something rare or strange turns up. Common words stay one cheap token, and the weird stuff still gets represented instead of disappearing into <UNK>.

There's one thing to settle before any of that, though. We've been saying "character" as if it's obvious what one is, and it isn't, quite.

1.3 What Counts as One Character

How many characters is 👍🏽?

I tried to actually pin that down and got a different answer every way I asked. On screen it's one thing: a thumbs-up with a medium skin tone, one mark sitting in one space. Then I asked Python for len("👍🏽") and got 2. Then I dropped it into a text field, put my cursor after it, and hit backspace once. Half of it stayed: a plain yellow thumbs-up, skin tone gone. Three answers so far, and the tokenizer hasn't even gotten involved.

Bring the tokenizer in and you get a fourth: encode 👍🏽 under cl100k_base and it comes back as six tokens. And underneath all of it, what your computer actually stores is bytes. UTF-8 writes this one emoji as eight of them:

One mark on screen, two from Python, six tokens, eight bytes. None of them is wrong. They're each counting a different layer, and "character" is a word that slides across all of them without ever telling you which one you mean. The tokenizer is eventually going to have to plant its feet on exactly one of these layers, so let's pull them apart.

Start at the top, with the thing you actually saw: one visible mark. The technical name for that is a grapheme cluster, which is really just "one character" in the everyday sense, however many smaller pieces it took to draw. Usually it's nothing fancy. Sometimes, like here, it's a base symbol with a modifier glued on so the two render as one. It can get a lot bigger: the family emoji 👨‍👩‍👧‍👦 is a single grapheme cluster stitched out of seven separate pieces with invisible joiners holding them together.

Drop down a layer and you reach what Python was counting. Unicode doesn't store "thumbs-up, medium skin tone" as one unit. It stores two codepoints, one for the thumbs-up (U+1F44D) and one for the skin tone modifier (U+1F3FD). A codepoint is just Unicode's integer ID for an abstract character: h is U+0068. Most of the time one codepoint is one grapheme and the distinction never comes up. Accents and emoji are where it leaks. The é you read can be a single codepoint (U+00E9) or two of them (a plain e followed by a combining accent at U+0301), and the two versions look identical on screen while being completely different underneath. Hold onto that one, it comes back to bite in a minute.

Go down one more layer and you hit what actually travels over a wire or sits on disk: bytes. Codepoints are abstract integers, and something has to write them down as concrete 8-bit values. That rule is an encoding, and the one essentially everything uses now is UTF-8. It keeps plain ASCII at one byte per character, gives every other codepoint its own unique byte sequence, and (this is the part that's about to matter) does the whole job with an alphabet of just 256 possible byte values no matter how exotic the input gets.

That gives us grapheme on top, then codepoint, then byte, and finally the tokens the tokenizer stamps on after all three. A tokenizer needs a floor: some set of indivisible units it's allowed to start from and build everything else out of. It can't just say "characters," because we've now watched that word mean four different things. So which layer does the floor get built from?

1.4 Why Bytes Are the Floor

My first instinct was codepoints, and I suspect it's most people's. They feel like the obvious pick. They're already integers, so there's no encoding step to think about (ord("h") hands you 104 directly). The Unicode Consortium has already done the staggering work of cataloging every character in every script anyone writes in. Give each codepoint a row and you're done, and you never have to look at a raw byte again.

It doesn't hold, for a reason that took me a second to see. Your vocabulary gets frozen before training and never changes after. Unicode does not get frozen. Every revision, the Consortium adds more to it: a script someone finally digitized, a batch of emoji someone decided were overdue. Anything that lands after you've frozen your table is a codepoint you have no row for, and now you're back to needing a "something was here" token at inference time. That's <UNK> again, and not having to do that was the whole reason we walked away from a fixed word list a section ago.

Graphemes are an even looser set. Once you let in skin tones, combining marks, regional flags, and the zero-width joiners that fuse several emoji into one (that seven-piece family from before), the collection of possible visible clusters isn't a list you could write down if you tried. You can always build a new valid one. That's a grammar, not a vocabulary, and there's no enumerating it up front.

Which leaves bytes, and bytes have the one property the other two are missing: the set is finished. There are 256 of them, 00 through ff, and there will only ever be 256. UTF-8 guarantees that every string, every codepoint Unicode has ever shipped or ever will ship, comes apart into some sequence of those values. Give each byte a row in the base vocabulary and there's no longer any such thing as text that falls off the table. Whatever shows up at inference, however new or strange, is still just bytes you already have rows for.

That would settle the whole thing, except a tokenizer built from raw bytes and nothing else is a bad tokenizer. Spell tokenization one byte at a time and it's twelve tokens for a single ordinary word, twelve sequence positions that each carry almost nothing on their own. It covers everything and it's painfully inefficient. So bytes aren't really the answer to "what goes in the vocabulary." They're the answer to a narrower question, "what sits at the floor," and the answer there is a guarantee: whatever happens, the text can always come apart into bytes and never hit a wall. The vocabulary is the thing you build on top of that floor.

The guarantee hangs on one adjective. Every valid UTF-8 string has a way in, and a few things look like holes in that promise that aren't.

The guarantee itself is airtight on the encoding side. Hand the tokenizer any ordinary text (a brain emoji 🧠, an accented é, a Japanese 日, a codepoint Unicode shipped last month) and UTF-8 turns it into bytes, every byte lands between 00 and ff, and every one of those has a row. The cost in tokens might climb, but the alphabet never runs out. Three things sit outside the guarantee, though, and each one tends to get blamed on coverage when it shouldn't be.

Special tokens first. Type the literal characters <|endoftext|> into a prompt and the tokenizer can absolutely represent them. The interesting question is whether those characters map to the reserved control ID the model was trained to treat as a document boundary, or to a handful of ordinary tokens that just spell the string out. That's a policy decision the API layer makes, and it has nothing to do with Unicode.

Normalization is the second, and the one that actually bites in practice. Remember that é can be written two ways, precomposed or as e plus a combining accent. Some tokenizers silently rewrite one form into the other before encoding. For a search engine or a classifier that's a feature: both spellings fold together and nobody notices. But it means decode(encode(x)) can hand back bytes that render exactly like x without being the bytes you put in; the round trip looks lossless and isn't. If you need the bytes to survive, find out whether a normalizer is sitting inside encode.

The last one runs in the opposite direction, which is why it's easy to miss. Encoding a valid string always produces valid UTF-8 bytes. Decoding promises you nothing of the kind. Decoding is just looking IDs up in the table and gluing their bytes together, and nothing stops you from handing it a run of IDs whose bytes don't form valid UTF-8. A single token can carry a byte like 0x80, which is only legal in the middle of a multi-byte character, never on its own at the front. Hand that to a strict decoder and it throws. Hand it to a lenient one and it emits U+FFFD, the replacement character (the little � you've seen on a mangled web page), and keeps going. That U+FFFD isn't <UNK> sneaking back in through a side door; it's the decoder telling you the IDs it was handed don't spell valid text, which is a different problem from running out of vocabulary.

So the floor promises one specific thing: ordinary valid text has a way into the model and never runs into an unknown wall. It says nothing about how the API routes control tokens, whether a normalizer is rewriting your bytes on the way in, or whether an arbitrary pile of token IDs decodes back to clean text. Keep those as three separate questions and the guarantee holds up exactly as far as it should.

1.5 Five Things Move Together

Bytes solve coverage without solving tokenization. The merge table on top of the byte floor still decides how the model's world gets tiled, and token count is only the easiest number to read off the result.

Take user_id in a prompt. One tokenizer hands it back as a single ID; another splits it into three: user, _, id. Same seven characters out either way, and five different things just shifted inside the model.

Atomic units. An embedding row is a learned object. One-token user_id gives the model a single row to hang behavior off. The three-token version gets assembled from pieces, but the pieces are reusable: user shows up everywhere, _ lives in identifiers, id is in field names by the thousand. It's a bet about which atoms the model should find easy to compose from.

Sequence geometry. One token became three positions; everything after slid right. The prompt now has more attention edges, more keys and values in the KV cache (the per-position state Part 4 prices out), and hits the context limit sooner. For a single identifier the difference is invisible; for a repeated tool schema or a long log file it adds up fast.

Output distribution. A decoder-only model doesn't emit text. It scores the next token over the vocabulary, picks one ID, appends it, runs again. If the answer the user sees is one token in one vocabulary and three in another, the model made one decision in the first case and a chain of three in the second.

Training signal. Pretraining loss lands on token targets. A rare surname kept as a single token concentrates its updates into one row that almost never fires; split it into reusable pieces and some of the gradient flows through rows the model sees everywhere. The reverse is just as common: a useful recurring unit gets cut into ill-fitting pieces, and the model has to relearn the same pattern out of its neighbors.

Post-training surface. Instruction data, preference pairs, chat templates, tool schemas, safety text, eval prompts: all of it is text on the page and tokens to the model. A template that looks identical to a human can take more positions, cross different boundaries, or expose a control marker through a different route after a tokenizer swap. Not every behavior change after a swap is the tokenizer's fault, but it's one real channel.

Token count is just the number you can see. Part 3 walks through what it looks like when each of these five channels goes wrong.

1.6 Compression, Bias, Systems Cost

Those five channels are where the consequences land. The tokenizer behind them is making three bets at once, all with the same artifact. It compresses your text into fewer positions, it decides which substrings the model gets to treat as single objects, and it sets the two numbers that drive most of the inference bill.

Compression first. hello world is two IDs under cl100k_base and eleven positions as raw UTF-8 bytes. Both reconstruct the same eleven characters, but only one is cheap to attend to inside a 128k context window. The cheapness is uneven, though. the is a single row; 123 is one token; 2026 comes out as two (202, 6), not because anyone decided years deserve two slots but because that's where this tokenizer's splits happen to fall. The token counter tells you how many positions you're paying for; it doesn't tell you why some strings pay more than others.

Bias is the part people argue about loudest. Every merge says: this substring is worth treating as one object. Sometimes the judgment is right: ing, four-space indents, common JSON fragments, and leading-space English words really are atoms most of the time. Other times it hides what the task cares about. .DefaultCellStyle packed into one token makes copying it cheap and spelling questions hard, because the letters are no longer in the input. unhappiness split as un + happiness is a win, because the model has already seen those pieces in unable and sadness. A merge landing mid-word does the opposite: the model is suddenly learning around a shape it can't reuse anywhere else. Both decode cleanly. Only one gives the model useful atoms.

And then there's the arithmetic. Vocabulary size V and sequence length T pull on each other. A bigger V means more embedding rows and a wider output head, but a shorter T for the same text. Smaller pieces flip the trade: lighter head, longer sequences.

All three bets ride on the same artifact, which is why "this tokenizer is better than that one" is such a slippery claim. Better can mean a smaller bill, or cleaner stems and suffixes, or code that splits at sensible boundaries, or less multilingual bloat, or a smaller output head, or fewer ugly downstream failures. Related claims, not the same claim, and they routinely disagree on which artifact wins.

Part 2: Building a Vocabulary

The byte floor from Part 1 buys coverage, and the cost shows up in sequence positions. hello world runs eleven of them. A plain thumbs-up 👍 takes four (the skin-toned one from Part 1 took eight). The four-space indent at the start of a Python function body takes another four. The 128k context windows we keep advertising don't survive a normal day of production traffic at that rate, especially once code or non-Latin scripts show up.

The right vocabulary sits somewhere between "pure bytes and stop" and "every common phrase in English gets its own row". So which byte sequences earn a row?

The easy calls aren't interesting: the deserves a row, and so does ing. The four-space Python indent that opens most function bodies, debatable but probably yes. A Reddit username that showed up twice in the training corpus, probably not. The gap between those two is what the tokenizer trainer is paid for. Before any model training begins, the trainer reads through a corpus, picks which byte strings get to be cheap, and stamps one ID on each chosen string. Everything else gets reassembled at inference time out of whatever smaller pieces survive in the vocabulary, all the way down to single bytes if no learned alternative made it through.

I think of a vocabulary as a codebook, not a dictionary. Train one on Python and English web text and chat transcripts and you get one set of cheap substrings. Train it on Hindi novels and operations logs and you get a different one. Whichever discounts and bad cuts come out of the trainer, the model spends its whole pretraining run learning to live with them.

BPE is where almost everyone starts, and where most of the family still lives. The whole algorithm fits on one line: count adjacent pairs in the corpus, merge the most common one into a new symbol, repeat. It was invented in 1994 to compress files and pulled into NLP around 2016, and most modern tokenizers are either BPE with a twist or BPE inside a wrapper. The variations are easier to see once the loop itself is on the table.

2.1 BPE Was a Compression Loop

Philip Gage published Byte Pair Encoding in The C Users Journal in February 1994, and it had nothing to do with language. It was a file compressor. The algorithm was mechanical: find the most common adjacent byte pair, mint a new symbol for it, scan through replacing every non-overlapping occurrence, repeat. Gage's compressor stopped at 256 because his new symbols had to fit in a byte themselves. Once every slot was spoken for, the loop had nowhere left to grow.

Sennrich, Haddow, and Birch yanked that ceiling off in 2016. They were working on neural machine translation for German, which is brutal for fixed word vocabularies. Donaudampfschifffahrtsgesellschaft shows up once in your training corpus and immediately becomes <UNK>, along with every place name and every recently coined chemical compound.

The fix was small. Run Gage's loop on characters and stop wherever the compute budget told you to, somewhere around 30k, 50k, or 100k entries. Continuation markers kept the segmentation reversible so the original sequence could be reassembled at decode time. A rare word, a typo, a German compound: none of them disappeared into a single shared UNK row before training had even started. The pieces still had to be learned, but the input wasn't being erased on the way in. That was the open-vocabulary win, and GPT-style tokenizers later swapped the floor from characters to UTF-8 bytes without changing the basic bargain. Three decades after Gage, his compression loop is still running under every modern LLM.

Karpathy's smallest worked example of the loop is aaabdaaabac. Four letters in the alphabet, eleven characters in the corpus, small enough to trace by hand and big enough to have a clear winner pair. Count the adjacent pairs. (a, a) shows up four times, more than anything else, so mint a new symbol Z = aa, scan left to right replacing every non-overlapping aa, and the corpus becomes Z a b d Z a b a c. Recount on the new string. Mint the next winner. Keep going until the vocabulary fills to whatever budget you set.

The trainer below is that loop, with the bookkeeping exposed.

What you walk out with is a list of merges, in order, and the order matters more than it looks. The encoder has to replay it exactly at inference time, or the IDs come out wrong. Take aab and the ranks {aa: 0, ab: 1}. Apply rank 0 first, you get [aa, b]. Apply rank 1 first, you get [a, ab]. Both orders decode to the same string, but they produce different IDs, and the IDs are what the model trained against. The merge file is small, plain text, deterministic, and the entire artifact you need to reproduce the encoder.

Karpathy's minBPE is the version of all this I'd reach for at a whiteboard. UTF-8 bytes at the floor, a merge table, train, encode, decode. Around seventy lines of Python once you strip the file I/O. The first test worth writing is simple: feed it text, encode, decode, compare bytes. If that round trip fails, nothing downstream matters.

Go looking for the production version and the first thing you notice about tiktoken is what isn't there. No training mode. The library can't make a tokenizer; it can only execute one. You hand it a frozen encoding (a pre-tokenization regex, a byte-pair rank table, a small set of specials) and it runs that artifact fast enough to put on every inference request and every training document. The speed isn't magic. A schoolbook encoder rescans the chunk after every merge, which is fine on aaabdaaabac and hopeless on a real corpus. tiktoken keeps the same merge semantics but tracks adjacent pairs in a linked structure, applies the best-ranked merge, and patches up only the neighborhood that just changed, which produces the same token IDs at production speed.

The split between the two libraries is visible in the cl100k_base encoding file itself: pat_str, merge ranks, specials, and nothing else. That file is the whole artifact, and the Rust core's only job is to obey it, quickly and exactly. minBPE is how you'd build the table to understand it; tiktoken is what runs the table once a model actually depends on it.

2.2 Byte-Level BPE

Sennrich's BPE ran on characters, which still leaves a version of the old problem: the trainer can meet a character it never chose to keep, and the unknown thing ends up at <UNK>. GPT-2 closed that hole by moving the floor down to bytes. Its tokenizer starts with all 256 byte values pre-loaded as base IDs, so every valid UTF-8 string is representable before BPE has learned a single merge.

hello 🧠 as bytes:

The base vocabulary has no row for the brain emoji, and it doesn't need one. The four bytes f0 9f a7 a0 already have rows. So do the space byte and the ASCII letters. BPE then adds compression on top of that floor. Maybe hello collapses to one token, maybe a leading-space word like world becomes one, maybe a common emoji earns a row of its own. If a string never gets common enough to earn a merge, the encoder just spells it out down to single bytes.

That's the entire "no <unk> needed" claim, stated narrowly. Byte-level BPE doesn't need <unk> for ordinary valid UTF-8. The encoding might fragment uglier than you'd like, but the string survives all the way through.

What backs that claim is byte-for-byte equality on the encode/decode round trip, and the test has to be precise about what it compares. Here's a case that fools the loose version. You paste café into a prompt. Then you paste it again, this time copying from a different editor. Both render identical: same font, same kerning, same glyph. The bytes don't agree:

Five bytes against six: it's the two spellings of é from Part 1, back inside a different word. One is é as a single precomposed codepoint; the other is e followed by a combining acute accent, which UTF-8 spells with two bytes. Whichever one you typed is the one the model should see. A tokenizer is allowed to split these two strings differently, and the decomposed form can come out a token longer. What it can't do, on an LLM prompt path, is silently eat the difference and hand back a string that merely looks close to what you typed.

So there's a test. The loose version first:

The strict version is about bytes, not about what your font happens to render:

The strict one catches what the loose one sleeps through. A normalizer that rewrites cafe + U+0301 into precomposed é fails it. So does a wrapper that drops a zero-width joiner from an emoji sequence, a tokenizer that collapses an unknown mark into <unk>, and a special-token policy that lets literal user text like <|endoftext|> slip onto the control-token path when the caller never asked for it.

Passing the round trip doesn't make a tokenizer good. A raw byte tokenizer passes, a character tokenizer passes, and a tokenizer that makes Korean expensive and chops long numbers into nonsense passes too. Reversibility is only a floor, but it's a real one: a modern LLM tokenizer that hands back <unk> for ordinary valid UTF-8 means the model never gets the evidence the user actually typed.

2.3 `Ġ` Is Byte `0x20`

Tokenize hello world with a GPT tokenizer and the second token comes back Ġworld, leading with a character you've never typed in your life. That leading character is the ASCII space byte, 0x20, wearing a printable Unicode costume; it isn't a word-boundary marker and it isn't linguistics.

The costume exists for a small, annoying reason. BPE wants symbols, and bytes are the symbols. Some bytes print cleanly: h, e, 7. Others are whitespace, control codes, or the middle of multi-byte UTF-8 sequences. A literal newline inside a merge table is meaningful data and miserable to read. So GPT-2's authors swapped each unprintable byte for something they could actually see.

The swap is a static permutation, picked once and never moved since. Every byte from 0 to 255 maps to one printable codepoint. Printable ASCII (33 through 126) mostly maps to itself. Space (0x20) maps to Ġ (U+0120). Newline (0x0A) maps to Ċ (U+010A). The remaining unprintable bytes land somewhere in the Latin Extended-A block. BPE then trains and encodes in that remapped alphabet.

Shim is the word for it. This layer doesn't learn anything, and it doesn't change the fact that the tokenizer is byte-level underneath. On decode the permutation runs in reverse: Ġ becomes 0x20, Ċ becomes 0x0A, and UTF-8 stitches the bytes back into a string. The model itself never sees the costume come off. To it, Ġthe is just Ġthe, one row in the embedding table, no different from Tokyo.

That's why GPT-2-lineage vocabularies are full of leading-space pieces like Ġthe, Ġimport, Ġworld. The pre-tokenizer regex usually keeps a leading space glued to the next word, and BPE notices that space + the shows up everywhere in real text. One row for Ġthe beats two smaller ones every time.

It's also a fingerprint: gpt2, cl100k_base, o200k_base, Llama 3, Tekken differ in vocabulary size, regex policy, specials, and training data, but whenever a Ġ shows up in a token dump, the 2019 GPT-2 byte shim is still wired in down there somewhere.

2.4 Five Algorithms, One Decision Tree

There aren't really five tokenizer algorithms here. There's one decision tree with four forks, and BPE, WordPiece, Unigram, SentencePiece, and Tekken are five paths through it. The papers feel disjoint because each one ships its bundle of choices under its own name. Line them up against the forks and most of the apparent disagreement collapses.

Four questions every one of them has to answer:

Fork	Common choices	What it changes
What earns a row?	frequent pair, PMI-ish pair, likelihood-preserving piece	whether the table prefers common chunks or informative ones
What's the floor?	characters, Unicode codepoints, UTF-8 bytes, bytes as fallback	what happens when something rare shows up
Which direction does training move?	grow by merges, or start huge and prune	whether early choices are reversible
What does the trainer see?	WordPiece `##`, SentencePiece `▁`, GPT-style `Ġ`, Tekken's tiktoken stack	where whitespace, digits, punctuation, and specials get drawn

BPE takes the simplest answer at each fork: most frequent pair, bytes at the floor, grow by merges, run on whatever the pre-tokenizer hands over. WordPiece keeps the loop and changes the score it uses to pick the next merge. Unigram flips the training direction: start with too many pieces and prune. SentencePiece and Tekken aren't scoring rules at all; they change what the trainer is allowed to see before scoring starts.

The model doesn't care which paper named the bundle. The final atom stream is all it ever sees: where spaces landed, whether a rare character collapsed to bytes or to a single [UNK], whether internationalization got a reusable suffix or just got chopped, whether a number arrived as one chunk or seven.

2.4.1 The Scoring Rule

The loudest disagreement on the decision tree is the first fork: how do you score a candidate merge? It's easiest to see on two candidates every trainer eventually has to weigh:

th is everywhere in English. Tok + yo is rarer in absolute terms, but Tok almost never shows up in front of anything other than yo. Two real facts, pulling in opposite directions, and the three algorithms disagree about which one matters.

BPE takes the absolute count. If t and h are adjacent more often than Tok and yo, BPE spends the next slot on th. There's no hidden language model or information-theoretic claim behind that, just frequency.

WordPiece keeps the merge loop and swaps the score:

That's pointwise mutual information. The number goes positive when a pair shows up together more often than chance would predict from each half alone, negative when it shows up less. Pairs whose halves are already common on their own get penalized: a big pair count is exactly what chance predicts when both halves are common. t and h show up everywhere separately, so th has to clear a high bar to earn its slot. Tok is rare, but if nearly every Tok is followed by yo, the merge has a strong case. BPE rewards quick compression, while PMI rewards pairs whose halves are statistically tied to each other. (WordPiece is also the one that marks word-internal pieces with a ## prefix, the convention you'll recognize from BERT vocabularies.)

Unigram bets on likelihood and throws out the merge loop. The question changes from "which adjacent pair should I add next?" to "which existing piece can the corpus least afford to lose?" Suffixes, stems, and frequent surface chunks compete as reusable parts rather than as accidents of whichever pair won early.

Does any of this matter once a model trains on top? Bostrom and Durrett ran the controlled version of this question in 2020: RoBERTa-base, English Wikipedia plus BookCorpus, 32k vocabulary, same compute, BPE versus Unigram as the only knob between runs. SentencePiece Unigram won by small but repeated margins on GLUE, and the same comparison repeated with Japanese pretraining data pulled further ahead. A vocabulary is a compression policy with opinions about what counts as a reusable unit, and the model trains around those opinions from its first batch to its last.

2.4.2 Unigram, Backwards

Every algorithm so far grew its vocabulary from below. Start with single bytes, find pairs, glue them together, repeat until the table fills. If round 800 spent a slot on a mediocre merge (two common letters that didn't really belong together), the trainer never walked it back. The merge log is monotone: whatever was greedy at the time is what you got.

Unigram inverts the move: start with far too many pieces and let the corpus tell you which ones to cut.

Take internationalization. The seed vocabulary keeps the full word, the obvious stems, the suffixes, and a pile of overlapping fragments, all at once:

That's intentional. Hundreds of thousands of substring candidates get allowed in at the start. So which of these can the corpus live without?

There's a wrinkle. You can see the corpus, but you can't see how each word was meant to split. Unigram treats the segmentation as a hidden variable. Fit a small probability model over the pieces: guess likely segmentations under the current piece probabilities, update the probabilities from those segmentations, repeat until it settles. (This is EM doing its usual job.)

Once it does, removing any given piece has a measurable effect. Yank ization and a lot of words suddenly need worse alternatives. Yank a duplicate fragment that barely wins anywhere and almost nothing moves. Then prune: drop the pieces whose removal hurts corpus likelihood least, refit the model, prune again, stop at the target vocabulary size.

That reversal is the real difference between Unigram and everything before it. BPE asks "which adjacent pair should I add next?", which is a local, historical decision; once it's made, it's frozen. Unigram asks "which piece is least worth keeping?", which is global and revisable. A useful suffix earns its slot because thousands of words depend on it, not because it was the most common adjacent pair around round 800.

Encoding changes too. There's no merge log to replay. Each piece carries a probability, so encoding a new string becomes a search: find the most likely segmentation, usually with Viterbi (a shortest-path sweep over the valid ways to cut the string).

The probability model also gives Unigram something BPE can't do natively. The encoder can return the best split, the n-best splits, or sample a split with temperature. Kudo's subword regularization rides exactly this during training: the same word arrives under slightly different segmentations across epochs, and the model can't overfit to one brittle cut. BPE-Dropout got most of the way there later, by randomly skipping merges at encoding time. With Unigram, the variation falls out of the model the tokenizer already learned, no extra knob needed.

So BPE ships an ordered compression history, and Unigram ships a vocabulary plus a probability for each piece.

2.4.3 Whitespace as a Character

A Llama 2 token dump pulls the same trick in a different costume: Hello world comes back as ▁Hello and ▁world. Where GPT-2 remapped the space byte so BPE could see it, SentencePiece renames it, deliberately, before training ever starts, and for a different reason.

Before SentencePiece existed, every subword trainer came with the same homework assignment: chop the text into words first, then train the learner on the chunks. English got Moses-style rules, Japanese got MeCab, Chinese got jieba. Each adapter made model-facing decisions before BPE or WordPiece had learned anything. And the chopping wasn't reversible: once Moses or MeCab or jieba had split the text into words, nothing in the token stream recorded which spaces had actually been there, so decoding meant guessing the whitespace back.

Kudo and Richardson threw the homework out. Start from the raw Unicode string. Rename every space to ▁ (U+2581, the lower one-eighth block) so whitespace survives as an ordinary visible character. Then train the subword learner on that stream. With the common dummy-prefix setting turned on, "Hello world" enters the trainer as ▁Hello▁world, which means ▁world actually means "world preceded by a space," not "world after a separator the detokenizer will guess at later."

One character rename carries a lot of downstream consequences. The trainer no longer needs its own definition of "word." Languages without spaces between words don't have to invent one. Languages with them keep them, in plain sight, and decoding becomes mechanical: concatenate the pieces, swap ▁ back to a space, done.

SentencePiece itself is just the wrapper. The learner inside it can be BPE or Unigram:

Llama 2 and Mistral v1 picked BPE inside SentencePiece. T5, ALBERT, mT5, and most NLLB checkpoints picked Unigram. Both can produce a row like ▁international. What changes is how that row earned its slot: merge count for BPE, likelihood under a pruned piece model for Unigram.

The escape hatch is byte fallback. SentencePiece starts from Unicode codepoints, not all 256 bytes, so a rare character can still slip past the learned vocabulary. Instead of collapsing it into a shared <unk>, production Llama-style models reserve byte tokens up front. In Llama 2, right after the three special tokens, the vocabulary runs all the way from <0x00> to <0xFF>. If a character has no normal piece, its UTF-8 bytes go out one by one, and the string still makes it through.

SentencePiece sits in an odd place on the decision tree: it never answers the frequency-PMI-likelihood fork at all. It answers the one before it, what does the learner get to see, and its answer is raw text with visible whitespace, plus bytes held in reserve for whatever doesn't fit.

2.4.4 Tekken: New Wrapper, Same Loop

The release notes for Mistral NeMo call Tekken a new tokenizer. It's BPE underneath, the same merge loop Mistral v1 used; what's new is the wrapper. Move from a v1 model to NeMo and your token counts drop anyway:

The space marker moved from ▁ to Ġ, and the split changed along with it.

Mistral v1 was SentencePiece BPE: 32k entries, visible whitespace through ▁, Unicode pieces, byte fallback when the learned vocabulary missed. Tekken is the same merge idea ported into the GPT/tiktoken line: UTF-8 bytes as the floor, a byte remap that makes spaces show up as Ġ, a regex that fences off digits and punctuation and whitespace before merges can count anything, and 131,072 token IDs to spend.

BPE is still BPE: merge adjacent units into longer units, by frequency, until the table fills. What changed is the definition of "adjacent." In SentencePiece, the learner sees a raw-text stream with ▁ spaces and SentencePiece's own split rules. In Tekken, it sees byte-level chunks after a tiktoken-style regex has already drawn walls around digits, punctuation, whitespace, and word-like spans, so the same merge loop counts over a different stream.

The bigger table is the headline number. Mistral reports Tekken used roughly 30% fewer tokens than its older SentencePiece tokenizer on source code and several large languages, with bigger gains on Korean and Arabic. That's what the extra vocabulary goes to: more rows for code fragments, non-Latin scripts, and frequent byte sequences a 32k table had to spell out with smaller pieces.

What a new wrapper can't do is slide under an existing checkpoint. ID 12000 meant one row in the old embedding table and means something else, or nothing, in Tekken's. Changing tokenizers between model generations looks much more like training a new generation than editing a config file.

Tekken slots onto the decision tree next to SentencePiece, not next to BPE. BPE, WordPiece, and Unigram are deciding what earns a row, while SentencePiece and Tekken are deciding the wrapper around the decision. The wrapper looks like an implementation detail until a model has trained on it; after that the weights assume it, the same way they assume everything else about the vocabulary.

2.5 The Pre-Tokenizer

hello123 enters cl100k_base and comes out as two tokens: hello and 123. The split looks like BPE's call. It isn't. A regex sliced the string into two chunks before BPE ever saw a byte.

The BPE loop from the previous sections was cheating a little. It got one long stream of bytes and got to decide every boundary itself. Run it that way on raw web text and you get what GPT-2's authors saw: vocabulary slots going to dog., dog!, and dog?, the same word fused to whatever punctuation followed it, which is why they fenced character categories apart before counting. Real tokenizers cut first. tiktoken ships its encoding with a regex called pat_str, and before any merges happen, the regex slices raw text into chunks. Byte-level BPE runs inside each chunk, never across them:

That wall is hard. If the regex separates hello from 123, no merge can later pull them back together as hello123. If it keeps "hello together, BPE can learn the quote and the word as a single piece. The merge table still trains on data, but only inside the rooms the regex has built for it.

A preprocessor edits text and leaves; this layer decides which bytes are even allowed to live next to each other: digits with digits, punctuation with words, indentation as its own object instead of ordinary spaces, line breaks owned by whatever trailing punctuation arrived in front of them.

SentencePiece makes the same kinds of calls through a config file instead of one wall-sized regex. Preserve repeated spaces? Add a ▁ prefix at the start of every string? Split runs of digits? Let pieces cross Unicode scripts? Fall back to <0xNN> byte tokens, or collapse rare codepoints into <unk>? It's a different control surface making the same decisions, all of them frozen before BPE has counted a single pair.

BPE is the compression loop. The pre-tokenizer decides what the loop is allowed to compress.

2.5.1 The GPT-2 to GPT-4 Diff

The GPT-2 regex and the cl100k_base regex both look like keyboard noise, so ignore the syntax. The only question worth asking of either is: what is BPE allowed to see as one chunk?

GPT-2 used broad buckets: contractions, letters with an optional leading space, numbers with an optional leading space, punctuation, whitespace. cl100k_base keeps the family resemblance and moves the fences around.

Boundary	GPT-2	`cl100k_base`	Why it matters
Contractions	Lowercase only	Case-insensitive	`WE'RE` no longer gets a different pre-tokenization shape from `we're`.
Word prefixes	Optional ASCII space	Optional non-letter, non-number, non-newline prefix	`"hello` can stay together instead of forcing the quote into a separate chunk.
Number prefixes	Optional leading space	No leading space	`2024` stops being a different kind of number from `2024`.
Number length	Any run of digits	One to three digits	`20240506` enters BPE as `202`, `405`, `06`, not one long opaque chunk.
Punctuation at line ends	Punctuation and newline split	Punctuation can carry trailing newlines	Common shapes like `:\n` and `.\n` become learnable pieces.
Layout whitespace	Generic whitespace fallback	Whitespace ending in a newline gets its own branch	Indentation before a line break stops looking like ordinary spaces inside a sentence.

The merge loop is identical in both; it still counts adjacent byte pairs inside each chunk. What changed is the chunks. If four spaces arrive as four one-character chunks, BPE can never learn a four-space indentation token, no matter how many Python files end up in the corpus. If a long number arrives as one open chunk, the vocabulary spends rows on fragments of dates, prices, IDs, and log lines, which all look the same to BPE and act nothing alike downstream.

How much do these fence moves actually buy? There's a controlled measurement, and it belongs with the code story in Part 3.

2.5.2 SentencePiece, Via Config

"Llama 2 uses SentencePiece BPE" names about a third of its tokenizer. The config file is the rest of it: a list of small arguments about text.

The first line picks the learner; every line under it shapes the stream the learner gets to see.

normalization_rule_name = identity is the trainer saying: don't clean this up. If the corpus writes é both ways (the precomposed and combining-accent spellings from Part 1), keep the distinction. If code contains a weird Unicode space, a casing trick, or some terminal artifact someone copy-pasted in, let the model see it.

remove_extra_whitespaces = false is the same promise for layout. Two spaces stay two spaces. Leading spaces stay leading spaces. A trailing space in a prompt is ugly, but it's still data. Both of these are mostly restraint: a language model is trained on the mess, and the tokenizer shouldn't be quick to decide the mess was an accident.

Then SentencePiece does one deliberate, visible rewrite: add_dummy_prefix = true. SentencePiece writes spaces as the meta-space character ▁, and with the dummy prefix turned on, a bare world at the start of a prompt gets treated as if it began with that marker, so it shares a piece shape with world later in the same sentence. That's why Llama-style vocabularies fill up with pieces like ▁world, ▁the, and ▁return. Without the dummy prefix, the first word of a sequence and the same word after a space compete for different rows more often than they should; with it, both positions feed evidence into the same row.

byte_fallback = true is the escape hatch from §2.4.3. A unicorn 🦄 with no learned piece becomes <0xF0>, <0x9F>, <0xA6>, <0x84>: four tokens for one visible character, but the text comes through intact instead of collapsing into a shared <unk>.

The split rules map almost directly onto the GPT regex from the previous section. SentencePiece doesn't draw the walls with one regex; it draws them with named knobs:

Setting	Wall it draws
`split_digits=true`	`1234` enters through digit-sized boundaries instead of one open number span.
`split_by_number=true`	`abc123` can't become one ordinary word-like run.
`split_by_unicode_script=true`	Latin, Cyrillic, CJK, and other scripts don't freely merge across script changes.
`split_by_whitespace=true`	The meta-space marks boundaries instead of letting pieces drift across spaces.
`max_sentencepiece_length=16`	No learned piece is allowed to grow past a fixed character cap.

Each one sounds reasonable on its own, and together they decide what BPE is allowed to see. A mixed-script identifier, a product code, a long number, and a whitespace-heavy code block all get shaped before merge counts begin. So the precise expansion of "Llama 2 uses SentencePiece BPE" is: BPE inside a SentencePiece recipe with identity normalization, preserved whitespace, a dummy-prefix space, byte fallback, digit splitting, script splitting, whitespace boundaries, and a 32k table.

2.6 Specials: Protocol, Not Prose

Open tiktoken and you'll find two tables, not one.

The first is what we've been calling "vocabulary" all along: byte rows, merge ranks, maybe a visible space marker, plus the regex and config rules that decide what BPE is allowed to touch. It lives under mergeable_ranks: hello, ing, ():, Default, 123. Each row earned its place because the bytes showed up often enough during training to win the merge contest.

The second is special_tokens: reserved strings mapped directly to reserved IDs. <|endoftext|> isn't there because BPE noticed that angle brackets, pipes, and endoftext make a useful phrase. [CLS], [SEP], [MASK], <|fim_middle|>, <|im_start|>, [INST], and the various tool markers are all control codes with printable names. Nothing in this table earned anything.

A normal token is compressed text. A special token is protocol: document boundary, message boundary, role header, padding, mask, code hole, tool call, tool result. To the model, both look identical after the lookup: integer IDs with embedding rows and output logits attached. By the time the transformer sees the sequence, nothing says "this ID came from BPE" or "this ID came from trusted template code"; there's just a row number and the training history behind it. So special-token policy has to live at the tokenizer boundary, before raw text and model protocol collapse into the same integer array.

Special tokens look like bookkeeping until the day they break, because they're the contract between the tokenizer, the model weights, the serving layer, and the application code.

2.6.1 Same String, Different Input

A user pastes <|endoftext|> into the chat box of an app you wrote. Maybe they're curious. Maybe they're testing the seams. Either way, your tokenizer has to pick a path before the model sees a thing.

One path: ordinary user text. A byte-level BPE encoder chews through the thirteen ASCII characters like any other input: <, |, maybe endo, maybe f, maybe text, |, >. The exact split depends on the vocabulary. Every piece that comes out is still content.

The other path: special token. The same printed string collapses to one reserved ID. GPT-2 spent 50256 on it. cl100k_base uses 100257. That row isn't the compressed form of "end of text"; it's the row the model was trained to treat as a document boundary.

A log line can make the two paths look identical. Both might print <|endoftext|> to the screen. One came from untrusted user bytes; the other came from code that owns the model protocol. After encoding, the receipt is gone.

tiktoken is deliberately annoying about this. Registered specials are disallowed in ordinary text by default, and to get the sentinel ID at all, the caller has to opt in with allowed_special. The API is forcing the question the application has to answer anyway: is this user content, or am I assembling the model-side template?

The rule that falls out: in user content, escape, reject, or ordinary-tokenize anything that looks like a special, and insert real special IDs only from code you control. By the time the transformer sees [50256] or [100257], the boundary decision has already been made for it.

2.6.2 Chat Templates

A chat request starts out structured. A list of messages, each with a role and a string:

That object is for you and the SDK; the model never sees it. The decoder-only transformer gets exactly what it always gets: one flat tape of token IDs. Somewhere between the request body and the first forward pass, the serving layer rewrites your messages as text pieces, special IDs, role labels, newlines, turn endings, tool blocks, and a small assistant opener that tells the model "the next token belongs to the reply."

The chat template does that rewriting. A Llama 3 turn is <|start_header_id|>, then a role name, then <|end_header_id|>, then content, then <|eot_id|>. ChatML-style families use <|im_start|> and <|im_end|>. Mistral's instruct format has used [INST] and [/INST]. Harmony-style envelopes (the format OpenAI's open-weight gpt-oss models use) carry roles, channels, tool calls, and final answers in something richer. The spellings differ, but the job is the same: turn a structured conversation back into the exact token distribution the model saw during instruction tuning.

So what happens if you skip the template and assemble the prompt yourself?

Nothing throws. That's the whole problem. The model still answers, the response still streams back, and the failure stays invisible to anything but a loss curve you don't have. If the assistant opener is missing, the model may keep going as if it were still the user. If the end-of-turn marker is wrong, generation stops early or runs straight through a boundary. If a tool result lands as ordinary assistant text, the transcript is now lying about who said what.

String equality won't save you. A printed <|im_start|> can be ordinary user bytes in one code path and a reserved ID in another, and the rendered text gives no clue. The only check that means anything is a diff of the token stream: same template, same specials policy, same assistant prefill, same stop markers. For open models, run the family's apply_chat_template or its request encoder and compare IDs whenever you have to reimplement anything. For hosted models, let the provider own the envelope and stop trying to recreate it.

The template is a trained interface: changing it changes the prompt at the layer where the model actually reads, whatever bracket style a family uses.

2.6.3 Adding a Token Is Surgery

Adding a token to the tokenizer looks like thirty seconds of work. Open the config, drop <|tool_call|> into the special-token dictionary, hand it the next unused ID, point the encoder at it. The tokenizer now "supports" <|tool_call|>.

The model doesn't.

Every token ID indexes an embedding row. If the old vocabulary has V entries and the hidden size is d, the embedding table is V x d. One more token, one more row:

The new row has no pretraining history. It's typically initialized to small random numbers, sitting there waiting for the first gradient that gives it a job. The model has never seen this ID before, so there's nothing else for it to be.

The output side has the same problem. A decoder-only model produces one logit per vocabulary entry. With an untied LM head, the projection grows from d x V to d x (V + 1). With tied embeddings, the same resized matrix serves both input and output, but the new row is still untrained either way. The model can now receive the token and assign probability to it, which is availability, not meaning; the meaning still has to come from training data.

Nothing in the architecture changed, only the token stream. After enough examples, <|fim_middle|> stops being a random row and becomes a learned instruction: write the span that fits between the prefix and the suffix.

A pause token, a tool-call boundary, a gist token: each one follows the same recipe. Reserve an ID, resize the matrices, push examples through training until the new row has a job; the printable name is for humans, and the behavior lives in the row and its output logit.

And with that, the tokenizer isn't one table anymore: it's the merge table, the regex, the special-token dictionary, and the chat template. From here on, the model trains inside those choices, and the strange behaviors downstream start before the transformer sees a single token.

Part 3: Why Tokenizers Leak

A tokenizer is supposed to be plumbing. Bytes in, IDs out. It sits between the user and the transformer, does its job, and stays out of the way. You pick a byte floor, learn the merges, reserve a handful of specials, glue on a chat template, freeze the artifact. Part 2 was that build. Then the plumbing leaks.

Most of the glitches from the top of this post (the spelling, the reversal, the arithmetic, the Hindi bill) are leaks of this kind, and code cost belongs on the list too; the <|endoftext|> one was protocol, and Part 2 already closed it. The leaks all have the same shape: the tokenizer compressed the text into atoms, and the task is asking about something that lives below, across, or outside those atoms. .Default as one row is fine compression, right up until the question is "how many ls are in this?" Same story for 123 when the question is about the ones column, and for an old Reddit username whose row never got trained. The transformer has training signal, scratch space, attention, the whole standard kit. It can recover from a lot. What it can't easily recover from is evidence that arrives in the wrong shape, before layer one has run.

3.1 Five Strings at the Front Door

Here's what the encoder actually does to the strings this part keeps poking at (the first four under cl100k_base, the last one under GPT-2's vocabulary, where that story unfolded):

That table isn't a view of the user's text. For the model, the table is the text. A token is just an address into the embedding table, and the model has plenty of machinery for working with whatever lives at that address: it can attend to the row, push gradient through it during training, predict it at decode time. What it can't easily do is reach back inside the row and recover the bytes that earned the row its place. Those bytes were spent at the encode step.

Put the atoms next to the task and the question answers itself: the question is about letters, or digit columns, or specific bytes, and the tokenizer kept whatever compressed well instead, so the evidence the task needs is already missing by the time the transformer runs.

Three flavors of that bug come first: spelling needs to look below a single token, reversal needs output atoms the input atom doesn't share, and glitch tokens are rows that exist in the table but were never really trained. Numbers, code, and language cost then need no new machinery, just the same table meeting different inputs.

3.1.1 Counting Letters That Aren't There

Forget the model for a moment. .DefaultStyle is thirteen characters, two of them lowercase ls:

A byte loop can scan that and tell you "two" without knowing what English is, what .DefaultStyle is, or that any of this has to do with code. Count bytes, you're done.

Then the question hits the tokenizer. Under cl100k_base, the same string becomes:

.Default is a row lookup. Style is another row lookup. The transformer starts from two learned vectors, not thirteen characters.

Yes, row 13578 decodes back to .Default if you call the tokenizer's decoder. But the transformer isn't running the tokenizer's decoder. It has whatever training put into that row: loose associations with code, UI properties, CSS, the general shape of an identifier. Exact letter counts are a bad thing to bury inside a 4,096-dimensional vector. So when the model says "three ls," it isn't forgetting how to count. The characters aren't in its input. They're inside two learned vectors, and digging them back out isn't what those vectors are for.

The dumb workaround is to put the characters into the prompt explicitly:

The spaces drag the tokenizer back toward single characters, and the question turns into a list operation. Or hand the string to a tool and let real code do the byte scan. Either way, you're admitting the model needed a layer the tokenizer had hidden.

3.1.2 Reversal: The Atoms Don't Match

Spelling at least kept the letters together inside one row, where the model had some shot at digging them back out. Reversal has a worse problem: the row the model receives and the rows it has to emit don't share atoms at all.

Under cl100k_base, the whole identifier collapses into one token:

Great for autocomplete. Useless for "write the same characters in reverse." The model receives one embedding row, but the answer is a brand new string:

elytSlleCtluafeD. isn't row 98518 flipped, or row 98518 with a minus sign on it. Run it through the same vocabulary and you get something unrecognizable:

Nine pieces, and none of them is the input. There's no shared D, e, f, a, u, l, t lane the model can simply walk in reverse. So it has two jobs at once: dig the characters back out of row 98518, then assemble a token sequence whose atoms have no obvious relationship to the input row. Outside the model, reversing the byte string is a one-liner. Inside, row 98518 is just a vector, and the symmetric path back doesn't exist.

The spaced-out workaround from the spelling section works here too, and for a stronger reason. With one character per token, reversal becomes a list operation. Without the spaces there's no list, just one opaque address that has to come out as a completely different sequence of addresses.

None of this proves the model knows nothing about how .DefaultCellStyle is spelled. It has almost certainly picked plenty of that up from code in pretraining. But exact reversal is asking for a byte-level program, and the encode step hid the bytes. The failure stops being mysterious the moment you put the atoms on the page.

3.1.3 Holes in the Table

You ask the model to repeat SolidGoldMagikarp back to you, and it substitutes a different word. Pushed, it refuses; pushed again, it answers in French.

The string is an old Reddit username with a leading space, and nothing about that explains what the model just did. The tokenizer didn't fail here; it represented those bytes perfectly. The model is the part that never caught up.

In early GPT-3-family models, SolidGoldMagikarp was token 43453. The reason it earned a row is, hilariously, prosaic: that Reddit user posted often enough that the username showed up everywhere in the data used to train GPT-2's BPE tokenizer. The merge loop did what it always does. It saw a frequent byte pattern and gave it an ID.

Then the language model had a different problem to solve. Earning an ID is one event. Learning what that ID means is a separate event. A token earns its row by being frequent in the tokenizer corpus. It earns a usable embedding by being frequent in the LM corpus. Most of the time those corpora are close enough that nobody notices the distinction. Here they weren't.

So row 43453 existed and almost no gradient ever touched it. The input embedding started as small random numbers and stayed roughly where it started. The output side, tied or untied, had the same problem: nothing was teaching the model what to predict at that ID either.

Prompting with that token wasn't like prompting with a rare word. A rare word still has neighbors: contexts that share its meaning, gradient pressure on every surrounding token, a learned embedding that at least sits in a sensible neighborhood. SolidGoldMagikarp was closer to injecting an untrained coordinate into the residual stream. The model could attend to it, pass it forward, write it into the next layer's keys and values. It had never learned any stable behavior for it. That's why the public demos came out so weird in such specific ways: evasion, refusals to repeat the string, sudden language switches, insults out of nowhere, all the behavior of a row that was never given an interpretation.

Jessica Rumbelow and Matthew Watkins found that SolidGoldMagikarp was part of a cluster, not a one-off. Other tokens had the same shape: common enough in the tokenizer corpus to earn merges, rare enough in the LM corpus to leave the resulting rows essentially untrained. The bug lives in the seam between two datasets that people usually treat as one dataset.

This is the inverse of <unk>. The mitigation is boring and right: train the tokenizer on the distribution the model is actually going to train on, and audit token frequencies before pretraining starts, because a row with near-zero occurrences in the LM corpus never gets trained, and the rest of the model ends up routing around it.

3.2 Why Arithmetic Is Hard

999 + 1 comes back from a model as 99, or 1009, or some other near-miss often enough to be a meme. The schoolbook algorithm is three lines. Add one to the ones digit. Carry through the nines. Write 1000. The model isn't running that worksheet, because under cl100k_base:

999 is one row. 1000 is two rows: 100 and 0. The ones digit of the input is buried inside 999. The ones digit of the answer sits in its own slot. For a right-to-left carry algorithm, that's an awful layout.

Numbers look orderly, which is why this is easy to miss. elephant as one token is fine; the model just has to know about elephants as objects. 999 as one token isn't, because arithmetic is built on an invariant that text doesn't care about: the digit in the ones place stays in the ones place no matter what's happening around it.

A compression tokenizer doesn't owe you that invariant. It was trained on text where 2024 is everywhere (years, prices, postal codes, log timestamps), so common digit substrings end up earning merges. Which substrings exactly depends on whatever was most frequent in that particular corpus, and whether the pre-tokenizer even lets BPE form those merges is the other half of the story.

So the model does two jobs in one forward pass: the arithmetic, and the reconstruction of digit columns the tokenizer hid. Scratchpads help with the first job, but they don't undo the second.

3.2.1 The Digit Lottery

Watch what GPT-2's tokenizer does to three numbers that share a prefix:

Nothing about those splits came from arithmetic. They're GPT-2's BPE merge history showing through. Its pre-tokenizer let an entire digit run enter as one chunk, and the merge table picked up whichever digit substrings happened to be common in web text: years, prices, postal codes, build numbers, timestamps. The merge loop only ever saw counts.

The path matters. 12, 34, 45, and 123 all earned rows somewhere along the way. When the encoder hits 1234, the best available merges land on 12 | 34. When it hits 12345, they land on 123 | 45. Different inputs trace different paths through the same merge table, so the 3 in 1234 sits next to the 4 while the 3 in 12345 sits in a different row entirely: same visible digit, different atom, in a different sequence position. For language, that kind of fossilization is mostly fine. international and ization are useful chunks regardless of which side of the word wins which merge. For numbers it's a real problem, because a million examples of "numbers" in a pretraining corpus aren't math data. A request ID 12345 in a log line teaches the model about row 123, row 45, and the contexts they appear in. It doesn't teach it that 12345 is five digits with carries waiting to happen.

cl100k_base patched the lottery: its pre-tokenizer caps any digit run at three, so 2024 can't fossilize into one opaque row anymore. Then you look at what the cap does to a longer number:

The slicing runs left to right. Take three, then three, then whatever's left over at the right end. In the bare number, "whatever's left" is the ones digit 7 sitting by itself in the trailing token, and the middle group 456 only means "hundreds, tens, ones" if you've already counted the digits in the whole number. Place value gets reconstructed from the outside in. Addition runs the other way: the carry starts in the ones column and walks left, which means the natural anchor for the digits is the right end, not the left. The comma version happens to provide exactly that. Each three-digit group is anchored from the right, so 567 is reliably hundreds-tens-ones and 234 is reliably the group above.

That sounds like punctuation trivia until someone measures it. Singh and Strouse tested length-mismatched addition, where a carry bumps the answer one digit longer than the inputs. GPT-3.5 on bare numbers: 8.25%. Same model, same arithmetic, drop commas into the inputs: 97.8%. The commas didn't teach it how to add. They just put the token boundaries where the columns already were.

Which suggests the clean fix lives upstream of BPE, and Llama 2 took the most aggressive version of it. Its pre-tokenizer splits every run of digits into individual digits before SentencePiece BPE sees the text. 2024, 20, 8675309: none of them can become a single token, because the merge loop never sees those strings as merge candidates. A number like 1234567 then arrives at the model as seven positions, not a handful of opaque chunks whose boundaries depend on length and surrounding punctuation. Place value collapses back into sequence position. The carry still has to be computed, but the model isn't being asked to reconstruct columns first.

You can guess the cost. Anything numeric gets longer: logs, CSVs, timestamps, account IDs, version strings. cl100k_base's three-digit cap is the compromise, looser than per-digit and tighter than GPT-2's free-for-all, picked for a general-purpose model that reads far more text than arithmetic worksheets. Which representation is right depends on which question you asked. "What makes arithmetic least painful?" gets you per-digit. "What makes a trillion tokens of pretraining corpus cheap?" gets you something looser. A model can still fail at addition for reasons that have nothing to do with tokenization: weak scratch work, brittle prompting, bad training examples. Tokenization is the layer that decides whether the carry algorithm has columns to work with at all.

If you don't get to retrain, the moves are simple: put commas or spaces in long numbers, ask for scratch work, and reach for a tool when the answer actually has to be right.

3.3 Why Code Is Hard

A model refactors a Python function for you. The output looks clean, pastes clean, and dies on the first run: IndentationError. The same model that can sketch a microservice from a one-paragraph description tripped on a column of spaces.

Prose tolerates a bad split. tokenization chopped into token | ization still reads fine. Code doesn't have that slack. A dot, an underscore, a closing paren, four leading spaces: any of these can be syntax, and getting the split wrong means the file stops parsing.

Here's FizzBuzz:

Run the full program through both and count: 102 tokens under GPT-2, 63 under cl100k_base. The Python didn't get simpler in the meantime, and most of the savings aren't print or range. They're spaces.

Python uses leading whitespace as syntax, but the tokenizer decides whether one indent level arrives as one chunk or a dozen repeated blanks. The transformer isn't allergic to programs. It gets handed the tokenizer's version of the source, and that version can be punctuation splinters, whitespace noise, and identifiers chopped through useful boundaries. Modern code models learn around a lot of that. They still start from whatever the encode step handed them. That's what happened to the refactor at the top of this section: the model was writing indentation it couldn't see the width of, because one indent level is sometimes a single token and sometimes several copies of one. Drop or double one whitespace token and a line shifts by a level, which is exactly the IndentationError the first run died on.

3.3.1 Indentation, Token `220`, and the Regex

Take one line from deep inside a Python function:

Twelve spaces. Three nested blocks, four spaces each. In Python, that distance from the left margin isn't formatting. It's syntax. The grammar uses it to find the end of the function body.

GPT-2's tokenizer didn't see any of that. The ASCII space was token 220, and a run of spaces became that token over and over. Eleven of the twelve spaces in front of return arrive as eleven separate copies of 220; the twelfth rides along glued to the keyword:

A compiler has an indent stack: one piece of state, one number that knows exactly how nested you are. The transformer got none of that. It got eleven forward-pass positions, eleven positional embeddings, and eleven places where attention could spend probability mass on blank cells, all before the first keyword.

Modern GPT tokenizers changed the boundary rule. Runs of spaces stay together long enough for BPE to learn them as chunks, so four spaces, eight spaces, or a newline followed by indentation can become one token instead of a row of 220s. That's where the FizzBuzz savings come from: 102 drops to 63, and the model stops spending most of its budget on page layout. More of the context window holds branch conditions, variable names, and function bodies.

Notice what did the work there. It wasn't the bigger vocabulary. GPT-2 had about 50k tokens and cl100k_base has roughly twice that, so "more slots, more code fragments" is the natural story. Dagan, Synnaeve, and Roziere ran the cleaner test: hold the vocabulary size fixed, swap only the pre-tokenizer, and measure how long the same source code comes out. They report a normalized length where 1.00 is a Llama-style baseline; lower means fewer tokens.

Tokenizer setup	Normalized source length
Llama-style regex, `32k` vocabulary	`1.00`
GPT-4-style regex, `32k` vocabulary	`0.81`
GPT-4-style regex, `100k` vocabulary	`0.74`

Look at the middle row. Same vocabulary, different regex, nineteen points cheaper. Tripling the table on top of that buys only seven more.

The mechanism is the wall from Part 2. BPE only merges inside the chunks the regex hands it. If the pre-tokenizer chops an indent into single spaces:

those vertical bars are walls. The merge loop never gets to count a ·· pair across them, never gets to promote ···· into the table. Four-space indents don't exist as candidates for the trainer to score. Hold the whitespace run together:

and now ····, :\n, .\n, and the rest of Python's layout shapes can earn rows, assuming they show up often enough in the corpus. BPE didn't get smarter between GPT-2 and GPT-4. The room it was allowed to work in did. For code, the boundary policy comes first; a larger vocabulary can only memorize what the regex didn't already split apart.

3.3.2 Identifiers Want Morphemes

user_id isn't a word. It's a tiny phrase: user plus _id. parse_json_response is verb plus format plus object. HTTPRequestHandler is an acronym welded to two nouns. Programmers picked underscores and camel case for a reason: names have internal structure, and the conventions are there to surface it. Whitespace waste shows up in any token dump; identifier waste is harder to see.

Two extremes, both bad. Keep a rare identifier whole and you get great compression with zero transfer: a row for sync_customer_invoice does nothing outside the repo that invented it. Drop all the way down to characters and every name copies cleanly, except now the model is spending positions on individual letters before it gets near the verb.

The useful place sits in the middle:

That table is a target shape rather than a promise about any particular tokenizer; the point is transfer. _id should help with customer_id, order_id, session_id. Request should carry across RequestHandler, HTTPRequest, request_timeout. json should show up in method names, file names, and string literals alike.

Linguists call those pieces morphemes: the small meaning-bearing units inside a larger word. Code has a programmer-made version of the same thing: underscores, dots, casing transitions, library prefixes. A good code tokenizer wants pieces at roughly that scale.

Snake case is a gift here. The underscore sits right there in the bytes, and a GPT-4-style regex keeps it attached to the following letters, which lets BPE pick up _id, _json, and _response if those suffixes are common enough. Camel case is harder. The uppercase letter is visible in the bytes, but it isn't a hard pre-tokenizer wall, so the merge table has to find the seam by counting, and it won't always win.

This matters most for names the model never saw during training. You name something the corpus has never seen, sync_invoice_ledger or parseInternalRetryHeader, and the model has to write it exactly, every time, while still noticing that sync, invoice, and ledger are reusable parts. Morpheme-sized splits are the compromise that makes both halves of that work: rare identifiers stay typeable without each one turning into a brand-new atom.

3.4 Why Some Languages Pay More

A support ticket lands in your queue: "I can't log in after resetting my password." The complaint reads the same in English, Hindi, Thai, Burmese, and Korean. The human work is the same. The token count comes back different by up to a factor of three.

Under one GPT-4-era tokenizer, Hello, how are you is five tokens in English. The Korean translation runs to about fifteen. Petrov et al. measured cases where the same translated content cost up to fifteen times more tokens than its English source. The transformer hasn't started yet. The inequality arrived during encode.

That comparison is a microscope. A counter on a webpage doesn't win or lose anything; it shows you which strings the table prefers, before any model is involved.

BPE isn't sitting there with an opinion about Korean or Hindi. It just counts. If the tokenizer corpus is mostly English, English byte pairs win more merge slots. A common Burmese word can look rare on a global scale and lose to a less useful English fragment. The tokenizer compresses whichever distribution the trainer fed it, and users pay according to the resulting table.

A few mechanisms stack underneath. The vocabulary is finite, so a row spent on English is a row not spent on anything else. UTF-8 charges different scripts different numbers of bytes per character before any merges happen. Word boundaries work differently across languages: English lets a regex cheat with spaces; Thai doesn't put spaces between ordinary words at all.

The product effects fall out mechanically: per-token billing makes the same information cost more in some languages, and token-per-second streaming makes those same languages feel slower per sentence. A 128k-token context window holds fewer pages when the tokenizer fragments a script badly, and attention sees a longer sequence before it sees a harder problem. So "same prompt, translated" was never a controlled experiment: the model receives a different object in each language.

3.4.1 A Fixed Budget on an Uneven Floor

A hundred thousand rows sounds like plenty until you start listing what wants one: leading-space English words, code punctuation, JSON fragments, URLs, emoji, chat markers, and every language the model claims to support, sharing whatever's left.

BPE spends that budget by counting. It starts from small pieces and gives a row to whichever adjacent pair shows up most often. A row for authentication lets a future prompt carry that English-ish fragment in one position. The same row can't also be a Thai phrase, a Korean ending, or a Hindi suffix, and the table doesn't grow because the model card says "multilingual." The corpus mix is the lever. In an English-heavy corpus, English pairs win the global frequency contest. In a code-heavy corpus, );\n, four-space indents, and .DefaultCellStyle earn rows. If Tamil or Burmese is a rounding error in the tokenizer corpus, common local patterns there can lose to less useful English pairs. The text still encodes; byte fallback sees to that. It just arrives in smaller atoms, sitting on rows the model barely trained.

And the floor under that contest was never level to begin with. Type a and UTF-8 spends one byte: 61. Type 한 and UTF-8 spends three: ed 95 9c. No merge table has voted yet, no model has seen anything, and Korean has already been charged three times what English was.

Text	What it is	UTF-8 bytes
`a`	Latin letter	`61`
`م`	Arabic letter	`d9 85`
`अ`	Devanagari letter	`e0 a4 85`
`ก`	Thai letter	`e0 b8 81`
`한`	Korean syllable	`ed 95 9c`
`你`	Chinese character	`e4 bd a0`

The table isn't claiming those writing systems are more verbose. UTF-8 was designed as a compatibility encoding: ASCII stays one byte, and the rest of the codepoint space spends two, three, or four. That trade has long since paid for itself across every protocol on the internet. Byte-level tokenizers inherit the uneven part anyway.

Byte fallback is still the right safety net. Without it, a tokenizer needs an unknown token or a giant character table. With it, every valid UTF-8 string survives encoding: a rare Thai word, a new emoji, a Burmese name, a mixed-script username. But a safety net isn't compression. A merge table that's seen enough Korean can fold 한 into part of a larger learned piece. A table that hasn't will deliver the same syllable as three byte-derived chunks, none of which a Korean reader would call a unit. Either way the model never sees the Unicode scalar, only whatever token IDs fell out of the merge table.

Blaming Unicode for this is too neat. UTF-8 explains why the floor is uneven; the tokenizer decides how much of that floor gets covered by learned merges. A multilingual tokenizer can spend rows on Thai syllables, Hindi suffixes, Korean endings, Arabic word shapes. An English-heavy tokenizer spends them on _id, (){, the, ";", and then lets byte fallback pick up everything else. Shared vocabularies aren't the enemy; cross-lingual models need shared space for names, punctuation, Latin loanwords, numbers, and code shapes. The trap is pretending the sharing is free. A fixed table has winners and losers, and "multilingual fairness" starts with asking who got the rows.

3.4.2 Fertility, and Its Limits

Tokens per sentence is what the bill counts. The multilingual papers count tokens per word:

A corpus of 10,000 words tokenized into 11,000 pieces has fertility 1.1. The same corpus split into 25,000 pieces has fertility 2.5. Near one means the language is arriving in word-sized chunks. High fertility means the model is starting from fragments that have to be stitched back together.

The "word" half of the ratio is itself a measurement choice. Thai and Chinese don't hand you spaces between ordinary words; the corpus or evaluator has to decide where the boundaries go. I wouldn't quote fertility as a property of a language. It's a property of a tokenizer, a dataset, and a word-boundary convention, measured together. Move any one of the three and the number moves.

Inside a fixed setup, though, the number is sharp. Rust, Pfeiffer, and coauthors used it to ask whether the shared tokenizer was itself part of the multilingual performance gap. They lined up a shared multilingual tokenizer against monolingual tokenizers for each target language, adapted the embeddings, and re-ran the evals. Languages with adequate coverage in the shared vocabulary stayed close to their monolingual baselines. Languages with high fertility improved on almost every task once the tokenizer was swapped. That pins a specific failure to the encode step. Finnish syntax didn't get harder in some mysterious way. Hindi entities didn't disappear. The model was being asked to learn them through smaller pieces, more positions, and rows with less language-specific evidence behind them.

But fertility counts pieces; it doesn't judge them. It can't tell you whether un | happy | ness is better than unh | app | iness; both are three pieces. It can't tell you whether a token crosses a useful morpheme boundary, whether a byte prefix is gluing unrelated characters into fake resemblance, or whether a short piece is common enough to develop a usable embedding. Ali et al. ran this in controlled training and found exactly that: fertility tracks some of the cost and doesn't reliably predict downstream quality on its own. A tokenizer can shorten a sentence and still pick awkward pieces, and compression keeps helping until it hides the structure the model actually needed.

So treat fertility as a first smell test. If a language sits far above English on it, expect a bigger bill, slower streaming, less usable context, and more work for the model before reasoning even starts. If the number looks fine, keep checking: a low fertility says the tokenizer stopped fragmenting the text, not that it picked the right atoms.

3.5 One Decision, Five Systems

Part 1 made a claim (§1.5): when the tokenization of a string changes, five things move together. Part 3 has now shown each channel failing on a real input.

Atomic units: the letters inside .Default. When the question lives below the atoms, the model has to reconstruct it from training history and context, and that's not where the lost detail survives. Sequence geometry: indentation eating eleven forward-pass slots before the first keyword. The output side: reversal hurting because the answer's atoms share nothing with the input's. Training signal: the SolidGoldMagikarp row that the tokenizer corpus minted and the LM corpus never trained. And the post-training surface: a schema that grows under a different tokenizer leaves that much less room for everything else.

So the bugs in this part aren't separate trivia. Each one is the same compression decision crossing a different part of the system: input alphabet, sequence geometry, output alphabet, gradient targets, post-training interface.

Part 4: The Tokenizer Is Part of the Model

Up to here the cost of a tokenizer has been paid in behavior: a miscounted letter, a fumbled carry, a Korean prompt charged triple. There's a second bill paid in tensors, and that one's what the GPU sees.

A tokenizer ships you two numbers. V, the vocabulary size, is how many distinct IDs the embedding table is keyed on, a property of the artifact frozen at training time. T, the length of the current prompt after encoding, depends on the input, the script, and the merge table doing the encoding. V lives in the weights: one embedding row per ID, one output logit per ID, paid once when the checkpoint loads and again at every decode step. T lives in the runtime: every position is more attention work, more KV cache, less room in the context window.

So "should I change the tokenizer?" is really four arithmetic questions, all coupled: parameters and logits scale with V, sequence length and KV cache scale with T, and pushing on one moves the rest, usually in the wrong direction.

4.1 The Four-Axis Tradeoff

The bargain in one line: bigger V buys smaller T. A code model wants , authentication, and {"role":"assistant"} to be single rows; a multilingual model wants 한국어 to compress like Korean; a tool-heavy product wants <|tool_call|> to be one token, not seven. All of that argues for a bigger table, and the payment lands in four places at once.

The embedding table widens with V, and an untied LM head widens with it, paid by every replica that ever loads the model. The output side widens too: decode-time softmax runs over the whole vocabulary, so adding 50k rare rows means 50k extra logits scored at every step. On the other side, the sequence shortens: less attention work per request, more room inside the context window, more actual document per prompt. And the KV cache shrinks alongside it, multiplied across however many users are talking to the model at once.

The four axes are coupled, and no single setting wins everywhere. A 32k vocabulary that looks lean on a 1B local model is almost certainly fragmenting Korean and Python indentation on a 70B serving 128k contexts. A 200k vocabulary that pays for itself there is dead weight on the smaller model. "Best vocabulary size" turns out to be a budget question dressed up as a hyperparameter.

4.1.1 What `V` Costs: Parameters and Logits

Take the shape a mid-size chat model actually runs: V = 100,000 and d = 4,096. The input embedding table is V × d scalars: 100,000 × 4,096 = 409,600,000 of them. At bf16, two bytes each, that's ~819 MB of weights before a single attention block has been allocated. Push V up by 10,000 rows (Korean, Python indentation, three more chat-template specials, take your pick) and the table grows by 40,960,000 parameters, about 82 MB on top.

Whether you pay for the table once or twice depends on the LM head. Smaller or older models often tie the output projection to the input embedding: same matrix, used in reverse, V × d paid once. Untied heads carry a separate d × V matrix for the final projection, and the token-facing weights double to ~1.64 GB. That lives in every checkpoint of every replica on every GPU; quantization changes the bytes per number but doesn't make the rows go away.

There's a quieter cost living inside the table. Common tokens get gradient on every batch, and the rows behind them move. Rare ones (long surnames, obscure URLs, byte-fragment junk from the corpus's long tail) get a thin trickle, sometimes nothing at all. The row exists, the index is valid, the embedding lookup succeeds, and the vector behind it may sit a hair off random initialization for the whole training run. That's the SolidGoldMagikarp failure mode from Part 3, and it stays silent until something causes the model to actually use the row. Adding rows isn't the same as adding knowledge. Each one still has to earn the vector behind it.

Parameters you pay for once, at load; logits you pay for every time the model picks a token. The LM head takes the width-d hidden vector at the last position and dots it against every row of the output projection, one score per token in the vocabulary:

The 2 is the multiply-add count. Plug in d = 4,096 and V = 100,000 and the vote costs about 820 million floating-point operations per token, every token. Not the whole model, of course. The transformer blocks at this shape hold roughly 12 × L × d² parameters (4d² of attention projections and 8d² of MLP per layer), which is about 6.4 billion at L = 32, so the blocks cost on the order of 13 billion FLOPs per token against the head's 0.8 billion. The head only starts to matter when V gets huge or the model gets small. Attention has its own line item, and serving stacks fuse kernels and shard the vocabulary across devices.

The number that matters is the slope. Widen V and you widen the vote at every decode step, and the math doesn't care whether the new rows ever win. Bolt 50,000 rare rows onto the table and every decision scores 50,000 extra logits whose probability rounds to zero.

At training the cost is louder. The logit tensor there is [B, T, V], because every position is predicting its next token, not just the last one. A million-token vocabulary turns each of those B × T slots into a million-way softmax. The width of the table sets the width of that softmax, at every position, every step.

4.1.2 What `T` Costs: Sequence Length and KV Cache

T is the axis where a bigger vocabulary pays itself back. Run the same document through two tokenizers. One was trained on a corpus that doesn't look much like yours, and the document encodes to 12,000 tokens. The other was trained closer to the document's style, and the same bytes land at 8,000. The model body hasn't changed; the prompt is just shorter before attention sees it.

A 128k context window isn't 128k words or characters. It's 128k tokenizer outputs. If Korean fragments to byte trails, if Python indentation arrives as runs of 0x20, if a tool schema spends three tokens per key, the window pays itself out on bookkeeping, and the effective window ends up much smaller than the advertised one depending on what you paste in. Prefill picks up the savings directly: fewer position pairs for attention to compare, less per-block scaffolding to carry forward. Decode picks up some of it too, if the output compresses. A model emits one token per step, so when common code shapes or template chunks earn larger pieces, the same rendered answer takes fewer trips through the stack.

The other place T costs you is memory, per active user. A decoder-only transformer doesn't reread the prompt for every new token. Each attention layer caches the keys and values it already computed for earlier positions, and the next token attends to those cached vectors instead of redoing the work. Compute traded for memory, and the memory has a shape:

B is active sequences, T is cached positions per sequence, L is transformer layers, the 2 is keys plus values, and kvWidth is full d for vanilla multi-head attention or smaller for grouped-query and multi-query variants that share K and V across heads. Tokenizer choice enters through T.

Put numbers on it. B = 1, T = 8,192, L = 32, kvWidth = 1,024, bf16:

About a gigabyte for one active sequence. The same text under a tokenizer that fragments it to T = 12,288:

The paragraph, the model, and the precision are all held fixed; the tokenizer made the cache 50% larger because the prompt was 50% longer in tokens. And it isn't a one-time checkpoint cost paid at deploy, it's live serving state, one copy per active conversation. Ten long-context users means ten of those caches in memory simultaneously, and a fragmentation pattern that looked benign on a single test prompt becomes a fleet-wide GPU memory tax the moment real traffic starts.

There's a ceiling on the upside. The first hundred merges of a BPE run pick off spaces, frequent English words, punctuation shapes, common suffixes, all cheap and reusable. The hundred-thousandth merge tends to be a niche string that helps on a thin slice of the corpus while still widening the embedding table and the output head for everyone. Past some point, every new row buys you fewer positions. A pure byte tokenizer is the limit case in the other direction: V = 256, an embedding table the size of a postage stamp, and the whole cost handed to T, with tokenization at twelve positions and every Devanagari character at three.

Which side wins depends on the corpus, the context lengths people actually use, the model width, the attention variant, the serving load. You don't reason that out from first principles; you measure it.

4.2 Token Economics

The pricing page shows $1 / 1M input tokens, and the temptation is to start multiplying. The number you arrive at isn't the number that lands on the invoice, because a token isn't a unit the way dollars per gigabyte is a unit. A token is whatever this model's tokenizer emits, and two tokenizers shown the same paragraph come back with different counts. The rate card doesn't say which one ran.

The support ticket from Part 3 makes the point in dollars. The job is the same whichever language it arrives in: summarize it, extract the account IDs, draft a reply. In English it comes out to 2,000 tokens. In Thai, 3,200. The Thai request spent 60% more context before anyone glanced at the rate card. And the usable context window shrinks the same way, with no asterisk on the spec sheet.

Two providers that look comparable on paper hide the same trap:

Model B is cheaper per token and more expensive on this prompt. Quality and latency might still make it the right call. The unit just isn't as fixed as the rate card makes it look. Products with fixed wrappers feel this worst: a chat product sends a system prompt, a tool schema, retrieved documents, and a user question on every request, and if the tokenizer inflates the schema and the retrieved passages by 30%, every request loses that much room for the part that actually varies.

The other twist in the rate card sits in plain sight. The price isn't one number, it's two, one for what you send in and one for what comes back, almost always at different rates:

They aren't equal because input and output do different work. Input is prefill: the whole prompt sits there at once, and the serving stack burns through every position in parallel. Output is decode: token 17 can't be computed until token 16 exists. The KV cache spares the model from rerunning attention over earlier positions, but every new token still costs one full pass through the stack. That serial loop is the expensive part. Decode ties up serving capacity one step at a time and batches worse than a fully known prompt, which is why output tokens usually cost more.

Say the rates are $3 / 1M input and $15 / 1M output:

The output rate is 5× higher and the input line is still 4× larger, because the prompt is 20× longer. Flip the product shape and the answer flips with it: a code agent that reads a short bug report and writes a long patch can easily spend more on output than input.

Which is what makes reflexive prompt-trimming advice so annoying. Cut 20% off a 20,000-token input and you save 4,000 tokens. At $3 / 1M, that's 1.2 cents. If the leaner prompt causes one retry, or nudges the model into 1,000 extra output tokens at $15 / 1M, the savings are already gone.

The two meters need different strategies. Input is what you can plan around: strip dead JSON fields, drop redundant examples, cache a stable prefix. Output length is part product decision, part model behavior. A tokenizer that compresses code well makes generated code cheaper too, but the model still decides how much code to write. The unit that matters in the end is dollars per completed task: real input, expected output, retries, cache hits, all counted with the same tokenizer that actually runs in production. The rate card sets the price. The tokenizer sets what the price gets multiplied by.

4.3 A Swap Is a Retrain

The checkpoint learned exactly one integer language. A tokenizer is a few hundred kilobytes of files (a merge table, a vocabulary map, a regex, a config), so swapping it looks like swapping a JSON parser. The problem is that row 12345 of the embedding table doesn't carry meaning because 12345 is a magic number; it carries meaning because the old tokenizer emitted that ID across a particular set of contexts, and training nudged row 12345 around in d-dimensional space until the rest of the network knew what to do with the vector that landed there. The closest analogy I can find is an ABI: a fixed contract about which integer names which atom, with the LM head holding the same contract in reverse. When the hidden state says "the next atom should be this thing," the head has one logit slot per old vocabulary entry, so the distribution it produces is over the old tokenizer's atoms, not abstract strings.

Change the tokenizer and the code still produces integers, but the contract has moved underneath:

The exact IDs don't matter. The model was trained on the first sequence, not the second, and three things break at once.

The input embeddings stop lining up. A reused old ID now points the lookup at a vector trained for the wrong piece. A new ID lands on an unborn row with no history behind it. A former one-token string that now splits into three forces the model to assemble meaning from three rows where it used to read one.

The output distribution stops lining up too. A decoder-only model predicts the next token, not the next character span, and probability mass over the old vocabulary can't be cleanly remapped onto the new one unless the segmentations match, which they essentially never do.

And the sequence geometry shifts. A string that used to sit at position 400 may start at 430. A delimiter that was one step may become four. Attention paths move, truncation cutoffs land in different places, loss weighting shifts, the KV-cache footprint changes. The transformer body still wants width-d vectors; it just gets fed a different sequence underneath.

A parser refactor keeps the interface stable and changes the implementation; a tokenizer swap changes the interface the weights were trained against. That doesn't make every tokenizer change a from-scratch retrain, but it does mean the bridge has to be paid for somewhere: new rows initialized and trained, old rows remapped, the LM head adapted, continued pretraining, distillation, or, in the limit, a fresh recipe top to bottom. The question left is size. "Add a <tool_call> token" and "replace the entire vocabulary" aren't the same job.

4.3.1 Three Sizes of Swap

"Updated tokenizer" in a release note can hide three different jobs.

Append. One new control token bolted onto an existing tokenizer: <tool_call>, <fim_prefix>, a new end-of-turn marker. Ordinary text still maps to the same IDs it did yesterday, which is what makes this the cheap case. The mechanics are §2.6.3's surgery: resize the embedding table and the LM head, initialize the new row (mean of related existing rows is a popular trick), and train until the token has a job.

Transplant. The whole tokenizer changes under an existing checkpoint. Most ordinary text now segments differently, old one-token strings split into pieces, new pieces have no trained rows behind them, and the small overlap between the old and new vocabularies is about the only thing working in the procedure's favor. There's a real research line here (WECHSEL, Model-Aware Tokenizer Transfer) arguing that this damage has enough structure to repair without starting over; §5.4.3 covers it properly, numbers and all.

From scratch. Nothing to preserve, which makes it the easiest case to reason about and the most expensive one to run. Pick a tokenizer before pretraining a new foundation model and there's no row 12345 to remap, no LM head to rescue, no transplant trick to grade. The model learns the alphabet from batch zero, and the tokenizer is one more entry in the recipe, alongside context length, data mix, and architecture.

A public user almost never gets told which of the three happened. They see the artifact: how their prompts split, which budgets moved, whether behavior changed on their workloads.

4.4 Worked Case: Claude Opus 4.7

You run /v1/messages/count_tokens against the same prompt for Opus 4.6 and Opus 4.7. The prompt is identical, but the count comes back higher under 4.7, sometimes by a little, sometimes by 35%, and the rate card hasn't moved.

That's the Claude Opus 4.7 release of April 16, 2026. Same provider, same model family, same listed price. Different tokenizer.

The migration guide is unusually explicit about it. The same input may use roughly 1.0× to 1.35× as many text tokens as it did under 4.6, depending on content. It tells you to re-run /v1/messages/count_tokens, leave more headroom in max_tokens, and re-baseline compaction triggers. A tokenizer change has crossed into the API migration docs.

The arithmetic is easy. Opus 4.6 and 4.7 both list at $5 / MTok input and $25 / MTok output on Anthropic's API pricing page. A prompt that was 50,000 input tokens under 4.6 becomes 67,500 input tokens under 4.7. The input line moves like this:

Nothing about the rates changed, only the count.

Public measurements blur the tidy 1.0× to 1.35× range. Simon Willison measured one system prompt at 7,335 tokens on 4.7 versus 5,039 on 4.6, about 1.46× and above the official band. A text-heavy 30-page PDF in the same comparison moved from 56,482 to 60,934, only 1.08×. Another public count_tokens benchmark put TypeScript around 1.36× and dense JSON around 1.13×.

I wouldn't turn any of these ratios into a global multiplier. The interesting thing isn't the average, it's the spread. A tokenizer is a compression scheme trained around the patterns it expects to see, and code, JSON, prose, tool schemas, OCR, Markdown tables, and multilingual text don't share local patterns. Swap the scheme and the bill moves by content type, not by a flat factor.

§4.3 is the right lens for what the ratios don't tell you. Opus 4.7 publicly demonstrates that a frontier release can change tokenization in a way users can measure. It tells you nothing about Anthropic's training recipe. The release could sit anywhere on the §4.3.1 spectrum: from-scratch pretraining with the new tokenizer, continued training with interface adaptation, or either of those combined with distillation from a stronger teacher. The public docs don't say, and the ratios alone can't tell you.

Behavior is even easier to over-attribute. The same release documents changes around literal instruction following, effort control, response length, tool use, and high-resolution image handling. Some of that may interact with tokenization, but none of it becomes tokenizer-caused just because the tokenizer changed in the same release.

From there it's mechanical. Count your real prompts under both model IDs with the official token counter, broken out by content type: system prompts, tool schemas, code, JSON, documents, logs, the languages your users actually write in. Recompute context budgets, prompt-cache forecasts, retrieval packing, compaction triggers. Run behavior evals separately from token-count evals, so a regression in one doesn't get chalked up to the other. The tokenizer is the part you can measure before anyone starts arguing about model taste.

Part 5: Cracks and the Frontier

Everything so far has assumed the input is text. A vision transformer feeds attention something that came out of a learned convolution, not a merge table. Byte-level language models try to throw the merge table out entirely, then immediately pay for that decision in sequence length. Gist tokens don't compress the input at all; they compress the activations the model has already produced from reading the input.

People keep calling all of these "tokens," which is a little funny once you look at what each one actually is. Sometimes the token is an integer ID with a row in an embedding table, like everything in Part 4. Sometimes it's a patch vector that never had a row to begin with. Or one of the 256 raw byte values, no merges anywhere. Or a learned compression of state the model wrote to itself a few layers back. What stays constant is the job: a tokenizer is whatever decides what gets to be one position before attention has to pay for it. Byte-level BPE is one answer to that question, good enough for text that it's been the default for years. This part is about the places where it stops being the only answer.

5.1 Vision Tokenizers

A 224×224 RGB image is 150,528 channel values. One second of 512×512 RGB video at 24 fps is 18,874,368. Hand any of that to attention as raw sequence positions and the model spends its first dozen layers learning how to compare individual pixel channels before it locates so much as an edge. Nobody runs vision models this way. Sequence length kills you long before the model gets to do anything useful, the same way it kills naive byte-level text models.

So vision does what text does: compress locally first, let attention reason over the compressed units. And the atom question from Part 1 shows up again. What counts as one unit? Three live answers, all called "tokens" by their respective papers, only one of which looks anything like the text version.

ViT cuts the image into patches and projects each one straight into a vector, no vocabulary anywhere. VQ-VAE puts a vocabulary back, snapping each patch to the nearest entry in a learned codebook and passing the integer index forward, which makes it the closest analogue to BPE outside of language. And Sora compresses the video through a learned codec first, then tiles the result into spacetime patches that are neither codebook entries nor raw pixels.

5.1.1 Continuous Patches

ViT has no vocabulary; the thing that plays the tokenizer's role is a single convolution.

Those 150,528 channel values (3×224×224) get sliced into 16×16 patches before any attention runs. One patch is 16×16×3 = 768 numbers, and the whole image becomes a 14×14 grid of 196 patches. Each patch gets flattened and projected to width d. You don't actually write the flatten step; the whole operation collapses into one strided convolution:

Kernel size equal to stride means the kernel lands exactly once per patch, no overlap. The output is [B, d, 14, 14]. Flatten the spatial grid and the transformer sees [B, 196, d], vectors the whole way through, never an integer ID.

The "token" here is a vector the model trained alongside everything else. The same patch content at the top of an image and the bottom of one means something different only because of the positional embedding the network glues onto it. Take the position off and the patches are interchangeable; what a vocabulary does for text, geometry does here.

ViT exposes one tokenizer knob, and it's the patch size. The trade-off underneath is the one BPE already lives with. Use 32×32 patches and you get 7×7 = 49 positions: coarser, much cheaper, but the model can't see anything smaller than a 32-pixel block at a single position. Use 8×8 patches and you get 28×28 = 784: finer, but attention is now about (784/196)² ≈ 16× more expensive. Bigger atoms shorten the sequence and hide more structure inside each position; smaller atoms expose more structure and pay for it in compute.

5.1.2 Discrete Codebooks

ViT chose geometry. VQ-VAE chose integers.

The pipeline starts the same way. A patch (or a cell from a convolutional feature map) gets pushed through an encoder and comes out as a vector. But if you want to generate images with the language-modeling toolkit, you need something finite to softmax over, and a continuous vector isn't that. The quantizer is the new piece. It reads the vector, compares it against every row of a learned codebook, picks the closest row, and throws the original vector away. The row's index is what the rest of the model sees. If row 417 wins, the token for that patch is the integer 417, the same shape as a BPE ID, just learned over images instead of text.

The original VQ-VAE setup on ImageNet turns a 128×128 RGB image into a 32×32 grid of indices drawn from a 512-entry codebook. That's 1,024 visual token IDs per image, each chosen from one of 512 rows. The model on top no longer predicts pixels; it predicts integers laid out in a grid.

The moment images become integer IDs, the rest of the stack snaps back into a shape language modelers already know. Cross-entropy over the codebook. Autoregressive sampling. A separate decoder network that turns the predicted grid back into pixels. A text tokenizer picks one ID from a finite vocabulary of byte-string pieces, and a VQ tokenizer picks one ID from a finite vocabulary of learned visual vectors; the machinery on top doesn't care which.

The price you pay for the integer is the snap. A continuous patch keeps whatever the encoder put into its vector. A codebook token throws away the distance between the encoder vector and the row that won. You can push back by widening the codebook (more visual prototypes, but a wider prediction problem too), or you can accept that the decoder will have to clean up rougher choices.

The analogy to text stops at the words "integer ID." Row 417 doesn't have to mean "cat ear" or "blue tile" or anything you can give a name to. It means whatever the encoder, codebook, decoder, and autoregressive prior agreed on during training. Retrain the codebook from scratch and the old IDs stop speaking the same image language.

5.1.3 Sora's Spacetime Patches

Sora is the one that gets misread most often, especially if "token" in your head still means "integer ID." The Sora technical report never names a codebook size or an index table; there's nothing in it that looks like the visual vocabularies of the last section. What it describes is a compression network. Raw video gets squeezed into a much lower-dimensional latent representation, and the diffusion transformer operates on that compressed version. Pixels are too dense to model directly, so the model never sees them.

The tokenizer-shaped piece happens after the compression, not before. The latent video gets cut into spacetime patches: small blocks of width, height, and time, carved out of the compressed cube. Those patches are the sequence the transformer sees.

A Sora patch isn't row 417 in some published table; there is no published table. It's closer in spirit to ViT than to VQ-VAE, a continuous chunk of state with a position attached. But it isn't vanilla ViT either, because the chunk doesn't come straight from pixels. It comes from a learned codec that sits in front of the diffusion transformer.

Diffusion is why Sora went continuous. A hard discrete codebook would have turned the prediction problem into something that looks like language modeling, a one-hot choice over a finite set. That's a clean fit for autoregressive transformers and a bad fit for diffusion, which wants continuous, denoise-friendly targets. So Sora keeps the patches continuous and puts the codec on top of the pixels, instead of putting a codebook on top of the patches.

The payoff is flexibility. You can give the model a short clip, a long one, a vertical phone video, a still image, and the transformer handles each by rearranging patches in the same latent format. A still image is the latent cube with a depth of 1. A widescreen clip has more patches across width than height. No new tokenizer per aspect ratio; the patches just rearrange.

One more difference from text: a text tokenizer ships in three files you can email: vocab.json, merges.txt, a regex. A vision "tokenizer" doesn't ship that way at all. The thing that turns an image into atoms inside a ViT is a Conv2d layer with learned weights. In a VQ-VAE it's a convolutional encoder, a learned codebook, and a paired decoder. In Sora it's a latent codec plus a patch layout that came out of training. These are trained machinery, and the rest of the model has spent its training run learning to talk to one specific instance of that machinery. Replace the codec in a video model and the spacetime patches mean something different; nothing else in the network knows yet. And where a text swap moves embedding rows, logits, and sequence lengths, a vision swap also touches patch size, latent geometry, codebook size, decoder weights, positional encoding, and (for diffusion models) the noise schedule. That's the swap-is-a-retrain story from Part 4 with the volume turned up, and it makes the text-side version look almost gentle. Vision also makes one thing harder to ignore: with text you can pretend the compression boundary is a vocabulary problem, because the merge table does most of the allocation. In vision it's plainly a model-design choice, deciding which distinctions get their own sequence positions, which get folded inside a vector, and which get left for a decoder to reconstruct after the model is done thinking.

If a fixed vocabulary causes the grief Part 3 cataloged, why not delete the merge table and feed bytes? Pure UTF-8, no <unk>, nothing to swap because there's nothing to ship.

5.2 Tokenization-Free Models

The model can learn around every quirk the merge table imposes, given enough capacity and data. But it's learning around a decision some other corpus made before pretraining started. The first layer never got a vote in which byte strings deserve a single position.

UTF-8 offers a rude alternative. The alphabet is closed: 256 byte values, period. A model that reads bytes directly handles English, Thai, Python, mojibake, scraped-web nonsense, and emoji without ever consulting a merge file. No <unk> for ordinary valid text, and no fight where English substrings, code idioms, math notation, and underrepresented languages all compete for rows in the same finite vocabulary.

I'd love for that to be the whole story. The problem shows up on a single word. With a normal subword tokenizer, tokenization is one or two pieces. As raw bytes it's the twelve positions Part 1 already counted, and every one of them is a sequence position attention has to pay for.

That's the bargain in miniature: bytes cover everything, and the sequences get long. A subword tokenizer is prepaid compression: it shortens the sequence before attention, the KV cache, and the decoding loop ever see it. A pure byte model hands all three the uncompressed stream.

So the useful version of "tokenization-free" isn't "delete the tokenizer and keep the model identical." That deletes the only thing keeping the sequence short. The papers that actually work move the compression instead. Byte patches grouped before attention. Learned downsampling between layers. Entropy-driven boundaries that pack predictable bytes together and isolate surprising ones. A backbone whose memory cost doesn't scale the same way with sequence length. All of these are the same trick in different forms: remove the external vocabulary and push the compression somewhere else in the architecture.

5.2.1 Why Naive Bytes Don't Ship

The naive version trains, which surprised me the first time I saw it. It's exactly what it sounds like: 256 byte IDs, a few special tokens for things like end-of-sequence, the same transformer objective as before. ByT5 (Xue et al., 2022) is the cleanest example. Take T5, rip out SentencePiece, feed in raw UTF-8 bytes, predict raw UTF-8 bytes. On tasks that punish a fixed vocabulary (typos, character corruptions, low-resource scripts), the byte model actually does better than its tokenized cousin.

Then you look at the sequence lengths. The ByT5 paper measures this directly. mT5's SentencePiece vocabulary averaged around 4 bytes per token on its multilingual training data. A byte model is, by definition, 1 byte per token. Same text, roughly 4× as many sequence positions.

A factor of 4 in a spreadsheet doesn't look scary. In a transformer it is. Attention does pairwise work over positions, so a 4× longer sequence inflates the attention map by roughly 16×. The feedforward blocks and the KV cache scale linearly, which means they're each paying 4× more compute and 4× more memory during generation. Decoding gets longer too, because the model has to write tokenization as twelve sequential byte tokens:

Twelve forward passes through the whole stack, one per byte, to spell a single word. Non-Latin scripts make it worse: a single Devanagari character is usually three UTF-8 bytes, many emoji are four apiece, and an emoji sequence with skin tones or zero-width joiners stretches much longer. Bytes solve the coverage problem and trade it for a length problem.

The label also undersells how much tokenizing this version still does. The early transformer layers in a byte model end up doing the merging work themselves: relearning chunks like tion, indentation runs, UTF-8 character boundaries, common word shapes. By the time the deeper stack starts doing anything interesting, several layers' worth of compute have already been spent on what BPE would have done for free, before training even started.

Flat byte transformers prove an important point all the same. Bytes carry enough information. They also prove that "enough information" and "an affordable sequence length" are two different properties, and production systems need both.

5.2.2 The Price of Bytes

Vision saw this problem first. Continuous patches don't process individual pixels; they group pixels into chunks before any attention runs. Three byte-level papers run the same play on text.

MEGABYTE (Yu et al., 2023) is the bluntest version. Glue every eight bytes into a patch and stack two transformers. The large one attends over patch positions, of which there are about T / P. A small one predicts the bytes inside each patch, conditioned on what the large one produced. One enormous O(T²) attention map over the full byte stream becomes a smaller global attention plus a handful of tiny local ones. Bytes are still the atoms; the unit the expensive part of the model reasons over is bigger. The compression budget moved out of a frozen merge file and into the architecture itself.

BLT, the Byte Latent Transformer (Pagnoni et al., 2024), makes the patch boundary smarter. Fixed 8-byte strides are easy to ship; text obviously isn't built out of 8-byte facts. The first byte of Mozart after Who composed The Magic Flute? is a much harder prediction than the z after Mo. BLT runs a small byte-level entropy model out front, opens a new patch wherever uncertainty spikes, and lets predictable byte runs stay grouped inside larger patches. A BPE token is a permanent vocabulary row in a frozen file. A BLT patch is temporary: carved at runtime, alive for one forward pass, gone after. The model reads bytes at the very bottom, but the expensive latent transformer never sees individual bytes, only variable-size groups whose boundaries the model itself helped pick.

MambaByte (Wang et al., 2024) takes a different exit entirely. Keep bytes as the visible alphabet, change the backbone. A transformer pays for long byte streams twice over, once in attention and once in the KV cache. A state space model carries a fixed-size recurrent state instead, so its memory cost doesn't grow with context length the way attention's does. Byte-by-byte decoding on its own would still be painfully slow, so MambaByte cheats: a subword tokenizer drafts a few bytes ahead, and the byte model only checks the draft against its own distribution. When the draft disagrees, the draft loses. That flips the old role of the tokenizer: the merge table isn't the source of truth there, just a speculative hint, and the bytes get the final word.

The pattern across all three is the same: bytes make a fine floor, but something on top still has to compress (fixed patches in MEGABYTE, entropy patches in BLT, a more byte-friendly recurrent state in MambaByte).

5.2.3 Frozen vs Adaptive

The axis that actually separates these systems isn't bytes versus tokens but when the compression happens.

BPE makes the decision once, before any loss has ever been computed. The merge table turns bytes into IDs, and that file ships with the model forever. Short sequences, stable IDs, an artifact you can serve. You also inherit every implicit vote the merge corpus cast. If Thai, Python indentation, the digit string 2000, or a rare identifier ended up with an awkward split, the transformer is stuck with it for every input it will ever see.

Flat byte models live at the other end. No precompression. The model gets the raw byte stream and has to learn every grouping from scratch, during training. Clean interface, terrible geometry: the sequence is already enormous before the model has had a chance to decide which bytes belonged together.

The interesting papers don't pick a side. MEGABYTE keeps bytes at the boundary but groups them into fixed patches before the expensive transformer ever sees them. BLT makes those boundaries depend on uncertainty rather than position. MrT5 keeps the bytes at the input and learns a delete gate that prunes positions partway through the network, so easy stretches get compressed away once the model has had a chance to look at them. What unites them is the timing rather than the absence of a tokenizer: compression happens after the model has had some local context to work with, not months earlier in a frozen merge file. And an adaptive system can spend its budget unevenly: more positions on the hard spans, fewer on the predictable ones. A dense script, a misspelling, an emoji sequence, and a long English boilerplate template don't all have to inherit the same merge table.

This is roughly the version of "no tokenizer" I'd actually bet on. The frozen merge table may be on borrowed time, but its successor probably isn't a 256-token vocabulary glued onto an otherwise unchanged transformer, because something still has to compress. The bet is that the model should get a vote in what gets compressed and where.

5.3 Gist Tokens

A chat product's prompt is mostly text the user never typed. The user types summarize this email, five tokens; the model gets 3,000. A system prompt. House style. Tool docs. Retrieved documents. A routing rule. A few examples. A JSON schema you'd rather not paste into a screenshot. The user's actual question is a footnote on the prefix that wraps it, and the prefix gets paid for on every single call.

There are two standard ways to make this cheaper. Prompt caching helps when the long prefix is byte-for-byte identical across calls. Fine-tuning helps when the behavior the prefix encodes is stable enough that you'd rather bake it into the weights. Gist tokens (Mu, Li, and Goodman, 2023) try a third route. Read the long instruction once. Compress whatever the model needs from it into a couple of special-token positions. Do the actual task by attending to those positions instead of the original prefix.

The training setup is almost embarrassingly simple. Add a special token, say <G1>. Arrange each training example so it reads:

Then change the attention mask. Tokens after the gist position can't attend back to the raw instruction; they can only see the gist positions. If the model needs anything from the instruction to produce a correct output, that information has to squeeze through the gist on the way in. If it doesn't make it, it's gone.

What's being learned isn't the new vocabulary row. A fresh <G1> embedding starts about as informative as any other special-token row, which is to say, basically random. The useful signal lives in the hidden state at that position after the model has read the instruction and updated its internal representation. The gist token is just a slot for that state.

The serving-time payoff only lands if the long prefix gets reused. Run the instruction once, keep just the keys and values at the gist positions, throw away the rest of the prefix's KV cache, and reuse that tiny stored cache as the context for every subsequent user input. A 500-token instruction stops being 500 positions of prefix cache per call and becomes a handful of learned positions per call. Compression is never perfect, though. There's no guarantee the gist captures every clause you cared about, and some details get squeezed out.

The authors report prompt compression of up to 26× with small task-quality losses, and a measured wall-clock speedup of about 4.2%. The gap between those two numbers follows from where the time goes. The instruction prefix is a few hundred prefill positions processed in parallel, while the model still runs every layer and still decodes the answer one serial token at a time, so the prefix was never more than a few percent of the total work. Compressing it 26× saves 26× of a few percent, which is a few percent.

Calling a gist token a token feels like a stretch at first. <G1> doesn't decode to "translate this to French." It doesn't decode to any phrase at all; the useful version is an activation pattern, not a string. But it answers the same question every mechanism in this part answers: which sequence positions does the model spend compute on, and how does that decision get made? BPE compresses before the model runs. Byte hierarchies push the boundary into the architecture. Gist tokens compress inside the model, after the prompt has already passed through the forward pass once. The definition from Part 1 stretches but doesn't break: the token here is a unit of attention and cache, not a unit of text, and the string side is just how it gets serialized.

Even the failure shape carries over. If a BPE merge hides the letters the model later needs to spell a word, the spelling gets weird. If a gist bottleneck drops the clause you cared about (a tool name, a formatting rule, a safety constraint), the model answers using its compressed memory of an instruction it can no longer see. Wherever the compression happens, if the compressed version drops the bit that turned out to matter, the model is answering a slightly different question than the one you asked. It's the Part 3 bug one layer deeper.

5.4 The Frontier, Narrowly

On the text side, researchers have spent years trying to remove the tokenizer. On the multimodal side, they've spent the same years intentionally building stronger ones. The two camps aren't having a coordinated meeting about this.

Three threads on the text side seem worth following. "Tokenizer-free" architectures usually aren't, exactly; the boundary rule is still in there, moved out of merges.txt and into the network. A fixed vocabulary doesn't pin segmentation as tightly as the contract claims, and that wobble turns inference-time tokenization into a knob almost nobody has been using. And the swap-is-retraining math from Part 4 has loosened: tokenizer transplants are working well enough now that a full retrain isn't the obvious default anymore.

I don't want to oversell this. "Tokens are over" is too cute to be true, and "BPE won forever" is too lazy to be useful. The narrower thing that does seem to be happening is that the boundary between text and computation is shifting from a preprocessing default toward a model-design choice.

5.4.1 Hidden Tokenizers

A "tokenizer-free" paper's title promises no vocabulary, no merge table, none of the bugs from Part 3. Then the architecture section describes a module, somewhere ahead of the expensive layers, deciding which bytes get to share a position. The tokenizer didn't leave. It moved into a .py file.

Charformer (Tay et al., 2022) is the cleanest example. No BPE merges ship with the model. What does ship is a module called GBST. At every position, GBST enumerates short overlapping byte spans, scores each one with a small learned function, mixes the candidates into a soft pooled representation, and downsamples the sequence before the main transformer ever runs. The mechanism is soft and trainable, which really is different from a frozen merge table. It's also a decision, made before the expensive layers, that some bytes should travel together.

SpaceByte (Slagle, 2024) is even blunter. The surface alphabet is bytes, and the architecture spends extra heavyweight transformer blocks at the positions immediately following boundary bytes like spaces and punctuation, which is the old whitespace rule expressed as a compute schedule.

BLT, from §5.2.2, makes the point about as cleanly as anything has: no vocabulary row corresponds to tokenization, the expensive latent transformer only ever sees patch vectors, and there's no merges.txt anywhere in the artifact. It's still very much a segmentation policy.

This isn't a gotcha against the field. Moving the segmentation rule inside the model is a real and useful change, because it removes the fixed vocabulary budget and lets the training loss have some influence on the grouping. A typo, a Devanagari word, or a long identifier no longer has to inherit one corpus's merge table forever. But the compression decision is still being made. Some byte-level distinctions get to be local detail; others get their own position in the expensive part of the network. Call the chooser a patcher, an entropy-driven boundary detector, a delete gate, or GBST. The question from Part 1 hasn't moved: what counts as one thing?

5.4.2 Segmentation as a Knob

The encoder turned cat into one token. You could have sent c and at instead, and the model would have accepted it.

That's not a bug, and it's not a feature anyone advertises. The model checkpoint is a function on ID sequences. If your vocabulary contains cat, c, and at, the canonical BPE encoder almost certainly emits cat as a single token when it sees that string. The checkpoint underneath doesn't enforce that choice. Skip the canonical encoder (which you can, if you're calling the model directly rather than through somebody else's prompt layer) and the model runs the forward pass on whatever non-canonical IDs you hand it. The user-visible prompt didn't change. The attention graph the model sees did.

A loophole sitting in plain view, and I don't think it gets talked about enough.

You can frame what this enables as an adversarial attack, and people have. The more interesting frame is what happens when nobody is attacking anything. In Broken Tokens, Zheng et al. tested instruction-tuned models under random valid non-canonical segmentations and the models mostly shrugged. Qwen-2.5-7B-Instruct retained 93.4% of its original benchmark performance under random valid segmentations and 90.8% under character-level segmentation across their evaluation suite. Honestly, I don't have a good story for why the robustness is this high. Every embedding the model sees under a scrambled segmentation is a row it trained on far less often, in contexts that mostly never occurred, and the clean explanation (subword regularization-style exposure during pretraining) doesn't obviously apply to models whose tokenizers are deterministic. It works much better than I'd have guessed, and I haven't seen a satisfying account of why.

The other half of the same paper is the actionable half. Character-level segmentation improved Llama-3.1-8B-Instruct's code description accuracy by 14.3 points. Right-aligned three-digit grouping pushed ten-digit arithmetic on the same model from 36.5% to 70.2%. No new training. No new vocabulary. Same decoded string. A different choice of atoms going in.

I wouldn't ship this as a production API knob without thinking it through. Random segmentations waste context, confuse prefix caches, and push the model into regions of input space the serving stack hasn't been tuned for. As a research signal it's hard to dismiss. If you can shift a model's arithmetic accuracy by 34 points without retraining, just by changing how digits get grouped before they enter, then the tokenizer boundary is an inference-time compute schedule that almost everyone has been treating as fixed.

The old contract was simple: train a vocabulary, freeze it, run every request through the canonical encoder. This work pokes a small hole in it. Picture a future model that picks segmentation per task: character-ish atoms for spelling, digit-grouped atoms for arithmetic, compressed subwords for prose. We don't have it yet. But the fixed-tokenizer world is starting to admit that the boundary it kept calling frozen is actually a knob.

5.4.3 The Teachable Transplant

Part 4's verdict on tokenizer swaps was bleak, and it's still mostly right. The "mostly" is recent: the surgery has gotten precise enough that you don't have to throw the patient away.

WECHSEL (Minixhofer et al., 2022) is the useful, low-tech version of the idea. Pretrained RoBERTa or GPT-2-style model. Replace the tokenizer. Keep the transformer body untouched. For each row in the new embedding table, initialize it as a weighted combination of semantically nearby rows from the old embedding table, where "nearby" is measured in a shared multilingual word-embedding space like fastText. Then continue training. The motivation is simple: a new German or Swahili subword shouldn't start from random noise if its meaning overlaps with subwords the old model already learned.

The result is a head start, nothing more. WECHSEL hits cross-lingual transfer competitive with from-scratch training while using up to 64× less target-language data. The reading I'd give it: the trained transformer body wasn't destroyed when the string pieces under it changed. Most of its knowledge sat above the embedding lookup, and that knowledge survived as long as the new front end could be coaxed into producing inputs the body was prepared to receive.

Model-Aware Tokenizer Transfer (MATT, 2025) goes after the part WECHSEL more or less ignores. Embedding similarity tells you where rows should start. It doesn't tell you whether token A in the new tokenizer should attend to token B the same way some pair of old tokens did. MATT freezes the old model as a teacher and trains the student (now with the new tokenizer) to match attention-weighted interactions over shared character spans. When two tokenizers cut the same string in different places, the characters underneath both segmentations are the shared ground truth.

One of their headline experiments: transferring Gemma 3 12B to a Ukrainian-heavy tokenizer improved compression on Ukrainian text from 2.98 to 4.44 characters per token. The unchanged original model averaged a benchmark score of 56.91 across the paper's evaluation suite. MATT brought the new-tokenizer student to 54.41. The best heuristic baseline, without the attention distillation step, reached 35.99.

That's a specific kind of progress, and it supports a specific reading. It doesn't make tokenizer swaps cheap. It says the damage from a swap has structure. Rows can be initialized from prior meaning, internal communication patterns can be distilled across mismatched IDs, and the transformer body really can be reused, because the computations it learned weren't living in the vocabulary in the first place. A model that overcharges Ukrainian, fragments code identifiers, or hides digit positions behind bad chunks may not require a full cold restart anymore. Tokenization is starting to look less like an untouchable mistake locked in before pretraining and more like a part of the model you can measure, complain about, and occasionally repair after the fact.

The "occasionally" matters, though. These papers all still train. They need adaptation data on the target distribution, careful initialization, and evaluations to confirm the surgery worked. And a transplanted tokenizer can fix one of Part 4's budgets while breaking another: compression, spelling access, output segmentation, KV cache length, chat templates, and rare-token training all move together when the vocabulary changes. The realistic summary isn't "the future is tokenizer-free." It's that the boundary is less frozen than it used to be: an architectural decision before training, an interface migration after, and never a JSON edit.

5.5 Choose, Train, Ship

Most teams shouldn't train a tokenizer at all. If you're building on top of someone else's pretrained checkpoint, use the tokenizer that checkpoint shipped with. Swapping the encoding under a model that already learned the rest of its layers around it is a reliable way to ship something worse, and the failure modes won't obviously trace back to the tokenizer when the bug reports arrive.

If you actually do need one, you usually know. You're training a model that has to read legal briefs, Python repositories, Spanish customer chats, and Arabic support tickets. The project plan hits the line that says "and now we need a tokenizer," and the conversation skips straight to defaults. Set vocab_size=100000. Pick BPE or Unigram. Point the trainer at whatever crawl is convenient. Wait for tokenizer.json to land in the output directory.

What just got skipped is every requirement the model has to live up to. Should 123456789 split into nine digits, into 123,456,789, or into whatever shape BPE's merge counts produce? Are tabs and four-space indent chunks visible enough that Python code doesn't spend a third of its context budget on whitespace? Can ordinary user text ever become <unk>? Is <|tool_call|> a reserved protocol token, or just a string a user might plausibly drop into a blog post about LLMs? How much more expensive is it acceptable for an Arabic support ticket to be than its English translation, before you'd say the tokenizer is broken?

The trainer doesn't ask any of these, and it couldn't answer them anyway. BPE optimizes for pair counts. Unigram optimizes for held-out likelihood under a piece model. Whatever those statistics happen to produce becomes the contract every layer above the tokenizer is stuck with.

The missing piece in that workflow isn't the algorithm choice or the vocab size. It's the spec: a written list of things the tokenizer is not allowed to get wrong, drawn up before any compute is spent on the model that has to live inside whatever the tokenizer ships. So here's the order I'd actually recommend. Write the spec. Choose a corpus. Train a few candidates against it. Evaluate them on the slices that will hurt. Ship the whole artifact, tests and all, not just the vocab table.

5.5.1 Write the Spec

The temptation with a file called tokenizer-spec.md is to fill it with prose: rationale, considerations, paragraphs about what could go wrong. Skip that. The spec has to fit on one screen, and every line has to be a check a script can run, not a paragraph someone has to interpret. A near-minimum version:

Each of those six lines names a thing the model is going to inherit if the tokenizer gets it wrong.

The line I'd fight hardest for, by a wide margin, is the first one. If your decoder silently strips a trailing newline, or normalizes Unicode on the way out, what you have is technically a tokenizer, but it's a tokenizer plus a data bug with better manners. If normalization is what you actually want, fine. Name it in the spec and test for it explicitly. What you can't afford is a decoder that disagrees with the encoder without anyone having decided that, because every subsequent bug ends up dressed in that disagreement.

The special-token line exists because the literal user-typed string "<|tool_call|>" and the reserved <|tool_call|> control token the model emits to start a tool call aren't the same input, and they shouldn't produce the same token sequence by the time they reach the model. If they do, every layer above the tokenizer is doing a job the tokenizer was supposed to do: deciding whether this is prose or protocol.

Digits and whitespace are on the list for parallel reasons. When 123456789 ends up split by whichever pair-count accident won that day, the model is being asked to learn arithmetic on top of arbitrary chunking. When tabs get eaten, leading spaces get trimmed, or repeated newlines collapse, source code arrives at the embedding lookup already damaged, and the rest of the network spends parameters undoing the damage instead of doing anything you wanted from it.

None of these checks need a transformer to run. They're tokenizer tests, the kind that complete on a laptop in seconds, long before pretraining starts or any serving stack gets in the way. A tokenizer that can't pass its own spec is a broken interface, not a weaker one the model is supposed to compensate for.

If you don't write the spec, the trained model ends up as the only record of what the tokenizer does, and a 70B-parameter network is a uniquely expensive place to keep a bug.

5.5.2 Corpus, Algorithm, Size

At this point, everyone wants to argue about the algorithm. BPE versus Unigram. vocab_size=100000. The whiteboard fills up quickly, and none of it is the first decision. The first decision is the corpus: the text you point the trainer at decides, before anything else, which byte strings are cheap.

The mental model I find useful here is a ballot. Every repeated byte sequence in the training text gets a vote for a vocabulary slot. English morphological suffixes win in a hurry when the corpus is mostly English web crawl. Markdown fences and import statements win when there's enough technical documentation. Arabic diacritics, emoji clusters, JSON keys, four-space Python indents, legal boilerplate, Spanish customer-support phrasing: they get slots only if they show up to vote at all. The trainer isn't going to read your product plan and figure out that you also care about Arabic. A mostly-English corpus produces a mostly-English tokenizer, no matter how clever the algorithm sitting on top.

So the corpus question is just: what is the model actually going to be reading in production? If the answer involves code, the trainer has to see real code, not blog posts about code. If it involves Arabic support tickets, sample those deliberately; a general web crawl isn't going to reach them on its own. If real prompts arrive shaped like JSON tool calls, those shapes belong in the training corpus too. The vocabulary is a fixed budget, and the corpus mix decides how it gets spent, long before the algorithm has any say.

Only after the corpus is settled does the algorithm decision actually matter, and the safe default is byte-level BPE. Robust merges, inspectable artifacts, no <unk> problem on valid UTF-8, well-understood failure modes after a decade in production. Unigram is worth a real look when morphology is doing real work (Turkish or Finnish, say, where words are built by stacking suffixes) and you can afford the slightly heavier training machinery. WordPiece is mostly something you inherit from a BERT-shaped encoder stack; I wouldn't start there for a new raw-text generator. And the right way to decide isn't to argue about it on a whiteboard. Train two or three candidates on the same corpus and look at the artifacts.

Vocabulary size comes last, which is the opposite of how people usually do it. A larger vocabulary does shorten sequences, which helps almost everything downstream. It also adds rows to the embedding table and to the LM head, paid for in parameters. The trade-off, stated plainly: sometimes those extra rows buy real wins on multilingual cost or code compression, and sometimes they're vanity slots for rare strings whose embeddings barely get any gradient and end up as dead weight in the lookup table.

The practical loop:

If 32k → 50k buys a lot of tokens and 50k → 100k barely moves the curve, stop near 50k. If multilingual or code slices are still expensive at 100k and the model is large enough that extra embedding rows are cheap relative to the win, keep climbing. When the compression curve has no obvious knee, smaller usually wins on the margin: smaller tokenizers are easier to train around, easier to serve, and less likely to hide dead rows in the long tail.

The order matters more than any single choice: corpus, then algorithm, then size. Reverse it and the work starts feeling rigorous in a comforting way, while the actual decision is being made by whichever text dump happened to be sitting on disk.

5.5.3 Evaluate Where It Embarrasses You

The aggregate numbers on a tokenizer eval almost always look fine, and that's exactly what makes them dangerous. Average tokens-per-byte on a catch-all held-out set is an English-dominated mean, and a mean that size can absorb a lot of damage without moving: Thai exploding, long account numbers getting cut into nonsense, code identifiers degrading. The report is green and the product is broken.

Split the report by domain instead. One row per slice, and expect some rows to lose:

Per slice, start with the plain quantitative measurements: tokens per byte, tokens per word where words are well defined, decode(encode(x)) == x on the whole slice, and a diff against your previous tokenizer if there is one. Then start printing examples. A column of the worst-fragmented strings is usually more useful than another column of aggregate ratios. If customerAccountId lands as customer | Account | Id under one tokenizer and as eight unrelated pieces under another, that has to show up in the report. If 123456789 gets a clean three-by-three grouping in one candidate and a merge-table accident in another, the eval can't let you skim past that either.

This is where a lot of "better" tokenizers get caught lying. One candidate looks like a code win, but only because it memorized giant project-specific identifiers from its training corpus, identifiers that barely help next-token prediction and don't generalize. Another posts a great multilingual aggregate while destroying the one language your product actually serves. A third claims a chat-transcript win because the training-time chat template happens to be cheap and the serving-time template happens to be expensive; the savings evaporate the moment someone updates the wrapper. Compression is a useful filter for narrowing the candidate set, but by itself it isn't the verdict.

After the static report, run the small-model ablation. Same architecture, same data, same training budget, vary only the tokenizer, and train long enough that validation loss has time to move. The point isn't to produce a publishable model; it's to make the tokenizer show its real cost before it gets locked into a model that will inherit the choice. If two candidates compress the held-out corpus about the same but one produces noticeably worse validation loss, trust the loss. If a candidate wins on the global aggregate but loses badly on Arabic support tickets, don't paper over that with the global number. The embarrassing slice is the whole reason you ran the experiment.

5.5.4 Ship the Artifact

When the inference team asks for the tokenizer, the natural thing to send is tokenizer.json. Strings on the left, integers on the right, all of it fitting on a screen. It feels like the deliverable, and it isn't. The vocabulary is one file inside a larger artifact, and the rest of that artifact decides whether the model in production sees what the model in training saw.

The real artifact is the entire path from bytes to IDs and back: the normalizer (or the deliberate decision not to normalize), the pre-tokenizer that decides which byte sequences are even allowed to be candidates for a merge, the merge ranks or Unigram scores themselves, the byte-fallback policy, the decoder, the post-processor, the special-token map, and the chat template that wraps user and assistant turns.

Two systems can use identical vocabulary rows and still hand the model different sequences for the same input. One library adds a BOS token by default; another doesn't. A serving stack cleans up whitespace during decode in a way the training-time decoder never did. Chat rendering is its own failure mode: include the assistant prefix the model actually saw during training, or omit it, and you're prompting the model with something subtly different from the format it was taught to respond to. By the time those mismatches surface, the bug doesn't look like preprocessing anymore. It looks like a model regression, and what's actually happening is that the model is being asked to answer in an input format different from the one it was trained on.

The version of the artifact worth shipping looks something like this:

The fixtures are the actual contract. They should be strings that have caused trouble in earlier parts of this post: a Python block with leading tabs and embedded blank lines; the literal integer 123456789; an Arabic support sentence; an emoji sequence with skin-tone modifiers; user text that contains the literal characters <|tool_call|>; the reserved <|tool_call|> protocol token; a complete chat transcript with system, user, and assistant turns. For each fixture, assert the exact token IDs and the exact decoded bytes. Then run those assertions in every implementation that touches production traffic: the official tokenizer, the client SDK, the inference server, and whatever third-party adapter someone added because it was easier than fighting the docs.

Special IDs deserve their own paranoia budget. When fine-tuning introduces a new control token, that token's embedding row and LM-head row inherit nothing from pretraining; the surgery from §2.6.3 has to be done and trained. When a serving stack changes the chat template after SFT, the prompt the model sees in production is no longer the prompt it was taught to respond to, and behavior drifts in ways that look like model regressions but are template disagreements.

"We shipped the vocabulary" is, in my experience, never a complete deployment plan. Ship the whole artifact. Version it the way you'd version any other interface. Test it byte-for-byte across machines and language bindings. Make the tokenizer boring enough that the next time the model behaves strangely in production, "did two systems disagree about where the text begins?" isn't on the shortlist of things you have to check.

Strawberry, Again

Back to the screen. You asked how many rs were in strawberry. The model said two, and now you know why.

The letters were never there. Somewhere upstream, a 1994 compression loop decided that str, aw, and berry deserved their own rows of the embedding table, and the model has been training around that decision ever since. By the time attention starts moving, the text you typed is gone. What's left is the tokenizer's version.

This is the habit I'd most like to retire: calling tokenization "preprocessing." It isn't preprocessing. It's the first layer of the model. The transformer never sees raw text; it sees what the tokenizer chose to hand it, and it spends every parameter of capacity learning to think inside that choice. The algorithms that produced the choice weren't optimized for anything downstream. They optimized pair counts, piece-likelihood scores, pruned EM objectives. Nobody asked them the question that matters once training begins: given this prefix, what's the right next token? When the mismatch is deliberate, fine. When it's accidental, it's the kind of slow disaster that surfaces eight months later as a benchmark nobody can explain.

A well-chosen tokenizer gives the model atoms it can use. Common whole words. Reusable suffixes. Visible whitespace where indentation matters. Digit groupings that don't fight arithmetic. Protocol tokens with stable IDs the model saw during training, whose meaning doesn't depend on the spelling on the page. A sloppy tokenizer forces the model to spend parameters undoing what got cut upstream: reassembling 123456789 after the encoder chopped it into something arbitrary, recovering the letters of .DefaultCellStyle long enough to write it backwards, learning that assistant is a chat role in one part of the input and ordinary prose in another. Models are forgiving enough to learn these repairs, which is part of why LLMs feel as robust as they do. It's also why every parameter spent on cleanup is one that didn't go toward whatever you actually wanted the model to learn.

Don't ask whether a tokenizer is good. Ask what it promises. That user text round-trips byte-for-byte. That digits and whitespace stay legible. That none of your target languages are paying an impossible bill. That protocol tokens can't be confused with prose. That the chat template in the artifact is the one the model actually trained on. Make those promises part of the release: version them, test them, count production prompts with the real encoder, run an ablation when you change one. If you ever swap the tokenizer, assume you've changed the model's interface until the evals say otherwise.

A tokenizer is a compression scheme the model is forced to learn around, and the compression can be good or bad in specific, expensive ways. It's never invisible. Pick it on purpose. The model you ship is going to be the one that learned to think through that particular front door, whether you chose the door carefully or not.

References and Further Reading

This is the reading list I'd keep open if I had to rebuild the post from scratch. It's not a neutral bibliography. Each group answers a different version of the same question: where did the atoms come from, and what breaks when they're wrong?

Algorithms And Artifacts

Philip Gage, "A New Algorithm for Data Compression" (1994). The original BPE compression loop: count adjacent byte pairs, replace the most common pair, repeat.
Sennrich, Haddow, and Birch, "Neural Machine Translation of Rare Words with Subword Units" (2016). The paper that moved BPE from file compression into NLP.
Kudo, "Subword Regularization: Improving Neural Network Translation Models with Multiple Subword Candidates" (2018). The Unigram tokenizer paper, and the cleanest entry point for probabilistic segmentations.
Kudo and Richardson, "SentencePiece" (2018). Raw-text tokenization, the ▁ metaspace, BPE versus Unigram inside the same library, and the reason SentencePiece became the open-model default for years.
Radford et al., "Language Models are Unsupervised Multitask Learners" (2019). The GPT-2 paper. The input-representation section is the seed of the byte-level BPE stack used by later GPT-style tokenizers.
OpenAI tiktoken and Karpathy's minBPE. Read both. minBPE shows the mechanism; tiktoken shows the production artifact.

Leaks And Failure Modes

Bostrom and Durrett, "Byte Pair Encoding is Suboptimal for Language Model Pretraining" (2020). A controlled comparison of BPE and Unigram that makes the tokenizer choice feel empirical instead of aesthetic.
Dagan, Synnaeve, and Roziere, "Getting the Most Out of Your Tokenizer for Pre-training and Domain Adaptation" (2024). The best paper for the pre-tokenizer argument, especially the code-tokenization results.
Singh and Strouse, "Tokenization Counts: The Impact of Tokenization on Arithmetic in Frontier LLMs" (2024). The arithmetic companion to the number-splitting sections.
Rumbelow and Watkins, "SolidGoldMagikarp plus prompt generation" (2023). The anomalous-token case study.
Zheng et al., "Broken Tokens" (2025). Non-canonical segmentations as an inference-time knob: same decoded text, different atoms.

Multilingual Cost and Quality

Rust et al., "How Good is Your Tokenizer?" (2021). Tokenizer fertility and monolingual tokenizer swaps for multilingual models.
Petrov et al., "Language Model Tokenizers Introduce Unfairness Between Languages" (2023). The token-tax paper: same content, very different token counts across languages.
Ali et al., "Tokenizer Choice For LLM Training: Negligible or Crucial?" (2024). A broader controlled training study across tokenizer choices.
Haslett, "Tokenization Changes Meaning in LLMs: Evidence from Chinese" (2025). A useful reminder that tokenization can disturb representation, not just bills.

Specials, Templates, And Swaps

Devlin et al., "BERT" (2019). WordPiece in the encoder family, plus the special-token conventions that became part of the interface.
Bavarian et al., "Efficient Training of Language Models to Fill in the Middle" (2022). FIM tokens as a concrete example of specials becoming trained behavior.
Hugging Face chat-template docs. Good practical reference for why chat messages are tokenizer-side protocol rather than just JSON.
Mistral NeMo and Tekken plus the mistral-common tokenizer docs. A public example of a model family moving from SentencePiece-style tokenization to a tiktoken-style stack.
Minixhofer et al., "WECHSEL" (2022) and Model-Aware Tokenizer Transfer (2025). The two papers to read if "swap the tokenizer" sounds impossible but still tempting.
Anthropic's Opus 4.7 launch post, migration guide, and Simon Willison's token-count comparison. The public case study behind the Opus 4.7 section.

Beyond Fixed Tokenizers

Xue et al., "ByT5" (2022), Clark et al., "CANINE" (2022), and Tay et al., "Charformer" (2022). Three ways to push below fixed subword tokenizers.
Yu et al., "MEGABYTE" (2023), Pagnoni et al., "Byte Latent Transformer" (2024), and Wang et al., "MambaByte" (2024). Bytes become plausible once the architecture stops charging every byte the same way.
Mu, Li, and Goodman, "Learning to Compress Prompts with Gist Tokens" (2023). A different use of "token": a learned bottleneck for prompt state rather than a text piece.
Dosovitskiy et al., "An Image is Worth 16x16 Words" (2021), van den Oord et al., "Neural Discrete Representation Learning" (2017), and OpenAI's Sora technical report (2024). The vision side of the same boundary question: patches, codebooks, and latent spacetime chunks.