Vinit Vyas
WritingToolsTopicsAbout
WritingToolsTopicsAbout
May 9, 202636 min readfoundation

Tokenizers from First Principles

Tokenization looks like preprocessing and behaves like architecture. From bytes to BPE to the cracks at the frontier, this is an argument that almost everything weird about LLMs starts at the atom you chose.

#tokenization#bpe#byte-pair-encoding#wordpiece#unigram#sentencepiece#tiktoken#tekken#unicode#utf-8#vocab-size#fundamentals

Previous

Floating Point: Designing a Number System from 32 Bits

2026
GitHubXLinkedIn