LLM Tokenization Advances: BPE, SentencePiece, and the Quest for Better Tokenizers
A technical deep dive into how modern LLM tokenizers work, the tradeoffs between BPE and SentencePiece, and emerging approaches that improve multilingual and code performance.
Why Tokenization Matters More Than You Think
Tokenization is the first and arguably most consequential step in any LLM pipeline. It determines how text is split into the discrete units that the model processes. Poor tokenization wastes context window space, degrades multilingual performance, and creates unexpected failure modes. Yet it receives a fraction of the attention given to model architecture and training.
A tokenizer's vocabulary and merge rules directly affect cost (more tokens per text means more inference cost), latency (longer sequences take more time), and quality (splitting meaningful words into fragments hurts comprehension).
Byte-Pair Encoding: The Dominant Approach
BPE, originally a compression algorithm, is the foundation of most modern LLM tokenizers. The training process is straightforward:
- Start with a base vocabulary of individual bytes (256 entries)
- Count all adjacent token pairs in the training corpus
- Merge the most frequent pair into a single new token
- Repeat until the desired vocabulary size is reached
GPT-4's tokenizer (cl100k_base) uses BPE with a vocabulary of approximately 100,000 tokens. Claude's tokenizer uses a similar approach.
Strengths of BPE
- Handles any input: Byte-level BPE can encode any text without unknown token fallbacks
- Efficient for common patterns: Frequent words become single tokens, reducing sequence length
- Deterministic: The same text always produces the same tokens
Weaknesses of BPE
- English-centric vocabularies: Tokenizers trained primarily on English data create more tokens per word for other languages, effectively penalizing non-English users with higher costs and shorter effective context windows
- Whitespace sensitivity: "Hello" and " Hello" (with leading space) may tokenize differently, creating subtle bugs
- Code fragmentation: Variable names and syntax patterns from less-common programming languages are split into many small tokens
SentencePiece: Language-Agnostic Tokenization
SentencePiece, developed by Google, treats the input as a raw byte stream without pre-tokenization. This makes it truly language-agnostic — it does not assume spaces separate words, which is essential for languages like Chinese, Japanese, and Thai.
import sentencepiece as spm
# Training a SentencePiece model
spm.SentencePieceTrainer.train(
input="training_data.txt",
model_prefix="tokenizer",
vocab_size=32000,
model_type="bpe", # or "unigram"
character_coverage=0.9995
)
sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
tokens = sp.encode("This is a test.", out_type=str)
SentencePiece also supports the unigram model, which starts with a large vocabulary and prunes tokens with the least impact on the training data's likelihood. This approach can produce more linguistically motivated subword units than greedy BPE merges.
Emerging Approaches and Improvements
Tiktoken and Fast BPE
OpenAI's tiktoken library implements BPE encoding in Rust with Python bindings, achieving 3-6x speedups over pure Python implementations. This matters for applications that tokenize large volumes of text for cost estimation or chunking.
See AI Voice Agents Handle Real Calls
Book a free demo or calculate how much you can save with AI voice automation.
Multilingual Tokenizer Balancing
Newer models address the multilingual penalty through several strategies:
- Larger vocabularies: Moving from 32K to 100K+ tokens allows more non-English words to be represented as single tokens
- Balanced training corpora: Ensuring the tokenizer training data includes proportional representation of target languages
- Language-specific byte fallbacks: Using UTF-8 byte representations that align with the character boundaries of specific scripts
Byte Latent Transformer (BLT)
Meta's BLT architecture, published in late 2024, proposes eliminating fixed tokenization entirely. Instead, it dynamically groups bytes into variable-length patches based on the complexity of the input. Simple, predictable text gets grouped into large patches (processed efficiently), while complex or information-dense text gets fine-grained byte-level attention.
This approach could resolve the multilingual fairness problem because it adapts to the data rather than relying on a fixed vocabulary trained on a potentially imbalanced corpus.
Practical Implications
Token Counting and Cost
Different tokenizers produce dramatically different token counts for the same text:
| Text | GPT-4 (cl100k) | Llama 3 | Gemma |
|---|---|---|---|
| English paragraph (100 words) | ~130 tokens | ~125 tokens | ~128 tokens |
| Chinese paragraph (100 chars) | ~110 tokens | ~150 tokens | ~105 tokens |
| Python code (50 lines) | ~350 tokens | ~380 tokens | ~340 tokens |
These differences directly affect inference costs and effective context window utilization.
Chunking for RAG
When building retrieval-augmented generation systems, token-based chunking is more reliable than character-based chunking because it aligns with how the model processes text. Libraries like LangChain and LlamaIndex offer tokenizer-aware text splitters for this purpose.
Tokenization is infrastructure — invisible when it works well, painful when it does not. Understanding your tokenizer's behavior is essential for cost optimization, multilingual support, and reliable LLM application development.
Sources: SentencePiece GitHub | Tiktoken GitHub | BLT Paper - arXiv:2412.09871
NYC News
Expert insights on AI voice agents and customer communication automation.
Try CallSphere AI Voice Agents
See how AI voice agents work for your industry. Live demo available -- no signup required.