Why Tokenization Matters More Than You Think

Tokenization is the first and arguably most consequential step in any LLM pipeline. It determines how text is split into the discrete units that the model processes. Poor tokenization wastes context window space, degrades multilingual performance, and creates unexpected failure modes. Yet it receives a fraction of the attention given to model architecture and training.

A tokenizer's vocabulary and merge rules directly affect cost (more tokens per text means more inference cost), latency (longer sequences take more time), and quality (splitting meaningful words into fragments hurts comprehension).

Byte-Pair Encoding: The Dominant Approach

BPE, originally a compression algorithm, is the foundation of most modern LLM tokenizers. The training process is straightforward:

Start with a base vocabulary of individual bytes (256 entries)
Count all adjacent token pairs in the training corpus
Merge the most frequent pair into a single new token
Repeat until the desired vocabulary size is reached

GPT-4's tokenizer (cl100k_base) uses BPE with a vocabulary of approximately 100,000 tokens. Claude's tokenizer uses a similar approach.

Strengths of BPE

Handles any input: Byte-level BPE can encode any text without unknown token fallbacks
Efficient for common patterns: Frequent words become single tokens, reducing sequence length
Deterministic: The same text always produces the same tokens

Weaknesses of BPE

English-centric vocabularies: Tokenizers trained primarily on English data create more tokens per word for other languages, effectively penalizing non-English users with higher costs and shorter effective context windows
Whitespace sensitivity: "Hello" and " Hello" (with leading space) may tokenize differently, creating subtle bugs
Code fragmentation: Variable names and syntax patterns from less-common programming languages are split into many small tokens

SentencePiece: Language-Agnostic Tokenization

SentencePiece, developed by Google, treats the input as a raw byte stream without pre-tokenization. This makes it truly language-agnostic — it does not assume spaces separate words, which is essential for languages like Chinese, Japanese, and Thai.

import sentencepiece as spm

# Training a SentencePiece model
spm.SentencePieceTrainer.train(
    input="training_data.txt",
    model_prefix="tokenizer",
    vocab_size=32000,
    model_type="bpe",  # or "unigram"
    character_coverage=0.9995
)

sp = spm.SentencePieceProcessor(model_file="tokenizer.model")
tokens = sp.encode("This is a test.", out_type=str)

SentencePiece also supports the unigram model, which starts with a large vocabulary and prunes tokens with the least impact on the training data's likelihood. This approach can produce more linguistically motivated subword units than greedy BPE merges.

Emerging Approaches and Improvements

Tiktoken and Fast BPE

OpenAI's tiktoken library implements BPE encoding in Rust with Python bindings, achieving 3-6x speedups over pure Python implementations. This matters for applications that tokenize large volumes of text for cost estimation or chunking.

See AI Voice Agents Handle Real Calls

Book a free demo or calculate how much you can save with AI voice automation.

Book a Demo ROI Calculator

Multilingual Tokenizer Balancing

Newer models address the multilingual penalty through several strategies:

Larger vocabularies: Moving from 32K to 100K+ tokens allows more non-English words to be represented as single tokens
Balanced training corpora: Ensuring the tokenizer training data includes proportional representation of target languages
Language-specific byte fallbacks: Using UTF-8 byte representations that align with the character boundaries of specific scripts

Byte Latent Transformer (BLT)

Meta's BLT architecture, published in late 2024, proposes eliminating fixed tokenization entirely. Instead, it dynamically groups bytes into variable-length patches based on the complexity of the input. Simple, predictable text gets grouped into large patches (processed efficiently), while complex or information-dense text gets fine-grained byte-level attention.

This approach could resolve the multilingual fairness problem because it adapts to the data rather than relying on a fixed vocabulary trained on a potentially imbalanced corpus.

Practical Implications

Token Counting and Cost

Different tokenizers produce dramatically different token counts for the same text:

Text	GPT-4 (cl100k)	Llama 3	Gemma
English paragraph (100 words)	~130 tokens	~125 tokens	~128 tokens
Chinese paragraph (100 chars)	~110 tokens	~150 tokens	~105 tokens
Python code (50 lines)	~350 tokens	~380 tokens	~340 tokens

These differences directly affect inference costs and effective context window utilization.

Chunking for RAG

When building retrieval-augmented generation systems, token-based chunking is more reliable than character-based chunking because it aligns with how the model processes text. Libraries like LangChain and LlamaIndex offer tokenizer-aware text splitters for this purpose.

Tokenization is infrastructure — invisible when it works well, painful when it does not. Understanding your tokenizer's behavior is essential for cost optimization, multilingual support, and reliable LLM application development.

Sources: SentencePiece GitHub | Tiktoken GitHub | BLT Paper - arXiv:2412.09871

LLM Tokenization Advances: BPE, SentencePiece, and the Quest for Better Tokenizers