Chapter 22 — Byte-pair encoding from scratch
The CharTokenizer from Chapter 16 has a 65-character vocabulary and turns every Tiny Shakespeare character into one token. That is the simplest tokenizer imaginable, and it has two costs:
- Sequences are long. A 100-character paragraph is 100 tokens, not 20. Every
mygptoperation that scales with sequence length pays for the redundancy. - The model has to learn long-range character chains. “Q is followed by U” is a fact about English that the model has to rediscover from gradient updates instead of getting it for free.
Real LLMs use byte-pair encoding (BPE): a tokenizer that starts from characters, then iteratively merges the most-frequent adjacent pair into a single new token, and repeats until the vocabulary reaches a target size. The algorithm was originally a data-compression technique (Gage, 1994) and was adapted to NLP by Sennrich, Haddow, and Birch in 2016. Modern variants (GPT-2, Llama) all start from BPE.
This chapter teaches the algorithm. It does not modify mygpt — Chapter 23 will wire it in. Here we hand-trace BPE on Sennrich’s canonical four-word example and watch ten merges build it up.
By the end you will have:
- understood the BPE algorithm in five lines (initialise, count pairs, find most-frequent, merge, repeat),
- traced ten merges on the example
"low low low low low lower newer newer newer newer wider"by hand and by Python, - seen the final tokenization where
"low"and"newer"are each one token, demonstrating compression, - understood how production BPE (GPT-2, tiktoken) differs from the textbook version.
22.1 Setup
This chapter assumes Chapter 21 — mygpt/ exists with --device, --precision, --val-split, --schedule, --warmup, --max-grad-norm, plus the cosine_warmup_lr and estimate_val_loss helpers. None of those are touched in this chapter.
If you skipped, the only piece you actually need from Part I is experiments/:
mkdir -p experiments
You are ready.
22.2 Why merge pairs?
Pick any English text and look at the character pair frequencies. t followed by h happens constantly (every the, that, this, with). q followed by u happens whenever q shows up. If we replace those pairs with a single token, two things happen:
- Sequences shrink. “the cat” stops being 7 tokens (
t-h-e- -c-a-t) and becomes 5 tokens (th-e- -c-a-t) — or 3 tokens (the- -cat) if we keep merging. - The model gets a head-start. It no longer has to learn “after
tpredicthwith high probability” —this one token, atomic, and the next-token prediction skips that step.
The trick is to let the data decide which pairs to merge. Instead of “merge th because English-speakers know it’s common”, we count pair frequencies in the corpus and merge whichever is most-frequent right now. After that merge, count pairs again — but now the previously-merged pair is a single symbol, so new pairs (e.g. th paired with e to make the) become candidates. Repeat until you have as many merges as you want.
The vocabulary after N merges has size (initial chars) + N. Tiny Shakespeare’s 65 starting characters plus 1024 BPE merges give a vocabulary of 1089 tokens — a 17× reduction in number of tokens compared to character-level on the same corpus, and a roughly 3× reduction in tokens-per-character of text.
22.3 The algorithm in five steps
Notation: each “word” in the corpus is represented as an ordered tuple of symbols, where each symbol is initially one character. We append a special end-of-word marker </w> to every word so the algorithm can tell low (the word) apart from low (the prefix of lower).
Init: for each word w in the corpus, count words[w]; represent w as
(c1, c2, ..., cn, </w>) — its characters plus the marker.
Step 1: count pair frequencies. For each (symbols, freq) in the vocab, and each
adjacent pair (sym_i, sym_i+1), add freq to that pair's count.
Step 2: find the most-frequent pair (a, b).
Step 3: merge: in every (symbols, freq), replace every adjacent (a, b) with
the new symbol "ab". Vocabulary now has one extra token.
Step 4: repeat steps 1–3 until the vocab reaches the target size.
Step 5: the merged tokens — in the order they were created — ARE the BPE rules.
To tokenize new text, apply the rules in order: scan, merge, scan, merge.
That’s it. Five steps, no neural network involved. The “training” of a BPE tokenizer is just frequency counting and replacement; it takes seconds even on Tiny Shakespeare.
22.4 The algorithm in code
We capture the algorithm as one standalone Python script. No mygpt import is needed — BPE has nothing to do with PyTorch, models, or training.
Save the following to 📄 experiments/41_bpe_by_hand.py:
"""Experiment 41 — Byte-pair encoding traced by hand on Sennrich's example.
The algorithm has five steps:
- Initialise: words → (chars + '</w>') tuples, with their counts.
- Count adjacent pair frequencies across the whole vocabulary.
- Pick the most-frequent pair.
- Merge that pair everywhere it occurs.
- Repeat steps 2-4 until the desired number of merges.
"""
import collections
def init_vocab(corpus_words: list[str]) -> dict[tuple[str, ...], int]:
"""Each word becomes a tuple of (chars + '</w>'); value is its frequency."""
counts = collections.Counter(corpus_words)
return {tuple(list(word) + ["</w>"]): freq for word, freq in counts.items()}
def get_pair_counts(
vocab: dict[tuple[str, ...], int],
) -> dict[tuple[str, str], int]:
"""Count adjacent (a, b) pair frequencies across the whole vocabulary."""
pairs: collections.Counter[tuple[str, str]] = collections.Counter()
for symbols, freq in vocab.items():
for i in range(len(symbols) - 1):
pairs[(symbols[i], symbols[i + 1])] += freq
return pairs
def merge_pair(
pair: tuple[str, str],
vocab: dict[tuple[str, ...], int],
) -> dict[tuple[str, ...], int]:
"""Replace every adjacent (a, b) in vocab's keys with 'ab'."""
a, b = pair
merged = a + b
new_vocab = {}
for symbols, freq in vocab.items():
new_symbols = []
i = 0
while i < len(symbols):
if i < len(symbols) - 1 and symbols[i] == a and symbols[i + 1] == b:
new_symbols.append(merged)
i += 2
else:
new_symbols.append(symbols[i])
i += 1
new_vocab[tuple(new_symbols)] = freq
return new_vocab
def main() -> None:
corpus = "low low low low low lower newer newer newer newer wider".split()
vocab = init_vocab(corpus)
print("Initial vocab:")
for k, v in vocab.items():
print(f" {v}: {' '.join(k)}")
print()
merges = []
for step in range(10):
pairs = get_pair_counts(vocab)
if not pairs:
break
best_pair, best_count = max(pairs.items(), key=lambda kv: kv[1])
merged = best_pair[0] + best_pair[1]
print(
f"Step {step + 1}: merge ({best_pair[0]!r}, {best_pair[1]!r}) "
f"-> {merged!r} (freq={best_count})"
)
merges.append(best_pair)
vocab = merge_pair(best_pair, vocab)
print()
print("Final vocab:")
for k, v in vocab.items():
print(f" {v}: {' '.join(k)}")
if __name__ == "__main__":
main()
Run it:
uv run python experiments/41_bpe_by_hand.py
Expected output:
Initial vocab:
5: l o w </w>
1: l o w e r </w>
4: n e w e r </w>
1: w i d e r </w>
Step 1: merge ('l', 'o') -> 'lo' (freq=6)
Step 2: merge ('lo', 'w') -> 'low' (freq=6)
Step 3: merge ('e', 'r') -> 'er' (freq=6)
Step 4: merge ('er', '</w>') -> 'er</w>' (freq=6)
Step 5: merge ('low', '</w>') -> 'low</w>' (freq=5)
Step 6: merge ('n', 'e') -> 'ne' (freq=4)
Step 7: merge ('ne', 'w') -> 'new' (freq=4)
Step 8: merge ('new', 'er</w>') -> 'newer</w>' (freq=4)
Step 9: merge ('low', 'er</w>') -> 'lower</w>' (freq=1)
Step 10: merge ('w', 'i') -> 'wi' (freq=1)
Final vocab:
5: low</w>
1: lower</w>
4: newer</w>
1: wi d er</w>
22.5 Reading the trace
Step-by-step the algorithm grows the vocabulary by one token at each merge:
- Step 1 picks
('l', 'o')becauseloappears 6 times across the corpus (5 inlow, 1 inlower). That’s the most-frequent pair before any merging. - Step 2 picks
('lo', 'w')— same 6 occurrences, but now starting from the just-mergedlo. The merge turnslo w </w>intolow </w>. - Step 3 picks
('e', 'r'), frequency 6 (1 inlower, 4 innewer, 1 inwider). Even though we have already createdloandlow, thiserpair was untouched — and it’s now the most frequent. - Step 4 glues
('er', '</w>')because every word that ends inrin this corpus also ends with</w>immediately after — total 6 occurrences. The merged tokener</w>is “anerthat ends a word”. - Steps 5-7 progressively build
low</w>(5),ne(4),new(4) — each merge consolidating the most-frequent pair currently visible. - Step 8 is the first really big merge:
('new', 'er</w>'). Both pieces are themselves the result of earlier merges. The whole wordnewer</w>becomes one token, freq 4. - Step 9 merges
('low', 'er</w>')to makelower</w>— this happens only once (the corpus has just onelower), so its frequency is 1, but it’s still the highest-frequency pair remaining. - Step 10 dips into the
widerword:('w', 'i')is the only pair with frequency 1 that we haven’t merged yet. The whole wordwideris rare enough that it never gets fully consolidated within 10 merges.
The final vocabulary tokenizes the four word-types as:
| Word type | Frequency | Tokens after 10 merges |
|---|---|---|
low |
5 | low</w> (1 token) |
lower |
1 | lower</w> (1 token) |
newer |
4 | newer</w> (1 token) |
wider |
1 | wi d er</w> (4 tokens) |
Three of the four word-types collapsed to a single token; the fourth (the rarest) stayed mostly as characters. This is BPE’s central trade: frequent strings become atomic, rare strings stay broken-up. The model gets a head-start on common words and can still represent uncommon words via their character decomposition.
22.6 The merge order is the rulebook
The list of (a, b) → ab merges, in the order they were chosen, is the tokenizer. To tokenize a new word, you start from its characters plus </w> and apply the merge rules in the order they were created, repeatedly, until no more apply. For our 10 merges:
1. l + o → lo
2. lo + w → low
3. e + r → er
4. er + </w> → er</w>
5. low + </w>→ low</w>
6. n + e → ne
7. ne + w → new
8. new + er</w> → newer</w>
9. low + er</w> → lower</w>
10. w + i → wi
A new word like slow would tokenize as: start with s l o w </w>. Apply rule 1 (l + o → lo) where it fits: s lo w </w>. Apply rule 2 (lo + w → low): s low </w>. Apply rule 5 (low + </w> → low</w>): s low</w>. No more rules apply (rule 3 needs e + r, etc.). Final: s low</w> — 2 tokens.
A new word like slower would tokenize as: s l o w e r </w> → after rules 1, 2, 3, 4: s low er</w> → after rule 9 (low + er</w> → lower</w>): s lower</w> — 2 tokens.
A word with no rule applying at all (e.g. xyz) stays at character level: x y z </w> — 4 tokens. Every byte sequence has some tokenization, even if it’s “all characters”. BPE never fails to tokenize.
22.7 What production BPE does differently
Two things separate this textbook BPE from the GPT-2 / tiktoken / Llama variants:
1. Byte-level, not character-level. GPT-2’s BPE operates on bytes, not Unicode characters. The initial vocabulary has 256 entries (one per byte value). Any UTF-8 input — including emoji, Chinese, Cyrillic, code points the training corpus never saw — encodes to a unique byte sequence first, then runs through BPE. Result: GPT-2’s tokenizer can encode any Unicode text without an “unknown token”, at the cost of representing non-ASCII characters as multiple bytes.
2. No </w> marker. GPT-2’s BPE does not use end-of-word markers. Instead, it pre-splits text with a regex that keeps spaces attached to the following word — so “ the” (space-prefixed) and “the” (after no space) tokenize differently. This is why you’ll see Ġthe in GPT-2’s vocabulary — Ġ is GPT-2’s printable representation of the leading-space byte. The end-of-word marker isn’t needed because the leading space disambiguates.
We will implement the textbook version (with </w>) in Chapter 23 because it is simpler to reason about and achieves the same compression ratios on Tiny Shakespeare. The byte-level BPE adds engineering, not pedagogy.
22.8 Experiments
-
Different number of merges. Edit
experiments/41_bpe_by_hand.pyto userange(20)instead ofrange(10). Steps 11–12 finish offwider: step 11 is('wi', 'd') → wid, step 12 is('wid', 'er</w>') → wider</w>. After step 12 every word-type is a single token, soget_pair_countsreturns an empty dict and the loop’s early-exitif not pairs: breakfires. Total of 12 merges, after which the vocab cannot grow further on this corpus. -
A different corpus. Replace the
corpus = ...line withcorpus = "the cat sat on the mat the cat ran".split(). The first merge is('a', 't')with frequency 4 — once each incat ×2,sat, andmat. The pair('t', 'h')is the second-most-frequent (3 occurrences inthe). After 6 merges or so,the</w>,cat</w>, andat</w>all show up as single tokens. Watch how quickly common short words become atomic when the corpus has more repetition. -
The
</w>marker matters. Remove the+ ["</w>"]frominit_vocabso words are pure character tuples without an end marker. Re-run on Sennrich’s example. Observe that step 4 disappears (noer</w>token) andlowandlowerbecome indistinguishable as prefixes — the algorithm now happily merges across word boundaries. (In a real BPE this is fine because pre-splitting handles word boundaries; in this toy version it is a bug.) -
What happens if all pairs are tied? With a corpus of
["abcd"](one occurrence), every adjacent pair has frequency 1. The algorithm picks one based on Python’s dict iteration order.max(...)in CPython returns the first occurrence of the maximum, so it’s deterministic given a fixed input. Verify experimentally: run with corpus["abcd"] * 1ten times and confirm the merges are identical.
After each experiment, restore any file you changed.
22.9 Exercises
-
Tokenize
lowest. With our 10-merge tokenizer, walk through whatlowestbecomes. Answer:lowest</w>doesn’t fit any whole-word rule becauselowestwasn’t in the training corpus. Apply rules 1, 2 (buildlow), thenlow e s t </w>. Rule 5 islow + </w>, but the next symbol isenot</w>so it doesn’t fire. Final:low e s t </w>— 5 tokens. Confirm by adding a smalltokenize()function toexperiments/41_bpe_by_hand.pythat applies the merge list to a new word. -
Why is the merge order significant? Argue that applying the rules in step-creation order is not equivalent to picking, at each tokenization position, the longest applicable merge. (Hint: rule 5 (
low</w>) was created before rule 9 (lower</w>); if we apply rules greedily from the corpus we’d over-mergelowerintolow</w>erinstead oflower</w>.) -
Vocabulary size after K merges. Argue that after
Kmerges, the BPE vocabulary has exactlylen(initial_chars) + Kentries. Why doesn’t a merge ever reduce vocabulary size? (Hint: the merged tokens are added; the original component tokens stay in the vocab because they may still be needed for words the merge doesn’t apply to.) -
A pathological corpus. Suppose the corpus has only one word:
corpus = ["aaaaaaaa"](eightas). Trace 4 merges by hand. (Hint: step 1 merges('a', 'a') → aa, step 2 merges('aa', 'aa') → aaaa, etc. After 3 merges the whole word is one tokenaaaaaaaa.)
22.10 What’s next
We have the algorithm. The next chapter, Chapter 23 — BPETokenizer in mygpt, packages it up into a mygpt.BPETokenizer class with the same shape as CharTokenizer (constructor, encode, decode, save, load) and adds a --tokenizer {char, bpe} flag to the CLI. By the end of Ch.23 we will have trained a BPE tokenizer on Tiny Shakespeare with 1024 merges and confirmed it reduces the encoded sequence length by roughly 2.7×.
Looking ahead — what to remember from this chapter:
- BPE is “iteratively merge the most-frequent adjacent pair” and nothing more sophisticated.
- The output of BPE training is a list of merge rules in order; that list, plus the initial alphabet, IS the tokenizer.
- Frequent strings become single tokens; rare strings stay broken-up — this trade-off is what makes the tokenizer compact and able to handle arbitrary input.
- Production BPE (GPT-2, tiktoken) replaces character-level with byte-level and
</w>with a regex pre-split; the algorithm itself is unchanged.