Chapter 16 — A reusable character tokenizer
So far the entire tutorial has run on the four-token vocabulary
VOCAB = ("I", "love", "AI", "!")
That worked because we hand-picked it. Real text is not so kind: a paragraph of English contains thousands of distinct words, punctuation patterns, capitalisations and whitespace conventions. We need a tokenizer — a small, separate component whose only job is to map text → ids and ids → text. It is independent of the model: training the model on different text just means swapping out the tokenizer.
This chapter builds the simplest possible tokenizer, character-level: every distinct character becomes its own token. By the end you will have:
- understood why the tokenizer is a separable component (and why GPT-2 actually uses something fancier, called BPE),
- added a
CharTokenizerclass tomygptwithencode,decode,save, andloadmethods, - watched a round-trip encode → decode reproduce the input text exactly,
- saved a tokenizer to disk and reloaded it, ready for Chapter 17 to train a real model on real text.
16.1 What a tokenizer is, and why it is separate from the model
Recall what the GPT model expects as input: a (B, T) long tensor of integer ids in $[0, V)$, where $V$ is the vocabulary size. The model neither knows nor cares whether token id 5 stands for the word "the", the character 't', or a sub-word fragment like "##ing". That mapping — text ↔ ids — lives entirely in the tokenizer.
This separation has two consequences:
- Different tokenizers, same model architecture. A 124M-parameter GPT-2 with
vocab_size=50257is the same architecture as ourvocab_size=4toy. Only the embedding table, the LM head, and (because they are tied) the readout dimension differ. - The tokenizer must be saved alongside the model. A trained model whose token id 5 means
't'is gibberish if loaded with a tokenizer where id 5 means'q'. Persistence matters.
Real GPT-2 uses a byte-pair encoding (BPE) tokenizer with 50,257 sub-word tokens — a compromise that gives common words like "the" their own id while breaking rare words into pieces. We won’t implement BPE; it is its own subject. Character-level is the simplest tokenizer that works on arbitrary text, and it is the natural starting point for a tutorial.
A character-level tokenizer:
- vocabulary = the sorted list of distinct characters in the training text,
encode(text)= look up every character’s id,decode(ids)= look up every id’s character and concatenate.
That’s the whole idea.
16.2 Setup
If you finished Chapter 15, you already have everything: mygpt/ with get_batch, GPT, generate, and the trained_gpt.pt checkpoint.
If you skipped, recreate the state from a clean directory:
uv init mygpt --package
cd mygpt
mkdir -p experiments
uv add torch numpy
Overwrite src/mygpt/__init__.py with the Chapter 15 ending state from docs/_state_after_ch15.md. (trained_gpt.pt from Ch.14 is not required for this chapter; we don’t load any checkpoint here.)
You are ready.
16.3 Building the alphabet from text
The very first step in character-level tokenization is to scan a text and collect every distinct character. Python makes this a one-liner:
text = "I love AI !"
chars = sorted(set(text))
set(text) deduplicates; sorted(...) puts the result in a deterministic order so the same text always produces the same vocabulary on different machines.
For our familiar string the alphabet is:
[' ', '!', 'A', 'I', 'e', 'l', 'o', 'v'] # 8 distinct characters
Eight, not four — the four-token VOCAB of earlier chapters treated "love" as one symbol; the character tokenizer treats it as four ('l', 'o', 'v', 'e').
The id-of-character lookup is just a dict over enumerate:
stoi = {c: i for i, c in enumerate(chars)} # "string to int"
itos = {i: c for i, c in enumerate(chars)} # "int to string"
stoi[' '] is 0; stoi['I'] is 3; itos[3] is 'I'. This is exactly the same structure as the earlier VOCAB/VOCAB.index(...) pair — we have just generalised it to “whatever characters appear in the training text”.
A small experiment to see this for our running example.
Save the following to 📄 experiments/34_alphabet.py:
"""Experiment 34 — Build a character alphabet from a string."""
def main() -> None:
text = "I love AI !"
chars = sorted(set(text))
stoi = {c: i for i, c in enumerate(chars)}
itos = {i: c for i, c in enumerate(chars)}
print(f"text: {text!r}")
print(f"vocab_size: {len(chars)}")
print(f"chars: {chars}")
print(f"stoi: {stoi}")
print(f"itos: {itos}")
if __name__ == "__main__":
main()
Run it:
uv run python experiments/34_alphabet.py
Expected output:
text: 'I love AI !'
vocab_size: 8
chars: [' ', '!', 'A', 'I', 'e', 'l', 'o', 'v']
stoi: {' ': 0, '!': 1, 'A': 2, 'I': 3, 'e': 4, 'l': 5, 'o': 6, 'v': 7}
itos: {0: ' ', 1: '!', 2: 'A', 3: 'I', 4: 'e', 5: 'l', 6: 'o', 7: 'v'}
16.4 The CharTokenizer class
We package this into a class that exposes a clean four-method API: encode, decode, save, load. Plus a class-method constructor from_text(text) that builds the alphabet for you.
Append the following to 📄 src/mygpt/__init__.py (after the generate function, before main):
import json
class CharTokenizer:
"""Character-level tokenizer.
Vocabulary = the sorted list of distinct characters seen in the training
text. Token id = position in that list.
"""
def __init__(self, chars: list[str]) -> None:
self.chars = list(chars)
self.vocab_size = len(self.chars)
self.stoi = {c: i for i, c in enumerate(self.chars)}
self.itos = {i: c for i, c in enumerate(self.chars)}
@classmethod
def from_text(cls, text: str) -> "CharTokenizer":
"""Build a tokenizer whose vocabulary is the alphabet of `text`."""
return cls(sorted(set(text)))
def encode(self, text: str) -> torch.Tensor:
"""Encode `text` to a 1-D long tensor of ids."""
return torch.tensor([self.stoi[c] for c in text], dtype=torch.long)
def decode(self, ids: torch.Tensor) -> str:
"""Decode a 1-D tensor of ids back to a string."""
return "".join(self.itos[int(i)] for i in ids)
def save(self, path: str) -> None:
"""Persist the tokenizer to a JSON file at `path`."""
with open(path, "w") as f:
json.dump({"chars": self.chars}, f)
@classmethod
def load(cls, path: str) -> "CharTokenizer":
"""Reload a tokenizer from a JSON file produced by `save`."""
with open(path) as f:
data = json.load(f)
return cls(data["chars"])
A few small choices worth flagging:
json, nottorch.save. A tokenizer is a list of characters and nothing else; using JSON keeps the file small, human-readable, and editable in a text editor. The model checkpoint will continue to usetorch.save/torch.load.encodereturns atorch.Tensor, not a list. Every downstream consumer (GPT.forward,generate,get_batch) expects tensors. Returning the right type avoids atorch.tensor(...)cast at every call site.decodeaccepts a 1-D tensor.int(i)works for both Python ints and zero-dim tensors, so iterating over a 1-D tensor yields scalars that decode cleanly.- The vocabulary is fixed at construction.
encodewill raise aKeyErrorif you pass a character that wasn’t in the training text. We will not gracefully handle out-of-vocabulary characters in this tutorial; the next chapter just takes the union over the full training file.
16.5 Encoding and decoding: the round-trip
Now the satisfying experiment: encode a string, then decode the ids back, and check that we recover the original text.
Save the following to 📄 experiments/35_encode_decode.py:
"""Experiment 35 — Build a CharTokenizer and round-trip encode/decode."""
from mygpt import CharTokenizer
def main() -> None:
text = "I love AI ! I love AI !"
tok = CharTokenizer.from_text(text)
ids = tok.encode(text)
back = tok.decode(ids)
print(f"text: {text!r}")
print(f"vocab_size: {tok.vocab_size}")
print(f"ids shape: {tuple(ids.shape)}")
print(f"first 12 ids: {ids[:12].tolist()}")
print(f"decoded: {back!r}")
print(f"round-trip ok: {back == text}")
if __name__ == "__main__":
main()
Run it:
uv run python experiments/35_encode_decode.py
Expected output:
text: 'I love AI ! I love AI !'
vocab_size: 8
ids shape: (23,)
first 12 ids: [3, 0, 5, 6, 7, 4, 0, 2, 3, 0, 1, 0]
decoded: 'I love AI ! I love AI !'
round-trip ok: True
Read the 12 ids: 3, 0, 5, 6, 7, 4, 0, 2, 3, 0, 1, 0. Decoded character-by-character against itos:
| id | char |
|---|---|
| 3 | 'I' |
| 0 | ' ' |
| 5 | 'l' |
| 6 | 'o' |
| 7 | 'v' |
| 4 | 'e' |
| 0 | ' ' |
| 2 | 'A' |
| 3 | 'I' |
| 0 | ' ' |
| 1 | '!' |
| 0 | ' ' |
That spells "I love AI ! " — and the next 11 ids (3, 0, 5, 6, 7, 4, 0, 2, 3, 0, 1) repeat for "I love AI !". The round-trip is exact.
16.6 Saving and loading
A tokenizer is meaningless unless you can re-load the same vocabulary that the model was trained against. CharTokenizer.save(path) writes a small JSON file; CharTokenizer.load(path) re-creates the tokenizer from it.
Save the following to 📄 experiments/36_save_load.py:
"""Experiment 36 — Save a tokenizer to disk and reload it.
The reloaded tokenizer encodes to the same ids as the original.
"""
from mygpt import CharTokenizer
def main() -> None:
text = "I love AI !"
tok1 = CharTokenizer.from_text(text)
tok1.save("tokenizer.json")
print(f"saved tokenizer.json (vocab_size={tok1.vocab_size})")
tok2 = CharTokenizer.load("tokenizer.json")
print(f"loaded tokenizer.json (vocab_size={tok2.vocab_size})")
sample = "love AI"
ids1 = tok1.encode(sample)
ids2 = tok2.encode(sample)
print(f"sample: {sample!r}")
print(f"original ids: {ids1.tolist()}")
print(f"reloaded ids: {ids2.tolist()}")
print(f"identical: {ids1.tolist() == ids2.tolist()}")
if __name__ == "__main__":
main()
Run it:
uv run python experiments/36_save_load.py
Expected output:
saved tokenizer.json (vocab_size=8)
loaded tokenizer.json (vocab_size=8)
sample: 'love AI'
original ids: [5, 6, 7, 4, 0, 2, 3]
reloaded ids: [5, 6, 7, 4, 0, 2, 3]
identical: True
If you cat tokenizer.json, you will see exactly:
{"chars": [" ", "!", "A", "I", "e", "l", "o", "v"]}
A tokenizer file for our running example is 51 bytes (the JSON above, no trailing newline). A real-text tokenizer for the Tiny Shakespeare corpus we will use in Chapter 17 has ~65 distinct characters and the JSON file is around 350 bytes. The size is negligible compared to the model, but the correctness of pairing the right tokenizer with the right model is essential.
16.7 Tokenizing the running example for the model
For the rest of the tutorial, the full training pipeline takes:
- A training text file. (e.g.
tiny_shakespeare.txt.) - A tokenizer built from it. (
CharTokenizer.from_text(text).) - A 1-D long tensor of every character’s id. (
tok.encode(text).) - A
GPT(vocab_size=tok.vocab_size, ...)model sized to that vocabulary. - The
get_batchsampler from Ch.14, the AdamW training loop from Ch.14, thegeneratefunction from Ch.15.
For sanity, let’s wire this up with the existing four-token running example: scan the string, build a tokenizer, encode it into the corpus tensor, and confirm the result is shaped exactly like the corpus we had in Chapter 14 — except the vocabulary is 8 (characters) instead of 4 (whole words).
Save the following to 📄 experiments/37_corpus_with_tokenizer.py:
"""Experiment 37 — Build the training corpus tensor via CharTokenizer.
This is the same shape of input the Ch.14 training loop expects — just
with vocab_size=8 (characters) instead of 4 (whole words).
"""
import torch
from mygpt import CharTokenizer
def main() -> None:
text = "I love AI ! " * 16 # 192 characters total
tok = CharTokenizer.from_text(text)
data = tok.encode(text)
print(f"text length (chars): {len(text)}")
print(f"vocab_size: {tok.vocab_size}")
print(f"data shape: {tuple(data.shape)}")
print(f"data.dtype: {data.dtype}")
print(f"first 24 ids: {data[:24].tolist()}")
print(f"decoded[:24]: {tok.decode(data[:24])!r}")
if __name__ == "__main__":
main()
Run it:
uv run python experiments/37_corpus_with_tokenizer.py
Expected output:
text length (chars): 192
vocab_size: 8
data shape: (192,)
data.dtype: torch.int64
first 24 ids: [3, 0, 5, 6, 7, 4, 0, 2, 3, 0, 1, 0, 3, 0, 5, 6, 7, 4, 0, 2, 3, 0, 1, 0]
decoded[:24]: 'I love AI ! I love AI ! '
Notice the structure: every 12 characters we get the cycle "I love AI ! " (with the trailing space). The repeating-cycle property is preserved — Chapter 17’s training loop will rediscover it on real Shakespeare with a much larger alphabet.
We have not retrained the model. The Ch.14 checkpoint was trained on a 4-token vocabulary, so it cannot consume our character-level ids. Chapter 17 will train a fresh GPT against the character corpus produced here.
16.8 Experiments
- Try it on a longer sentence. Edit
experiments/35_encode_decode.py’stextto"The quick brown fox jumps over the lazy dog.". The vocab grows to 29 distinct characters (the 26 lowercase letters, plus space, period, and the uppercase'T'); the round-trip still works. - Out-of-vocabulary fails loudly. Try
tok.encode("Hello world")aftertok = CharTokenizer.from_text("I love AI !"). The character'H'was never in the training text, soencoderaisesKeyError: 'H'. This is intentional — we want it to fail loudly rather than silently produce a wrong-shaped token sequence. - Sorting makes the vocabulary deterministic. Build two tokenizers, one from
"abcd"and one from"dcba". Both produce the samechars = ['a', 'b', 'c', 'd']and the sametok.encode("a") == tensor([0]), becausesorted(set(text))doesn’t care what order the input characters appeared in. This is the whole reasonfrom_textcallssorted— same training text → same vocabulary, regardless of OS, Python version, or hash randomisation. - Save and reload across runs. Run experiment 36 once, then in a fresh
uv run pythoninvocation reloadtokenizer.jsonand confirmtok2.encode("love")is identical to what you got in the first run. The persistence is the whole reason the tokenizer needs to be a separate, saveable artifact.
After each experiment, restore any file you changed before moving on.
16.9 Exercises
- Vocabulary growth on real text. A 1 MB plain-text file like the Tiny Shakespeare corpus contains roughly 65 distinct characters (uppercase + lowercase alphabet + digits + punctuation + whitespace). Argue that
vocab_sizefor a character-level tokenizer is bounded by the size of the alphabet of the training text, not by its length. (Hint:set(text)cannot be larger than the number of distinct characters, regardless of how longtextis.) - Why sub-word tokenizers exist. Argue that a character-level tokenizer requires the model to learn long-range dependencies (e.g. “what comes after
t-h-e- -c-a-?”) that a word-level or sub-word tokenizer would express in a single id. (Hint: every English word becomes between ~3 and ~12 tokens at character level; ~1 to ~3 at sub-word level.) - Decoding in batches. Our
decodetakes a 1-D tensor. The model produces(B, T)tensors. Sketch how you would writedecode_batch(ids: torch.Tensor) -> list[str]for a(B, T)tensor; check whatfor row in ids:iterates over. - Tokenizer round-trip identity. Argue that for a character-level tokenizer trained on
text, the round-triptok.decode(tok.encode(text))always returnstextexactly. Why is this not true for BPE tokenizers in general? (Hint: BPE tokenizes at the byte level and merges based on frequency; it can re-tokenize an already-tokenized string differently.)
16.10 What’s next
We have a tokenizer that can chew on arbitrary text. Chapter 17 puts it to work:
- download the Tiny Shakespeare corpus (a ~1 MB plain-text file),
- build a
CharTokenizerfrom it, - train a
GPT(vocab_size=tokenizer.vocab_size, embed_dim=128, num_heads=4, num_layers=4, max_seq_len=128, dropout=0.1)on it, - watch the loss drop and the model start producing pseudo-Shakespearean text.
Chapter 18 wraps everything in a CLI and adds checkpointing, so you can mygpt train file.txt and mygpt generate --prompt "..." from any text file.
Looking ahead — what to remember from this chapter
- The tokenizer is a separate component from the model. Different text → different tokenizer → different vocabulary, but the same model architecture.
- Character-level is the simplest tokenizer: alphabet =
sorted(set(text)), encode = lookup, decode = lookup.- The tokenizer must be saved alongside the model. A model whose token id 5 means
't'is gibberish if loaded with a tokenizer where id 5 means'q'.- Real GPT-2 uses BPE (byte-pair encoding), a sub-word tokenizer with 50,257 ids — the same architecture, just a much bigger vocabulary.
On to Chapter 17 — Training on a real text file (coming soon).