Chapter 17 — Training on a real text file
Up to now everything has run on toy data — the four-token cycle from Ch.14, the same four words tokenized at character level in Ch.16. This chapter throws a real corpus at the pipeline:
- Tiny Shakespeare: ~1.1 MB of plain-text Shakespeare collected by Andrej Karpathy. 65 distinct characters, ~1.1 million tokens at character level.
- A 207k-parameter
GPTsized to fit on a laptop CPU. - A real training run: 2,000 AdamW steps in under a minute on CPU.
- A real sample: 200 generated tokens that look like Shakespeare — capitalised speakers, line breaks, vaguely English vocabulary — even though every word is invented.
Nothing in mygpt changes for this chapter. Everything we built in Ch.5–16 is now used together against a non-toy corpus: CharTokenizer (Ch.16), GPT (Ch.12–13), get_batch (Ch.14), AdamW training loop (Ch.14), generate (Ch.15).
17.1 Setup
If you finished Chapter 16, you already have everything: mygpt/ with CharTokenizer, GPT, get_batch, generate. There is no __init__.py change in this chapter.
If you skipped, recreate the state from a clean directory:
uv init mygpt --package
cd mygpt
mkdir -p experiments
uv add torch numpy
Overwrite src/mygpt/__init__.py with the Chapter 16 ending state from docs/_state_after_ch16.md.
You are ready.
17.2 The corpus
Tiny Shakespeare is a 1.1 MB plain-text concatenation of Shakespeare’s plays formatted as
SPEAKER:
Lines of dialogue.
OTHER SPEAKER:
More lines.
Download it. From the project root:
curl -s -o tinyshakespeare.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
curl -s (silent) prints nothing on success; you should be returned to the prompt with tinyshakespeare.txt of about 1.1 MB sitting in the directory. To confirm:
wc -c tinyshakespeare.txt
head -3 tinyshakespeare.txt
Expected output:
1115394 tinyshakespeare.txt
First Citizen:
Before we proceed any further, hear me speak.
That’s 1,115,394 bytes (≈ 1.1 MB) and the first three lines are a speaker label First Citizen: followed by the start of the play. If the byte count or first three lines differ, re-download — the file shouldn’t be changing.
17.3 Tokenising the corpus
We use CharTokenizer.from_text(text) to scan the corpus and build the alphabet, then tok.encode(text) to convert to a 1-D tensor of ids.
Save the following to 📄 experiments/38_tokenize_shakespeare.py:
"""Experiment 38 — Tokenize Tiny Shakespeare."""
from mygpt import CharTokenizer
def main() -> None:
with open("tinyshakespeare.txt") as f:
text = f.read()
tok = CharTokenizer.from_text(text)
data = tok.encode(text)
print(f"corpus chars: {len(text):,}")
print(f"vocab_size: {tok.vocab_size}")
print(f"chars: {tok.chars}")
print()
print(f"data shape: {tuple(data.shape)}")
print(f"data.dtype: {data.dtype}")
print(f"first 60 ids: {data[:60].tolist()}")
if __name__ == "__main__":
main()
Run it:
uv run python experiments/38_tokenize_shakespeare.py
Expected output:
corpus chars: 1,115,394
vocab_size: 65
chars: ['\n', ' ', '!', '$', '&', "'", ',', '-', '.', '3', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'Q', 'R', 'S', 'T', 'U', 'V', 'W', 'X', 'Y', 'Z', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']
data shape: (1115394,)
data.dtype: torch.int64
first 60 ids: [18, 47, 56, 57, 58, 1, 15, 47, 58, 47, 64, 43, 52, 10, 0, 14, 43, 44, 53, 56, 43, 1, 61, 43, 1, 54, 56, 53, 41, 43, 43, 42, 1, 39, 52, 63, 1, 44, 59, 56, 58, 46, 43, 56, 6, 1, 46, 43, 39, 56, 1, 51, 43, 1, 57, 54, 43, 39, 49, 8]
Read the alphabet: 65 characters — newline, space, the punctuation Shakespeare actually uses (! $ & ' , - . : ; ?), the digit 3 (yes, there is exactly one digit in the entire corpus, in a stage direction), all 26 uppercase letters and all 26 lowercase letters. No “y” without “Y”, no Unicode quirks: this is plain ASCII Shakespeare.
The first 60 ids decode to "First Citizen:\nBefore we proceed any further, hear me speak." — the speaker label, the line break (id 0 is '\n'), and the first line up to and including the period. (You can verify this with tok.decode(data[:60]) if you like.)
17.4 Sizing the model
We pick a model that is just big enough to do something interesting on CPU:
GPT(vocab_size=65, embed_dim=64, num_heads=4, num_layers=4, max_seq_len=64, dropout=0.0)
That is 207,296 parameters — bigger than the 2,240-parameter toy in Ch.14 by two orders of magnitude, but still tiny compared to GPT-2’s 124 M. For reference:
| Source | Params |
|---|---|
| Ch.14 toy GPT (V=4) | 2,240 |
| This chapter (V=65) | 207,296 |
| GPT-2 small | 124,412,160 |
The four hyperparameters (embed_dim, num_heads, num_layers, max_seq_len) all matter:
embed_dim=64: each character becomes a 64-dimensional vector. GPT-2 uses 768; we use 64 to keep training fast.num_heads=4: attention is split into 4 heads ofhead_dim = 64/4 = 16. GPT-2 uses 12 heads.num_layers=4: 4 stackedTransformerBlocks. GPT-2 small uses 12.max_seq_len=64: the model can see at most 64 characters of context. GPT-2 uses 1024. Sixty-four characters is roughly one short line of Shakespeare — enough to learn local structure, not enough to learn long-range dependencies.
dropout=0.0 because our model is small and we won’t overtrain badly enough to need regularisation. (Real training runs use 0.1.)
17.5 The training run
The same four-line training loop from §14.5, with three differences:
get_batchsamples from the Tiny Shakespeare data tensor (length 1,115,394) instead of the 256-token toy corpus.- More steps (2,000 instead of 200) because the task is harder.
- We save the trained checkpoint and the tokenizer at the end so §17.6 can load them and generate.
Save the following to 📄 experiments/39_train_shakespeare.py:
"""Experiment 39 — Train a 207k-parameter GPT on Tiny Shakespeare.
2,000 AdamW steps with B=16, T=64. Loss drops from ~41 (random init,
confidently wrong) to ~2.08 (a meaningful improvement over log(65) ≈ 4.17,
the random-baseline cross-entropy).
"""
import math
import time
import torch
from mygpt import GPT, CharTokenizer, get_batch, set_seed
def main() -> None:
with open("tinyshakespeare.txt") as f:
text = f.read()
tok = CharTokenizer.from_text(text)
data = tok.encode(text)
set_seed(0)
model = GPT(
vocab_size=tok.vocab_size,
embed_dim=64,
num_heads=4,
num_layers=4,
max_seq_len=64,
dropout=0.0,
)
n_params = sum(p.numel() for p in model.parameters() if p.requires_grad)
optimizer = torch.optim.AdamW(model.parameters(), lr=1e-3)
B, T = 16, 64
print(f"params: {n_params:,}")
print(f"vocab_size: {tok.vocab_size}")
print(f"corpus chars: {len(data):,}")
print(f"batch_size B: {B}")
print(f"seq_len T: {T}")
print(f"reference log(V) = {math.log(tok.vocab_size):.4f}\n")
set_seed(42)
t0 = time.time()
for step in range(1, 2001):
x, y = get_batch(data, B, T)
optimizer.zero_grad()
_, loss = model(x, y)
loss.backward()
optimizer.step()
if step in (1, 10, 100, 500, 1000, 2000):
print(f"step {step:>4}: loss = {loss.item():.4f} ({time.time()-t0:.1f}s)")
torch.save(model.state_dict(), "shakespeare_gpt.pt")
tok.save("shakespeare_tokenizer.json")
print("\nSaved shakespeare_gpt.pt and shakespeare_tokenizer.json")
if __name__ == "__main__":
main()
Run it (this is the longest-running command in the tutorial — about a minute on a typical laptop CPU):
uv run python experiments/39_train_shakespeare.py
Expected output:
params: 207,296
vocab_size: 65
corpus chars: 1,115,394
batch_size B: 16
seq_len T: 64
reference log(V) = 4.1744
step 1: loss = 41.0367 (0.0s)
step 10: loss = 15.0190 (0.2s)
step 100: loss = 3.2371 (2.0s)
step 500: loss = 2.5944 (9.9s)
step 1000: loss = 2.3529 (19.5s)
step 2000: loss = 2.0785 (45.1s)
Saved shakespeare_gpt.pt and shakespeare_tokenizer.json
(Wall-clock seconds will differ on your machine; the loss values will not.)
Two shorthands to keep in mind while reading the curve below: loss is the negative log-probability the model assigns the correct next character; exp(loss) is the effective number of characters the model is still uncertain between. (We will use that second number to read the final loss of 2.08 at the end of this section.)
Read the curve:
- Step 1: loss ≈ 41. The randomly-initialised model is confidently wrong — its logits happen to point hard at random tokens, and cross-entropy of a confident-wrong prediction is unbounded. (See Ch.13 §13.4 for the full derivation; the principle is identical to Ch.14’s V=4 toy where step-1 loss was 5.27 — only the numbers scale with V.)
- Step 10: loss ≈ 15. Already collapsing toward
log(V) ≈ 4.17; AdamW has knocked the wildly-confident logits down toward a flat distribution. - Step 100: loss ≈ 3.24. Below the random-baseline
log(V). The model now predicts character-level structure better than uniform guessing. - Step 500: loss ≈ 2.59. It has learned letter co-occurrence (“q” tends to be followed by “u”, capital letters start words, line ends are followed by newlines).
- Step 2000: loss ≈ 2.08. It has learned the shape of Shakespeare — speaker labels, line breaks, the rough rhythm of words. Further training keeps lowering the loss but with diminishing returns per step; we stop here because the next chapter wraps everything in a CLI.
That ~2.08 does not mean the model writes Shakespeare. It means the model’s average uncertainty about the next character is e^2.08 ≈ 8 characters out of 65 — so it has narrowed the field by a factor of 8 over uniform guessing. We will see what 2.08-loss text looks like in the next section.
17.6 Sampling from the trained model
The trained checkpoint is saved as shakespeare_gpt.pt and the tokenizer as shakespeare_tokenizer.json. We reload both, prompt with "ROMEO:", and generate 200 new characters with temperature=1.0 and top_k=10.
Save the following to 📄 experiments/40_generate_shakespeare.py:
"""Experiment 40 — Sample from the trained Shakespeare GPT."""
import torch
from mygpt import GPT, CharTokenizer, generate, set_seed
def main() -> None:
tok = CharTokenizer.load("shakespeare_tokenizer.json")
set_seed(0)
model = GPT(
vocab_size=tok.vocab_size,
embed_dim=64,
num_heads=4,
num_layers=4,
max_seq_len=64,
dropout=0.0,
)
model.load_state_dict(torch.load("shakespeare_gpt.pt"))
set_seed(0)
prompt = tok.encode("ROMEO:").unsqueeze(0)
out = generate(model, prompt, max_new_tokens=200, temperature=1.0, top_k=10)
print(tok.decode(out[0]))
if __name__ == "__main__":
main()
Run it:
uv run python experiments/40_generate_shakespeare.py
Expected output:
ROMEO:
Thy momed has seltered, a neark'ly your tle centeloourse.
Of therere hath thin beielly saneer best.
BRINCE:
Bucker I to my yet, tronen my bety sevene you for mad, bendoth,
Whe a bros swencurenty hou
Read what worked:
- Speaker label structure.
ROMEO:is followed by a newline and dialogue. The model produces a new speaker labelBRINCE:(not a real Shakespeare character, but the right shape: capital letters, colon, line break) after the first speech. - Word-shaped tokens.
Thy,your,Of,hath,to,my,for,youare real English words. Most of the rest are pronounceable nonsense. - Line breaks and punctuation. Commas mid-line, periods at the end. The rhythm of two-to-three short clauses per line is recognisably Shakespearean.
Read what didn’t:
- Most “words” are gibberish (
momed,seltered,centeloourse). - No grammar to speak of.
- No semantic content —
BRINCEis not a character, the text is not making any argument.
This is exactly what a 207k-parameter character-level model trained on 1 MB for 45 seconds should produce: it has learned the shape of the data (speakers, line breaks, capitalisation, the most common words) but not the deeper structure. Bigger models trained for longer on larger corpora — say, GPT-2 small at ~124M parameters — produce qualitatively much more coherent pastiche.
17.7 Experiments
- Cooler temperature. Edit
experiments/40_generate_shakespeare.pyto callgenerate(model, prompt, max_new_tokens=200, temperature=0.5, top_k=10). The output stays closer to the most-likely token at every step — fewer gibberish words, more repetition. - Hotter temperature. Try
temperature=1.5. The output gets noticeably more random — more gibberish, more odd character sequences (xq,kv). - Bigger top-k. Try
top_k=40. The output sees more of the long tail; some samples are more creative, others are more nonsensical. - Different prompt. Try
prompt = tok.encode("KING HENRY:").unsqueeze(0). The model carries over the speaker-label structure: it produces something speaker-like after the colon. - Train longer. Edit
experiments/39_train_shakespeare.pytorange(1, 5001)instead ofrange(1, 2001). Watch the loss continue to drop — past 2000 steps the gain per step is small but nonzero. Re-generate and notice the words are slightly less broken. - Train smaller. Edit
experiments/39_train_shakespeare.pyto useembed_dim=32, num_heads=2, num_layers=2. Re-train. The loss plateaus higher (around 2.5) and the generated text is qualitatively worse — more repetition, fewer real words. Capacity matters.
After each experiment, restore any file you changed before moving on.
17.8 Exercises
- What is “loss = 2.08” in plain English? Cross-entropy loss is the negative log-likelihood per token, in nats. The model’s average probability assigned to the correct next character is therefore $e^{-2.08} \approx 0.125$, or about 12.5%. Compare that to the uniform baseline $1/V = 1/65 \approx 1.5\%$. The model is roughly 8× better than random guessing per character. Argue from this that loss reductions of
0.1matter far more at low loss than at high loss (the relative likelihood gain is exponential in the loss reduction). - Why does step-1 loss exceed
log(V)? A perfectly uniform random predictor has loss exactlylog(V) = 4.17for V=65. But our random initialisation has wide logits, so it predicts with high confidence on random tokens — and confident wrong predictions are penalised quadratically harder than diffuse wrong ones. Sketch a 3-token softmax with logits(10, 0, 0)and one with logits(0, 0, 0), and compute the cross-entropy if the correct class is2in each case. - Parameter budget. From the formulas in earlier chapters, derive that a
GPT(V, C, h, N, L)model has approximatelyV*C + L*C + N*(12*C^2 + 9*C) + 2*Cparameters (ignoring the LM head, which is tied to the token embedding). Plug in V=65, C=64, N=4, L=64; you should get 207,296. (Hint:4*(12*64^2 + 9*64) = 4*49728 = 198,912for the blocks.) - Why character-level is hard. A typical Shakespeare line is ~50 characters. At T=64 we see at most one full line plus a bit. Argue that this means the model effectively cannot learn dependencies that span a stanza or a speech, only ones that span a single line — which is one reason real models use much larger context windows (T=1024 for GPT-2, T=128k+ for newer models).
17.9 What’s next
We can now train and generate on real text. The last piece is ergonomics: replacing the bespoke experiments/39_train_shakespeare.py and experiments/40_generate_shakespeare.py scripts with a proper command-line interface. Chapter 18 builds:
mygpt train file.txt— read any text file, build a tokenizer, train a model, save both as a checkpoint.mygpt generate --checkpoint ckpt.pt --prompt "..."— load a checkpoint, sample from it.- Checkpoint files that bundle the tokenizer with the model so a checkpoint is a self-contained unit.
After Chapter 18, the package is done — the §1.10 promise (“a Python package called mygpt that you can train on a text file and use to generate text from the command line”) is fully delivered.
Looking ahead — what to remember from this chapter
- Nothing in
mygptchanged. We composedCharTokenizer(Ch.16),GPT(Ch.12–13),get_batch(Ch.14), the AdamW loop (Ch.14), andgenerate(Ch.15) into one end-to-end pipeline.- A 207k-parameter character-level GPT trained on 1 MB of Shakespeare for 45 seconds reaches loss ≈ 2.08, about 8× better than random per character. It produces text with speaker labels, line breaks, and recognisable English words — the shape of Shakespeare, not the content.
- Loss reductions are exponential in their effect on per-token probability —
0.1loss at step 100 buys far less than0.1loss at step 2000.- A trained checkpoint is meaningless without its tokenizer. Save them together; load them together.
On to Chapter 18 — Checkpoints, inference, and a CLI (coming soon).