Part II — Advanced Topics
Ten chapters that upgrade mygpt from a hand-rolled toy to a code base whose architecture and training recipe match real modern open-weight LLMs (Llama-style: BPE + RoPE + RMSNorm + GQA + cosine LR + bf16). Same model size as Part I; modern recipe.
Prerequisites
Part I — LLM Fundamentals finished. An Apple M1 / M2 / M3 / M4 Mac (any RAM tier; 8 GB works), or a CUDA GPU, or willingness to wait on CPU. ~10 GB free disk for the Chapter 28 Wikipedia subset.
What changes
- Ch.19–21 — training infrastructure: device-aware (MPS/CUDA/CPU), mixed precision (bf16), validation loss + cosine LR schedule + gradient clipping.
- Ch.22–23 — BPE tokenization: build the algorithm from scratch, then wire it into
mygptalongside the existingCharTokenizer. - Ch.24–26 — modern architecture: replace
LayerNormwithRMSNorm, learned position embeddings with RoPE, multi-head attention with GQA. - Ch.27–28 — payoff: same training run as Part I but with the modern stack (Ch.27); then a real ~500 MB Wikipedia training run on M1 in 1–3 hours (Ch.28).
Backward compatibility: every Part-I checkpoint continues to load. Architecture flags (--norm, --position, --num-kv-heads, --tokenizer) coexist with Part-I defaults; Part-II checkpoints record which combination was used.
Chapters
- Device-aware training
- Mixed precision training (bf16)
- Training-loop hardening
- BPE from scratch (algorithm)
BPETokenizerinmygpt- RMSNorm replaces LayerNorm
- RoPE: rotary position embeddings
- GQA: grouped-query attention
- Modern recipe vs Ch.17 baseline
- Modern recipe at scale (Wikipedia)
Stuck on a chapter? Each chapter has a
chapter_states/chNN/snapshot — a complete, runnableuvpackage matching its end-state. If you get lost partway through Ch.25 (say),cp -r chapter_states/ch24/ <your-working-dir>to start from a known-good state. Seechapter_states/README.mdon GitHub.
Table of contents
- 19. Device-aware training
- 20. Mixed precision training (bf16)
- 21. Training-loop hardening
- 22. BPE from scratch (algorithm)
- 23. BPETokenizer in mygpt
- 24. RMSNorm replaces LayerNorm
- 25. RoPE — rotary position embeddings
- 26. GQA — grouped-query attention
- 27. Modern recipe vs Ch.17 baseline
- 28. Modern recipe at scale (Wikipedia)