Part II — Advanced Topics

Ten chapters that upgrade mygpt from a hand-rolled toy to a code base whose architecture and training recipe match real modern open-weight LLMs (Llama-style: BPE + RoPE + RMSNorm + GQA + cosine LR + bf16). Same model size as Part I; modern recipe.


Prerequisites

Part I — LLM Fundamentals finished. An Apple M1 / M2 / M3 / M4 Mac (any RAM tier; 8 GB works), or a CUDA GPU, or willingness to wait on CPU. ~10 GB free disk for the Chapter 28 Wikipedia subset.


What changes

  • Ch.19–21 — training infrastructure: device-aware (MPS/CUDA/CPU), mixed precision (bf16), validation loss + cosine LR schedule + gradient clipping.
  • Ch.22–23 — BPE tokenization: build the algorithm from scratch, then wire it into mygpt alongside the existing CharTokenizer.
  • Ch.24–26 — modern architecture: replace LayerNorm with RMSNorm, learned position embeddings with RoPE, multi-head attention with GQA.
  • Ch.27–28 — payoff: same training run as Part I but with the modern stack (Ch.27); then a real ~500 MB Wikipedia training run on M1 in 1–3 hours (Ch.28).

Backward compatibility: every Part-I checkpoint continues to load. Architecture flags (--norm, --position, --num-kv-heads, --tokenizer) coexist with Part-I defaults; Part-II checkpoints record which combination was used.


Chapters

  1. Device-aware training
  2. Mixed precision training (bf16)
  3. Training-loop hardening
  4. BPE from scratch (algorithm)
  5. BPETokenizer in mygpt
  6. RMSNorm replaces LayerNorm
  7. RoPE: rotary position embeddings
  8. GQA: grouped-query attention
  9. Modern recipe vs Ch.17 baseline
  10. Modern recipe at scale (Wikipedia)

Stuck on a chapter? Each chapter has a chapter_states/chNN/ snapshot — a complete, runnable uv package matching its end-state. If you get lost partway through Ch.25 (say), cp -r chapter_states/ch24/ <your-working-dir> to start from a known-good state. See chapter_states/README.md on GitHub.


Table of contents