Chapter 24 — RMSNorm replaces LayerNorm
LayerNorm (Chapter 10) does two things to each token vector: it subtracts the mean, then it divides by the standard deviation. Modern transformer architectures — Llama, Mistral, Qwen, every Llama-derived open-weight model — use a simpler operation called RMSNorm: skip the mean subtraction, divide by the root mean square instead, and drop the bias. Slightly fewer parameters, slightly faster, no observable training-quality penalty.
By the end of this chapter you will have:
- understood the difference between LayerNorm and RMSNorm in one formula,
- added an
RMSNormclass tomygptnext toLayerNorm, - threaded a
norm_typeconfig option throughGPT,TransformerBlock, and the checkpoint format, - added a
--norm {layer, rms}flag tomygpt train(defaultlayerso Ch.17–23 expected outputs continue to bit-reproduce), - trained the same Tiny Shakespeare model with both norms and compared their parameter counts and loss curves,
- confirmed that every pre-Ch.24 checkpoint still loads correctly (the new
norm_typefield defaults to"layer"when missing).
24.1 Setup
This chapter assumes Chapter 23 — mygpt/ has BPETokenizer and an import collections. None of those are touched in this chapter; only the model architecture changes.
Confirm Tiny Shakespeare is in place:
ls tinyshakespeare.txt || curl -s -o tinyshakespeare.txt https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
You are ready.
24.2 LayerNorm vs RMSNorm in one formula
Recall LayerNorm from Chapter 10. For each token vector $\mathbf{x} \in \mathbb{R}^{C}$:
\[\text{LayerNorm}(\mathbf{x}) = \frac{\mathbf{x} - \mu(\mathbf{x})}{\sqrt{\sigma^2(\mathbf{x}) + \varepsilon}} \cdot \mathbf{w} + \mathbf{b}, \quad \mu = \tfrac{1}{C}\sum_i x_i, \quad \sigma^2 = \tfrac{1}{C}\sum_i (x_i - \mu)^2.\]Two trainable vectors of size $C$ each: gain $\mathbf{w}$ and bias $\mathbf{b}$. Total parameters per LayerNorm instance: $2C$.
RMSNorm (Zhang & Sennrich, 2019) drops both the mean subtraction and the bias:
\[\text{RMSNorm}(\mathbf{x}) = \frac{\mathbf{x}}{\sqrt{\frac{1}{C}\sum_i x_i^2 + \varepsilon}} \cdot \mathbf{w}.\]One trainable vector of size $C$: gain $\mathbf{w}$. Total parameters per RMSNorm instance: $C$. Half the parameters, half the arithmetic ops in the forward pass.
Why does this work just as well? Empirically, the centering step in LayerNorm doesn’t carry much load — the invariance to the mean of $\mathbf{x}$ that LayerNorm provides is mostly redundant with what attention and the MLP already give the model. The bias term is similarly inessential: every linear layer in the model already has a bias (or doesn’t, by design choice), so a separate norm-bias is a duplicate degree of freedom. The Zhang & Sennrich paper trained a Transformer on a translation benchmark with each variant and showed indistinguishable convergence. Llama adopted RMSNorm in 2023; everything downstream from Llama uses it.
There is no theory that RMSNorm is better. The argument is pragmatic: same training quality, slightly cheaper, simpler code.
24.3 The RMSNorm class
Identical shape to LayerNorm from §10 — a nn.Module with one trainable weight parameter. The forward pass is two lines.
Append the following class to 📄 src/mygpt/__init__.py (right after the existing LayerNorm class, before TransformerBlock):
class RMSNorm(nn.Module):
"""Root-mean-square layer norm (Llama / Mistral default).
Compared to LayerNorm:
- drops the bias term,
- drops the mean subtraction (no centring),
- normalises by the root-mean-square of x along the last axis.
Forward: ``x / sqrt(mean(x²) + eps) * weight``.
"""
def __init__(self, embed_dim: int, eps: float = 1e-5) -> None:
super().__init__()
self.embed_dim = embed_dim
self.eps = eps
self.weight = nn.Parameter(torch.ones(embed_dim))
def forward(self, x: torch.Tensor) -> torch.Tensor:
rms = torch.sqrt(x.pow(2).mean(dim=-1, keepdim=True) + self.eps)
return x / rms * self.weight
Three things to read off:
- One parameter, not two.
self.weightis a vector ofCones at init; there is noself.bias. The half-parameter saving compounds: a 4-block transformer has 9 norm instances (2 per block + 1 final), so forC = 64we save9 × 64 = 576parameters. x.pow(2).mean(...)is the mean of squared values; its square root is the root mean square. For a centred vector ($\mu = 0$) the RMS equals the standard deviation; in general it equals $\sqrt{\sigma^2 + \mu^2}$. RMSNorm tolerates non-centred input by design.- No
unbiased=Falseargument like LayerNorm needed. We compute the mean over allCelements; there is no Bessel-correction question because there is no variance computation.
24.4 The norm-selector helper
We want GPT and TransformerBlock to be able to use either norm without duplicating their constructors. A small factory function does the dispatch.
Append the following helper to 📄 src/mygpt/__init__.py (right after RMSNorm, before TransformerBlock):
def _make_norm(embed_dim: int, norm_type: str) -> nn.Module:
"""Norm-class selector. `norm_type` is 'layer' or 'rms'."""
if norm_type == "layer":
return LayerNorm(embed_dim)
if norm_type == "rms":
return RMSNorm(embed_dim)
raise ValueError(f"unknown norm_type: {norm_type!r} (expected 'layer' or 'rms')")
The leading underscore signals “internal helper” — students should pass norm_type to GPT() directly, not call _make_norm themselves.
24.5 Threading norm_type through the model
TransformerBlock and GPT both currently hard-code LayerNorm. We change them to accept a norm_type parameter and dispatch via _make_norm. Default "layer" so existing call sites work unchanged.
Replace TransformerBlock in 📄 src/mygpt/__init__.py:
class TransformerBlock(nn.Module):
def __init__(self, embed_dim, num_heads, max_seq_len=64, dropout=0.0, norm_type="layer"):
super().__init__()
self.embed_dim = embed_dim
self.num_heads = num_heads
self.norm_type = norm_type
self.ln1 = _make_norm(embed_dim, norm_type)
self.mha = MultiHeadAttention(embed_dim, num_heads, max_seq_len, dropout)
self.ln2 = _make_norm(embed_dim, norm_type)
self.mlp = MLP(embed_dim, dropout)
def forward(self, x):
x = x + self.mha(self.ln1(x))
x = x + self.mlp(self.ln2(x))
return x
(One new constructor parameter, two _make_norm calls replacing direct LayerNorm instantiation, and a stored self.norm_type for introspection. Forward is unchanged.)
Replace GPT.__init__ in 📄 src/mygpt/__init__.py:
class GPT(nn.Module):
"""Full GPT-2-style decoder-only transformer with weight-tied head."""
def __init__(self, vocab_size, embed_dim, num_heads, num_layers,
max_seq_len=64, dropout=0.0, norm_type="layer"):
super().__init__()
self.vocab_size = vocab_size
self.embed_dim = embed_dim
self.num_heads = num_heads
self.num_layers = num_layers
self.max_seq_len = max_seq_len
self.norm_type = norm_type
self.token_embedding = TokenEmbedding(vocab_size, embed_dim)
self.position_embedding = nn.Embedding(max_seq_len, embed_dim)
self.embed_drop = nn.Dropout(dropout)
self.blocks = nn.Sequential(*[
TransformerBlock(embed_dim, num_heads, max_seq_len, dropout, norm_type)
for _ in range(num_layers)
])
self.ln_f = _make_norm(embed_dim, norm_type)
(Two new lines: self.norm_type = norm_type and the norm_type argument added to TransformerBlock(...). The final norm ln_f switches from direct LayerNorm(embed_dim) to _make_norm(embed_dim, norm_type). The forward method is unchanged from Ch.13.)
24.6 Checkpoint format adds norm_type
A model trained with RMSNorm has a different set of parameters from one trained with LayerNorm — different shapes (no bias tensors), different state-dict keys. The checkpoint must record which norm was used, or load_checkpoint will try to construct the wrong architecture.
We add one new field, norm_type, to the saved config. Crucially, when loading we read it with .get(..., "layer") — pre-Ch.24 checkpoints don’t have this field, and we want them to default to LayerNorm (their original behavior).
Replace save_checkpoint in 📄 src/mygpt/__init__.py:
def save_checkpoint(model: "GPT", tokenizer: "CharTokenizer", path: str) -> None:
"""Bundle model weights, tokenizer, and architecture into one .ckpt file."""
torch.save(
{
"model_state_dict": model.state_dict(),
"tokenizer_chars": tokenizer.chars,
"config": {
"vocab_size": model.vocab_size,
"embed_dim": model.embed_dim,
"num_heads": model.num_heads,
"num_layers": model.num_layers,
"max_seq_len": model.max_seq_len,
"norm_type": getattr(model, "norm_type", "layer"),
},
},
path,
)
The getattr(..., "layer") is defensive — a model created in Ch.23 code wouldn’t have a norm_type attribute; we fall back to "layer".
Replace load_checkpoint in 📄 src/mygpt/__init__.py:
def load_checkpoint(path: str) -> tuple["GPT", "CharTokenizer"]:
"""Reload a (model, tokenizer) pair from a checkpoint produced by `save_checkpoint`.
Always loads to CPU; the caller is responsible for `.to(device)` afterwards.
Pre-Ch.24 checkpoints have no `norm_type` field; they default to ``"layer"``
so old `.ckpt` files continue to load correctly.
"""
ckpt = torch.load(path, map_location="cpu")
config = ckpt["config"]
tokenizer = CharTokenizer(ckpt["tokenizer_chars"])
model = GPT(
vocab_size=config["vocab_size"],
embed_dim=config["embed_dim"],
num_heads=config["num_heads"],
num_layers=config["num_layers"],
max_seq_len=config["max_seq_len"],
dropout=0.0,
norm_type=config.get("norm_type", "layer"),
)
model.load_state_dict(ckpt["model_state_dict"])
return model, tokenizer
The config.get("norm_type", "layer") is the backward-compatibility hinge. A Chapter 18 shakespeare.ckpt will load without modification, because its config has no norm_type key, and we default it to "layer".
24.7 The --norm CLI flag
Wire it through _train_command. Two edits: the GPT(...) call now passes norm_type=args.norm, and the print block adds a norm: line (matching the device: and precision: lines already there).
In _train_command, the GPT(...) constructor call:
model = GPT(
vocab_size=tokenizer.vocab_size,
embed_dim=args.embed_dim,
num_heads=args.num_heads,
num_layers=args.num_layers,
max_seq_len=args.max_seq_len,
dropout=args.dropout,
norm_type=args.norm,
).to(device)
And the print block (right after precision:):
print(f"device: {device}")
print(f"precision: {args.precision}")
print(f"norm: {args.norm}")
print(f"corpus chars: {len(text):,}")
# … rest unchanged
In main’s argparse setup, add to p_train (right after the --max-grad-norm block, before set_defaults(...)):
p_train.add_argument(
"--norm",
choices=["layer", "rms"],
default="layer",
help="Normalisation: 'layer' (default; LayerNorm, Ch.10) or 'rms' (RMSNorm, Llama default).",
)
24.8 Backward-compat: defaults still reproduce Ch.21
Sanity check: mygpt train with the default --norm layer must still produce the Ch.21 / Ch.20 / Ch.19 / Ch.17 default loss curve. If it didn’t, we’d have changed semantics, not just added a feature.
uv run mygpt train tinyshakespeare.txt --device mps --output sh-layer.ckpt
Expected output:
device: mps
precision: fp32
norm: layer
corpus chars: 1,115,394
train chars: 1,115,394
vocab_size: 65
params: 207,296
steps: 2000
schedule: constant (warmup=0)
max_grad_norm:0.0
step 1: loss = 41.0367
step 500: loss = 2.5944
step 1000: loss = 2.3529
step 1500: loss = 2.1795
step 2000: loss = 2.0785
saved checkpoint to sh-layer.ckpt
The new norm: layer line appears in the header but the loss values are identical to every previous chapter’s default run. With every flag at its default, the loop is the same loop. Backward-compat preserved.
24.9 The RMSNorm run
Now the same training with --norm rms:
uv run mygpt train tinyshakespeare.txt --device mps --norm rms --output sh-rms.ckpt
Expected output:
device: mps
precision: fp32
norm: rms
corpus chars: 1,115,394
train chars: 1,115,394
vocab_size: 65
params: 206,720
steps: 2000
schedule: constant (warmup=0)
max_grad_norm:0.0
step 1: loss = 41.2164
step 500: loss = 2.5935
step 1000: loss = 2.3517
step 1500: loss = 2.1689
step 2000: loss = 2.0752
saved checkpoint to sh-rms.ckpt
Two things to read off:
1. Parameter count drops from 207,296 to 206,720 — exactly 576 fewer parameters. Where did they go? RMSNorm has only embed_dim parameters per instance (the gain), versus LayerNorm’s 2 × embed_dim (gain + bias). The difference is embed_dim per norm instance. Our model has 9 norm instances (2 per TransformerBlock × 4 blocks + 1 final ln_f), so we save 9 × 64 = 576 parameters.
2. Loss curve is barely different from LayerNorm’s — within ~0.5% at every step, and slightly lower at the final step (2.0752 vs 2.0785). The difference is dominated by initialisation noise, not by an architectural advantage. RMSNorm is not better than LayerNorm at this scale; it is comparable, slightly cheaper. That is exactly the empirical result Zhang & Sennrich reported in 2019 and the basis on which Llama adopted it.
24.10 Backward-compat: pre-Ch.24 checkpoint still loads
The load_checkpoint config.get("norm_type", "layer") default means any .ckpt file produced by Ch.18 / 19 / 20 / 21 / 22 / 23 still works. Sanity-check by generating from the LayerNorm checkpoint we just saved:
uv run mygpt generate --checkpoint sh-layer.ckpt --prompt "ROMEO:" --device cpu
Expected output (matches Ch.17 §17.6 byte-for-byte):
device: cpu
ROMEO:
Thy momed has seltered, a neark'ly your tle centeloourse.
Of therere hath thin beielly saneer best.
BRINCE:
Bucker I to my yet, tronen my bety sevene you for mad, bendoth,
Whe a bros swencurenty hou
Now the RMSNorm checkpoint:
uv run mygpt generate --checkpoint sh-rms.ckpt --prompt "ROMEO:" --device cpu
Expected output (RMSNorm produces a different sample — same architecture topology, different parameters because of the missing bias terms):
device: cpu
ROMEO:
Thy momed haveseltered ad neards way.
ISele fent hourese his therere his herins ore sese. I him
We athor to siely deall: my art, tronen my betyod wely stone.
FRUPENO:
My your sarin shus the shere wa
The first 14 characters (ROMEO:\nThy momed) match the LayerNorm sample exactly because at --top-k 10 and very low entropy the most-likely next token wins regardless of small parameter differences. The two samples diverge once the distribution flattens enough to let the per-device RNG (Ch.19) and the per-parameter weights (LayerNorm vs RMSNorm) disagree.
24.11 Experiments
-
Inspect the saved config.
python -c 'import torch; print(torch.load("sh-rms.ckpt", map_location="cpu")["config"])'. The output includes'norm_type': 'rms'. Try the same onsh-layer.ckpt; it shows'norm_type': 'layer'. Now try an older.ckptfrom Ch.18 — its config dict has nonorm_typekey at all, butload_checkpointstill works because of the.get(..., "layer")default. -
A wrong norm at load time. Suppose you forced
norm_type="rms"when loading a LayerNorm checkpoint. Themodel.load_state_dict(ckpt["model_state_dict"])call would raise an error — the LayerNorm checkpoint contains*.biaskeys for every norm, but the RMSNorm-built model has no*.biasparameters to receive them. Try it: editload_checkpointto hard-codenorm_type="rms", then run generate onsh-layer.ckpt. Compare the error message PyTorch produces to your prediction. - Manual RMSNorm. Open a Python REPL and verify the formula:
import torch x = torch.tensor([3.0, 4.0]) # rms = sqrt((9+16)/2) = sqrt(12.5) = 3.536 from mygpt import RMSNorm norm = RMSNorm(embed_dim=2) # weight = [1.0, 1.0] y = norm(x) # ≈ x / 3.536 = [0.849, 1.131] print(y)Confirm by hand.
- Param-count formula. A
GPT(V, C, h, N, L=64)with LayerNorm hasV*C + L*C + N*(12 C² + 9 C) + 2Cparameters (Ch.12 §12.5 formula). Argue from §24.2 that the RMSNorm version hasV*C + L*C + N*(12 C² + 8 C) + Cparameters —(N + 1) * Cfewer. For ourV=65, C=64, N=4model, that is5 * 64 = 320… wait, let me recount. Each block has 2 norms (so $2 \times C$ saved, $\times N = 4$ blocks = $8C$), plus the finalln_f($C$ saved), so total saved = $9C = 576$ for $C=64$. Confirm against the §24.9 numbers (207,296 → 206,720 = 576 saved). ✓
24.12 Exercises
-
Why no bias? Argue that adding a bias to RMSNorm would not break it functionally (the model can still learn) but is redundant given that every linear layer downstream has its own bias. (Hint: $\text{RMSNorm}(\mathbf{x}) \cdot \mathbf{w} + \mathbf{b}$ can be absorbed into the linear layer that follows it, by noting that
Linear(W, B)(input + b) = Linear(W, B + W b)(input). The bias of the norm and the bias of the next linear are the same parameter up to where you put it.) -
Eps placement. RMSNorm’s eps is added inside the
sqrt: $\sqrt{\text{mean}(x^2) + \varepsilon}$. LayerNorm’s eps is the same (inside the sqrt, after the variance). Argue that placing eps outside the sqrt would also work numerically but would shrink the gradient near zero — sketch the gradient of $1 / (\sqrt{u} + \varepsilon)$ vs $1 / \sqrt{u + \varepsilon}$ near $u = 0$. -
Why is
mean(x²)notmean(x)²? LayerNorm subtracts the mean before squaring; RMSNorm computesmean(x²)directly — without subtracting the mean. Argue these are the same iff $\mu(\mathbf{x}) = 0$, and they differ by exactly $\mu(\mathbf{x})^2$ when $\mu \neq 0$. (This is the same identity you saw in Ch.10’s variance derivation: $\sigma^2 = \mathbb{E}[x^2] - \mathbb{E}[x]^2$.) -
Initialisation: gain=1, no bias. Both
LayerNormandRMSNorminitialiseweightto ones and (for LayerNorm)biasto zeros. Argue that this initialisation makes both a no-op at the start of training on a centred unit-variance input, so the network’s first forward pass is well-defined. (Hint: with $\mathbf{x}$ unit-variance and zero mean, both norms compute approximately $\mathbf{x} / 1 \cdot \mathbf{1} + \mathbf{0} = \mathbf{x}$.)
24.13 What’s next
The next chapter, Chapter 25 — RoPE: rotary position embeddings, replaces the learned position embedding (Ch.12) with a parameter-free rotation of the query and key vectors. RoPE has the killer property that the §15.9 “untrained position embedding” failure mode disappears — the model can generate at any position, including positions beyond what it was trained on, because there are no learned position parameters at all.
Looking ahead — what to remember from this chapter:
- RMSNorm is LayerNorm minus the mean subtraction and minus the bias.
- Half the parameters per norm instance, half the forward-pass arithmetic, comparable training quality.
--norm rmsopt-in;--norm layer(default) bit-reproduces every prior chapter’s run.- Checkpoints carry their
norm_typein the config; pre-Ch.24 checkpoints default to"layer"on load, so old.ckptfiles keep working.