Chapter 12 — Position embeddings and the language modeling head
After Chapter 11 we have the body of GPT-2: an embedding lookup, a stack of transformer blocks, and a (B, T, C) tensor at the top. Two things still missing before we have a complete model:
- Position embeddings — a way to tell the model that token
"I"at position 0 is different from token"I"at position 3. Without this, attention is permutation-invariant: it would happily produce the same logits for"I love AI !"and"AI I ! love". - The language-modelling head — a final projection from the $C$-dimensional output of the last block to a $V$-dimensional logit vector at every position. Plus a final layer norm before the projection. GPT-2 uses a clever weight-tying trick that lets the head share weights with the token embedding, saving $V \cdot C$ parameters.
By the end you will have:
- understood why self-attention is permutation-invariant and how a single
nn.Embedding(max_seq_len, C)lookup fixes it, - met the language-modelling head as the final $C \to V$ linear projection that produces logits,
- seen the weight-tied head trick that saves $V \cdot C$ parameters in real GPT-2,
- assembled
mygpt.GPT(vocab_size, embed_dim, num_heads, num_layers)— the full model — and verified it produces output of shape(B, T, V).
After this chapter we have a model. Chapter 13 wires up the loss; Chapter 14 trains it.
12.1 The permutation-invariance problem
Self-attention, as we built it in Chapters 6–8, is permutation-invariant in a precise sense: if you permute the rows of the input, you get the same output rows in the corresponding permuted order. Concretely, if $X$ is a $(T, C)$ input and $P$ is any permutation matrix, then
\[\text{Attention}(P X) \;=\; P \cdot \text{Attention}(X).\]Apply that to the language-modelling task: mha([x_0, x_1, x_2]) and mha([x_2, x_0, x_1]) produce the same set of output vectors, just reordered. The model has no way to tell whether a token came first, second, or third — it only sees content, not position.
That breaks language modelling. The whole point is that “love AI” and “AI love” are different — sequence order matters.
(Two technical caveats. The causal mask is a per-position constraint, but it does not actually break the symmetry: a permuted input would just see a different prefix at each position, with the same content-based attention pattern within that prefix. The MLP and residual sub-layers are also position-wise, so they don’t break it either. The whole network is fully permutation-equivariant from $X$ to the final output.)
The fix is mechanical and elegant: add a position-dependent vector to every token embedding, before feeding it to the transformer body. Then “token 3 at position 0” and “token 3 at position 3” become different inputs from the model’s point of view, even though they share the token embedding.
12.2 Setup
This chapter assumes you finished Chapter 11 — mygpt/ exists with the full Chapter 11 module set ending in TransformerBlock.
If you skipped Chapter 11, recreate the state from a clean directory:
uv init mygpt --package
cd mygpt
mkdir -p experiments
uv add torch numpy
Then overwrite src/mygpt/__init__.py with the Chapter 11 ending state from docs/_state_after_ch11.md.
You are ready.
12.3 Learned position embeddings
GPT-2 uses learned position embeddings: a separate nn.Embedding(max_seq_len, embed_dim) lookup, indexed by the integer position $0, 1, \ldots, T-1$. The vectors are learned during training, just like the token embeddings — there is nothing built-in about how they work, no sine/cosine formula. The model figures out, on its own, what each position should contribute.
Concretely:
self.token_embedding = TokenEmbedding(vocab_size, embed_dim)
self.position_embedding = nn.Embedding(max_seq_len, embed_dim)
# In forward:
positions = torch.arange(T, device=ids.device) # (T,)
x = self.token_embedding(ids) + self.position_embedding(positions) # (B, T, C)
Three things to notice:
positionsis a 1-D tensor[0, 1, ..., T-1]. Same for every example in the batch — broadcasting handles the batch axis when we do the addition.- The two embeddings are added, not concatenated. Adding keeps the output shape $(B, T, C)$ — same as the token embedding alone — so the rest of the model is unchanged. Concatenating would double the channel axis, requiring every downstream layer to rewire.
- Position embeddings have $\text{max_seq_len} \cdot C$ parameters. For our running example with $\text{max_seq_len}=64, C=4$, that is $256$ parameters — already bigger than the token embedding’s $V \cdot C = 16$. For GPT-2 small ($\text{max_seq_len}=1024, C=768$), it is $786{,}432 \approx 0.79$ M — small compared to attention/MLP, but not negligible.
Let’s see the effect empirically: with position embeddings, the same token at different positions gets a different vector.
Save the following to 📄 experiments/25_position_breaks_invariance.py:
"""Experiment 25 — Position embeddings break self-attention's permutation invariance.
Without position embeddings, token 3 at position 3 has the same vector as
token 3 at position 0. With position embeddings added, they differ.
"""
import torch
import torch.nn as nn
from mygpt import TokenEmbedding, set_seed
def main() -> None:
set_seed(0)
V, C, max_seq = 4, 4, 8
te = TokenEmbedding(V, C)
pe = nn.Embedding(max_seq, C)
# Two id sequences that share the token "3" at different positions
ids1 = torch.tensor([0, 1, 2, 3])
ids2 = torch.tensor([3, 2, 1, 0])
# Without position embedding: token 3 always has the same vector
print("Without position embedding (token 3 row):")
print(f" ids1 position 3: {te(ids1)[3]}")
print(f" ids2 position 0: {te(ids2)[0]}")
print(f" identical: {torch.equal(te(ids1)[3], te(ids2)[0])}")
print()
# With position embedding: same token at different positions differs
def with_pos(ids):
T = ids.shape[-1]
positions = torch.arange(T)
return te(ids) + pe(positions)
v1 = with_pos(ids1)
v2 = with_pos(ids2)
print("With position embedding (token 3 at different positions):")
print(f" ids1 position 3 (token 3 at pos 3): {v1[3]}")
print(f" ids2 position 0 (token 3 at pos 0): {v2[0]}")
print(f" identical: {torch.equal(v1[3], v2[0])}")
if __name__ == "__main__":
main()
Run it:
uv run python experiments/25_position_breaks_invariance.py
Expected output:
Without position embedding (token 3 row):
ids1 position 3: tensor([ 0.1198, 1.2377, 1.1168, -0.2473], grad_fn=<SelectBackward0>)
ids2 position 0: tensor([ 0.1198, 1.2377, 1.1168, -0.2473], grad_fn=<SelectBackward0>)
identical: True
With position embedding (token 3 at different positions):
ids1 position 3 (token 3 at pos 3): tensor([ 1.5092, 2.8240, 2.0631, -1.0910], grad_fn=<SelectBackward0>)
ids2 position 0 (token 3 at pos 0): tensor([-1.2328, -0.4583, 1.6834, 0.5462], grad_fn=<SelectBackward0>)
identical: False
The two rows of the with-position output are different (identical: False), even though both encode the same token id. The position embedding is doing its job.
12.4 The language-modelling head
After the last transformer block, every token’s vector still has $C$ channels. To predict the next token we need to project from $C$ to $V$ — one logit per token in the vocabulary. That projection is the language-modelling head:
\[\text{logits} \;=\; \text{LayerNorm}_\text{final}(\text{block\_output}) \cdot W_\text{head}^\top, \qquad W_\text{head} \in \mathbb{R}^{V \times C}.\]Three details:
- A final
LayerNormruns before the head, normalising the output of the last block. This is GPT-2’s design — without it, the residual stream’s drift (which we measured in §10.6) would feed directly into the head and push some logits to extreme values. - The head is
nn.Linear(C, V, bias=False)in a typical implementation. Its weight matrix is $W_\text{head}$ of shape $(V, C)$; output shape is $(B, T, V)$. - No softmax inside the head. The head produces logits — unnormalised real numbers. Softmax happens later, either inside the cross-entropy loss (Chapter 13) or at generation time (Chapter 15). Producing logits is the right contract because cross-entropy is more numerically stable when computed directly from logits than from probabilities.
12.5 Weight tying: the GPT-2 trick
The head’s weight $W_\text{head} \in \mathbb{R}^{V \times C}$ has the exact same shape as the token embedding’s matrix $E \in \mathbb{R}^{V \times C}$. GPT-2 ties them: $W_\text{head} = E$. The head is then not a separate parameter — its forward pass uses x @ E^T directly, with $E$ being the same tensor that the token embedding looks up.
Two reasons this is appealing:
- Saves $V \cdot C$ parameters. For GPT-2 small ($V=50{,}257, C=768$) that is $\approx 38.6$ M — about 31% of the model. (Recall §5.10 ex 2: the embedding was already 31% of the total. Tying makes the head free.)
- Aligns input and output spaces. A token’s input embedding $E[i]$ and its output prediction $W_\text{head}[i]$ now point in the same direction in $\mathbb{R}^C$. Empirically, this helps training: the model has fewer redundant degrees of freedom.
In code, we do not allocate an nn.Linear for the head at all. We just use the token embedding’s weight matrix transposed:
# Tied head — no separate Linear layer
logits = x @ self.token_embedding.embedding.weight.T # (B, T, V)
self.token_embedding.embedding.weight is the $V \times C$ tensor inside nn.Embedding. Its transpose is $C \times V$. Multiplying x: (B, T, C) by it gives (B, T, V) — the logit shape we want.
12.6 Building mygpt.GPT
Time to assemble the full model.
Append the following class to 📄 src/mygpt/__init__.py (after TransformerBlock, before main):
class GPT(nn.Module):
"""The full GPT-2-style decoder-only transformer.
Inputs:
ids: long tensor of shape (B, T) with values in [0, vocab_size).
Outputs:
logits: float tensor of shape (B, T, vocab_size), unnormalised.
Architecture:
token_embedding (V, C) parameters
+ position_embedding (max_seq_len, C) parameters
→ embed_drop
→ N x TransformerBlock(C, num_heads)
→ ln_f (final LayerNorm)
→ head (tied to token_embedding.embedding.weight; no extra params)
"""
def __init__(
self,
vocab_size: int,
embed_dim: int,
num_heads: int,
num_layers: int,
max_seq_len: int = 64,
dropout: float = 0.0,
) -> None:
super().__init__()
self.vocab_size = vocab_size
self.embed_dim = embed_dim
self.num_heads = num_heads
self.num_layers = num_layers
self.max_seq_len = max_seq_len
self.token_embedding = TokenEmbedding(vocab_size, embed_dim)
self.position_embedding = nn.Embedding(max_seq_len, embed_dim)
self.embed_drop = nn.Dropout(dropout)
self.blocks = nn.Sequential(*[
TransformerBlock(embed_dim, num_heads, max_seq_len, dropout)
for _ in range(num_layers)
])
self.ln_f = LayerNorm(embed_dim)
# No separate head: we reuse self.token_embedding.embedding.weight in forward.
def forward(self, ids: torch.Tensor) -> torch.Tensor:
B, T = ids.shape
if T > self.max_seq_len:
raise ValueError(
f"input length T={T} exceeds max_seq_len={self.max_seq_len}"
)
positions = torch.arange(T, device=ids.device)
x = self.token_embedding(ids) + self.position_embedding(positions) # (B, T, C)
x = self.embed_drop(x)
x = self.blocks(x)
x = self.ln_f(x)
# Tied head: logits = x @ E^T, where E is the token-embedding matrix.
logits = x @ self.token_embedding.embedding.weight.T # (B, T, V)
return logits
Then update main to construct a small GPT and run it on the running example:
def main() -> None:
print("Vocabulary:", VOCAB)
print(f"Vocabulary size V = {len(VOCAB)}")
set_seed(0)
V, C, h, N = len(VOCAB), 4, 2, 2
gpt = GPT(vocab_size=V, embed_dim=C, num_heads=h, num_layers=N,
max_seq_len=64, dropout=0.0)
gpt.eval()
ids = to_ids(["I", "love", "AI", "!"]).unsqueeze(0)
logits = gpt(ids)
print(f"\nToken ids shape: {tuple(ids.shape)}")
print(f"Logits shape: {tuple(logits.shape)} (B, T, V)")
print()
n_te = sum(p.numel() for p in gpt.token_embedding.parameters())
n_pe = sum(p.numel() for p in gpt.position_embedding.parameters())
n_blocks = sum(p.numel() for p in gpt.blocks.parameters())
n_ln_f = sum(p.numel() for p in gpt.ln_f.parameters())
n_total = sum(p.numel() for p in gpt.parameters())
print(f"Token embedding (V*C): {n_te:>5}")
print(f"Position embedding (max_seq*C): {n_pe:>5}")
print(f"{N} TransformerBlocks (N*228): {n_blocks:>5}")
print(f"Final LayerNorm (2*C): {n_ln_f:>5}")
print(f"Tied head (0 extra): {0:>5}")
print(f"Total parameters: {n_total:>5}")
Run:
uv run mygpt
Expected output:
Vocabulary: ('I', 'love', 'AI', '!')
Vocabulary size V = 4
Token ids shape: (1, 4)
Logits shape: (1, 4, 4) (B, T, V)
Token embedding (V*C): 16
Position embedding (max_seq*C): 256
2 TransformerBlocks (N*228): 456
Final LayerNorm (2*C): 8
Tied head (0 extra): 0
Total parameters: 736
Three things to read off:
- Logits shape
(1, 4, 4). That is $(B, T, V) = (1, 4, 4)$. Every position gets a vector of $V = 4$ logits — one prediction for each of"I","love","AI","!"as the next token. Chapter 13 will use these to compute cross-entropy loss. - Total parameters: 736. The breakdown is $16 + 256 + 456 + 8 = 736$ (no contribution from the tied head). The position embedding alone (256) is bigger than every other piece except the transformer blocks.
- Tied head costs zero parameters.
gpt.parameters()walks the registered submodules; the head reusestoken_embedding.embedding.weight, which is already counted undertoken_embedding. There is nothing else to count.
12.7 Experiments
- Wider model. Construct
GPT(vocab_size=4, embed_dim=8, num_heads=2, num_layers=2, max_seq_len=64). Predicted parameter count: $V \cdot C + \text{max_seq_len} \cdot C + N \cdot (12 C^2 + 9 C) + 2 C = 32 + 512 + 1680 + 16 = 2240$. Verify by counting. - Deeper model. Construct
GPT(vocab_size=4, embed_dim=4, num_heads=2, num_layers=4)(twice as many blocks). Predicted parameter count: $16 + 256 + 4 \cdot 228 + 8 = 1192$. Verify. - Position-embedding hyperparameter. Construct
GPT(vocab_size=4, embed_dim=4, num_heads=2, num_layers=2, max_seq_len=8). The position embedding now has $8 \cdot 4 = 32$ parameters instead of $256$ — a much smaller fraction of the model. Total: $16 + 32 + 456 + 8 = 512$. Verify. - The tied head really is tied. After constructing
gpt = GPT(...), change one entry ofgpt.token_embedding.embedding.weight.data(e.g.gpt.token_embedding.embedding.weight.data[0, 0] = 999.0). Rungpt(ids)again and confirm the logits at every position have changed in their column 0 — because column 0 of the head’s effective weight is row 0 of the token embedding, and we just modified it.
After each experiment, restore the file you changed before moving on.
12.8 Exercises
- Why max_seq_len affects parameter count but not training data size. Increasing
max_seq_lenfrom 64 to 1024 grows the position embedding from $64 C$ to $1024 C$ — adding parameters. But it does not require more training data; the same data is just available to use up to length 1024. Argue why, and what the cost is at inference time. (Hint: per-step compute scales like $T^2$ in attention.) - Why aren’t position embeddings outside the transformer blocks like a “+pe” hack? They are inside the model because they are learnable parameters that need gradients. Argue from the chain rule that putting
+peinforward(before the first block) givespea non-trivial gradient on every example. (Hint: $\frac{\partial \mathcal{L}}{\partial \text{pe}_t}$ flows back through every block to position $t$.) - GPT-2 small parameter accounting. Compute the number of parameters for GPT-2 small with $V=50{,}257, C=768, h=12, N=12, \text{max_seq_len}=1024$, with weight tying. (Answer: $V C + \text{max_seq_len} \cdot C + N \cdot (12 C^2 + 9 C) + 2 C = 38{,}597{,}376 + 786{,}432 + 85{,}026{,}816 + 1{,}536 = 124{,}412{,}160 \approx 124$ M. ✓ matches the published 124 M figure.)
- What does the tied head buy you in terms of generalisation? The token embedding $E[i]$ and the head row $W_\text{head}[i]$ both represent token $i$ — one as input, one as output. With tying, training the model to predict token $i$ updates $E[i]$ in a useful way, and training the model to consume token $i$ updates $W_\text{head}[i]$ in a useful way. Argue informally why this should help small-sample generalisation.
12.9 What’s next
We have a complete model. mygpt.GPT(vocab_size=4, embed_dim=4, num_heads=2, num_layers=2) has 736 parameters and produces a (1, 4, 4) logit tensor on the running example.
Chapter 13 wires up the loss. Given logits and the true next-token ids, we compute the cross-entropy loss — the scalar that gradient descent will minimise. We will also write the forward-pass-with-loss method that real training loops use.
Chapter 14 trains the model. We bring back the SGD loop from Chapter 4, give it a small text dataset, and watch the loss go down. After Chapter 14 we have a trained GPT.
Chapter 15 generates text. Given a prompt and a trained model, we sample tokens one at a time, autoregressively, until we hit a stop condition.
Looking ahead — what to remember from this chapter
- Self-attention is permutation-invariant — without position information,
mha([x_0, x_1, x_2])andmha([x_2, x_0, x_1])give the same set of output rows. Position embeddings break this symmetry by giving each position its own learned vector to add to the token embedding.- The language-modelling head is a $C \to V$ projection. GPT-2 ties its weight to the token embedding, saving $V \cdot C$ parameters.
- A final
LayerNormruns before the head to renormalise the residual stream.mygpt.GPT(V=4, C=4, h=2, N=2, max_seq_len=64)has 736 parameters and outputs logits of shape(B, T, V).
On to Chapter 13 — The forward pass with loss (coming soon).