Multi-Head Attention Mechanism

Self-attention is a type of attention (who attends to whom), while multi-head attention is an architecture that runs several attention mechanisms in parallel to capture different relationships.

Difference between self and multi-head?

A single head uses one set of projections $W^Q, W^K, W^V$, so it learns one dominant pattern per position (e.g., “mostly attend to the previous word”).

Think of each head as a different specialist reading the same sentence.

Each head computes attention with the same formula, on the same tokens, but using different weight matrices, so in general each head’s attention pattern is different.

Because the projections/linear layers differ, the Q/K/V spaces differ, so the similarity scores and softmax weights differ per head. This makes each head attend to different positions and features in the sequence.

When you visualise heads in trained Transformers, you see some heads focus on syntax (e.g., heads attending from verbs to subjects), others on long-range dependencies, others on punctuation, etc.

Why heads can specialise?

Each head has its own parameters $W_i^Q, W_i^K, W_i^V$, so gradients update them independently.

The loss only cares about the sum effect of all heads; any division of labor among heads that lowers loss is allowed, so symmetry can break and roles can diverge.

Early in training, heads are near-random, often behaving similarly because weights are close to initialization.

As training proceeds, different gradient signals push different heads to latch onto patterns that help reduce loss.

If multiple heads all do the same thing, their marginal benefit overlaps; gradients for some of them shrink because others already cover that function.