Self-Attention Mechanism

Self-Attention is a mathematical process where every word in a sequence "looks" at every other word to determine which ones provide the most relevant context for its own meaning. It has 3 main components, Query, Key and Value.

$$ \begin{aligned}\text{Query (Q)} &: \text{The specific information a token is "looking for"} \\\text{Key (K)} &: \text{The label a token offers to help others "find it"} \\\text{Value (V)} &: \text{The actual content/meaning a token "gives" once matched} \\\text{Attention} &: \text{The degree of relevance between a Query and a Key}\end{aligned} $$

How attention works?

To make this work, the model creates three distinct versions of every token vector using learned linear transformations. We use something called Scaled Dot-Product Attention.

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

$$ \begin{aligned} Q, K^T &: \text{Search Query \& Transposed Index Keys} \\ d_k &: \text{Scaling factor (prevents gradient vanishing)} \\ \text{softmax} &: \text{Score normalization (0 to 1 range)} \\ V &: \text{Value matrix (weighted content extracted)} \end{aligned} $$

Input: The vectors from the previous layer (Embeddings + Positional Encodings).
The Process (Scaled Dot-Product Attention):

The Match ($Query \cdot Key$ ): We calculate the dot product of every Query with every Key. This produces a "Compatibility Score." High score = high relevance.

The Scaling: We divide the scores by the square root of the dimension size to keep numbers stable.

The Filter (Softmax): We turn those scores into probabilities (0 to 1) that sum to 100%. This is the "Attention Map."

The Update ($Attention \cdot Value$): Each token takes a weighted sum of all the Value vectors based on those probabilities.