Self-Attention is a mathematical process where every word in a sequence "looks" at every other word to determine which ones provide the most relevant context for its own meaning. It has 3 main components, Query, Key and Value.

$$ \begin{aligned}\text{Query (Q)} &: \text{The specific information a token is "looking for"} \\\text{Key (K)} &: \text{The label a token offers to help others "find it"} \\\text{Value (V)} &: \text{The actual content/meaning a token "gives" once matched} \\\text{Attention} &: \text{The degree of relevance between a Query and a Key}\end{aligned} $$


How attention works?

image.png

To make this work, the model creates three distinct versions of every token vector using learned linear transformations. We use something called Scaled Dot-Product Attention.

$$ \text{Attention}(Q, K, V) = \text{softmax}\left(\frac{QK^T}{\sqrt{d_k}}\right)V $$

$$ \begin{aligned} Q, K^T &: \text{Search Query \& Transposed Index Keys} \\ d_k &: \text{Scaling factor (prevents gradient vanishing)} \\ \text{softmax} &: \text{Score normalization (0 to 1 range)} \\ V &: \text{Value matrix (weighted content extracted)} \end{aligned} $$

image.png


  1. Input: The vectors from the previous layer (Embeddings + Positional Encodings).

  2. The Process (Scaled Dot-Product Attention):