Embeddings & Positional Encoding

Embeddings translate a discrete token into a high-dimensional vector that represents its "meaning" based on context, while Positional Encoding adds a "time stamp" to that vector so the model knows where the word sits in a sentence.

The Process:

The Lookup (Embedding Layer): The model has a giant weights matrix (the "Embedding Table"). It looks up the ID and retrieves a row of numbers (e.g., a vector of 768 or 1536 dimensions).
The Injection (Positional Encoding): A second vector of the exact same size is generated using a mathematical formula. This vector represents the index (0, 1, 2...).
The Fusion: These two vectors are added together: $Embedding + Position$.

Positional Encoding

$$ \begin{aligned}PE_{(\text{pos}, 2i)} &= \sin\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right) \\PE_{(\text{pos}, 2i+1)} &= \cos\left(\frac{\text{pos}}{10000^{2i/d_{\text{model}}}}\right)\end{aligned} $$

$$ \begin{aligned}\text{pos} &: \text{The position of the token in the sequence } (0, 1, 2, \dots) \\i &: \text{The index of the specific dimension in the vector} \\d_{\text{model}} &: \text{The total size of the embedding vector (e.g., 512)} \\10000 &: \text{A constant "base" that stretches the waves}\end{aligned} $$

Why use waves?

Boundedness: Sines and cosines stay between -1 and 1.12 This prevents the positional signal from "overpowering" the semantic meaning of the word.
Relative Distance: Because of trigonometric identities, the model can mathematically "calculate" how far apart two tokens are. The encoding for pos + k can be represented as a linear function of pos.
Extrapolation: In theory, a wave pattern can continue forever, allowing the model to potentially handle sequences longer than what it saw during training.

Older models (RNNs) processed words one by one, so they "knew" order naturally through time. Transformers process everything in a single "flash" (parallelization) therefore without Positional Encoding, the sentence "The dog bit the man" and "The man bit the dog" would look identical to the model.

Why add the position vector to the embedding rather than concatenation?

The Trade-off: By adding them, we technically "pollute" the semantic space. However, because these vectors are high-dimensional (e.g., 768 dimensions), there is plenty of "empty space" for the model to learn to separate the "What" (semantic) from the "Where" (positional) during training.

Embeddings provide the Identity (The book's content), and Positional Encoding provides the Context (The book's page number). When added together, they create a single, rich numerical signal.

https://erdem.pl/2021/05/understanding-positional-encoding-in-transformers