Tokenization | Notion

Tokenization is the process of breaking down a continuous stream of text into discrete, manageable units (tokens) that a computer can map to a fixed dictionary of numbers.

Imagine you are building a massive Lego castle. If the manual tried to give you a single, unique brick for every possible building in the world (a "Whole Word" approach), the manual would be infinitely large. If it gave you only individual plastic atoms (a "Character" approach), it would take years to build a single wall.

Tokenization is like finding the perfect set of reusable sub-assemblies. It identifies common patterns, like a standard 2x4 brick, a window frame, or a roof tile.

Rare words are broken into smaller pieces (like "un-" + "break" + "able"), while common words are kept whole. This balances efficiency (not too many pieces) with flexibility (you can build anything).

Most modern LLMs (like GPT-4 or Llama) use an algorithm called Byte-Pair Encoding (BPE).

<aside> ❓

Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa.

</aside>

The Process (BPE):

Pre-tokenization: The text is split by whitespace and punctuation into a list of words.
Character Splitting: Every word is initially treated as a sequence of individual characters.
Iterative Merging: The algorithm looks at the entire training corpus and identifies the most frequent pair of adjacent tokens. It merges them into a single new token.
Loop: This repeats (e.g., 50,000 times) until the desired Vocabulary Size is reached. Frequent words like "the" become one token; rare words like "bioluminescence" are left as several sub-tokens.

Output: A sequence of Token IDs (integers).

Example: [1243, 45, 8902, 311]

Vocabulary Mapping Function.

While tokenization is often algorithmic, it is governed by a Vocabulary Mapping Function.

We use a fixed vocabulary size because of the final layer of the neural network. In an LLM, the very last calculation is a Softmax over every possible token in the dictionary.