Tokenization is the process of breaking down a continuous stream of text into discrete, manageable units (tokens) that a computer can map to a fixed dictionary of numbers.


Imagine you are building a massive Lego castle. If the manual tried to give you a single, unique brick for every possible building in the world (a "Whole Word" approach), the manual would be infinitely large. If it gave you only individual plastic atoms (a "Character" approach), it would take years to build a single wall.

image.png

Tokenization is like finding the perfect set of reusable sub-assemblies. It identifies common patterns, like a standard 2x4 brick, a window frame, or a roof tile.

Rare words are broken into smaller pieces (like "un-" + "break" + "able"), while common words are kept whole. This balances efficiency (not too many pieces) with flexibility (you can build anything).

Most modern LLMs (like GPT-4 or Llama) use an algorithm called Byte-Pair Encoding (BPE).

<aside> ❓

Byte-Pair Encoding (BPE) was initially developed as an algorithm to compress texts, and then used by OpenAI for tokenization when pretraining the GPT model. It’s used by a lot of Transformer models, including GPT, GPT-2, RoBERTa, BART, and DeBERTa.

</aside>

image.png

The Process (BPE):

Output: A sequence of Token IDs (integers).

Vocabulary Mapping Function.

While tokenization is often algorithmic, it is governed by a Vocabulary Mapping Function.

We use a fixed vocabulary size because of the final layer of the neural network. In an LLM, the very last calculation is a Softmax over every possible token in the dictionary.