Decoding back to Language

Once the model’s feature extractor (CNNs, Transformer, LSTM etc.) processes the Log-Mel Spectrogram, it produces a probability distribution of characters or phonemes.

The Decoder is responsible for turning those probabilities into the final transcript.

Decoding & Language Modeling

The Acoustic to Text Mapping

The model often outputs a "best guess" for every few milliseconds of audio.

CTC (Connectionist Temporal Classification): This is a specialized algorithm that handles the alignment. It collapses repeated characters (like h-e-e-l-l-l-o) and removes "blanks" to produce the word hello.

Beam Search: Instead of just picking the single most likely letter at every step (which can lead to errors), Beam Search keeps a "beam" of the top N most likely word sequences and picks the one with the highest overall probability at the end.
The Language Model (LM) Integration

The "Acoustic Model" only knows what the audio sounds like. The Language Model knows how humans actually speak. This helps the AI resolve homophones:

<aside> ⛔

Input: "I have /tu/ apples."

LM Correction: The LM knows that "two" is mathematically more likely to follow "have" than "to" or "too" in this context.

</aside>

Post-Processing & Formatting

Modern ASR systems don't just output raw text; they perform Inverse Text Normalization (ITN):

Punctuation & Capitalization: Adding periods, commas, and question marks.
Entity Formatting: Converting "January first twenty twenty four" into "Jan 1, 2024" or "$50" instead of "fifty dollars."