Once the model’s feature extractor (CNNs, Transformer, LSTM etc.) processes the Log-Mel Spectrogram, it produces a probability distribution of characters or phonemes.
The Decoder is responsible for turning those probabilities into the final transcript.
The Acoustic to Text Mapping
The model often outputs a "best guess" for every few milliseconds of audio.
CTC (Connectionist Temporal Classification): This is a specialized algorithm that handles the alignment. It collapses repeated characters (like h-e-e-l-l-l-o) and removes "blanks" to produce the word hello.
Beam Search: Instead of just picking the single most likely letter at every step (which can lead to errors), Beam Search keeps a "beam" of the top N most likely word sequences and picks the one with the highest overall probability at the end.
The Language Model (LM) Integration
The "Acoustic Model" only knows what the audio sounds like. The Language Model knows how humans actually speak. This helps the AI resolve homophones:
<aside> ⛔
Input: "I have /tu/ apples."
LM Correction: The LM knows that "two" is mathematically more likely to follow "have" than "to" or "too" in this context.
</aside>
Modern ASR systems don't just output raw text; they perform Inverse Text Normalization (ITN):