Automatic Speech Recognition

Automatic Speech Recognition converts human speech into text. In simple terms, we take raw sound, turn it into a form that a model can understand, let the model learn patterns from speech, then decode the output back into words.

ASR is used in things like voice assistants, subtitle generation, call transcription, meeting notes, dictation tools, and speech interfaces.

Why not use raw audio directly?

Raw audio is just a long stream of changing values over time. While it contains all the information, it is not the easiest format for a model to learn from.

Instead, we usually transform audio into features that reveal useful patterns more clearly.

One of the most common representations is the Log-Mel Spectrogram.

It is a way to turn sound into an image-like representation that shows:

how energy changes over time
which frequencies are strong
how humans roughly perceive sound

The Mel scale is useful because it better reflects how people hear differences in pitch. The log scale helps compress very large differences in energy, making quieter and louder parts easier to compare.

Mel-Spectogram

In short, the Log-Mel Spectrogram acts as the bridge between raw sound and the deep learning model.

What is sound?

This page explains the foundations of audio, frequency, and waveform behavior before we turn sound into model inputs.

How does ASR work?

A basic ASR pipeline usually looks like this: