Raw audio is incredibly high-dimensional (e.g., 16,000 samples per second). Deep learning models struggle to find patterns in these raw waves. They can be visualized in a log melspectogram.

example of log melspectogram
Decomposition: The audio is sliced into short "frames" (usually 25ms). Each frame is converted into its frequency components.
Mel Scaling: Humans are better at hearing differences in low frequencies than high frequencies. The frequency axis is "warped" to the Mel Scale to prioritize the information that matters for speech.
Log Compression: Our perception of loudness is logarithmic, not linear. Applying a log() to the values makes the signal robust to different volume levels and noise.
In a Mel-spectrogram, the frequency axis is divided into Mel bins (also called frequency bins) that represent distinct frequency ranges:
Each Mel bin contains information about the energy at a specific range of frequencies. For example, if you use 40 Mel bins, this means the frequency spectrum is divided into 40 distinct intervals or frequency bands.
It is then send into a 1D Conv layer, to extract the features. (depends on the various architetecture)
— Note: To understand more refer to my notes on Computer Vision
1D vs 2D vs 3D Convolution: https://stackoverflow.com/questions/42883547/intuitive-understanding-of-1d-2d-and-3d-convolutions-in-convolutional-neural-n

1D Convolution Visualized
A 1D convolution is a technique commonly used in Convolutional Neural Networks (CNNs) to process sequential data, such as time series or audio signals. In 1D convolution, a filter or kernel slides across the frequency bins of the Mel-spectrogram to extract features.