If I have multiple audio files and I want to know which of them have the best quality, I can use a few things to test them.
MOS Score:
โ Note: These MOS Score are Non Intrusive Prediction Models, they only need the audio files/no need of reference files.
| Model | Accuracy | Speed | Best For |
|---|---|---|---|
| **DeepMOS | |||
| (Wav2Vec + Transformer)** | ๐ข๐ข๐ข๐ข๐ข | ๐ข๐ข | General Speech Quality |
| **NISQA | |||
| (CNN + Transformer)** | ๐ข๐ข๐ข๐ข | ๐ข๐ข๐ข | VoIP & Speech Enhancement |
| **MOSNet | |||
| (CNN + BiLSTM)** | ๐ข๐ข๐ข | ๐ข๐ข๐ข๐ข | TTS & Synthesized Speech |
| **DNSMOS | |||
| (Deep Learning)** | ๐ข๐ข | ๐ข๐ข๐ข๐ข๐ข | Noisy Speech & Real-Time Processing |
Speech Diarization (Speechbrain):
| Model / Setup | Model Size | DER (โ better) | Speed | Notes |
|---|---|---|---|---|
| ECAPA-TDNN + Spectral Clustering | 14M params | ~3.0โ4.0% | โกโกโก | High accuracy, slower clustering |
| ECAPA-TDNN + AHC | 14M params | ~4.5โ5.5% | โกโกโกโก | Fast clustering, good accuracy |
| SpeechBrain Full Pipeline (CRDNN+ECAPA) | 20M params | ~3.5โ4.5% | โกโกโก | All-in-one pipeline, includes VAD |
| X-vector + PLDA | 7M params | ~6.5โ7.5% | โกโกโกโกโก | Very fast, lightweight, lower accuracy |
| X-vector + AHC / KMeans | 7M params | ~6.0โ7.0% | โกโกโกโกโก | Fastest with simple clustering |
| Resemblyzer + KMeans (external) | 5M params | ~7.0โ8.0% | โกโกโกโกโก | Python-only, very low overhead |
| Tiny ECAPA (custom) | 6โ8M params | ~5.0โ6.0% | โกโกโกโก | Requires training, good trade-off |
| WavLM Base + Clustering | 94M params | ~4.0โ5.0% | โกโกโก | Accurate but large and slower |
| pyannote VAD + clustering | 1M (VAD) | ~6.0โ7.0% | โกโกโกโก | Modular, pair with any embedder |