Sep 21, 2022 · Other existing approaches frequently use smaller, more closely paired audio-text training datasets, 1 2, 3 or use broad but unsupervised audio pretraining. 4, 5, 6 Because Whisper was trained on a large and diverse dataset and was not fine-tuned to any specific one, it does not beat models that specialize in LibriSpeech performance, a famously competitive benchmark in speech recognition.