Supplemental reading and resources

This unit provided a hands-on introduction to speech recognition, one of the most popular tasks in the audio domain. Want to learn more? Here you will find additional resources that will help you deepen your understanding of the topics and enhance your learning experience.

Whisper Talk by Jong Wook Kim: a presentation on the Whisper model, explaining the motivation, architecture, training and results, delivered by Whisper author Jong Wook Kim
End-to-End Speech Benchmark (ESB): a paper that comprehensively argues for using the orthographic WER as opposed to the normalised WER for evaluating ASR systems and presents an accompanying benchmark
Fine-Tuning Whisper for Multilingual ASR: an in-depth blog post that explains how the Whisper model works in more detail, and the pre- and post-processing steps involved with the feature extractor and tokenizer
Fine-tuning MMS Adapter Models for Multi-Lingual ASR: an end-to-end guide for fine-tuning Meta AI’s new MMS speech recognition models, freezing the base model weights and only fine-tuning a small number of adapter layers
Boosting Wav2Vec2 with n-grams in 🤗 Transformers: a blog post for combining CTC models with external language models (LMs) to combat spelling and punctuation errors

Update on GitHub

Audio Course

Supplemental reading and resources