Whisper speech recognition is a state-of-the-art model designed for a variety of speech processing tasks. It is a Transformer sequence-to-sequence model capable of multilingual speech recognition, speech translation, language identification, and voice activity detection. The model is trained on a large dataset, enabling it to handle diverse audio inputs effectively. Whisper can replace many stages of traditional speech-processing pipelines with its multitasking abilities. It requires Python 3.8-3.11, PyTorch 1.10.1, and other dependencies like OpenAI’s tiktoken. The model is available in five sizes, each offering different speed and accuracy tradeoffs. Performance varies by language, with detailed metrics provided in the paper’s appendices. Whisper also includes a command-line interface for easy audio file transcription and language detection within Python. The code and model weights are released under the MIT License, promoting open-source collaboration.

OpenAI
Over 40,001 stars
April 2, 2024
OpenAI Whisper GitHub Repository