openai
whisper-large-v3
A multilingual speech recognition and translation model, trained on 5M hours of audio to deliver robust transcription across diverse languages, accents, and domains in a zero-shot setting.
Whisper is a state-of-the-art model for automatic speech recognition (ASR) and speech translation, proposed in the paper Robust Speech Recognition via Large-Scale Weak Supervision by Alec Radford et al. from OpenAI. Trained on >5M hours of labeled data, Whisper demonstrates a strong ability to generalise to many datasets and domains in a zero-shot setting.
Whisper large-v3 has the same architecture as the previous large and large-v2 models, except for the following minor differences:
- The spectrogram input uses 128 Mel frequency bins instead of 80
- A new language token for Cantonese
The Whisper large-v3 model was trained on 1 million hours of weakly labeled audio and 4 million hours of pseudo-labeled audio collected using Whisper large-v2. The model was trained for 2.0 epochs over this mixture dataset.
The large-v3 model shows improved performance over a wide variety of languages, showing 10% to 20% reduction of errors compared to Whisper large-v2. For more details on the different checkpoints available, refer to the section Model details.
Disclaimer: Content for this model card has partly been written by the Hugging Face team, and partly copied and pasted from the original model card.
Additional Speed & Memory Improvements
You can apply additional speed and memory improvements to Whisper to further reduce the inference speed and VRAM requirements.
Chunked Long-Form
Whisper has a receptive field of 30-seconds. To transcribe audios longer than this, one of two long-form algorithms are required:
- Sequential: uses a "sliding window" for buffered inference, transcribing 30-second slices one after the other
- Chunked: splits long audio files into shorter ones (with a small overlap between segments), transcribes each segment independently, and stitches the resulting transcriptions at the boundaries
The sequential long-form algorithm should be used in either of the following scenarios:
- Transcription accuracy is the most important factor, and speed is less of a consideration
- You are transcribing batches of long audio files, in which case the latency of sequential is comparable to chunked, while being up to 0.5% WER more accurate
Conversely, the chunked algorithm should be used when:
- Transcription speed is the most important factor
- You are transcribing a single long audio file
Model details
Whisper is a Transformer based encoder-decoder model, also referred to as a sequence-to-sequence model. There are two flavours of Whisper model: English-only and multilingual. The English-only models were trained on the task of English speech recognition. The multilingual models were trained simultaneously on multilingual speech recognition and speech translation. For speech recognition, the model predicts transcriptions in the same language as the audio. For speech translation, the model predicts transcriptions to a different language to the audio.
Evaluated Use
The primary intended users of these models are AI researchers studying robustness, generalization, capabilities, biases, and constraints of the current model. However, Whisper is also potentially quite useful as an ASR solution for developers, especially for English speech recognition. They recognize that once models are released, it is impossible to restrict access to only “intended” uses or to draw reasonable guidelines around what is or is not research.
The models are primarily trained and evaluated on ASR and speech translation to English tasks. They show strong ASR results in ~10 languages. They may exhibit additional capabilities, particularly if fine-tuned on certain tasks like voice activity detection, speaker classification, or speaker diarization but have not been robustly evaluated in these areas. They strongly recommend that users perform robust evaluations of the models in a particular context and domain before deploying them.
In particular, they caution against using Whisper models to transcribe recordings of individuals taken without their consent or purporting to use these models for any kind of subjective classification. They recommend against use in high-risk domains like decision-making contexts, where flaws in accuracy can lead to pronounced flaws in outcomes. The models are intended to transcribe and translate speech, use of the model for classification is not only not evaluated but also not appropriate, particularly to infer human attributes.
Run this model inference with a simple API call.
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
API Example
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
Model provider
openai
Model tree
Base
this model
Modalities
Input
Audio
Output
Text
Pricing
Serverless Endpoints
$0.0015 / audio minute
Dedicated Endpoints
View detailsSupported Functionality
Serverless Endpoints
Dedicated Endpoints
Container
More information