Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitModel description
This model is a fine-tuned version of the Whisper Large V3 Turbo model, optimized for multilingual Automatic Speech Recognition (ASR). It has been trained on the ANV (Swivuriso) dataset to improve performance on specific target languages and domains represented in that corpus.
Whisper is a Transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on weak supervision using large-scale noisy data, and this fine-tuning step adapts it specifically for the languages and accents found in the dsfsi-anv dataset.
Intended uses & limitations
Intended Uses
- Automatic Speech Recognition (ASR): The model is primarily intended to transcribe audio in the languages present in the training data.
- Research: Suitable for researchers studying low-resource language adaptation and fine-tuning efficiency.
Limitations
- Hallucinations: Like the base Whisper model, this model may generate repetitive text or hallucinations, particularly in silence or with background noise.
- Domain Specificity: Performance may degrade on audio that differs significantly (in terms of accent, noise, or recording quality) from the ANV dataset.
Training and evaluation data
The model was trained on the dsfsi-anv dataset.
- Dataset Name: ANV (Swivuriso)
- Source: https://huggingface.co/dsfsi-anv
Training procedure
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: AdamW (betas=(0.9,0.98), epsilon=1e-08)
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 500
- training_steps: 10,000
- framework: PyTorch 2.9.1+cu128 / Transformers 4.57.3
Training results
| Epoch | Step | Training Loss | Validation Loss | WER | CER |
|---|---|---|---|---|---|
| 0.1 | 1000 | 0.4108 | 0.5753 | 0.3702 | 0.1237 |
| 0.2 | 2000 | 0.2326 | 0.4653 | 0.2888 | 0.0881 |
| 0.3 | 3000 | 0.4429 | 0.3750 | 0.2354 | 0.0782 |
| 0.4 | 4000 | 0.3309 | 0.3388 | 0.2075 | 0.0674 |
| 0.5 | 5000 | 0.3298 | 0.3135 | 0.1952 | 0.0635 |
| 0.6 | 6000 | 0.3238 | 0.2929 | 0.1782 | 0.0592 |
| 0.7 | 7000 | 0.3926 | 0.2766 | 0.1688 | 0.0545 |
| 0.8 | 8000 | 0.2261 | 0.2627 | 0.1593 | 0.0519 |
| 0.9 | 9000 | 0.2197 | 0.2514 | 0.1573 | 0.0506 |
| 1.0 | 10000 | 0.2276 | 0.2427 | 0.1501 | 0.0510 |
Usage
This model can be used with the Hugging Face transformers library via the pipeline class.
bash
pip install --upgrade pippip install --upgrade transformers datasets[audio] accelerateimport torchfrom transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipelinefrom datasets import load_datasetdevice = "cuda:0" if torch.cuda.is_available() else "cpu"torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32# Load your fine-tuned modelmodel_id = "dsfsi-anv/multilingual-whisper-v3-turbo"processor_id = "openai/whisper-large-v3-turbo"model = AutoModelForSpeechSeq2Seq.from_pretrained(model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True)model.to(device)processor = AutoProcessor.from_pretrained(processor_id)pipe = pipeline("automatic-speech-recognition",model=model,tokenizer=processor.tokenizer,feature_extractor=processor.feature_extractor,torch_dtype=torch_dtype,device=device,)# Example: Transcribe a sample file# result = pipe("path/to/audio.wav")# print(result["text"])
Framework versions
- Transformers 4.57.3
- Pytorch 2.9.1+cu128
- Datasets 4.4.1
- Tokenizers 0.22.1
BibTeX entry and citation info
bibtex
@misc{radford2022whisper,doi = {10.48550/ARXIV.2212.04356},url = {[https://arxiv.org/abs/2212.04356](https://arxiv.org/abs/2212.04356)},author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},title = {Robust Speech Recognition via Large-Scale Weak Supervision},publisher = {arXiv},year = {2022},copyright = {arXiv.org perpetual, non-exclusive license}}
Model provider
more8467394
Model tree
Base
openai/whisper-large-v3-turbo
Fine-tuned
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information