more8467394/za-anv-multilingual-whisper-v3-turbo API & Inference Endpoint

Model description

This model is a fine-tuned version of the Whisper Large V3 Turbo model, optimized for multilingual Automatic Speech Recognition (ASR). It has been trained on the ANV (Swivuriso) dataset to improve performance on specific target languages and domains represented in that corpus.

Whisper is a Transformer-based encoder-decoder model, also referred to as a sequence-to-sequence model. It was trained on weak supervision using large-scale noisy data, and this fine-tuning step adapts it specifically for the languages and accents found in the dsfsi-anv dataset.

Intended uses & limitations

Intended Uses

Automatic Speech Recognition (ASR): The model is primarily intended to transcribe audio in the languages present in the training data.
Research: Suitable for researchers studying low-resource language adaptation and fine-tuning efficiency.

Limitations

Hallucinations: Like the base Whisper model, this model may generate repetitive text or hallucinations, particularly in silence or with background noise.
Domain Specificity: Performance may degrade on audio that differs significantly (in terms of accent, noise, or recording quality) from the ANV dataset.

Training and evaluation data

The model was trained on the dsfsi-anv dataset.

Dataset Name: ANV (Swivuriso)
Source: https://huggingface.co/dsfsi-anv

Training procedure

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 16
optimizer: AdamW (betas=(0.9,0.98), epsilon=1e-08)
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 500
training_steps: 10,000
framework: PyTorch 2.9.1+cu128 / Transformers 4.57.3

Training results

Epoch	Step	Training Loss	Validation Loss	WER	CER
0.1	1000	0.4108	0.5753	0.3702	0.1237
0.2	2000	0.2326	0.4653	0.2888	0.0881
0.3	3000	0.4429	0.3750	0.2354	0.0782
0.4	4000	0.3309	0.3388	0.2075	0.0674
0.5	5000	0.3298	0.3135	0.1952	0.0635
0.6	6000	0.3238	0.2929	0.1782	0.0592
0.7	7000	0.3926	0.2766	0.1688	0.0545
0.8	8000	0.2261	0.2627	0.1593	0.0519
0.9	9000	0.2197	0.2514	0.1573	0.0506
1.0	10000	0.2276	0.2427	0.1501	0.0510

Usage

This model can be used with the Hugging Face transformers library via the pipeline class.

bash
pip install --upgrade pip
pip install --upgrade transformers datasets[audio] accelerate


import torch
from transformers import AutoModelForSpeechSeq2Seq, AutoProcessor, pipeline
from datasets import load_dataset

device = "cuda:0" if torch.cuda.is_available() else "cpu"
torch_dtype = torch.float16 if torch.cuda.is_available() else torch.float32

# Load your fine-tuned model
model_id = "dsfsi-anv/multilingual-whisper-v3-turbo"
processor_id = "openai/whisper-large-v3-turbo"

model = AutoModelForSpeechSeq2Seq.from_pretrained(
    model_id, torch_dtype=torch_dtype, low_cpu_mem_usage=True, use_safetensors=True
)
model.to(device)

processor = AutoProcessor.from_pretrained(processor_id)

pipe = pipeline(
    "automatic-speech-recognition",
    model=model,
    tokenizer=processor.tokenizer,
    feature_extractor=processor.feature_extractor,
    torch_dtype=torch_dtype,
    device=device,
)

# Example: Transcribe a sample file
# result = pipe("path/to/audio.wav")
# print(result["text"])

Framework versions

Transformers 4.57.3
Pytorch 2.9.1+cu128
Datasets 4.4.1
Tokenizers 0.22.1

BibTeX entry and citation info

bibtex
@misc{radford2022whisper,
  doi = {10.48550/ARXIV.2212.04356},
  url = {[https://arxiv.org/abs/2212.04356](https://arxiv.org/abs/2212.04356)},
  author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  title = {Robust Speech Recognition via Large-Scale Weak Supervision},
  publisher = {arXiv},
  year = {2022},
  copyright = {arXiv.org perpetual, non-exclusive license}
}

za-anv-multilingual-whisper-v3-turbo

Get help setting up a custom Dedicated Endpoints.

README