KalineZephyr/whisper-small-yoruba-finetuned API & Inference Endpoint

Model description

Whisper Small is a transformer-based encoder-decoder model for automatic speech recognition (ASR), pre-trained on 680k hours of multilingual data. It uses 80-channel log-mel spectrograms as input and outputs a sequence of text tokens via autoregressive decoding with cross-attention to the encoder.

This fine-tuned version adapts the model specifically to Yoruba, a Niger-Congo language spoken in Nigeria, Benin and Togo. The model generates Yoruba text with full diacritic marks (à, è, ẹ, í, ò, ọ, ú, ṣ, etc.).

Intended uses & limitations

Intended uses:

Transcribing Yoruba speech from audio recordings
ASR research for low-resource African languages
Benchmarking diacritic-preserving ASR systems

Limitations:

Trained on ~4.3 epochs only (500 steps). More training may improve WER.
Training data is mostly read-speech from Common Voice; performance on conversational/spontaneous Yoruba may be lower.
The model retains Whisper's 30-second audio window.
Diacritic coverage may be imperfect for some Yoruba dialects.

Training and evaluation data

Common Voice 25.0 (yo) — 1,422 train / 975 validation / 1,071 test samples
Google Fleurs (yo_ng, raw_transcription field to preserve diacritics) — 2,339 train samples (+43 removed by the >30 s filter)

Only clips shorter than 30 seconds were kept. Combined training set: 3,718 samples.

Training procedure

Hardware

2 × NVIDIA Tesla T4 (15.6 GB each)
Distributed Data Parallel (DDP) via accelerate launch
Mixed precision: fp16 (Native AMP)
Total training time: ~26 minutes (1,605 seconds)

Training hyperparameters

learning_rate: 1e-05
train_batch_size (per device): 16
eval_batch_size (per device): 16
gradient_accumulation_steps: 1
total_train_batch_size (effective): 32
total_eval_batch_size (effective): 32
seed: 42
optimizer: AdamW (torch fused) with betas=(0.9, 0.999) and epsilon=1e-08
lr_scheduler_type: constant_with_warmup
lr_scheduler_warmup_steps: 50
max_steps: 500
gradient_checkpointing: True
eval_strategy: steps (every 500 steps)
save_strategy: steps (every 500 steps, max 2 checkpoints)
generation_max_length: 225
predict_with_generate: True
mixed_precision_training: Native AMP (fp16)

Training results

Table
Training Loss	Epoch	Step	Validation Loss	Validation WER
0.4559	4.2735	500	0.7070	71.50 %

Qualitative examples (test set, seed=42)

Table
Prédiction	Référence	WER
Àkọ̀wé ẹgbẹ́ wá ní, àwọn ọmọ ẹgbẹ́ ọ̀mú ti pú t	Akọ̀wé ẹgbẹ́ wa ní àwọn ọmọ ẹgbẹ́ ọhún ti kú tán.	63.6 %
Lóòótọ́ ní àwọn aláboyú míràn ma ń kan rá.	Lóòótọ́ ni àwọn aláboyún míràn máa ń kanra	75.0 %
Ọkùnrin àti obìnrin tí kò ní ìbálòpọ̀ kò lé ibi mọ́	Ọkùnrin àti obìnrin tí kò ní ìbálòpọ̀ kò lè bímọ	40.0 %

Framework versions

Transformers 5.12.1
PyTorch 2.10.0+cu128
Datasets 5.0.0
Tokenizers 0.22.2
Accelerate 1.14.0

whisper-small-yoruba-finetuned

Get help setting up a custom Dedicated Endpoints.

README