KalineZephyr
whisper-small-yoruba-finetuned
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model description
Whisper Small is a transformer-based encoder-decoder model for automatic speech recognition (ASR), pre-trained on 680k hours of multilingual data. It uses 80-channel log-mel spectrograms as input and outputs a sequence of text tokens via autoregressive decoding with cross-attention to the encoder.
This fine-tuned version adapts the model specifically to Yoruba, a Niger-Congo language spoken in Nigeria, Benin and Togo. The model generates Yoruba text with full diacritic marks (à, è, ẹ, í, ò, ọ, ú, ṣ, etc.).
Intended uses & limitations
Intended uses:
- Transcribing Yoruba speech from audio recordings
- ASR research for low-resource African languages
- Benchmarking diacritic-preserving ASR systems
Limitations:
- Trained on ~4.3 epochs only (500 steps). More training may improve WER.
- Training data is mostly read-speech from Common Voice; performance on conversational/spontaneous Yoruba may be lower.
- The model retains Whisper's 30-second audio window.
- Diacritic coverage may be imperfect for some Yoruba dialects.
Training and evaluation data
- Common Voice 25.0 (yo) — 1,422 train / 975 validation / 1,071 test samples
- Google Fleurs (yo_ng,
raw_transcriptionfield to preserve diacritics) — 2,339 train samples (+43 removed by the >30 s filter)
Only clips shorter than 30 seconds were kept. Combined training set: 3,718 samples.
Training procedure
Hardware
- 2 × NVIDIA Tesla T4 (15.6 GB each)
- Distributed Data Parallel (DDP) via
accelerate launch - Mixed precision:
fp16(Native AMP) - Total training time: ~26 minutes (1,605 seconds)
Training hyperparameters
- learning_rate: 1e-05
- train_batch_size (per device): 16
- eval_batch_size (per device): 16
- gradient_accumulation_steps: 1
- total_train_batch_size (effective): 32
- total_eval_batch_size (effective): 32
- seed: 42
- optimizer: AdamW (torch fused) with betas=(0.9, 0.999) and epsilon=1e-08
- lr_scheduler_type: constant_with_warmup
- lr_scheduler_warmup_steps: 50
- max_steps: 500
- gradient_checkpointing: True
- eval_strategy: steps (every 500 steps)
- save_strategy: steps (every 500 steps, max 2 checkpoints)
- generation_max_length: 225
- predict_with_generate: True
- mixed_precision_training: Native AMP (fp16)
Training results
| Training Loss | Epoch | Step | Validation Loss | Validation WER |
|---|---|---|---|---|
| 0.4559 | 4.2735 | 500 | 0.7070 | 71.50 % |
Qualitative examples (test set, seed=42)
| Prédiction | Référence | WER |
|---|---|---|
| Àkọ̀wé ẹgbẹ́ wá ní, àwọn ọmọ ẹgbẹ́ ọ̀mú ti pú t | Akọ̀wé ẹgbẹ́ wa ní àwọn ọmọ ẹgbẹ́ ọhún ti kú tán. | 63.6 % |
| Lóòótọ́ ní àwọn aláboyú míràn ma ń kan rá. | Lóòótọ́ ni àwọn aláboyún míràn máa ń kanra | 75.0 % |
| Ọkùnrin àti obìnrin tí kò ní ìbálòpọ̀ kò lé ibi mọ́ | Ọkùnrin àti obìnrin tí kò ní ìbálòpọ̀ kò lè bímọ | 40.0 % |
Framework versions
- Transformers 5.12.1
- PyTorch 2.10.0+cu128
- Datasets 5.0.0
- Tokenizers 0.22.2
- Accelerate 1.14.0
Model provider
KalineZephyr
Model tree
Base
openai/whisper-small
Fine-tuned
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information