KalineZephyr

whisper-small-yoruba-finetuned

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model description

Whisper Small is a transformer-based encoder-decoder model for automatic speech recognition (ASR), pre-trained on 680k hours of multilingual data. It uses 80-channel log-mel spectrograms as input and outputs a sequence of text tokens via autoregressive decoding with cross-attention to the encoder.

This fine-tuned version adapts the model specifically to Yoruba, a Niger-Congo language spoken in Nigeria, Benin and Togo. The model generates Yoruba text with full diacritic marks (à, è, ẹ, í, ò, ọ, ú, ṣ, etc.).

Intended uses & limitations

Intended uses:

  • Transcribing Yoruba speech from audio recordings
  • ASR research for low-resource African languages
  • Benchmarking diacritic-preserving ASR systems

Limitations:

  • Trained on ~4.3 epochs only (500 steps). More training may improve WER.
  • Training data is mostly read-speech from Common Voice; performance on conversational/spontaneous Yoruba may be lower.
  • The model retains Whisper's 30-second audio window.
  • Diacritic coverage may be imperfect for some Yoruba dialects.

Training and evaluation data

  • Common Voice 25.0 (yo) — 1,422 train / 975 validation / 1,071 test samples
  • Google Fleurs (yo_ng, raw_transcription field to preserve diacritics) — 2,339 train samples (+43 removed by the >30 s filter)

Only clips shorter than 30 seconds were kept. Combined training set: 3,718 samples.

Training procedure

Hardware

  • 2 × NVIDIA Tesla T4 (15.6 GB each)
  • Distributed Data Parallel (DDP) via accelerate launch
  • Mixed precision: fp16 (Native AMP)
  • Total training time: ~26 minutes (1,605 seconds)

Training hyperparameters

  • learning_rate: 1e-05
  • train_batch_size (per device): 16
  • eval_batch_size (per device): 16
  • gradient_accumulation_steps: 1
  • total_train_batch_size (effective): 32
  • total_eval_batch_size (effective): 32
  • seed: 42
  • optimizer: AdamW (torch fused) with betas=(0.9, 0.999) and epsilon=1e-08
  • lr_scheduler_type: constant_with_warmup
  • lr_scheduler_warmup_steps: 50
  • max_steps: 500
  • gradient_checkpointing: True
  • eval_strategy: steps (every 500 steps)
  • save_strategy: steps (every 500 steps, max 2 checkpoints)
  • generation_max_length: 225
  • predict_with_generate: True
  • mixed_precision_training: Native AMP (fp16)

Training results

Table
Training LossEpochStepValidation LossValidation WER
0.45594.27355000.707071.50 %

Qualitative examples (test set, seed=42)

Table
PrédictionRéférenceWER
Àkọ̀wé ẹgbẹ́ wá ní, àwọn ọmọ ẹgbẹ́ ọ̀mú ti pú tAkọ̀wé ẹgbẹ́ wa ní àwọn ọmọ ẹgbẹ́ ọhún ti kú tán.63.6 %
Lóòótọ́ ní àwọn aláboyú míràn ma ń kan rá.Lóòótọ́ ni àwọn aláboyún míràn máa ń kanra75.0 %
Ọkùnrin àti obìnrin tí kò ní ìbálòpọ̀ kò lé ibi mọ́Ọkùnrin àti obìnrin tí kò ní ìbálòpọ̀ kò lè bímọ40.0 %

Framework versions

  • Transformers 5.12.1
  • PyTorch 2.10.0+cu128
  • Datasets 5.0.0
  • Tokenizers 0.22.2
  • Accelerate 1.14.0

Model provider

KalineZephyr

Model tree

Base

openai/whisper-small

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today