cdli

whisper-large-v3_finetuned_ugandan_luganda_waxal_7_standard_speech_v1.0

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Dataset

The model has been fine-tuned using google/WaxalNLP. The lug_asr subset of the dataset was used.

Training

The train split was used for training, and the dev split for selecting the best checkpoint.

This Whisper model was fine-tuned and is decoded using the Swahili (sw) language setting — out of all languages Whisper supports, the one most similar to Luganda.

All model parameters (encoder, decoder, and output projection) were fine-tuned, with SpecAugment enabled.

Evaluation

This model was evaluated on the test split of the dataset. Utterances longer than 30 seconds were excluded:

Examples evaluated: 503
Speakers: 194

For decoding we ran Whisper with language=sw, task=transcribe, greedy search (num_beams=1, do_sample=False).

We report two complementary word error rate (WER) metrics, both computed on text normalized with Whisper's BasicTextNormalizer:

Standard (corpus-level) WER — the usual error rate, pooling all reference words and edit errors across the entire test set.
Per-utterance averaged WER — WER computed separately for each utterance, each capped at 1.0, then averaged across utterances.

The per-utterance averaged WER bounds each utterance to [0, 1] and weights all utterances equally, so it reflects typical performance without a few catastrophic utterances dominating — but it is not a true error rate and isn't directly comparable to other published WER, hence we report the standard, corpus-level WER as well.