cdli/whisper-large-v3_finetuned_ugandan_luganda_nonstandard_speech_v1.0 API & Inference Endpoint

Dataset

The model has been fine-tuned using cdli/ugandan_luganda_nonstandard_speech_v1.0, a dataset of speech samples of people living with impaired speech across a range of impairment severity levels and etiologies.

Training

The train split was used for training, and the dev split for selecting the best checkpoint.

This Whisper model was fine-tuned and is decoded using the Swahili (sw) language setting — out of all languages Whisper supports, the one most similar to Luganda.

All model parameters (encoder, decoder, and output projection) were fine-tuned, with SpecAugment enabled.

Evaluation

This model was evaluated on the test split of the dataset. Utterances longer than 30 seconds were excluded:

Examples evaluated: 1028
Speakers: 9

For decoding we ran Whisper with language=sw, task=transcribe, greedy search (num_beams=1, do_sample=False).

Results are compared against the unadapted base model cdli/whisper-large-v3_finetuned_ugandan_luganda_waxal_7_standard_speech_v1.0, evaluated identically, to show the effect of fine-tuning on non-standard speech.

We report two complementary word error rate (WER) metrics, both computed on text normalized with Whisper's BasicTextNormalizer:

Standard (corpus-level) WER — the usual error rate, pooling all reference words and edit errors across the entire test set.
Per-utterance averaged WER — WER computed separately for each utterance, each capped at 1.0, then averaged across utterances.

The per-utterance averaged WER bounds each utterance to [0, 1] and weights all utterances equally, so it reflects typical performance without a few catastrophic utterances dominating — but it is not a true error rate and isn't directly comparable to other published WER, hence we report the standard, corpus-level WER as well.

Results

Overall Results

Adapted — this CDLI model, fine-tuned on non-standard speech.
Unadapted — the base model it was fine-tuned from (here: cdli/whisper-large-v3_finetuned_ugandan_luganda_waxal_7_standard_speech_v1.0).

Model	Standard WER	Per-utterance averaged WER
Adapted	0.66	0.53
Unadapted	1.00	0.78
Relative improvement	34%	32%

Detailed Analysis

Aggregated results can hide important underlying patterns, so we also break the WER down by subset: per speaker, and — where speaker severity is available — per impairment severity group.

Results by impairment severity

All WER values below are the per-utterance averaged WER, first averaged per speaker and then averaged within each severity group. n_speakers and n_utterances are the number of speakers and test utterances in each group.

severity	n_speakers	n_utterances	Avg WER (unadapted model)	Avg WER (adapted model)	Rel. improvement
mild	3	366	0.71	0.49	31%
moderate	3	347	0.79	0.55	31%
severe	3	315	0.89	0.6	33%

Results by speaker

Per-utterance averaged WER per speaker. n_utterances is the number of test utterances for that speaker.

speaker_id	severity	etiology	n_utterances	Avg WER (unadapted model)	Avg WER (adapted model)	Rel. improvement
UG001	mild	Cerebral palsy - cerebral malaria	99	0.78	0.52	32%
UG014	mild	Idiopathic	149	0.72	0.47	35%
UG022	mild	Developmental	118	0.64	0.48	25%
UG021	moderate	Structural presence of akloglosia, simply tongue tie	91	0.85	0.64	25%
UG036	moderate	Cerebral Palsy	177	0.65	0.43	34%
UG052	moderate	Developmental	79	0.88	0.57	34%
UG042	severe	Developmental	85	0.98	0.59	40%
UG057	severe	Acquired hearing impairment	105	0.91	0.68	26%
UG058	severe	Developmental	125	0.78	0.52	33%

whisper-large-v3_finetuned_ugandan_luganda_nonstandard_speech_v1.0

Get help setting up a custom Dedicated Endpoints.

README