cdli/whisper-small_finetuned_ugandan_english_nonstandard_speech_v1.0 API & Inference Endpoint

Dataset

The model has been fine-tuned using cdli/ugandan_english_nonstandard_speech_v1.0, a dataset of speech samples of people living with impaired speech across a range of impairment severity levels and etiologies.

Training

The train split was used for training, and the dev split for selecting the best checkpoint.

All model parameters (encoder, decoder, and output projection) were fine-tuned.

Evaluation

This model was evaluated on the test split of the dataset. Utterances longer than 30 seconds were excluded:

Examples evaluated: 1013
Speakers: 9

For decoding we ran Whisper with language=en, task=transcribe, greedy search (num_beams=1, do_sample=False).

Results are compared against the unadapted base model openai/whisper-small, evaluated identically, to show the effect of fine-tuning on non-standard speech.

We report two complementary word error rate (WER) metrics, both computed on text normalized with Whisper's BasicTextNormalizer:

Standard (corpus-level) WER — the usual error rate, pooling all reference words and edit errors across the entire test set.
Per-utterance averaged WER — WER computed separately for each utterance, each capped at 1.0, then averaged across utterances.

The per-utterance averaged WER bounds each utterance to [0, 1] and weights all utterances equally, so it reflects typical performance without a few catastrophic utterances dominating — but it is not a true error rate and isn't directly comparable to other published WER, hence we report the standard, corpus-level WER as well.

Results

Overall Results

Adapted — this CDLI model, fine-tuned on non-standard speech.
Unadapted — the base model it was fine-tuned from (here: openai/whisper-small).

Model	Standard WER	Per-utterance averaged WER
Adapted	0.26	0.24
Unadapted	0.42	0.36
Relative improvement	39%	33%

Detailed Analysis

Aggregated results can hide important underlying patterns, so we also break the WER down by subset: per speaker, and — where speaker severity is available — per impairment severity group.

Results by impairment severity

All WER values below are the per-utterance averaged WER, first averaged per speaker and then averaged within each severity group. n_speakers and n_utterances are the number of speakers and test utterances in each group.

severity	n_speakers	n_utterances	Avg WER (unadapted model)	Avg WER (adapted model)	Rel. improvement
mild	3	334	0.29	0.22	24%
moderate	3	340	0.37	0.24	35%
severe	3	339	0.39	0.26	32%

Results by speaker

Per-utterance averaged WER per speaker. n_utterances is the number of test utterances for that speaker.

speaker_id	severity	etiology	n_utterances	Avg WER (unadapted model)	Avg WER (adapted model)	Rel. improvement
UG001	mild	Cerebral palsy - cerebral malaria	102	0.32	0.26	19%
UG014	mild	Idiopathic	133	0.17	0.13	23%
UG022	mild	Developmental	99	0.37	0.27	27%
UG021	moderate	Structural presence of akloglosia, simply tongue tie	77	0.22	0.2	11%
UG036	moderate	Cerebral Palsy	151	0.47	0.23	51%
UG052	moderate	Developmental	112	0.42	0.3	29%
UG042	severe	Developmental	92	0.27	0.26	5%
UG057	severe	Acquired hearing impairment	96	0.5	0.31	39%
UG058	severe	Developmental	151	0.39	0.23	43%

whisper-small_finetuned_ugandan_english_nonstandard_speech_v1.0

Get help setting up a custom Dedicated Endpoints.

README