cdli/whisper-small_finetuned_kenyan_swahili_nonstandard_speech_v1.0 API & Inference Endpoint

Evaluation

This model was evaluated on the test split of cdli/kenyan_swahili_nonstandard_speech_v1.0, a non-standard speech dataset for Kenyan Swahili. Examples longer than 30 seconds were excluded. For decoding we ran Whisper with language=sw, task=transcribe, greedy search (num_beams=1, do_sample=False). Results are compared against the unadapted base model cdli/whisper-small-Swahili_finetuned_small_CV20, evaluated identically, to show the effect of fine-tuning on non-standard speech.

Examples evaluated: 849
Speakers: 9

Metrics

We report two complementary word error rate (WER) metrics, both computed on text normalized with Whisper's BasicTextNormalizer:

Standard (corpus-level) WER — the usual error rate, pooling all reference words and edit errors across the entire test set.
Per-utterance averaged WER — WER computed separately for each utterance, each capped at 1.0, then averaged across utterances.

The per-utterance averaged WER bounds each utterance to [0, 1] and weights all utterances equally, so it reflects typical performance without a few catastrophic utterances dominating — but it is not a true error rate and isn't directly comparable to other published WER, hence we report the standard, corpus-level WER as well.

Overall Results

Adapted — this CDLI model, fine-tuned on non-standard speech.
Unadapted — the base model it was fine-tuned from (here: cdli/whisper-small-Swahili_finetuned_small_CV20).

Model	Standard WER	Per-utterance averaged WER
Adapted	0.44	0.37
Unadapted	0.57	0.56
Relative improvement	24%	35%

Detailed Analysis

For non-standard speech, aggregated results can hide important underlying patterns. Hence, we also report the WERs for different subsets: per severity group as well as per speaker.

Results by impairment severity

All WER values below are the per-utterance averaged WER, first averaged per speaker and then averaged within each severity group. n_speakers and n_utterances are the number of speakers and test utterances in each group.

severity	n_speakers	n_utterances	Avg WER (unadapted model)	Avg WER (adapted model)	Rel. improvement
mild	3	278	0.42	0.28	33%
moderate	3	235	0.57	0.35	38%
severe	3	336	0.67	0.44	35%

Results by speaker

Per-utterance averaged WER per speaker. n_utterances is the number of test utterances for that speaker.

speaker_id	severity	etiology	n_utterances	Avg WER (unadapted model)	Avg WER (adapted model)	Rel. improvement
KES013	mild	Cerebral Palsy	107	0.56	0.42	25%
KES021	mild	Parkinson’s Disease	68	0.38	0.24	39%
KES030	mild	Neurodevelopmental disorder	103	0.33	0.19	42%
KES012	moderate	Neurodevelopmental disorder	97	0.43	0.24	44%
KES028	moderate	Cerebral Palsy	44	0.75	0.42	44%
KES035	moderate	Multiple Sclerosis (MS)	94	0.52	0.4	23%
KES001	severe	Cerebral Palsy	140	0.78	0.53	32%
KES002	severe	Cerebral Palsy	82	0.64	0.41	36%
KES010	severe	Neurodevelopmental disorder	114	0.6	0.37	38%

whisper-small_finetuned_kenyan_swahili_nonstandard_speech_v1.0

Get help setting up a custom Dedicated Endpoints.

README