Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Evaluation
This model was evaluated on the test split of
cdli/kenyan_swahili_nonstandard_speech_v1.0, a
non-standard speech dataset for Kenyan Swahili. Examples longer than 30 seconds were excluded.
For decoding we ran Whisper with language=sw, task=transcribe,
greedy search (num_beams=1, do_sample=False). Results are compared against the unadapted base model cdli/whisper-small-Swahili_finetuned_small_CV20, evaluated identically, to show the effect of fine-tuning on non-standard speech.
- Examples evaluated: 849
- Speakers: 9
Metrics
We report two complementary word error rate (WER) metrics, both computed on text
normalized with Whisper's BasicTextNormalizer:
- Standard (corpus-level) WER — the usual error rate, pooling all reference words and edit errors across the entire test set.
- Per-utterance averaged WER — WER computed separately for each utterance, each capped at 1.0, then averaged across utterances.
The per-utterance averaged WER bounds each utterance to [0, 1] and weights all utterances equally, so it reflects typical performance without a few catastrophic utterances dominating — but it is not a true error rate and isn't directly comparable to other published WER, hence we report the standard, corpus-level WER as well.
Overall Results
- Adapted — this CDLI model, fine-tuned on non-standard speech.
- Unadapted — the base model it was fine-tuned from (here:
cdli/whisper-small-Swahili_finetuned_small_CV20).
| Model | Standard WER | Per-utterance averaged WER |
|---|---|---|
| Adapted | 0.44 | 0.37 |
| Unadapted | 0.57 | 0.56 |
| Relative improvement | 24% | 35% |
Detailed Analysis
For non-standard speech, aggregated results can hide important underlying patterns. Hence, we also report the WERs for different subsets: per severity group as well as per speaker.
Results by impairment severity
All WER values below are the per-utterance averaged WER, first averaged per
speaker and then averaged within each severity group. n_speakers and
n_utterances are the number of speakers and test utterances in each group.
| severity | n_speakers | n_utterances | Avg WER (unadapted model) | Avg WER (adapted model) | Rel. improvement |
|---|---|---|---|---|---|
| mild | 3 | 278 | 0.42 | 0.28 | 33% |
| moderate | 3 | 235 | 0.57 | 0.35 | 38% |
| severe | 3 | 336 | 0.67 | 0.44 | 35% |
Results by speaker
Per-utterance averaged WER per speaker. n_utterances is the number of test
utterances for that speaker.
| speaker_id | severity | etiology | n_utterances | Avg WER (unadapted model) | Avg WER (adapted model) | Rel. improvement |
|---|---|---|---|---|---|---|
| KES013 | mild | Cerebral Palsy | 107 | 0.56 | 0.42 | 25% |
| KES021 | mild | Parkinson’s Disease | 68 | 0.38 | 0.24 | 39% |
| KES030 | mild | Neurodevelopmental disorder | 103 | 0.33 | 0.19 | 42% |
| KES012 | moderate | Neurodevelopmental disorder | 97 | 0.43 | 0.24 | 44% |
| KES028 | moderate | Cerebral Palsy | 44 | 0.75 | 0.42 | 44% |
| KES035 | moderate | Multiple Sclerosis (MS) | 94 | 0.52 | 0.4 | 23% |
| KES001 | severe | Cerebral Palsy | 140 | 0.78 | 0.53 | 32% |
| KES002 | severe | Cerebral Palsy | 82 | 0.64 | 0.41 | 36% |
| KES010 | severe | Neurodevelopmental disorder | 114 | 0.6 | 0.37 | 38% |
Model provider
cdli
Model tree
Base
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information