cdli
whisper-small_finetuned_ghanian_ga_standard_speech_v1.0
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Dataset
The model has been fine-tuned using cdli/ghanian_ga_standard_speech_v1.0.
Training
The train split was used for training, and the dev split for selecting the best checkpoint.
This Whisper model was fine-tuned and is decoded using the Yoruba (yo) language setting — out of all languages Whisper supports, the one most similar to Ga.
All model parameters (encoder, decoder, and output projection) were fine-tuned.
Evaluation
This model was evaluated on the test split of the dataset. Utterances longer than 30 seconds were excluded:
- Examples evaluated: 2025
- Speakers: 22
For decoding we ran Whisper with language=yo, task=transcribe,
greedy search (num_beams=1, do_sample=False).
We report two complementary word error rate (WER) metrics, both computed on text
normalized with Whisper's BasicTextNormalizer:
- Standard (corpus-level) WER — the usual error rate, pooling all reference words and edit errors across the entire test set.
- Per-utterance averaged WER — WER computed separately for each utterance, each capped at 1.0, then averaged across utterances.
The per-utterance averaged WER bounds each utterance to [0, 1] and weights all utterances equally, so it reflects typical performance without a few catastrophic utterances dominating — but it is not a true error rate and isn't directly comparable to other published WER, hence we report the standard, corpus-level WER as well.
Results
Overall Results
| Model | Standard WER | Per-utterance averaged WER |
|---|---|---|
| This model | 0.19 | 0.19 |
Detailed Analysis
Aggregated results can hide important underlying patterns, so we also break the WER down by subset: per speaker, and — where speaker severity is available — per impairment severity group.
Results by speaker
Per-utterance averaged WER per speaker. n_utterances is the number of test utterances for that speaker.
| speaker_id | n_utterances | Avg WER |
|---|---|---|
| 496 | 107 | 0.14 |
| 520 | 82 | 0.26 |
| 531 | 95 | 0.17 |
| 533 | 115 | 0.19 |
| 538 | 66 | 0.14 |
| 541 | 133 | 0.24 |
| 545 | 47 | 0.15 |
| 546 | 93 | 0.19 |
| 562 | 107 | 0.16 |
| 579 | 114 | 0.32 |
| 587 | 86 | 0.16 |
| 602 | 115 | 0.14 |
| 622 | 40 | 0.13 |
| 625 | 63 | 0.13 |
| 644 | 133 | 0.18 |
| 651 | 64 | 0.13 |
| 666 | 73 | 0.2 |
| 675 | 105 | 0.22 |
| 693 | 106 | 0.28 |
| 707 | 124 | 0.08 |
| 736 | 111 | 0.19 |
| 780 | 46 | 0.26 |
Model provider
cdli
Model tree
Base
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information