cdli

whisper-small_finetuned_ghanian_ga_standard_speech_v1.0

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Dataset

The model has been fine-tuned using cdli/ghanian_ga_standard_speech_v1.0.

Training

The train split was used for training, and the dev split for selecting the best checkpoint.

This Whisper model was fine-tuned and is decoded using the Yoruba (yo) language setting — out of all languages Whisper supports, the one most similar to Ga.

All model parameters (encoder, decoder, and output projection) were fine-tuned.

Evaluation

This model was evaluated on the test split of the dataset. Utterances longer than 30 seconds were excluded:

Examples evaluated: 2025
Speakers: 22

For decoding we ran Whisper with language=yo, task=transcribe, greedy search (num_beams=1, do_sample=False).

We report two complementary word error rate (WER) metrics, both computed on text normalized with Whisper's BasicTextNormalizer:

Standard (corpus-level) WER — the usual error rate, pooling all reference words and edit errors across the entire test set.
Per-utterance averaged WER — WER computed separately for each utterance, each capped at 1.0, then averaged across utterances.

The per-utterance averaged WER bounds each utterance to [0, 1] and weights all utterances equally, so it reflects typical performance without a few catastrophic utterances dominating — but it is not a true error rate and isn't directly comparable to other published WER, hence we report the standard, corpus-level WER as well.

Results

Overall Results

Table
Model	Standard WER	Per-utterance averaged WER
This model	0.19	0.19

Detailed Analysis

Aggregated results can hide important underlying patterns, so we also break the WER down by subset: per speaker, and — where speaker severity is available — per impairment severity group.

Results by speaker

Per-utterance averaged WER per speaker. n_utterances is the number of test utterances for that speaker.

Table
speaker_id	n_utterances	Avg WER
496	107	0.14
520	82	0.26
531	95	0.17
533	115	0.19
538	66	0.14
541	133	0.24
545	47	0.15
546	93	0.19
562	107	0.16
579	114	0.32
587	86	0.16
602	115	0.14
622	40	0.13
625	63	0.13
644	133	0.18
651	64	0.13
666	73	0.2
675	105	0.22
693	106	0.28
707	124	0.08
736	111	0.19
780	46	0.26