Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Evaluation

This model was evaluated on the test split of cdli/kenyan_swahili_nonstandard_speech_v1.0, a non-standard speech dataset for Kenyan Swahili. Examples longer than 30 seconds were excluded. For decoding we ran Whisper with language=sw, task=transcribe, greedy search (num_beams=1, do_sample=False). Results are compared against the unadapted base model cdli/whisper-small-Swahili_finetuned_small_CV20, evaluated identically, to show the effect of fine-tuning on non-standard speech.

  • Examples evaluated: 849
  • Speakers: 9

Metrics

We report two complementary word error rate (WER) metrics, both computed on text normalized with Whisper's BasicTextNormalizer:

  • Standard (corpus-level) WER — the usual error rate, pooling all reference words and edit errors across the entire test set.
  • Per-utterance averaged WER — WER computed separately for each utterance, each capped at 1.0, then averaged across utterances.

The per-utterance averaged WER bounds each utterance to [0, 1] and weights all utterances equally, so it reflects typical performance without a few catastrophic utterances dominating — but it is not a true error rate and isn't directly comparable to other published WER, hence we report the standard, corpus-level WER as well.

Overall Results

ModelStandard WERPer-utterance averaged WER
Adapted0.440.37
Unadapted0.570.56
Relative improvement24%35%

Detailed Analysis

For non-standard speech, aggregated results can hide important underlying patterns. Hence, we also report the WERs for different subsets: per severity group as well as per speaker.

Results by impairment severity

All WER values below are the per-utterance averaged WER, first averaged per speaker and then averaged within each severity group. n_speakers and n_utterances are the number of speakers and test utterances in each group.

severityn_speakersn_utterancesAvg WER (unadapted model)Avg WER (adapted model)Rel. improvement
mild32780.420.2833%
moderate32350.570.3538%
severe33360.670.4435%

Results by speaker

Per-utterance averaged WER per speaker. n_utterances is the number of test utterances for that speaker.

speaker_idseverityetiologyn_utterancesAvg WER (unadapted model)Avg WER (adapted model)Rel. improvement
KES013mildCerebral Palsy1070.560.4225%
KES021mildParkinson’s Disease680.380.2439%
KES030mildNeurodevelopmental disorder1030.330.1942%
KES012moderateNeurodevelopmental disorder970.430.2444%
KES028moderateCerebral Palsy440.750.4244%
KES035moderateMultiple Sclerosis (MS)940.520.423%
KES001severeCerebral Palsy1400.780.5332%
KES002severeCerebral Palsy820.640.4136%
KES010severeNeurodevelopmental disorder1140.60.3738%

Model provider

cdli

Model tree

Base

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today