katoernest/whisper-distant-voices API & Inference Endpoint

Model description

whisper-distant-voices is a fine-tuned version of openai/whisper-small for multilingual transcription of Swahili (sw), English (en), French (fr), and Arabic (ar). It is designed for community voice transcription in contexts such as elections, crisis response, and civic reporting across East Africa and the broader Global South.

Try it live: katoernest/distant-voices-transcription

Intended uses & limitations

Intended uses:

Transcribing audio from community reporters and field workers
Supporting multilingual voice input in civic tech applications
Automatic speech recognition across Swahili, English, French, and Arabic

Limitations:

Fine-tuned on 500 samples per language — accuracy will be lower than the base whisper-small on formal speech
May struggle with heavy accents, background noise, or code-switching between languages mid-sentence
Training steps were limited (10 steps); a full run (2000 steps) will significantly improve performance
Not suitable for medical, legal, or safety-critical transcription without further evaluation

Training and evaluation data

Fine-tuned on the google/fleurs dataset:

sw_ke: 500 samples (Swahili, Kenya)
en_us: 500 samples (English, US)
fr_fr: 500 samples (French, France)
ar_eg: 500 samples (Arabic, Egypt)

Total: ~2,000 samples, 90/10 train/eval split, shuffled with seed 42.

Training procedure

The model was fine-tuned using HuggingFace Seq2SeqTrainer on Google Colab with a T4 GPU.

Data preparation

Audio resampled to 16 kHz and converted to log-mel spectrograms using WhisperProcessor
Transcriptions tokenized with a max length of 448 tokens
Dataset shuffled with seed 42 and split 90/10 train/eval

Multilingual setup

No language token was forced during training — the model learns to predict the language automatically per sample across all four languages (Swahili, English, French, Arabic).

Compute

Hardware: NVIDIA T4 GPU (Google Colab)
Training time: ~45 minutes (10-step prototype run)
Framework: Transformers 5.0.0, PyTorch 2.11.0+cu128

Intended next steps

Full training run of 2000 steps on expanded data
Evaluation on held-out community audio from East Africa
Integration into the live demo Space: katoernest/distant-voices-transcription

Training hyperparameters

The following hyperparameters were used during training:

learning_rate: 1e-05
train_batch_size: 8
eval_batch_size: 8
seed: 42
gradient_accumulation_steps: 2
total_train_batch_size: 16
optimizer: AdamW (fused) with betas=(0.9, 0.999) and epsilon=1e-08
lr_scheduler_type: linear
lr_scheduler_warmup_steps: 200
training_steps: 10
mixed_precision_training: Native AMP

Training results

Training Steps	Training Loss	Validation Loss
10	-	-

Note: This checkpoint is a prototype run (10 steps) used to validate the training pipeline end-to-end. Loss values were not recorded before the run completed. A full training run of 2000 steps is in progress and results will be updated here upon completion.

What to expect after full training:

Training loss should decrease steadily from ~3.0 toward ~0.5–1.0
Validation WER (Word Error Rate) target: below 30% across all four languages
Evaluation checkpoints saved every 500 steps

Framework versions

Transformers 5.0.0
PyTorch 2.11.0+cu128
Datasets 4.0.0
Tokenizers 0.22.2

whisper-distant-voices

Get help setting up a custom Dedicated Endpoints.

README