Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model description
whisper-distant-voices is a fine-tuned version of openai/whisper-small for multilingual transcription of Swahili (sw), English (en), French (fr), and Arabic (ar). It is designed for community voice transcription in contexts such as elections, crisis response, and civic reporting across East Africa and the broader Global South.
Try it live: katoernest/distant-voices-transcription
Intended uses & limitations
Intended uses:
- Transcribing audio from community reporters and field workers
- Supporting multilingual voice input in civic tech applications
- Automatic speech recognition across Swahili, English, French, and Arabic
Limitations:
- Fine-tuned on 500 samples per language — accuracy will be lower than the base whisper-small on formal speech
- May struggle with heavy accents, background noise, or code-switching between languages mid-sentence
- Training steps were limited (10 steps); a full run (2000 steps) will significantly improve performance
- Not suitable for medical, legal, or safety-critical transcription without further evaluation
Training and evaluation data
Fine-tuned on the google/fleurs dataset:
sw_ke: 500 samples (Swahili, Kenya)en_us: 500 samples (English, US)fr_fr: 500 samples (French, France)ar_eg: 500 samples (Arabic, Egypt)
Total: ~2,000 samples, 90/10 train/eval split, shuffled with seed 42.
Training procedure
The model was fine-tuned using HuggingFace Seq2SeqTrainer on Google Colab
with a T4 GPU.
Data preparation
- Audio resampled to 16 kHz and converted to log-mel spectrograms
using
WhisperProcessor - Transcriptions tokenized with a max length of 448 tokens
- Dataset shuffled with seed 42 and split 90/10 train/eval
Multilingual setup
No language token was forced during training — the model learns to predict the language automatically per sample across all four languages (Swahili, English, French, Arabic).
Compute
- Hardware: NVIDIA T4 GPU (Google Colab)
- Training time: ~45 minutes (10-step prototype run)
- Framework: Transformers 5.0.0, PyTorch 2.11.0+cu128
Intended next steps
- Full training run of 2000 steps on expanded data
- Evaluation on held-out community audio from East Africa
- Integration into the live demo Space: katoernest/distant-voices-transcription
Training hyperparameters
The following hyperparameters were used during training:
- learning_rate: 1e-05
- train_batch_size: 8
- eval_batch_size: 8
- seed: 42
- gradient_accumulation_steps: 2
- total_train_batch_size: 16
- optimizer: AdamW (fused) with betas=(0.9, 0.999) and epsilon=1e-08
- lr_scheduler_type: linear
- lr_scheduler_warmup_steps: 200
- training_steps: 10
- mixed_precision_training: Native AMP
Training results
| Training Steps | Training Loss | Validation Loss |
|---|---|---|
| 10 | - | - |
Note: This checkpoint is a prototype run (10 steps) used to validate the training pipeline end-to-end. Loss values were not recorded before the run completed. A full training run of 2000 steps is in progress and results will be updated here upon completion.
What to expect after full training:
- Training loss should decrease steadily from ~3.0 toward ~0.5–1.0
- Validation WER (Word Error Rate) target: below 30% across all four languages
- Evaluation checkpoints saved every 500 steps
Framework versions
- Transformers 5.0.0
- PyTorch 2.11.0+cu128
- Datasets 4.0.0
- Tokenizers 0.22.2
Model provider
katoernest
Model tree
Base
openai/whisper-small
Fine-tuned
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information