Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model description

whisper-distant-voices is a fine-tuned version of openai/whisper-small for multilingual transcription of Swahili (sw), English (en), French (fr), and Arabic (ar). It is designed for community voice transcription in contexts such as elections, crisis response, and civic reporting across East Africa and the broader Global South.

Try it live: katoernest/distant-voices-transcription

Intended uses & limitations

Intended uses:

  • Transcribing audio from community reporters and field workers
  • Supporting multilingual voice input in civic tech applications
  • Automatic speech recognition across Swahili, English, French, and Arabic

Limitations:

  • Fine-tuned on 500 samples per language — accuracy will be lower than the base whisper-small on formal speech
  • May struggle with heavy accents, background noise, or code-switching between languages mid-sentence
  • Training steps were limited (10 steps); a full run (2000 steps) will significantly improve performance
  • Not suitable for medical, legal, or safety-critical transcription without further evaluation

Training and evaluation data

Fine-tuned on the google/fleurs dataset:

  • sw_ke: 500 samples (Swahili, Kenya)
  • en_us: 500 samples (English, US)
  • fr_fr: 500 samples (French, France)
  • ar_eg: 500 samples (Arabic, Egypt)

Total: ~2,000 samples, 90/10 train/eval split, shuffled with seed 42.

Training procedure

The model was fine-tuned using HuggingFace Seq2SeqTrainer on Google Colab with a T4 GPU.

Data preparation

  • Audio resampled to 16 kHz and converted to log-mel spectrograms using WhisperProcessor
  • Transcriptions tokenized with a max length of 448 tokens
  • Dataset shuffled with seed 42 and split 90/10 train/eval

Multilingual setup

No language token was forced during training — the model learns to predict the language automatically per sample across all four languages (Swahili, English, French, Arabic).

Compute

  • Hardware: NVIDIA T4 GPU (Google Colab)
  • Training time: ~45 minutes (10-step prototype run)
  • Framework: Transformers 5.0.0, PyTorch 2.11.0+cu128

Intended next steps

Training hyperparameters

The following hyperparameters were used during training:

  • learning_rate: 1e-05
  • train_batch_size: 8
  • eval_batch_size: 8
  • seed: 42
  • gradient_accumulation_steps: 2
  • total_train_batch_size: 16
  • optimizer: AdamW (fused) with betas=(0.9, 0.999) and epsilon=1e-08
  • lr_scheduler_type: linear
  • lr_scheduler_warmup_steps: 200
  • training_steps: 10
  • mixed_precision_training: Native AMP

Training results

Training StepsTraining LossValidation Loss
10--

Note: This checkpoint is a prototype run (10 steps) used to validate the training pipeline end-to-end. Loss values were not recorded before the run completed. A full training run of 2000 steps is in progress and results will be updated here upon completion.

What to expect after full training:

  • Training loss should decrease steadily from ~3.0 toward ~0.5–1.0
  • Validation WER (Word Error Rate) target: below 30% across all four languages
  • Evaluation checkpoints saved every 500 steps

Framework versions

  • Transformers 5.0.0
  • PyTorch 2.11.0+cu128
  • Datasets 4.0.0
  • Tokenizers 0.22.2

Model provider

katoernest

Model tree

Base

openai/whisper-small

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today