Headline result
Table with columns: Model, WER ↓| Model | WER ↓ |
|---|
openai/whisper-medium (base) | 52.4 % |
| This model | 17.9 % |
| Absolute reduction | −34.5 points |
Measured on a held-out, deterministic 1 % test slice of asr_codeswitched_dataset (test_size=0.01, seed=42) using jiwer WER under light normalization (lowercase, diacritic + punctuation strip — no MSA folding so code-switching is preserved as a real evaluation signal). Reproducible from eval/evaluation.ipynb in the project repo.
Why fine-tune
Base Whisper has two failure modes on Arab-world technical speech:
- Drops or transliterates English technical terms — "embedding" becomes "إمبدنغ" or vanishes entirely.
- Pushes dialect toward Modern Standard Arabic — Egyptian "بيتكلم" becomes MSA "يتحدث".
For any downstream retrieval / RAG system the original surface form matters: the user's query and the indexed transcript have to share a vocabulary. This model is trained to keep code-switching verbatim — dialect stays dialect, English terms stay in Latin script.
Intended use
- Transcribing code-switched Arabic/English lectures, talks, technical meetings.
- Upstream component of an ASR → RAG pipeline. See Muhadara RAG for the full system this model was built for.
Not intended for: high-stakes (legal, medical) transcription without human review; non-Egyptian Arabic dialects (Gulf, Levantine, Maghrebi — untested).
Training data
Training procedure
Hyperparameters
Table | |
|---|
| Base model | openai/whisper-medium (769M params) |
| Steps | 6 000 |
| Batch size | 2 × grad-accum 8 = effective 16 |
| Learning rate | 1e-5, linear schedule, 500 warmup steps |
| Optimizer | AdamW (fused), β = (0.9, 0.999), ε = 1e-8 |
| Precision | fp16 (mixed) |
| Selection metric | WER on eval split |
load_best_model_at_end |
Training curve
Table with columns: Step, Eval loss, Eval WER| Step | Eval loss | Eval WER |
|---|
| 500 | 0.4333 | 33.92 |
| 1000 | 0.3502 | 28.53 |
| 1500 | 0.3207 | 26.72 |
| 2000 | 0.2965 | 25.30 |
| 2500 | 0.2714 | 23.28 |
| 3000 | 0.2608 | 23.32 |
Why the headline (17.9%) differs from the curve (best 21.4%)
Both are honest numbers, measured on different splits:
- 21.4 % is the trainer-time validation WER at the best step (4000). This is what
load_best_model_at_end selected against.
- 17.9 % is the held-out 1 % test slice WER measured after training with the eval notebook. The split, normalization conventions, and beam configuration in the eval notebook (
beam_size=1, vad_filter=False, same text normalization used downstream by the RAG system) better reflect deployed-system behavior.
The eval notebook is the source of the result reported here; the trainer curve is kept for full transparency.
Usage
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa
processor = WhisperProcessor.from_pretrained("Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched")
model = WhisperForConditionalGeneration.from_pretrained("Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched")
audio, sr = librosa.load("clip.wav", sr=16000, mono=True)
features = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
ids = model.generate(features, language="ar", task="transcribe")
print(processor.batch_decode(ids, skip_special_tokens=True)[0])
faster-whisper (CT2 INT8 build — recommended for deployment)
For ~4× smaller weights and ~33× speedup on GPU, use the INT8 build:
from faster_whisper import WhisperModel
model = WhisperModel(
"Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched-ct2",
device="cuda", compute_type="int8_float16",
)
segments, _ = model.transcribe("clip.wav", language="ar", beam_size=1, vad_filter=True)
print(" ".join(s.text for s in segments))
Table with columns: Device, Wall time, Real-time factor| Device | Wall time | Real-time factor |
|---|
| 2 vCPU | 82.94 s | 2.03× |
| T4 GPU | 2.49 s | 0.061× |
~33× CPU → GPU speedup — the architecture justification for offloading ASR to a serverless GPU in the Muhadara RAG deployment while keeping the frontend on free CPU.
Limitations
- Egyptian dialect dominant — other Arabic dialects (Gulf, Levantine, Maghrebi) untested and likely to degrade.
- Far-field / very noisy recordings remain hard; a DeepFilterNet denoising preprocessor was built and rejected (it suppressed the distant main speaker — see the project's design notes).
- Whisper hallucination on long silences is mitigated downstream via VAD + a 3-gram repetition guard, not in this model itself.
- Utterance-length training data (avg 2.7 s/clip) — very long unsegmented audio relies on Whisper's native chunking.
Citation
@misc{muhadara_whisper_2026,
author = {Seif Eldeen Sameh},
title = {Whisper-medium Arabic/English Code-Switched ASR},
year = {2026},
publisher = {Hugging Face},
howpublished = {\url{https://huggingface.co/Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched}}
}
Framework versions
- Transformers 5.5.4
- PyTorch 2.10.0+cu128
- Datasets 4.8.4
- Tokenizers 0.22.2