Seif-Eldeen-Sameh

whisper-medium-arabic-codeswitched

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Headline result

Table with columns: Model, WER ↓
Model	WER ↓
`openai/whisper-medium` (base)	52.4 %
This model	17.9 %
Absolute reduction	−34.5 points

Measured on a held-out, deterministic 1 % test slice of asr_codeswitched_dataset (test_size=0.01, seed=42) using jiwer WER under light normalization (lowercase, diacritic + punctuation strip — no MSA folding so code-switching is preserved as a real evaluation signal). Reproducible from eval/evaluation.ipynb in the project repo.

Why fine-tune

Base Whisper has two failure modes on Arab-world technical speech:

Drops or transliterates English technical terms — "embedding" becomes "إمبدنغ" or vanishes entirely.
Pushes dialect toward Modern Standard Arabic — Egyptian "بيتكلم" becomes MSA "يتحدث".

For any downstream retrieval / RAG system the original surface form matters: the user's query and the indexed transcript have to share a vocabulary. This model is trained to keep code-switching verbatim — dialect stays dialect, English terms stay in Latin script.

Intended use

Transcribing code-switched Arabic/English lectures, talks, technical meetings.
Upstream component of an ASR → RAG pipeline. See Muhadara RAG for the full system this model was built for.

Not intended for: high-stakes (legal, medical) transcription without human review; non-Egyptian Arabic dialects (Gulf, Levantine, Maghrebi — untested).

Training data

Table

Dataset	`Seif-Eldeen-Sameh/asr_codeswitched_dataset`
Total duration	34.1 hours
Examples	45,189 clips
Avg / median clip	2.7 s / 2.2 s (utterance-level)
Sources	EJUST custom recordings (~32 K clips, transcripts re-processed from all-Arabic back to true code-switching) + `MohamedRashad/arabic-english-code-switching` (~12 K clips)

Training procedure

Hyperparameters

Table

Base model	`openai/whisper-medium` (769M params)
Steps	6 000
Batch size	2 × grad-accum 8 = effective 16
Learning rate	1e-5, linear schedule, 500 warmup steps
Optimizer	AdamW (fused), β = (0.9, 0.999), ε = 1e-8
Precision	fp16 (mixed)
Selection metric	WER on eval split
`load_best_model_at_end`

Training curve

Table with columns: Step, Eval loss, Eval WER
Step	Eval loss	Eval WER
500	0.4333	33.92
1000	0.3502	28.53
1500	0.3207	26.72
2000	0.2965	25.30
2500	0.2714	23.28
3000	0.2608	23.32

Why the headline (17.9%) differs from the curve (best 21.4%)

Both are honest numbers, measured on different splits:

21.4 % is the trainer-time validation WER at the best step (4000). This is what load_best_model_at_end selected against.
17.9 % is the held-out 1 % test slice WER measured after training with the eval notebook. The split, normalization conventions, and beam configuration in the eval notebook (beam_size=1, vad_filter=False, same text normalization used downstream by the RAG system) better reflect deployed-system behavior.

The eval notebook is the source of the result reported here; the trainer curve is kept for full transparency.

Usage

Transformers

python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched")
model     = WhisperForConditionalGeneration.from_pretrained("Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched")

audio, sr = librosa.load("clip.wav", sr=16000, mono=True)
features  = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
ids       = model.generate(features, language="ar", task="transcribe")
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

faster-whisper (CT2 INT8 build — recommended for deployment)

For ~4× smaller weights and ~33× speedup on GPU, use the INT8 build:

python
from faster_whisper import WhisperModel

# GPU
model = WhisperModel(
    "Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched-ct2",
    device="cuda", compute_type="int8_float16",
)
# or CPU
# model = WhisperModel(..., device="cpu", compute_type="int8")

segments, _ = model.transcribe("clip.wav", language="ar", beam_size=1, vad_filter=True)
print(" ".join(s.text for s in segments))

Latency (CT2 INT8 build · 40.8 s test clip · median of 3 runs)

Table with columns: Device, Wall time, Real-time factor
Device	Wall time	Real-time factor
2 vCPU	82.94 s	2.03×
T4 GPU	2.49 s	0.061×

~33× CPU → GPU speedup — the architecture justification for offloading ASR to a serverless GPU in the Muhadara RAG deployment while keeping the frontend on free CPU.

Limitations

Egyptian dialect dominant — other Arabic dialects (Gulf, Levantine, Maghrebi) untested and likely to degrade.
Far-field / very noisy recordings remain hard; a DeepFilterNet denoising preprocessor was built and rejected (it suppressed the distant main speaker — see the project's design notes).
Whisper hallucination on long silences is mitigated downstream via VAD + a 3-gram repetition guard, not in this model itself.
Utterance-length training data (avg 2.7 s/clip) — very long unsegmented audio relies on Whisper's native chunking.

Citation

bibtex
@misc{muhadara_whisper_2026,
  author    = {Seif Eldeen Sameh},
  title     = {Whisper-medium Arabic/English Code-Switched ASR},
  year      = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched}}
}

Framework versions

Transformers 5.5.4
PyTorch 2.10.0+cu128
Datasets 4.8.4
Tokenizers 0.22.2

Model provider

Seif-Eldeen-Sameh

Model tree

Base

openai/whisper-medium

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Headline result

Table with columns: Model, WER ↓
Model	WER ↓
`openai/whisper-medium` (base)	52.4 %
This model	17.9 %
Absolute reduction	−34.5 points

Why fine-tune

Base Whisper has two failure modes on Arab-world technical speech:

Drops or transliterates English technical terms — "embedding" becomes "إمبدنغ" or vanishes entirely.
Pushes dialect toward Modern Standard Arabic — Egyptian "بيتكلم" becomes MSA "يتحدث".

Intended use

Transcribing code-switched Arabic/English lectures, talks, technical meetings.
Upstream component of an ASR → RAG pipeline. See Muhadara RAG for the full system this model was built for.

Not intended for: high-stakes (legal, medical) transcription without human review; non-Egyptian Arabic dialects (Gulf, Levantine, Maghrebi — untested).

Training data

Table

Dataset	`Seif-Eldeen-Sameh/asr_codeswitched_dataset`
Total duration	34.1 hours
Examples	45,189 clips
Avg / median clip	2.7 s / 2.2 s (utterance-level)
Sources	EJUST custom recordings (~32 K clips, transcripts re-processed from all-Arabic back to true code-switching) + `MohamedRashad/arabic-english-code-switching` (~12 K clips)

Training procedure

Hyperparameters

Table

Base model	`openai/whisper-medium` (769M params)
Steps	6 000
Batch size	2 × grad-accum 8 = effective 16
Learning rate	1e-5, linear schedule, 500 warmup steps
Optimizer	AdamW (fused), β = (0.9, 0.999), ε = 1e-8
Precision	fp16 (mixed)
Selection metric	WER on eval split
`load_best_model_at_end`

Training curve

Table with columns: Step, Eval loss, Eval WER
Step	Eval loss	Eval WER
500	0.4333	33.92
1000	0.3502	28.53
1500	0.3207	26.72
2000	0.2965	25.30
2500	0.2714	23.28
3000	0.2608	23.32

Why the headline (17.9%) differs from the curve (best 21.4%)

Both are honest numbers, measured on different splits:

21.4 % is the trainer-time validation WER at the best step (4000). This is what load_best_model_at_end selected against.
17.9 % is the held-out 1 % test slice WER measured after training with the eval notebook. The split, normalization conventions, and beam configuration in the eval notebook (beam_size=1, vad_filter=False, same text normalization used downstream by the RAG system) better reflect deployed-system behavior.

The eval notebook is the source of the result reported here; the trainer curve is kept for full transparency.

Usage

Transformers

python
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import librosa

processor = WhisperProcessor.from_pretrained("Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched")
model     = WhisperForConditionalGeneration.from_pretrained("Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched")

audio, sr = librosa.load("clip.wav", sr=16000, mono=True)
features  = processor(audio, sampling_rate=16000, return_tensors="pt").input_features
ids       = model.generate(features, language="ar", task="transcribe")
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

faster-whisper (CT2 INT8 build — recommended for deployment)

For ~4× smaller weights and ~33× speedup on GPU, use the INT8 build:

python
from faster_whisper import WhisperModel

# GPU
model = WhisperModel(
    "Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched-ct2",
    device="cuda", compute_type="int8_float16",
)
# or CPU
# model = WhisperModel(..., device="cpu", compute_type="int8")

segments, _ = model.transcribe("clip.wav", language="ar", beam_size=1, vad_filter=True)
print(" ".join(s.text for s in segments))

Latency (CT2 INT8 build · 40.8 s test clip · median of 3 runs)

Table with columns: Device, Wall time, Real-time factor
Device	Wall time	Real-time factor
2 vCPU	82.94 s	2.03×
T4 GPU	2.49 s	0.061×

~33× CPU → GPU speedup — the architecture justification for offloading ASR to a serverless GPU in the Muhadara RAG deployment while keeping the frontend on free CPU.

Limitations

Egyptian dialect dominant — other Arabic dialects (Gulf, Levantine, Maghrebi) untested and likely to degrade.
Far-field / very noisy recordings remain hard; a DeepFilterNet denoising preprocessor was built and rejected (it suppressed the distant main speaker — see the project's design notes).
Whisper hallucination on long silences is mitigated downstream via VAD + a 3-gram repetition guard, not in this model itself.
Utterance-length training data (avg 2.7 s/clip) — very long unsegmented audio relies on Whisper's native chunking.

Citation

bibtex
@misc{muhadara_whisper_2026,
  author    = {Seif Eldeen Sameh},
  title     = {Whisper-medium Arabic/English Code-Switched ASR},
  year      = {2026},
  publisher = {Hugging Face},
  howpublished = {\url{https://huggingface.co/Seif-Eldeen-Sameh/whisper-medium-arabic-codeswitched}}
}

Framework versions

Transformers 5.5.4
PyTorch 2.10.0+cu128
Datasets 4.8.4
Tokenizers 0.22.2

whisper-medium-arabic-codeswitched

Get help setting up a custom Dedicated Endpoints.

README

Headline result

Why fine-tune

Intended use

Training data

Training procedure

Hyperparameters

Training curve

Why the headline (17.9%) differs from the curve (best 21.4%)

Usage

Transformers

faster-whisper (CT2 INT8 build — recommended for deployment)

Latency (CT2 INT8 build · 40.8 s test clip · median of 3 runs)

Limitations

Citation

Framework versions

Explore FriendliAI today

README

Headline result

Why fine-tune

Intended use

Training data

Training procedure

Hyperparameters

Training curve

Why the headline (17.9%) differs from the curve (best 21.4%)

Usage

Transformers

faster-whisper (CT2 INT8 build — recommended for deployment)

Latency (CT2 INT8 build · 40.8 s test clip · median of 3 runs)

Limitations

Citation

Framework versions