Model Details
Model Description
This model is a fine-tuned version of
openai/whisper-small
for Automatic Speech Recognition (ASR) in the Balti language
(bft).
Balti is a Tibetic language with roughly 400,000 speakers, written
in Nastaliq (Arabic-based) script. Before this work, no publicly
available ASR models or datasets existed for Balti. This model
transcribes Balti speech into native Nastaliq text.
- Developed by: Muhammad Ali,
Independent Researcher, Gilgit-Baltistan, Pakistan.
Alumnus, The Islamia University of Bahawalpur (IUB).
- Model type: Sequence-to-sequence ASR (Whisper architecture)
- Language: Balti (bft)
- License: Apache 2.0
- Base model: openai/whisper-small
Model Sources
Results
Table with columns: Model, WER (%), CER (%)| Model | WER (%) | CER (%) |
|---|
| Whisper-small (zero-shot) | 159.19 | 152.52 |
| Whisper-base (fine-tuned) | 44.54 | 15.61 |
| Whisper-small (fine-tuned, this model) | 26.74 | 8.67 |
Zero-shot WER above 100% indicates hallucination — the model
generates words not present in the reference. Fine-tuning on
16.8 hours of Balti speech reduces this to an impressive 26.74% WER
and 8.67% CER on the 538-utterance speaker-disjoint validation set.
How to Get Started
Installation
pip install transformers torch librosa
Inference
from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="mohdali1/whisper-small-balti",
generate_kwargs={"language": "urdu", "task": "transcribe"}
)
result = asr("your_balti_audio.wav")
print(result["text"])
Manual inference
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
import librosa
model_id = "mohdali1/whisper-small-balti"
processor = WhisperProcessor.from_pretrained(
model_id, language="urdu", task="transcribe"
)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
audio, sr = librosa.load("your_balti_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
generated_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(
generated_ids, skip_special_tokens=True
)[0]
print(transcription)
Uses
Direct Use
- Transcription: Convert Balti audio into native Nastaliq text
- Research: Study low-resource ASR and transfer learning for
Tibetic languages
- Education: Build tools for Balti literacy and pronunciation
Downstream Use
- Voice assistants for Balti speakers
- Media archiving of radio broadcasts, folk stories, oral histories
- Healthcare documentation in rural Gilgit-Baltistan settings
Out-of-Scope Use
- High-stakes decisions (legal, medical, safety-critical) without
human verification — WER is ~27%, not production-ready
- Other languages — performance on non-Balti input is not
guaranteed
- Commercial deployment without further domain-specific evaluation
Training Details
Training Data
- Dataset: BaltiVoice ASR Dataset
- Total clips: 10,060 validated utterances (~16.8 hours)
- Format: 16kHz mono WAV, native Nastaliq transcriptions
- Split method: Speaker-disjoint (
GroupShuffleSplit on
client_id, seed 42)
Table with columns: Split, Samples, Speakers| Split | Samples | Speakers |
|---|
| Train | 9,519 | 122 |
| Validation | 538 | 14 |
Training Hyperparameters
Table with columns: Parameter, Value| Parameter | Value |
|---|
| Base model | openai/whisper-small |
| Language token | urdu (closest Nastaliq script in Whisper) |
| Task | transcribe |
| Learning rate | 1e-5 |
| Effective batch size | 16 (8 × 2 gradient accumulation) |
| Max steps | 1,000 |
| Optimizer | AdamW |
| Precision | fp16 |
| Gradient checkpointing |
Training Curve
Table with columns: Step, Train Loss, Val Loss, Raw WER (%)| Step | Train Loss | Val Loss | Raw WER (%) |
|---|
| 250 | 0.7905 | 0.4037 | 40.19 |
| 500 | 0.5968 | 0.3208 | 33.37 |
| 750 | 0.4542 | 0.2963 | 31.37 |
| 1000 | 0.4652 | 0.2830 | |
Note: The raw training WER at step 1,000 was 30.07%. However, the final normalized evaluation (with punctuation removed) on the speaker-disjoint held-out set yielded the reported 26.74% WER and 8.67% CER, confirming the model generalizes well to unseen speakers.
Bias, Risks, and Limitations
Technical Limitations
- WER of 26.74% — roughly one word in four may be incorrect.
Not suitable for critical applications without human review.
- Read speech only — trained on short read clips (avg 6 seconds).
Performance on spontaneous conversational speech will likely be lower.
- No Unicode normalization — Nastaliq script Unicode ambiguities
(e.g., Arabic Yeh vs. Farsi Yeh) may affect output consistency.
- Speaker diversity — 136 speakers, mostly from Gilgit-Baltistan.
Dialectal variation from other regions may affect accuracy.
Sociotechnical Considerations
- Balti is an endangered language. Mis-transcriptions could distort
cultural meaning. Native speaker validation is recommended.
- The dataset represents a specific regional subset of Balti speakers
and may not capture all dialectal variation.
Recommendations
- Use human review for sensitive or important content
- Encourage Balti speakers to report errors via GitHub Issues
- Consider extended training or Whisper-medium for higher accuracy
Environmental Impact
Estimated using the
ML Impact Calculator
(Lacoste et al., 2019).
- Hardware: NVIDIA Tesla T4
- Training time: ~1.9 hours
- Cloud provider: Google Colab
- Carbon emitted: ~0.1 kg CO₂eq (estimated)
Citation
If you use this model or the associated dataset in your research, please cite:
@misc{ali2026baltivoice,
author = {Muhammad Ali},
title = {BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language},
year = {2026},
eprint = {2606.03504},
archivePrefix = {arXiv},
primaryClass = {cs.CL},
url = {https://arxiv.org/abs/2606.03504}
}
Glossary
- WER: Word Error Rate = (Substitutions + Deletions + Insertions)
/ Total Words. Lower is better.
- CER: Character Error Rate. Useful for Nastaliq script where
Unicode ambiguities can inflate WER.
- Nastaliq: Arabic-based script used for Urdu, Persian, and Balti.
- Low-resource language: A language with limited digital data,
tools, and models available for NLP/ASR.
- Speaker-disjoint split: Train and validation sets contain
entirely different speakers, preventing the model from memorizing
speaker acoustics.