mohdali1

whisper-small-balti

README

License: apache-2.0

Model Details

Model Description

This model is a fine-tuned version of openai/whisper-small for Automatic Speech Recognition (ASR) in the Balti language (bft).

Balti is a Tibetic language with roughly 400,000 speakers, written in Nastaliq (Arabic-based) script. Before this work, no publicly available ASR models or datasets existed for Balti. This model transcribes Balti speech into native Nastaliq text.

Developed by: Muhammad Ali, Independent Researcher, Gilgit-Baltistan, Pakistan. Alumnus, The Islamia University of Bahawalpur (IUB).
Model type: Sequence-to-sequence ASR (Whisper architecture)
Language: Balti (bft)
License: Apache 2.0
Base model: openai/whisper-small

Model Sources

Repository: github.com/mohdali-dev/BaltiVoice-ASR
Demo: HuggingFace Spaces
Paper: arXiv:2606.03504

Results

Table with columns: Model, WER (%), CER (%)
Model	WER (%)	CER (%)
Whisper-small (zero-shot)	159.19	152.52
Whisper-base (fine-tuned)	44.54	15.61
Whisper-small (fine-tuned, this model)	26.74	8.67

Zero-shot WER above 100% indicates hallucination — the model generates words not present in the reference. Fine-tuning on 16.8 hours of Balti speech reduces this to an impressive 26.74% WER and 8.67% CER on the 538-utterance speaker-disjoint validation set.

How to Get Started

Installation

bash
pip install transformers torch librosa

Inference

python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="mohdali1/whisper-small-balti",
    generate_kwargs={"language": "urdu", "task": "transcribe"}
)

result = asr("your_balti_audio.wav")
print(result["text"])

Manual inference

python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
import librosa

model_id  = "mohdali1/whisper-small-balti"
processor = WhisperProcessor.from_pretrained(
    model_id, language="urdu", task="transcribe"
)
model     = WhisperForConditionalGeneration.from_pretrained(model_id)

audio, sr = librosa.load("your_balti_audio.wav", sr=16000)
inputs    = processor(audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(inputs.input_features)

transcription = processor.batch_decode(
    generated_ids, skip_special_tokens=True
)[0]
print(transcription)

Uses

Direct Use

Transcription: Convert Balti audio into native Nastaliq text
Research: Study low-resource ASR and transfer learning for Tibetic languages
Education: Build tools for Balti literacy and pronunciation

Downstream Use

Voice assistants for Balti speakers
Media archiving of radio broadcasts, folk stories, oral histories
Healthcare documentation in rural Gilgit-Baltistan settings

Out-of-Scope Use

High-stakes decisions (legal, medical, safety-critical) without human verification — WER is ~27%, not production-ready
Other languages — performance on non-Balti input is not guaranteed
Commercial deployment without further domain-specific evaluation

Training Details

Training Data

Dataset: BaltiVoice ASR Dataset
Total clips: 10,060 validated utterances (~16.8 hours)
Format: 16kHz mono WAV, native Nastaliq transcriptions
Split method: Speaker-disjoint (GroupShuffleSplit on client_id, seed 42)

Table with columns: Split, Samples, Speakers
Split	Samples	Speakers
Train	9,519	122
Validation	538	14

Training Hyperparameters

Table with columns: Parameter, Value
Parameter	Value
Base model	openai/whisper-small
Language token	urdu (closest Nastaliq script in Whisper)
Task	transcribe
Learning rate	1e-5
Effective batch size	16 (8 × 2 gradient accumulation)
Max steps	1,000
Optimizer	AdamW
Precision	fp16
Gradient checkpointing

Training Curve

Table with columns: Step, Train Loss, Val Loss, Raw WER (%)
Step	Train Loss	Val Loss	Raw WER (%)
250	0.7905	0.4037	40.19
500	0.5968	0.3208	33.37
750	0.4542	0.2963	31.37
1000	0.4652	0.2830

Note: The raw training WER at step 1,000 was 30.07%. However, the final normalized evaluation (with punctuation removed) on the speaker-disjoint held-out set yielded the reported 26.74% WER and 8.67% CER, confirming the model generalizes well to unseen speakers.

Bias, Risks, and Limitations

Technical Limitations

WER of 26.74% — roughly one word in four may be incorrect. Not suitable for critical applications without human review.
Read speech only — trained on short read clips (avg 6 seconds). Performance on spontaneous conversational speech will likely be lower.
No Unicode normalization — Nastaliq script Unicode ambiguities (e.g., Arabic Yeh vs. Farsi Yeh) may affect output consistency.
Speaker diversity — 136 speakers, mostly from Gilgit-Baltistan. Dialectal variation from other regions may affect accuracy.

Sociotechnical Considerations

Balti is an endangered language. Mis-transcriptions could distort cultural meaning. Native speaker validation is recommended.
The dataset represents a specific regional subset of Balti speakers and may not capture all dialectal variation.

Recommendations

Use human review for sensitive or important content
Encourage Balti speakers to report errors via GitHub Issues
Consider extended training or Whisper-medium for higher accuracy

Environmental Impact

Estimated using the ML Impact Calculator (Lacoste et al., 2019).

Hardware: NVIDIA Tesla T4
Training time: ~1.9 hours
Cloud provider: Google Colab
Carbon emitted: ~0.1 kg CO₂eq (estimated)

Citation

If you use this model or the associated dataset in your research, please cite:

bibtex
@misc{ali2026baltivoice,
  author    = {Muhammad Ali},
  title     = {BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language},
  year      = {2026},
  eprint    = {2606.03504},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url       = {https://arxiv.org/abs/2606.03504}
}

Glossary

WER: Word Error Rate = (Substitutions + Deletions + Insertions) / Total Words. Lower is better.
CER: Character Error Rate. Useful for Nastaliq script where Unicode ambiguities can inflate WER.
Nastaliq: Arabic-based script used for Urdu, Persian, and Balti.
Low-resource language: A language with limited digital data, tools, and models available for NLP/ASR.
Speaker-disjoint split: Train and validation sets contain entirely different speakers, preventing the model from memorizing speaker acoustics.

More Information

Dataset: mohdali1/baltivoice-asr
Demo: baltivoice-demo
GitHub: BaltiVoice-ASR
Paper: arXiv:2606.03504
Author: Muhammad Ali
Contact: s22bseen1m01052@iub.edu.pk | ORCID | LinkedIn

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

mohdali1

Model Tree

Base

this model

Input Modalities

Audio

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Model Details

Model Description

This model is a fine-tuned version of openai/whisper-small for Automatic Speech Recognition (ASR) in the Balti language (bft).

Developed by: Muhammad Ali, Independent Researcher, Gilgit-Baltistan, Pakistan. Alumnus, The Islamia University of Bahawalpur (IUB).
Model type: Sequence-to-sequence ASR (Whisper architecture)
Language: Balti (bft)
License: Apache 2.0
Base model: openai/whisper-small

Model Sources

Repository: github.com/mohdali-dev/BaltiVoice-ASR
Demo: HuggingFace Spaces
Paper: arXiv:2606.03504

Results

Table with columns: Model, WER (%), CER (%)
Model	WER (%)	CER (%)
Whisper-small (zero-shot)	159.19	152.52
Whisper-base (fine-tuned)	44.54	15.61
Whisper-small (fine-tuned, this model)	26.74	8.67

Zero-shot WER above 100% indicates hallucination — the model generates words not present in the reference. Fine-tuning on 16.8 hours of Balti speech reduces this to an impressive 26.74% WER and 8.67% CER on the 538-utterance speaker-disjoint validation set.

How to Get Started

Installation

bash
pip install transformers torch librosa

Inference

python
from transformers import pipeline

asr = pipeline(
    "automatic-speech-recognition",
    model="mohdali1/whisper-small-balti",
    generate_kwargs={"language": "urdu", "task": "transcribe"}
)

result = asr("your_balti_audio.wav")
print(result["text"])

Manual inference

python
from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
import librosa

model_id  = "mohdali1/whisper-small-balti"
processor = WhisperProcessor.from_pretrained(
    model_id, language="urdu", task="transcribe"
)
model     = WhisperForConditionalGeneration.from_pretrained(model_id)

audio, sr = librosa.load("your_balti_audio.wav", sr=16000)
inputs    = processor(audio, sampling_rate=16000, return_tensors="pt")

with torch.no_grad():
    generated_ids = model.generate(inputs.input_features)

transcription = processor.batch_decode(
    generated_ids, skip_special_tokens=True
)[0]
print(transcription)

Uses

Direct Use

Transcription: Convert Balti audio into native Nastaliq text
Research: Study low-resource ASR and transfer learning for Tibetic languages
Education: Build tools for Balti literacy and pronunciation

Downstream Use

Voice assistants for Balti speakers
Media archiving of radio broadcasts, folk stories, oral histories
Healthcare documentation in rural Gilgit-Baltistan settings

Out-of-Scope Use

High-stakes decisions (legal, medical, safety-critical) without human verification — WER is ~27%, not production-ready
Other languages — performance on non-Balti input is not guaranteed
Commercial deployment without further domain-specific evaluation

Training Details

Training Data

Dataset: BaltiVoice ASR Dataset
Total clips: 10,060 validated utterances (~16.8 hours)
Format: 16kHz mono WAV, native Nastaliq transcriptions
Split method: Speaker-disjoint (GroupShuffleSplit on client_id, seed 42)

Table with columns: Split, Samples, Speakers
Split	Samples	Speakers
Train	9,519	122
Validation	538	14

Training Hyperparameters

Table with columns: Parameter, Value
Parameter	Value
Base model	openai/whisper-small
Language token	urdu (closest Nastaliq script in Whisper)
Task	transcribe
Learning rate	1e-5
Effective batch size	16 (8 × 2 gradient accumulation)
Max steps	1,000
Optimizer	AdamW
Precision	fp16
Gradient checkpointing

Training Curve

Table with columns: Step, Train Loss, Val Loss, Raw WER (%)
Step	Train Loss	Val Loss	Raw WER (%)
250	0.7905	0.4037	40.19
500	0.5968	0.3208	33.37
750	0.4542	0.2963	31.37
1000	0.4652	0.2830

Note: The raw training WER at step 1,000 was 30.07%. However, the final normalized evaluation (with punctuation removed) on the speaker-disjoint held-out set yielded the reported 26.74% WER and 8.67% CER, confirming the model generalizes well to unseen speakers.

Bias, Risks, and Limitations

Technical Limitations

WER of 26.74% — roughly one word in four may be incorrect. Not suitable for critical applications without human review.
Read speech only — trained on short read clips (avg 6 seconds). Performance on spontaneous conversational speech will likely be lower.
No Unicode normalization — Nastaliq script Unicode ambiguities (e.g., Arabic Yeh vs. Farsi Yeh) may affect output consistency.
Speaker diversity — 136 speakers, mostly from Gilgit-Baltistan. Dialectal variation from other regions may affect accuracy.

Sociotechnical Considerations

Balti is an endangered language. Mis-transcriptions could distort cultural meaning. Native speaker validation is recommended.
The dataset represents a specific regional subset of Balti speakers and may not capture all dialectal variation.

Recommendations

Use human review for sensitive or important content
Encourage Balti speakers to report errors via GitHub Issues
Consider extended training or Whisper-medium for higher accuracy

Environmental Impact

Estimated using the ML Impact Calculator (Lacoste et al., 2019).

Hardware: NVIDIA Tesla T4
Training time: ~1.9 hours
Cloud provider: Google Colab
Carbon emitted: ~0.1 kg CO₂eq (estimated)

Citation

If you use this model or the associated dataset in your research, please cite:

bibtex
@misc{ali2026baltivoice,
  author    = {Muhammad Ali},
  title     = {BaltiVoice: A Speech Corpus and Fine-tuned Whisper ASR System for the Balti Language},
  year      = {2026},
  eprint    = {2606.03504},
  archivePrefix = {arXiv},
  primaryClass = {cs.CL},
  url       = {https://arxiv.org/abs/2606.03504}
}

Glossary

WER: Word Error Rate = (Substitutions + Deletions + Insertions) / Total Words. Lower is better.
CER: Character Error Rate. Useful for Nastaliq script where Unicode ambiguities can inflate WER.
Nastaliq: Arabic-based script used for Urdu, Persian, and Balti.
Low-resource language: A language with limited digital data, tools, and models available for NLP/ASR.
Speaker-disjoint split: Train and validation sets contain entirely different speakers, preventing the model from memorizing speaker acoustics.

More Information

Dataset: mohdali1/baltivoice-asr
Demo: baltivoice-demo
GitHub: BaltiVoice-ASR
Paper: arXiv:2606.03504
Author: Muhammad Ali
Contact: s22bseen1m01052@iub.edu.pk | ORCID | LinkedIn