Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
Model Description
This model is a fine-tuned version of openai/whisper-small for Automatic Speech Recognition (ASR) in the Balti language (bft).
Balti is a Tibetic language with roughly 400,000 speakers, written in Nastaliq (Arabic-based) script. Before this work, no publicly available ASR models or datasets existed for Balti. This model transcribes Balti speech into native Nastaliq text.
- Developed by: Mohammad Ali, Independent Researcher, Gilgit-Baltistan, Pakistan
- Model type: Sequence-to-sequence ASR (Whisper architecture)
- Language: Balti (bft)
- License: Apache 2.0
- Base model: openai/whisper-small
Model Sources
- Repository: github.com/mohdali-dev/BaltiVoice-ASR
- Demo: HuggingFace Spaces
- Paper: (arXiv link will be added upon publication)
Results
| Model | WER |
|---|---|
| Whisper-small (zero-shot) | 182.18% |
| Whisper-small fine-tuned (this model) | 30.07% |
Zero-shot WER above 100% indicates hallucination — the model generates words not present in the reference. Fine-tuning on 16.8 hours of Balti speech reduces this to 30.07% on a speaker-disjoint validation set of 538 utterances.
How to Get Started
Installation
bash
pip install transformers torch librosa
Inference
python
from transformers import pipelineasr = pipeline("automatic-speech-recognition",model="mohdali1/whisper-small-balti",generate_kwargs={"language": "urdu", "task": "transcribe"})result = asr("your_balti_audio.wav")print(result["text"])
Manual inference
python
from transformers import WhisperForConditionalGeneration, WhisperProcessorimport torchimport librosamodel_id = "mohdali1/whisper-small-balti"processor = WhisperProcessor.from_pretrained(model_id, language="urdu", task="transcribe")model = WhisperForConditionalGeneration.from_pretrained(model_id)audio, sr = librosa.load("your_balti_audio.wav", sr=16000)inputs = processor(audio, sampling_rate=16000, return_tensors="pt")with torch.no_grad():generated_ids = model.generate(inputs.input_features)transcription = processor.batch_decode(generated_ids, skip_special_tokens=True)[0]print(transcription)
Uses
Direct Use
- Transcription: Convert Balti audio into native Nastaliq text
- Research: Study low-resource ASR and transfer learning for Tibetic languages
- Education: Build tools for Balti literacy and pronunciation
Downstream Use
- Voice assistants for Balti speakers
- Media archiving of radio broadcasts, folk stories, oral histories
- Healthcare documentation in rural Gilgit-Baltistan settings
Out-of-Scope Use
- High-stakes decisions (legal, medical, safety-critical) without human verification — WER is 30%, not production-ready
- Other languages — performance on non-Balti input is not guaranteed
- Commercial deployment without further domain-specific evaluation
Training Details
Training Data
- Dataset: BaltiVoice ASR Dataset
- Total clips: 10,060 validated utterances (~16.8 hours)
- Format: 16kHz mono WAV, native Nastaliq transcriptions
- Split method: Speaker-disjoint (
GroupShuffleSplitonclient_id, seed 42)
| Split | Samples | Speakers |
|---|---|---|
| Train | 9,519 | 122 |
| Validation | 538 | 14 |
Training Hyperparameters
| Parameter | Value |
|---|---|
| Base model | openai/whisper-small |
| Language token | urdu (closest Nastaliq script in Whisper) |
| Task | transcribe |
| Learning rate | 1e-5 |
| Effective batch size | 16 (8 × 2 gradient accumulation) |
| Max steps | 1,000 |
| Optimizer | AdamW |
| Precision | fp16 |
| Gradient checkpointing | Enabled |
| Hardware | NVIDIA Tesla T4 (Google Colab) |
| Training time | 1h 54m |
Training Curve
| Step | Train Loss | Val Loss | WER (%) |
|---|---|---|---|
| 250 | 0.7905 | 0.4037 | 40.19 |
| 500 | 0.5968 | 0.3208 | 33.37 |
| 750 | 0.4542 | 0.2963 | 31.37 |
| 1000 | 0.4652 | 0.2830 | 30.07 |
Validation loss decreased consistently across all checkpoints with no sign of overfitting at step 1,000.
Bias, Risks, and Limitations
Technical Limitations
- WER of 30.07% — roughly one word in three may be incorrect. Not suitable for critical applications without human review.
- Read speech only — trained on short read clips (avg 6 seconds). Performance on spontaneous conversational speech will likely be lower.
- No Unicode normalization — Nastaliq script Unicode ambiguities (e.g., Arabic Yeh vs. Farsi Yeh) may affect output consistency.
- Speaker diversity — 136 speakers, mostly from Gilgit-Baltistan. Dialectal variation from other regions may affect accuracy.
Sociotechnical Considerations
- Balti is an endangered language. Mis-transcriptions could distort cultural meaning. Native speaker validation is recommended.
- The dataset represents a specific regional subset of Balti speakers and may not capture all dialectal variation.
Recommendations
- Use human review for sensitive or important content
- Encourage Balti speakers to report errors via GitHub Issues
- Consider extended training or Whisper-medium for higher accuracy
Environmental Impact
Estimated using the ML Impact Calculator (Lacoste et al., 2019).
- Hardware: NVIDIA Tesla T4
- Training time: ~1.9 hours
- Cloud provider: Google Colab
- Carbon emitted: ~0.1 kg CO₂eq (estimated)
Citation
bibtex
@misc{ali2026baltivoice,author = {Ali, Muhammad},title = {BaltiVoice: A Speech Corpus and Fine-tuned WhisperASR System for the Balti Language},year = {2026},publisher = {HuggingFace},url = {https://huggingface.co/mohdali1/whisper-small-balti}}
Glossary
- WER: Word Error Rate = (Substitutions + Deletions + Insertions) / Total Words. Lower is better.
- Nastaliq: Arabic-based script used for Urdu, Persian, and Balti.
- Low-resource language: A language with limited digital data, tools, and models available for NLP/ASR.
- Speaker-disjoint split: Train and validation sets contain entirely different speakers, preventing the model from memorizing speaker acoustics.
More Information
- Dataset: mohdali1/baltivoice-asr
- Demo: baltivoice-demo
- GitHub: BaltiVoice-ASR
- Contact: alisundusi10@gmail.com
Model provider
mohdali1
Model tree
Base
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information