Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

Model Description

This model is a fine-tuned version of openai/whisper-small for Automatic Speech Recognition (ASR) in the Balti language (bft).

Balti is a Tibetic language with roughly 400,000 speakers, written in Nastaliq (Arabic-based) script. Before this work, no publicly available ASR models or datasets existed for Balti. This model transcribes Balti speech into native Nastaliq text.

  • Developed by: Mohammad Ali, Independent Researcher, Gilgit-Baltistan, Pakistan
  • Model type: Sequence-to-sequence ASR (Whisper architecture)
  • Language: Balti (bft)
  • License: Apache 2.0
  • Base model: openai/whisper-small

Model Sources


Results

ModelWER
Whisper-small (zero-shot)182.18%
Whisper-small fine-tuned (this model)30.07%

Zero-shot WER above 100% indicates hallucination — the model generates words not present in the reference. Fine-tuning on 16.8 hours of Balti speech reduces this to 30.07% on a speaker-disjoint validation set of 538 utterances.


How to Get Started

Installation

bash

pip install transformers torch librosa

Inference

python

from transformers import pipeline
asr = pipeline(
"automatic-speech-recognition",
model="mohdali1/whisper-small-balti",
generate_kwargs={"language": "urdu", "task": "transcribe"}
)
result = asr("your_balti_audio.wav")
print(result["text"])

Manual inference

python

from transformers import WhisperForConditionalGeneration, WhisperProcessor
import torch
import librosa
model_id = "mohdali1/whisper-small-balti"
processor = WhisperProcessor.from_pretrained(
model_id, language="urdu", task="transcribe"
)
model = WhisperForConditionalGeneration.from_pretrained(model_id)
audio, sr = librosa.load("your_balti_audio.wav", sr=16000)
inputs = processor(audio, sampling_rate=16000, return_tensors="pt")
with torch.no_grad():
generated_ids = model.generate(inputs.input_features)
transcription = processor.batch_decode(
generated_ids, skip_special_tokens=True
)[0]
print(transcription)

Uses

Direct Use

  • Transcription: Convert Balti audio into native Nastaliq text
  • Research: Study low-resource ASR and transfer learning for Tibetic languages
  • Education: Build tools for Balti literacy and pronunciation

Downstream Use

  • Voice assistants for Balti speakers
  • Media archiving of radio broadcasts, folk stories, oral histories
  • Healthcare documentation in rural Gilgit-Baltistan settings

Out-of-Scope Use

  • High-stakes decisions (legal, medical, safety-critical) without human verification — WER is 30%, not production-ready
  • Other languages — performance on non-Balti input is not guaranteed
  • Commercial deployment without further domain-specific evaluation

Training Details

Training Data

  • Dataset: BaltiVoice ASR Dataset
  • Total clips: 10,060 validated utterances (~16.8 hours)
  • Format: 16kHz mono WAV, native Nastaliq transcriptions
  • Split method: Speaker-disjoint (GroupShuffleSplit on client_id, seed 42)
SplitSamplesSpeakers
Train9,519122
Validation53814

Training Hyperparameters

ParameterValue
Base modelopenai/whisper-small
Language tokenurdu (closest Nastaliq script in Whisper)
Tasktranscribe
Learning rate1e-5
Effective batch size16 (8 × 2 gradient accumulation)
Max steps1,000
OptimizerAdamW
Precisionfp16
Gradient checkpointingEnabled
HardwareNVIDIA Tesla T4 (Google Colab)
Training time1h 54m

Training Curve

StepTrain LossVal LossWER (%)
2500.79050.403740.19
5000.59680.320833.37
7500.45420.296331.37
10000.46520.283030.07

Validation loss decreased consistently across all checkpoints with no sign of overfitting at step 1,000.


Bias, Risks, and Limitations

Technical Limitations

  • WER of 30.07% — roughly one word in three may be incorrect. Not suitable for critical applications without human review.
  • Read speech only — trained on short read clips (avg 6 seconds). Performance on spontaneous conversational speech will likely be lower.
  • No Unicode normalization — Nastaliq script Unicode ambiguities (e.g., Arabic Yeh vs. Farsi Yeh) may affect output consistency.
  • Speaker diversity — 136 speakers, mostly from Gilgit-Baltistan. Dialectal variation from other regions may affect accuracy.

Sociotechnical Considerations

  • Balti is an endangered language. Mis-transcriptions could distort cultural meaning. Native speaker validation is recommended.
  • The dataset represents a specific regional subset of Balti speakers and may not capture all dialectal variation.

Recommendations

  • Use human review for sensitive or important content
  • Encourage Balti speakers to report errors via GitHub Issues
  • Consider extended training or Whisper-medium for higher accuracy

Environmental Impact

Estimated using the ML Impact Calculator (Lacoste et al., 2019).

  • Hardware: NVIDIA Tesla T4
  • Training time: ~1.9 hours
  • Cloud provider: Google Colab
  • Carbon emitted: ~0.1 kg CO₂eq (estimated)

Citation

bibtex

@misc{ali2026baltivoice,
author = {Ali, Muhammad},
title = {BaltiVoice: A Speech Corpus and Fine-tuned Whisper
ASR System for the Balti Language},
year = {2026},
publisher = {HuggingFace},
url = {https://huggingface.co/mohdali1/whisper-small-balti}
}

Glossary

  • WER: Word Error Rate = (Substitutions + Deletions + Insertions) / Total Words. Lower is better.
  • Nastaliq: Arabic-based script used for Urdu, Persian, and Balti.
  • Low-resource language: A language with limited digital data, tools, and models available for NLP/ASR.
  • Speaker-disjoint split: Train and validation sets contain entirely different speakers, preventing the model from memorizing speaker acoustics.

More Information

Model provider

mohdali1

Model tree

Base

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today