Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitModel Summary
| Field | Value |
|---|---|
| Base model | openai/whisper-large-v3-turbo |
| Architecture | Whisper encoder-decoder seq2seq (large-v3-turbo, ~809 M parameters) |
| Output target | IPA phonemes (no orthography) |
| Source language | Urdu (ur) |
| Audio | 16 kHz, mono, up to 30 s per chunk |
| Training data | 91 k synthetic TTS utterances (≈ 165 h) |
| Training compute | NVIDIA RTX A6000 (48 GB), single GPU, ~10.4 h wall clock |
| Phonemizer | espeak-ng (via the phonemizer Python package) |
Intended Use
Primary intended uses are research-grade and offline processing:
- Automatic labeling of speech corpora with IPA for TTS training pipelines.
- Forced alignment between Urdu audio and a known phonemic transcription.
- Computer-assisted pronunciation training and pronunciation error detection.
- Phonological and dialect-variation studies on Urdu speech.
- Generation of phoneme-level features for downstream models.
This model is not a substitute for an orthographic Urdu ASR. Its output is phonetic and is not intended to be read as Urdu text.
Limitations
- Training data is synthetic. Audio was produced by neural TTS systems and therefore differs in acoustic characteristics from spontaneous human speech (prosodic regularity, low noise, narrow speaker inventory). Real-speech word error rate on FLEURS is substantially higher than the in-loop development WER on the synthetic distribution (see Evaluation).
- Phonemizer determines the label space. Reference transcriptions were generated by
espeak-ng(Urdu rule set). The model therefore inherits any systematic errors or idiosyncrasies of that grapheme-to-phoneme system. In particular, dialectal variants not produced byespeak-ngwill not appear in the model's output distribution. - Speaker and channel diversity is limited. The synthetic data covers a small set of TTS voices and recording conditions. Accented speech, noisy channels, code-switched English/Urdu, and rapid spontaneous speech are out of distribution.
- No timestamps. This release does not produce word- or phoneme-level alignment timestamps. For alignment, pair this model with a forced-alignment tool over its IPA output.
Training Data
Two synthetic-speech Urdu corpora were used:
| Source | Utterances | Approx. audio |
|---|---|---|
mahwizzzz/syn-ur | 7,963 | ~14 h |
mahwizzzz/syn-ur-2 | 85,327 | ~151 h |
| Total | 93,290 | ~165 h |
Audio was extracted via the datasets Audio feature, resampled to 16 kHz mono, and stored as PCM-16 WAV. The accompanying Urdu transcripts were normalized (removal of Arabic diacritics in the range U+064B–U+065F, U+0670, U+06D6–U+06ED; collapse of zero-width joiners; whitespace normalization).
Transcripts were converted to IPA using phonemizer with the espeak backend (language="ur", with_stress=True). Empty phonemizations and transcripts exceeding 448 tokens after tokenization with Whisper were filtered, yielding 91,017 training examples. The resulting IPA inventory contains 61 distinct characters, including the Urdu retroflex consonant set (ʈ ɖ ɽ ɳ ʂ ʐ), aspirated consonants marked with ʰ, primary (ˈ) and secondary (ˌ) stress, long-vowel marker (ː), nasalization (combining tilde), velar fricative (χ), palatal stop (ɟ), labial approximant (ʋ), and the standard Urdu vowel space.
Phonemization Pipeline
Reference IPA is produced offline before training, not at inference time. The pipeline is deterministic:
text
Urdu script --> diacritic / ZWJ stripping --> espeak-ng (ur, with_stress=True) --> IPA token stream
This contract means a model output can be compared character-wise against the IPA produced by passing the corresponding gold Urdu text through the same espeak-ng configuration.
Training Procedure
The fine-tune was performed with transformers.Seq2SeqTrainer.
| Hyperparameter | Value |
|---|---|
| Initialization | openai/whisper-large-v3-turbo |
| Tokenizer language token | urdu |
| Optimizer | AdamW (Transformers default) |
| Learning rate | 1e-5, linear decay to 0 |
| Per-device train batch size | 4 |
| Gradient accumulation steps | 64 |
| Effective batch size | 256 |
| Number of epochs | 3 |
| Total optimizer steps | 1,062 |
| Mixed precision | fp16 |
| Gradient checkpointing | disabled (incompatible with the custom collator) |
| Max label tokens | 448 (filter applied at data load) |
| Eval set | 500 random held-out training utterances (synthetic) |
| Eval / save cadence | every 200 steps; load_best_model_at_end=True, metric_for_best_model="wer" |
| Wall-clock training time | 10 h 23 m 52 s |
| Final training loss | 0.0673 |
The data collator processes raw audio on the fly: features are extracted with the Whisper feature extractor, labels are tokenized with the Whisper tokenizer set to language="urdu", and label sequences are padded with -100 to be ignored by the loss. dataloader_num_workers=0 is enforced because the HuggingFace Audio column is not fork-safe under multiple workers.
Evaluation
Two evaluation distributions are reported.
In-distribution (synthetic held-out)
A 500 utterance random slice of the synthetic training set was held out and evaluated every 200 steps during training. Final scores:
| Metric | Value |
|---|---|
| eval loss | 0.0260 |
| eval CER | 0.0825 |
| eval WER | 0.1041 |
Out-of-distribution (real human speech, FLEURS)
The model was evaluated on the 299 utterance test split of google/fleurs configuration ur_pk. Reference IPA for FLEURS was produced with the same espeak-ng pipeline used in training, so this is a like-for-like comparison at the phonemic level. The metrics reported are:
- CER: character error rate on the IPA stream.
- WER: word error rate, where words are space-separated IPA tokens.
- SER: stress error rate, defined as the fraction of aligned word pairs in which the position of the primary stress marker
ˈdiffers between hypothesis and reference. - VER: vowel error rate, defined as the CER computed after restricting both hypothesis and reference to vowel characters.
| Metric | Value |
|---|---|
| CER | 0.1833 |
| WER | 0.3995 |
| SER | 0.4874 |
| VER | 0.1457 |
The gap between the synthetic dev WER (0.10) and the FLEURS real-speech WER (0.40) is a direct consequence of the synthetic-only training distribution. Continued training on real Urdu speech (for example, Common Voice Urdu) is expected to narrow this gap substantially.
Example Predictions on FLEURS
markdown
ID: fleurs_000000GT: ɪn keː dˈɔːr mˈẽ xəfˈiːf təlˈoːs ˈeːzaː mˈʌsəla nahˈiːn tʰaː ɟˈeːsaː ke woː ˈaːɟ hɛ ...Pred: ɪn keː dˈuːr mˈẽ xəfˈiːb təlˈoː ˈeːsaː mˈʌsəla nahˈiːn tʰaː ɟˈeːsaː keː woː ˈaːɟ hɛ ...ID: fleurs_000003GT: ˌaktˈoːbər mˈẽ ʃˈʊruː hˈoːneː ʋˈaːleː hˌʊkuːmˈat mʊxˈaːlɪf mʊʐˈaːhɪrˌõː pˈʌr mˈaːrʈilˌiː kaː ɾˌadeː amˈal kəmˈiːʃən tʰaːPred: woː ˌəktˈuːbər mˈẽ ʃˈʊruː hˈoːneː ʋˈaːli həqˈuːmət məxˈaːlɪf mʊʐˈaːhɪr ˈoː pˌərmaːˈiːnɖi kaː rˌadeː amˈal kəmˈiːʃən tʰaː
Errors are dominated by fine-grained phonemic confusions (vowel quality, single consonant substitutions, stress shifts of one syllable), not by structural failure: word boundaries, segmental inventory, and overall phrase shape are recovered correctly even on out-of-distribution acoustics.
Usage
Transformers pipeline
python
from transformers import pipelinepipe = pipeline(task="automatic-speech-recognition",model="mahwizzzz/aurix-v1",chunk_length_s=30,)result = pipe("path/to/urdu_audio.wav")print(result["text"])# Example: ʊrduː zəbˈaːn bəhʊt xuːbsuːrət hɛ
Direct model API
python
import torchfrom transformers import WhisperProcessor, WhisperForConditionalGenerationimport soundfile as sfprocessor = WhisperProcessor.from_pretrained("mahwizzzz/aurix-v1", language="Urdu", task="transcribe")model = WhisperForConditionalGeneration.from_pretrained("mahwizzzz/aurix-v1").to("cuda")audio, sr = sf.read("path/to/urdu_audio.wav")inputs = processor(audio, sampling_rate=sr, return_tensors="pt").to("cuda")with torch.no_grad():ids = model.generate(**inputs, language="urdu", task="transcribe")print(processor.batch_decode(ids, skip_special_tokens=True)[0])
Acknowledgements
- Base model: OpenAI Whisper (
whisper-large-v3-turbo). - Out of distribution evaluation: Google FLEURS (
ur_pk). - Phonemizer:
espeak-ngvia thephonemizerPython package.
Citation
If you use this model in academic work, please cite it as follows.
bibtex
@misc{aurix-v1-2026,title = {aurix-v1: A Whisper based Urdu Speech to IPA Model},author = {Mahwiz Khalil},year = {2026},howpublished = {\url{https://huggingface.co/mahwizzzz/aurix-v1}},note = {Fine-tuned from openai/whisper-large-v3-turbo on synthetic Urdu TTS data; phonemized with espeak-ng.}}
Please also cite the upstream artifacts:
bibtex
@article{radford2023whisper,title = {Robust Speech Recognition via Large-Scale Weak Supervision},author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},journal = {arXiv preprint arXiv:2212.04356},year = {2022}}@inproceedings{conneau2023fleurs,title = {{FLEURS}: Few-shot Learning Evaluation of Universal Representations of Speech},author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},booktitle = {IEEE Spoken Language Technology Workshop (SLT)},year = {2023}}@misc{espeak-ng,title = {{eSpeak NG}: Open source speech synthesizer},howpublished = {\url{https://github.com/espeak-ng/espeak-ng}}}
Model provider
mahwizzzz
Model tree
Base
openai/whisper-large-v3-turbo
Fine-tuned
this model
Modalities
Input
Audio
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information