Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

Model Summary

FieldValue
Base modelopenai/whisper-large-v3-turbo
ArchitectureWhisper encoder-decoder seq2seq (large-v3-turbo, ~809 M parameters)
Output targetIPA phonemes (no orthography)
Source languageUrdu (ur)
Audio16 kHz, mono, up to 30 s per chunk
Training data91 k synthetic TTS utterances (≈ 165 h)
Training computeNVIDIA RTX A6000 (48 GB), single GPU, ~10.4 h wall clock
Phonemizerespeak-ng (via the phonemizer Python package)

Intended Use

Primary intended uses are research-grade and offline processing:

  • Automatic labeling of speech corpora with IPA for TTS training pipelines.
  • Forced alignment between Urdu audio and a known phonemic transcription.
  • Computer-assisted pronunciation training and pronunciation error detection.
  • Phonological and dialect-variation studies on Urdu speech.
  • Generation of phoneme-level features for downstream models.

This model is not a substitute for an orthographic Urdu ASR. Its output is phonetic and is not intended to be read as Urdu text.

Limitations

  • Training data is synthetic. Audio was produced by neural TTS systems and therefore differs in acoustic characteristics from spontaneous human speech (prosodic regularity, low noise, narrow speaker inventory). Real-speech word error rate on FLEURS is substantially higher than the in-loop development WER on the synthetic distribution (see Evaluation).
  • Phonemizer determines the label space. Reference transcriptions were generated by espeak-ng (Urdu rule set). The model therefore inherits any systematic errors or idiosyncrasies of that grapheme-to-phoneme system. In particular, dialectal variants not produced by espeak-ng will not appear in the model's output distribution.
  • Speaker and channel diversity is limited. The synthetic data covers a small set of TTS voices and recording conditions. Accented speech, noisy channels, code-switched English/Urdu, and rapid spontaneous speech are out of distribution.
  • No timestamps. This release does not produce word- or phoneme-level alignment timestamps. For alignment, pair this model with a forced-alignment tool over its IPA output.

Training Data

Two synthetic-speech Urdu corpora were used:

SourceUtterancesApprox. audio
mahwizzzz/syn-ur7,963~14 h
mahwizzzz/syn-ur-285,327~151 h
Total93,290~165 h

Audio was extracted via the datasets Audio feature, resampled to 16 kHz mono, and stored as PCM-16 WAV. The accompanying Urdu transcripts were normalized (removal of Arabic diacritics in the range U+064B–U+065F, U+0670, U+06D6–U+06ED; collapse of zero-width joiners; whitespace normalization).

Transcripts were converted to IPA using phonemizer with the espeak backend (language="ur", with_stress=True). Empty phonemizations and transcripts exceeding 448 tokens after tokenization with Whisper were filtered, yielding 91,017 training examples. The resulting IPA inventory contains 61 distinct characters, including the Urdu retroflex consonant set (ʈ ɖ ɽ ɳ ʂ ʐ), aspirated consonants marked with ʰ, primary (ˈ) and secondary (ˌ) stress, long-vowel marker (ː), nasalization (combining tilde), velar fricative (χ), palatal stop (ɟ), labial approximant (ʋ), and the standard Urdu vowel space.

Phonemization Pipeline

Reference IPA is produced offline before training, not at inference time. The pipeline is deterministic:

text

Urdu script --> diacritic / ZWJ stripping --> espeak-ng (ur, with_stress=True) --> IPA token stream

This contract means a model output can be compared character-wise against the IPA produced by passing the corresponding gold Urdu text through the same espeak-ng configuration.

Training Procedure

The fine-tune was performed with transformers.Seq2SeqTrainer.

HyperparameterValue
Initializationopenai/whisper-large-v3-turbo
Tokenizer language tokenurdu
OptimizerAdamW (Transformers default)
Learning rate1e-5, linear decay to 0
Per-device train batch size4
Gradient accumulation steps64
Effective batch size256
Number of epochs3
Total optimizer steps1,062
Mixed precisionfp16
Gradient checkpointingdisabled (incompatible with the custom collator)
Max label tokens448 (filter applied at data load)
Eval set500 random held-out training utterances (synthetic)
Eval / save cadenceevery 200 steps; load_best_model_at_end=True, metric_for_best_model="wer"
Wall-clock training time10 h 23 m 52 s
Final training loss0.0673

The data collator processes raw audio on the fly: features are extracted with the Whisper feature extractor, labels are tokenized with the Whisper tokenizer set to language="urdu", and label sequences are padded with -100 to be ignored by the loss. dataloader_num_workers=0 is enforced because the HuggingFace Audio column is not fork-safe under multiple workers.

Evaluation

Two evaluation distributions are reported.

In-distribution (synthetic held-out)

A 500 utterance random slice of the synthetic training set was held out and evaluated every 200 steps during training. Final scores:

MetricValue
eval loss0.0260
eval CER0.0825
eval WER0.1041

Out-of-distribution (real human speech, FLEURS)

The model was evaluated on the 299 utterance test split of google/fleurs configuration ur_pk. Reference IPA for FLEURS was produced with the same espeak-ng pipeline used in training, so this is a like-for-like comparison at the phonemic level. The metrics reported are:

  • CER: character error rate on the IPA stream.
  • WER: word error rate, where words are space-separated IPA tokens.
  • SER: stress error rate, defined as the fraction of aligned word pairs in which the position of the primary stress marker ˈ differs between hypothesis and reference.
  • VER: vowel error rate, defined as the CER computed after restricting both hypothesis and reference to vowel characters.
MetricValue
CER0.1833
WER0.3995
SER0.4874
VER0.1457

The gap between the synthetic dev WER (0.10) and the FLEURS real-speech WER (0.40) is a direct consequence of the synthetic-only training distribution. Continued training on real Urdu speech (for example, Common Voice Urdu) is expected to narrow this gap substantially.

Example Predictions on FLEURS

markdown

ID: fleurs_000000
GT: ɪn keː dˈɔːr mˈẽ xəfˈiːf təlˈoːs ˈeːzaː mˈʌsəla nahˈiːn tʰaː ɟˈeːsaː ke woː ˈaːɟ hɛ ...
Pred: ɪn keː dˈuːr mˈẽ xəfˈiːb təlˈoː ˈeːsaː mˈʌsəla nahˈiːn tʰaː ɟˈeːsaː keː woː ˈaːɟ hɛ ...
ID: fleurs_000003
GT: ˌaktˈoːbər mˈẽ ʃˈʊruː hˈoːneː ʋˈaːleː hˌʊkuːmˈat mʊxˈaːlɪf mʊʐˈaːhɪrˌõː pˈʌr mˈaːrʈilˌiː kaː ɾˌadeː amˈal kəmˈiːʃən tʰaː
Pred: woː ˌəktˈuːbər mˈẽ ʃˈʊruː hˈoːneː ʋˈaːli həqˈuːmət məxˈaːlɪf mʊʐˈaːhɪr ˈoː pˌərmaːˈiːnɖi kaː rˌadeː amˈal kəmˈiːʃən tʰaː

Errors are dominated by fine-grained phonemic confusions (vowel quality, single consonant substitutions, stress shifts of one syllable), not by structural failure: word boundaries, segmental inventory, and overall phrase shape are recovered correctly even on out-of-distribution acoustics.

Usage

Transformers pipeline

python

from transformers import pipeline
pipe = pipeline(
task="automatic-speech-recognition",
model="mahwizzzz/aurix-v1",
chunk_length_s=30,
)
result = pipe("path/to/urdu_audio.wav")
print(result["text"])
# Example: ʊrduː zəbˈaːn bəhʊt xuːbsuːrət hɛ

Direct model API

python

import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf
processor = WhisperProcessor.from_pretrained("mahwizzzz/aurix-v1", language="Urdu", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("mahwizzzz/aurix-v1").to("cuda")
audio, sr = sf.read("path/to/urdu_audio.wav")
inputs = processor(audio, sampling_rate=sr, return_tensors="pt").to("cuda")
with torch.no_grad():
ids = model.generate(**inputs, language="urdu", task="transcribe")
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

Acknowledgements

  • Base model: OpenAI Whisper (whisper-large-v3-turbo).
  • Out of distribution evaluation: Google FLEURS (ur_pk).
  • Phonemizer: espeak-ng via the phonemizer Python package.

Citation

If you use this model in academic work, please cite it as follows.

bibtex

@misc{aurix-v1-2026,
title = {aurix-v1: A Whisper based Urdu Speech to IPA Model},
author = {Mahwiz Khalil},
year = {2026},
howpublished = {\url{https://huggingface.co/mahwizzzz/aurix-v1}},
note = {Fine-tuned from openai/whisper-large-v3-turbo on synthetic Urdu TTS data; phonemized with espeak-ng.}
}

Please also cite the upstream artifacts:

bibtex

@article{radford2023whisper,
title = {Robust Speech Recognition via Large-Scale Weak Supervision},
author = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
journal = {arXiv preprint arXiv:2212.04356},
year = {2022}
}
@inproceedings{conneau2023fleurs,
title = {{FLEURS}: Few-shot Learning Evaluation of Universal Representations of Speech},
author = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
booktitle = {IEEE Spoken Language Technology Workshop (SLT)},
year = {2023}
}
@misc{espeak-ng,
title = {{eSpeak NG}: Open source speech synthesizer},
howpublished = {\url{https://github.com/espeak-ng/espeak-ng}}
}

Model provider

mahwizzzz

mahwizzzz

Model tree

Base

openai/whisper-large-v3-turbo

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today