mahwizzzz

aurix-v1

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model Summary

Table with columns: Field, Value
Field	Value
Base model	`openai/whisper-large-v3-turbo`
Architecture	Whisper encoder-decoder seq2seq (large-v3-turbo, ~809 M parameters)
Output target	IPA phonemes (no orthography)
Source language	Urdu (`ur`)
Audio	16 kHz, mono, up to 30 s per chunk
Training data	91 k synthetic TTS utterances (≈ 165 h)
Training compute	NVIDIA RTX A6000 (48 GB), single GPU, ~10.4 h wall clock
Phonemizer	`espeak-ng` (via the `phonemizer` Python package)

Intended Use

Primary intended uses are research-grade and offline processing:

Automatic labeling of speech corpora with IPA for TTS training pipelines.
Forced alignment between Urdu audio and a known phonemic transcription.
Computer-assisted pronunciation training and pronunciation error detection.
Phonological and dialect-variation studies on Urdu speech.
Generation of phoneme-level features for downstream models.

This model is not a substitute for an orthographic Urdu ASR. Its output is phonetic and is not intended to be read as Urdu text.

Limitations

Training data is synthetic. Audio was produced by neural TTS systems and therefore differs in acoustic characteristics from spontaneous human speech (prosodic regularity, low noise, narrow speaker inventory). Real-speech word error rate on FLEURS is substantially higher than the in-loop development WER on the synthetic distribution (see Evaluation).
Phonemizer determines the label space. Reference transcriptions were generated by espeak-ng (Urdu rule set). The model therefore inherits any systematic errors or idiosyncrasies of that grapheme-to-phoneme system. In particular, dialectal variants not produced by espeak-ng will not appear in the model's output distribution.
Speaker and channel diversity is limited. The synthetic data covers a small set of TTS voices and recording conditions. Accented speech, noisy channels, code-switched English/Urdu, and rapid spontaneous speech are out of distribution.
No timestamps. This release does not produce word- or phoneme-level alignment timestamps. For alignment, pair this model with a forced-alignment tool over its IPA output.

Training Data

Two synthetic-speech Urdu corpora were used:

Table with columns: Source, Utterances, Approx. audio
Source	Utterances	Approx. audio
`mahwizzzz/syn-ur`	7,963	~14 h
`mahwizzzz/syn-ur-2`	85,327	~151 h
Total	93,290	~165 h

Audio was extracted via the datasets Audio feature, resampled to 16 kHz mono, and stored as PCM-16 WAV. The accompanying Urdu transcripts were normalized (removal of Arabic diacritics in the range U+064B–U+065F, U+0670, U+06D6–U+06ED; collapse of zero-width joiners; whitespace normalization).

Transcripts were converted to IPA using phonemizer with the espeak backend (language="ur", with_stress=True). Empty phonemizations and transcripts exceeding 448 tokens after tokenization with Whisper were filtered, yielding 91,017 training examples. The resulting IPA inventory contains 61 distinct characters, including the Urdu retroflex consonant set (ʈ ɖ ɽ ɳ ʂ ʐ), aspirated consonants marked with ʰ, primary (ˈ) and secondary (ˌ) stress, long-vowel marker (ː), nasalization (combining tilde), velar fricative (χ), palatal stop (ɟ), labial approximant (ʋ), and the standard Urdu vowel space.

Phonemization Pipeline

Reference IPA is produced offline before training, not at inference time. The pipeline is deterministic:

text
Urdu script  -->  diacritic / ZWJ stripping  -->  espeak-ng (ur, with_stress=True)  -->  IPA token stream

This contract means a model output can be compared character-wise against the IPA produced by passing the corresponding gold Urdu text through the same espeak-ng configuration.

Training Procedure

The fine-tune was performed with transformers.Seq2SeqTrainer.

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Initialization	`openai/whisper-large-v3-turbo`
Tokenizer language token	`urdu`
Optimizer	AdamW (Transformers default)
Learning rate	1e-5, linear decay to 0
Per-device train batch size	4
Gradient accumulation steps	64
Effective batch size	256
Number of epochs

The data collator processes raw audio on the fly: features are extracted with the Whisper feature extractor, labels are tokenized with the Whisper tokenizer set to language="urdu", and label sequences are padded with -100 to be ignored by the loss. dataloader_num_workers=0 is enforced because the HuggingFace Audio column is not fork-safe under multiple workers.

Evaluation

Two evaluation distributions are reported.

In-distribution (synthetic held-out)

A 500 utterance random slice of the synthetic training set was held out and evaluated every 200 steps during training. Final scores:

Table with columns: Metric, Value
Metric	Value
eval loss	0.0260
eval CER	0.0825
eval WER	0.1041

Out-of-distribution (real human speech, FLEURS)

The model was evaluated on the 299 utterance test split of google/fleurs configuration ur_pk. Reference IPA for FLEURS was produced with the same espeak-ng pipeline used in training, so this is a like-for-like comparison at the phonemic level. The metrics reported are:

CER: character error rate on the IPA stream.
WER: word error rate, where words are space-separated IPA tokens.
SER: stress error rate, defined as the fraction of aligned word pairs in which the position of the primary stress marker ˈ differs between hypothesis and reference.
VER: vowel error rate, defined as the CER computed after restricting both hypothesis and reference to vowel characters.

Table with columns: Metric, Value
Metric	Value
CER	0.1833
WER	0.3995
SER	0.4874
VER	0.1457

The gap between the synthetic dev WER (0.10) and the FLEURS real-speech WER (0.40) is a direct consequence of the synthetic-only training distribution. Continued training on real Urdu speech (for example, Common Voice Urdu) is expected to narrow this gap substantially.

Example Predictions on FLEURS

markdown
ID:   fleurs_000000
GT:   ɪn keː dˈɔːr mˈẽ xəfˈiːf təlˈoːs ˈeːzaː mˈʌsəla nahˈiːn tʰaː ɟˈeːsaː ke woː ˈaːɟ hɛ ...
Pred: ɪn keː dˈuːr mˈẽ xəfˈiːb təlˈoː ˈeːsaː mˈʌsəla nahˈiːn tʰaː ɟˈeːsaː keː woː ˈaːɟ hɛ ...

ID:   fleurs_000003
GT:   ˌaktˈoːbər mˈẽ ʃˈʊruː hˈoːneː ʋˈaːleː hˌʊkuːmˈat mʊxˈaːlɪf mʊʐˈaːhɪrˌõː pˈʌr mˈaːrʈilˌiː kaː ɾˌadeː amˈal kəmˈiːʃən tʰaː
Pred: woː ˌəktˈuːbər mˈẽ ʃˈʊruː hˈoːneː ʋˈaːli həqˈuːmət məxˈaːlɪf mʊʐˈaːhɪr ˈoː pˌərmaːˈiːnɖi kaː rˌadeː amˈal kəmˈiːʃən tʰaː

Errors are dominated by fine-grained phonemic confusions (vowel quality, single consonant substitutions, stress shifts of one syllable), not by structural failure: word boundaries, segmental inventory, and overall phrase shape are recovered correctly even on out-of-distribution acoustics.

Usage

Transformers pipeline

python
from transformers import pipeline

pipe = pipeline(
    task="automatic-speech-recognition",
    model="mahwizzzz/aurix-v1",
    chunk_length_s=30,
)

result = pipe("path/to/urdu_audio.wav")
print(result["text"])
# Example: ʊrduː zəbˈaːn bəhʊt xuːbsuːrət hɛ

Direct model API

python
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf

processor = WhisperProcessor.from_pretrained("mahwizzzz/aurix-v1", language="Urdu", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("mahwizzzz/aurix-v1").to("cuda")

audio, sr = sf.read("path/to/urdu_audio.wav")
inputs = processor(audio, sampling_rate=sr, return_tensors="pt").to("cuda")
with torch.no_grad():
    ids = model.generate(**inputs, language="urdu", task="transcribe")
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

Acknowledgements

Base model: OpenAI Whisper (whisper-large-v3-turbo).
Out of distribution evaluation: Google FLEURS (ur_pk).
Phonemizer: espeak-ng via the phonemizer Python package.

Citation

If you use this model in academic work, please cite it as follows.

bibtex
@misc{aurix-v1-2026,
  title        = {aurix-v1: A Whisper based Urdu Speech to IPA Model},
  author       = {Mahwiz Khalil},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/mahwizzzz/aurix-v1}},
  note         = {Fine-tuned from openai/whisper-large-v3-turbo on synthetic Urdu TTS data; phonemized with espeak-ng.}
}

Please also cite the upstream artifacts:

bibtex
@article{radford2023whisper,
  title   = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author  = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal = {arXiv preprint arXiv:2212.04356},
  year    = {2022}
}

@inproceedings{conneau2023fleurs,
  title     = {{FLEURS}: Few-shot Learning Evaluation of Universal Representations of Speech},
  author    = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
  booktitle = {IEEE Spoken Language Technology Workshop (SLT)},
  year      = {2023}
}

@misc{espeak-ng,
  title        = {{eSpeak NG}: Open source speech synthesizer},
  howpublished = {\url{https://github.com/espeak-ng/espeak-ng}}
}

Model provider

mahwizzzz

Model tree

Base

openai/whisper-large-v3-turbo

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Model Summary

Table with columns: Field, Value
Field	Value
Base model	`openai/whisper-large-v3-turbo`
Architecture	Whisper encoder-decoder seq2seq (large-v3-turbo, ~809 M parameters)
Output target	IPA phonemes (no orthography)
Source language	Urdu (`ur`)
Audio	16 kHz, mono, up to 30 s per chunk
Training data	91 k synthetic TTS utterances (≈ 165 h)
Training compute	NVIDIA RTX A6000 (48 GB), single GPU, ~10.4 h wall clock
Phonemizer	`espeak-ng` (via the `phonemizer` Python package)

Intended Use

Primary intended uses are research-grade and offline processing:

Automatic labeling of speech corpora with IPA for TTS training pipelines.
Forced alignment between Urdu audio and a known phonemic transcription.
Computer-assisted pronunciation training and pronunciation error detection.
Phonological and dialect-variation studies on Urdu speech.
Generation of phoneme-level features for downstream models.

This model is not a substitute for an orthographic Urdu ASR. Its output is phonetic and is not intended to be read as Urdu text.

Limitations

Training data is synthetic. Audio was produced by neural TTS systems and therefore differs in acoustic characteristics from spontaneous human speech (prosodic regularity, low noise, narrow speaker inventory). Real-speech word error rate on FLEURS is substantially higher than the in-loop development WER on the synthetic distribution (see Evaluation).
Phonemizer determines the label space. Reference transcriptions were generated by espeak-ng (Urdu rule set). The model therefore inherits any systematic errors or idiosyncrasies of that grapheme-to-phoneme system. In particular, dialectal variants not produced by espeak-ng will not appear in the model's output distribution.
Speaker and channel diversity is limited. The synthetic data covers a small set of TTS voices and recording conditions. Accented speech, noisy channels, code-switched English/Urdu, and rapid spontaneous speech are out of distribution.
No timestamps. This release does not produce word- or phoneme-level alignment timestamps. For alignment, pair this model with a forced-alignment tool over its IPA output.

Training Data

Two synthetic-speech Urdu corpora were used:

Table with columns: Source, Utterances, Approx. audio
Source	Utterances	Approx. audio
`mahwizzzz/syn-ur`	7,963	~14 h
`mahwizzzz/syn-ur-2`	85,327	~151 h
Total	93,290	~165 h

Phonemization Pipeline

Reference IPA is produced offline before training, not at inference time. The pipeline is deterministic:

text
Urdu script  -->  diacritic / ZWJ stripping  -->  espeak-ng (ur, with_stress=True)  -->  IPA token stream

This contract means a model output can be compared character-wise against the IPA produced by passing the corresponding gold Urdu text through the same espeak-ng configuration.

Training Procedure

The fine-tune was performed with transformers.Seq2SeqTrainer.

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Initialization	`openai/whisper-large-v3-turbo`
Tokenizer language token	`urdu`
Optimizer	AdamW (Transformers default)
Learning rate	1e-5, linear decay to 0
Per-device train batch size	4
Gradient accumulation steps	64
Effective batch size	256
Number of epochs

Evaluation

Two evaluation distributions are reported.

In-distribution (synthetic held-out)

A 500 utterance random slice of the synthetic training set was held out and evaluated every 200 steps during training. Final scores:

Table with columns: Metric, Value
Metric	Value
eval loss	0.0260
eval CER	0.0825
eval WER	0.1041

Out-of-distribution (real human speech, FLEURS)

CER: character error rate on the IPA stream.
WER: word error rate, where words are space-separated IPA tokens.
SER: stress error rate, defined as the fraction of aligned word pairs in which the position of the primary stress marker ˈ differs between hypothesis and reference.
VER: vowel error rate, defined as the CER computed after restricting both hypothesis and reference to vowel characters.

Table with columns: Metric, Value
Metric	Value
CER	0.1833
WER	0.3995
SER	0.4874
VER	0.1457

Example Predictions on FLEURS

markdown
ID:   fleurs_000000
GT:   ɪn keː dˈɔːr mˈẽ xəfˈiːf təlˈoːs ˈeːzaː mˈʌsəla nahˈiːn tʰaː ɟˈeːsaː ke woː ˈaːɟ hɛ ...
Pred: ɪn keː dˈuːr mˈẽ xəfˈiːb təlˈoː ˈeːsaː mˈʌsəla nahˈiːn tʰaː ɟˈeːsaː keː woː ˈaːɟ hɛ ...

ID:   fleurs_000003
GT:   ˌaktˈoːbər mˈẽ ʃˈʊruː hˈoːneː ʋˈaːleː hˌʊkuːmˈat mʊxˈaːlɪf mʊʐˈaːhɪrˌõː pˈʌr mˈaːrʈilˌiː kaː ɾˌadeː amˈal kəmˈiːʃən tʰaː
Pred: woː ˌəktˈuːbər mˈẽ ʃˈʊruː hˈoːneː ʋˈaːli həqˈuːmət məxˈaːlɪf mʊʐˈaːhɪr ˈoː pˌərmaːˈiːnɖi kaː rˌadeː amˈal kəmˈiːʃən tʰaː

Usage

Transformers pipeline

python
from transformers import pipeline

pipe = pipeline(
    task="automatic-speech-recognition",
    model="mahwizzzz/aurix-v1",
    chunk_length_s=30,
)

result = pipe("path/to/urdu_audio.wav")
print(result["text"])
# Example: ʊrduː zəbˈaːn bəhʊt xuːbsuːrət hɛ

Direct model API

python
import torch
from transformers import WhisperProcessor, WhisperForConditionalGeneration
import soundfile as sf

processor = WhisperProcessor.from_pretrained("mahwizzzz/aurix-v1", language="Urdu", task="transcribe")
model = WhisperForConditionalGeneration.from_pretrained("mahwizzzz/aurix-v1").to("cuda")

audio, sr = sf.read("path/to/urdu_audio.wav")
inputs = processor(audio, sampling_rate=sr, return_tensors="pt").to("cuda")
with torch.no_grad():
    ids = model.generate(**inputs, language="urdu", task="transcribe")
print(processor.batch_decode(ids, skip_special_tokens=True)[0])

Acknowledgements

Base model: OpenAI Whisper (whisper-large-v3-turbo).
Out of distribution evaluation: Google FLEURS (ur_pk).
Phonemizer: espeak-ng via the phonemizer Python package.

Citation

If you use this model in academic work, please cite it as follows.

bibtex
@misc{aurix-v1-2026,
  title        = {aurix-v1: A Whisper based Urdu Speech to IPA Model},
  author       = {Mahwiz Khalil},
  year         = {2026},
  howpublished = {\url{https://huggingface.co/mahwizzzz/aurix-v1}},
  note         = {Fine-tuned from openai/whisper-large-v3-turbo on synthetic Urdu TTS data; phonemized with espeak-ng.}
}

Please also cite the upstream artifacts:

bibtex
@article{radford2023whisper,
  title   = {Robust Speech Recognition via Large-Scale Weak Supervision},
  author  = {Radford, Alec and Kim, Jong Wook and Xu, Tao and Brockman, Greg and McLeavey, Christine and Sutskever, Ilya},
  journal = {arXiv preprint arXiv:2212.04356},
  year    = {2022}
}

@inproceedings{conneau2023fleurs,
  title     = {{FLEURS}: Few-shot Learning Evaluation of Universal Representations of Speech},
  author    = {Conneau, Alexis and Ma, Min and Khanuja, Simran and Zhang, Yu and Axelrod, Vera and Dalmia, Siddharth and Riesa, Jason and Rivera, Clara and Bapna, Ankur},
  booktitle = {IEEE Spoken Language Technology Workshop (SLT)},
  year      = {2023}
}

@misc{espeak-ng,
  title        = {{eSpeak NG}: Open source speech synthesizer},
  howpublished = {\url{https://github.com/espeak-ng/espeak-ng}}
}

aurix-v1

Get help setting up a custom Dedicated Endpoints.

README

Model Summary

Intended Use

Limitations

Training Data

Phonemization Pipeline

Training Procedure

Evaluation

In-distribution (synthetic held-out)

Out-of-distribution (real human speech, FLEURS)

Example Predictions on FLEURS

Usage

Transformers pipeline

Direct model API

Acknowledgements

Citation

Explore FriendliAI today

README

Model Summary

Intended Use

Limitations

Training Data

Phonemization Pipeline

Training Procedure

Evaluation

In-distribution (synthetic held-out)

Out-of-distribution (real human speech, FLEURS)

Example Predictions on FLEURS

Usage

Transformers pipeline

Direct model API

Acknowledgements

Citation