moorlee

qwen3-asr-0.6b-hinglish

README

License: apache-2.0

🧬 Srota model family

Table with columns: Variant, Best for, Specialty
Variant	Best for	Specialty
Srota (union)	General Hinglish (recommended default)	conversational + tutorial
Srota-Conv	Conversational Hinglish only	HiACC specialist
Srota-Tutorial	Technical tutorial speech only	OpenSLR-104 specialist

You are viewing Srota (the union model, recommended for general use).

ℹ️ What is Srota?

Srota is an automatic speech recognition (ASR) model for Hinglish (Hindi-English code-switched speech) that transcribes into natural mixed Devanagari + Latin script. It is a full-parameter fine-tune of Qwen/Qwen3-ASR-0.6B, trained on the union of conversational and technical-tutorial Hinglish speech. It improves over the base model on both domains at once: −8.88 pp word error rate (WER) on conversational speech and −15.60 pp on tutorial speech.

On the size. The base model's name, Qwen3-ASR-0.6B, refers to its LLM backbone (Qwen3-0.6B, ~600M parameters). The full speech model adds a ~180M AuT audio encoder and a small projector, for ~780M parameters total. Srota is a full-parameter fine-tune of all of them: there are no LoRA adapters and no frozen layers, every native weight is updated.

Project. Built by the team behind susrota.com, a voice-dictation tool that currently runs in English. Srota will power its upcoming Hinglish support; the live product does not run this model yet.

Try it in the live demo.

✨ Highlights

Beats the base on both domains. Conversational HiACC 24.73% → 15.85%; tutorial OpenSLR-104 50.66% → 35.06%.
One model, two domains. Unlike a single-domain specialist, Srota does not trade one domain off against the other; it eliminates the negative transfer seen when training on tutorials alone (see the Evaluation section).
Native Hinglish output. Emits Devanagari for Hindi words and Latin for English words, the way Hinglish is actually written (e.g. मेरा favourite festival Diwali है).
Compact. ~780M parameters; runs on a single GPU in bf16.
Honest lineage. A full-parameter fine-tune of Qwen3-ASR-0.6B (the ~180M AuT audio encoder, the projector, and the Qwen3-0.6B LLM): no frozen layers, no LoRA adapters. The extra ~180M over the "0.6B" name is the audio encoder, not LoRA.
Open. Apache-2.0; both training corpora are CC BY 4.0.

🎧 Srota in action

Real examples from the test set. The base Qwen3-ASR-0.6B transliterates English words into Devanagari (wrong for Hinglish); Srota keeps the natural mixed script.

Table with columns: Base Qwen3-ASR-0.6B, Srota
	Base Qwen3-ASR-0.6B	Srota
A	`तो डेट्स वाइ आई`	`तो that's why I`
B	`इन दिहार ऑफ अबस्लिंग सिटी दो सिब्लिंग रहते थे`	`In the heart of a bustling city दो siblings रहते थे`
C	`ओके सो मेरा होम टाउन न्यू देल्ही है`	`Okay so मेरा hometown New Delhi है`

The base collapses code-switched English into Devanagari transliteration; Srota preserves how Hinglish is actually written.

Try your own audio in the live demo.

📊 Results

Table with columns: Model, HiACC test (conversational, 1,036 utts), OpenSLR-104 test (tutorial, 3,132 utts)
Model	HiACC test (conversational, 1,036 utts)	OpenSLR-104 test (tutorial, 3,132 utts)
Qwen3-ASR-0.6B (base, zero-shot)	24.73%	50.66%
HiACC-only fine-tune (v1)	14.23%	≈ base (untested)
OpenSLR-only fine-tune (v2)	37.64% (worse than base)	32.83%
Srota (union, this model)	15.85%	35.06%
Srota Δ vs base	−8.88 pp

Srota is the only fine-tune that beats the base on both test sets. It gives up only ~1.6 pp versus the conversational specialist and ~2.2 pp versus the tutorial specialist, the expected, small generalist trade-off.

Table with columns: HiACC cohort, n, Srota WER
HiACC cohort	n	Srota WER
Adult	664	15.41%
Children	372	16.66%
Overall	1,036	15.85%

The adult/child gap stays gentle (1.25 pp): the union introduced no cohort bias.

Normalization. WER is computed with jiwer after a symmetric normalizer (lowercase + strip punctuation) is applied to both predictions and references. These numbers are not directly comparable to MUCS-2021 published baselines, which use a different (Kaldi-style) normalization.

🚀 Quickstart

Install the inference package, then load Srota and call transcribe. The minimal path is two lines of setup and one call.

bash
pip install qwen-asr==0.0.6

python
import torch
from qwen_asr import Qwen3ASRModel

model = Qwen3ASRModel.from_pretrained(
    "moorlee/qwen3-asr-0.6b-hinglish",
    dtype=torch.bfloat16,
    device_map="cuda:0",
    attn_implementation="flash_attention_2",
)

results = model.transcribe(audio="path/to/your.wav", language=None)
print(results[0].text)
# e.g. "मेरा favourite festival Diwali है"

language=None enables the language-agnostic decoding prefix Srota was trained with. Pass it explicitly.
Audio should be mono; keep segments ≤ 30 s per call (chunk longer audio).
bf16 + FlashAttention 2 is recommended; attn_implementation can be dropped on CPU or older GPUs.

No setup? Use the hosted demo.

🎯 Intended Use & Limitations of Use

Intended use

Transcribing conversational Hinglish (casual Q&A, storytelling, image-prompted descriptions).
Transcribing technical-tutorial Hinglish (software walkthroughs, lecture-style instruction).
Producing natural mixed Devanagari + Latin Hinglish text.

Out of scope / not recommended

Monolingual pure-Hindi or pure-English production ASR, where dedicated models are stronger.
Languages or dialects outside Hindi-English code-switching.
High-stakes uses (e.g. medical or legal transcription) without human review.

Full failure modes are described in the Limitations & Biases section below.

📚 Training Data

Srota is trained on the union of two CC BY 4.0 Hinglish corpora, simply concatenated with no upsampling.

HiACC (Singh, Singh & Kadyan, 2025; DOI 10.5281/zenodo.15551669, CC BY 4.0): 5.24 h of conversational Hinglish, 16 kHz mono WAV.
OpenSLR-104 (the MUCS-2021 Multilingual & Code-Switching ASR challenge; CC BY 4.0): 89.86 h of Hindi-English spoken-tutorial speech from the IIT Bombay Spoken Tutorial project.

Table with columns: Split, Utterances, Composition
Split	Utterances	Composition
Train	53,627	HiACC 6.8% + OpenSLR-104 93.2%
Val	3,282	518 HiACC + 2,764 OpenSLR-104

Each corpus's own official test set is used for evaluation, reported separately in the Results section above.

HiACC is only 6.8% of the training mix, yet Srota retains ~99% of the conversational specialist's quality (15.85% vs 14.23%): balanced upsampling was unnecessary at this scale; a deterministic shuffle (seed 42) was enough.

🧠 Training Procedure

Srota is a full-parameter fine-tune of Qwen3-ASR-0.6B: no frozen layers, no LoRA. Every native weight is updated. The "0.6B" in the base model's name refers only to its LLM backbone (Qwen3-0.6B, ~600M parameters); the full speech model is ~780M parameters, because it also includes a ~180M AuT audio encoder and a small projector. That extra ~180M is the audio encoder, not a LoRA adapter: all three components (audio encoder, projector, and LLM) are trained end-to-end.

Table with columns: Setting, Value
Setting	Value
Base model	`Qwen/Qwen3-ASR-0.6B`
Fine-tune scope	Full-parameter (no frozen layers, no LoRA)
Fine-tune script	`qwen3_asr_sft.py` @ commit `c17a131f` (QwenLM/Qwen3-ASR)
Optimizer	AdamW
Learning rate	2e-5, linear schedule, warmup_ratio 0.02
Gradient clipping	norm 1.0

Data format. Targets use the language-agnostic prefix language None<asr_text>... (following Polyglot-Lion / Toshniwal et al., 2018), with transcripts kept in their natural mixed Devanagari + Latin script.

📈 Evaluation

Methodology. For each test utterance, we call transcribe(audio=…, language=None), strip the leading language ?<asr_text> prefix, apply the symmetric lowercase + strip-punctuation normalizer to both hypothesis and reference, and compute WER with jiwer. Srota is evaluated on the HiACC test split (with adult/child cohort slicing) and the OpenSLR-104 official test split. See the Results section above for the full comparison table.

Union vs. specialists. A tutorial-only fine-tune (v2) gained −17.82 pp in-domain on OpenSLR-104 but regressed +12.91 pp versus the base on conversational HiACC, classic negative transfer, since lectures and spontaneous conversation are far apart distributionally. Adding back HiACC's 5.24 h of conversational speech (even at only 6.8% of the mix) re-anchors the model: Srota turns that +12.91 pp HiACC regression into a −8.88 pp improvement (a −21.79 pp swing versus v2 on HiACC) while keeping −15.60 pp on OpenSLR-104. Srota is the shippable generalist; the single-domain specialists are not drop-in replacements for the base across both domains.

⚠️ Limitations & Biases

Generalist trade-off. Srota is ~1.6 pp behind the conversational specialist on HiACC and ~2.2 pp behind the tutorial specialist on OpenSLR-104. For a single known domain, a specialist is marginally better.
Tutorial WER is still substantial (35.06%). Dense code/path/version vocabulary (bash, gnu/linux, version 1204) remains hard for a 0.6B model.
Not comparable to MUCS-2021 published numbers without matching their Kaldi-style normalization.
Single seed, single configuration. No hyperparameter sweep was run; the "upsampling unnecessary" claim is observed, not proven via a controlled concat-vs-upsampled ablation.
HiACC train/val/test share speakers. Reported HiACC WER is in-domain, not novel-speaker: real-world conversational WER on unseen speakers may be higher.
Bias note. Data is sourced from specific corpora (Indian spoken-tutorial speech and a defined conversational set that includes children); accent, dialect, and domain coverage is limited and may not generalize to all Hinglish varieties.

📬 Contact

Questions, feedback, or want Srota tailored to your use case? Email surajprasad8977@gmail.com.

📄 License

Apache-2.0, inherited from the base Qwen3-ASR-0.6B model. Apache-2.0 is a permissive open-source license: you may use, modify, and redistribute Srota freely, including for commercial purposes, with no copyleft. The only obligations are to retain the license/copyright notice and to state significant changes.

The training data is licensed CC BY 4.0: HiACC is CC BY 4.0, and OpenSLR-104 is CC BY 4.0 (see openslr.org/104 for full terms). Users must comply with the dataset licenses' attribution requirements.

📝 Citation

If you use Srota, please cite this model and the underlying works.

bibtex
@misc{srota2026,
  title  = {Srota: A Hinglish ASR model fine-tuned from Qwen3-ASR-0.6B},
  author = {Suraj},
  year   = {2026},
  url    = {https://huggingface.co/moorlee/qwen3-asr-0.6b-hinglish}
}

@article{shi2026qwen3asr,
  title  = {Qwen3-ASR Technical Report},
  author = {Shi, Xian and Wang, Xiong and Guo, Zhifang and Wang, Yongqi and
            Zhang, Pei and Zhang, Xinyu and Guo, Zishan and Hao, Hongkun and
            Xi, Yu and Yang, Baosong and Xu, Jin and Zhou, Jingren and
            Lin, Junyang},
  year   = {2026},
  url    = {https://arxiv.org/abs/2601.21337}
}

@article{dang2026polyglot,
  title  = {Polyglot-Lion: Efficient Multilingual ASR for Singapore via
            Balanced Fine-Tuning of Qwen3-ASR},
  author = {Dang, Quy-Anh and Ngo, Chris},
  year   = {2026},
  url    = {https://arxiv.org/abs/2603.16184}
}

@misc{singh2025hiacc,
  title  = {HiACC: Hinglish Adult \& Children Code-switched Corpus},
  author = {Singh, Shruti and Singh, Muskaan and Kadyan, Virender},
  year   = {2025},
  doi    = {10.5281/zenodo.15551669},
  url    = {https://zenodo.org/records/15551669}
}

@inproceedings{diwan2021mucs,
  title     = {{MUCS} 2021: Multilingual and Code-Switching {ASR} Challenges for Low Resource {Indian} Languages},
  author    = {Diwan, Anuj and Vaideeswaran, Rakesh and Shah, Sanket and others},
  booktitle = {Proc. Interspeech 2021},
  year      = {2021}
}

@inproceedings{toshniwal2018multilingual,
  title     = {Multilingual speech recognition with a single end-to-end model},
  author    = {Toshniwal, Shubham and Sainath, Tara N. and Weiss, Ron J. and
               Li, Bo and Moreno, Pedro and Weinstein, Eugene and Rao, Kanishka},
  booktitle = {2018 IEEE International Conference on Acoustics, Speech and
               Signal Processing (ICASSP)},
  pages     = {4904--4908},
  year      = {2018},
  doi       = {10.1109/ICASSP.2018.8461972}
}

🙏 Acknowledgements

Srota builds directly on the work of others. We thank the Qwen team for Qwen3-ASR-0.6B, the base model, and for the open qwen3_asr_sft.py training script. We thank the HiACC authors (Singh, Singh & Kadyan) and the MUCS-2021 / OpenSLR-104 / IIT Bombay Spoken Tutorial contributors for the training and evaluation data. We also thank the authors of Polyglot-Lion (Dang & Ngo) for the balanced-fine-tuning recipe and language-agnostic decoding prefix that this work builds on.

Srota stands entirely on Qwen3-ASR-0.6B; this work is the Hinglish adaptation, not a new foundation model.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider