Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

🧬 Srota model family

VariantBest forSpecialty
Srota (union)General Hinglish (recommended default)conversational + tutorial
Srota-ConvConversational Hinglish onlyHiACC specialist
Srota-TutorialTechnical tutorial speech onlyOpenSLR-104 specialist

You are viewing Srota-Tutorial (OpenSLR-104 tutorial specialist).

ℹ️ What is Srota-Tutorial?

Srota-Tutorial is an automatic speech recognition (ASR) model for Hindi-English code-switched tutorial speech: software walkthroughs, lectures, and step-by-step technical instruction from the IIT Bombay Spoken Tutorial project, as packaged in OpenSLR-104 / MUCS-2021. It is a full-parameter fine-tune of Qwen/Qwen3-ASR-0.6B trained on OpenSLR-104 alone.

On the size. The base model's name, Qwen3-ASR-0.6B, refers to its LLM backbone (Qwen3-0.6B, ~600M parameters). The full speech model adds a ~180M AuT audio encoder and a small projector, for ~780M parameters total. Srota-Tutorial is a full-parameter fine-tune of all of them: there are no LoRA adapters and no frozen layers, every native weight is updated. The extra ~180M over the "0.6B" name is the audio encoder, not a LoRA adapter.

Sibling model. For general Hinglish (conversational + tutorial), see Srota, the union model. Srota-Tutorial only exists to document the in-domain ceiling and the cross-domain cost of single-domain fine-tuning; Srota is the shippable generalist.

Project. Built by the team behind susrota.com, a voice-dictation tool that currently runs in English. Srota will power its upcoming Hinglish support; the live product does not run this model yet.

✨ Highlights

  • Large in-domain win. OpenSLR-104 test WER drops from 50.66% (base) to 32.83% (−17.83 pp, −35% relative).
  • Preserves natural code-switch. Keeps English jargon in Latin (tutorial, print button, slides handouts notes) and Hindi narration in Devanagari, instead of romanizing or hallucinating English continuations like the base.
  • Compact. ~780M parameters total (Qwen3-0.6B LLM + ~180M AuT audio encoder + projector); single-GPU bf16 inference.
  • Honest lineage. Full-parameter fine-tune of Qwen3-ASR-0.6B: no frozen layers, no LoRA adapters. The extra ~180M over the "0.6B" name is the audio encoder, not LoRA.
  • Specialist trade-off (read this). Conversational HiACC test WER goes from 24.73% (base) to 37.64% (+12.91 pp WORSE than base). This is a classic single-domain negative-transfer regression, and it is the entire reason the union model Srota exists.
  • Open. Apache-2.0; training data is OpenSLR-104 (CC BY 4.0).

⚠️ Read before downloading

Srota-Tutorial is a domain specialist, not a drop-in replacement for the base model. On conversational Hinglish (HiACC test), it scores 37.64% WER, which is +12.91 pp WORSE than Qwen3-ASR-0.6B's 24.73%. If your audio is anything other than technical Hindi-English tutorial speech (lectures, software walkthroughs), use Srota (the union model) or the base Qwen3-ASR-0.6B instead.

Additionally, because OpenSLR-104 transcripts are lowercase and unpunctuated by design, this model emits lowercase, no-punctuation, mixed-script text. It is not production-formatted output.

🎧 Srota-Tutorial in action

Real OpenSLR-104 test-set examples. On tutorial speech, the base model hallucinates English completions or romanizes everything into Devanagari. Srota-Tutorial transcribes what was actually said, preserving the natural code-switch (English jargon in Latin, Hindi narration in Devanagari).

Base Qwen3-ASR-0.6BSrota-Tutorial
AIn the tutorial, we have seen storage class specifiers, auto keyword, static keyword, extern keyword, register keyword.इस tutorial में हमने सीखा:
Bहम इस वर्ग में नहीं करेंगे अब प्रिंट बटन पर क्लिकअब print button पर click करें
Cप्रिंटिंग के बारे में सीखा, स्लाइड्स, हैंडओउट्स, नोट्स और आउटलाइनslides handouts notes और outline

In A, the base ignores the actual short Hindi phrase and hallucinates a fluent English summary. In B, the base prepends invented content before getting to the command. In C, the base romanizes English jargon into Devanagari (स्लाइड्स, हैंडओउट्स); Srota-Tutorial keeps English in Latin (slides handouts notes) the way it appears in the reference transcript.

📊 Results

Test setDomainn uttsBase Qwen3-ASR-0.6BSrota-TutorialΔ vs base
OpenSLR-104 testTutorial (in-domain)3,13250.66%32.83%−17.83 pp (−35% rel)
HiACC testConversational (cross-domain)1,03624.73%37.64%+12.91 pp (worse)

Normalization. WER is computed with jiwer after a symmetric normalizer (lowercase + strip punctuation) is applied to both predictions and references. These numbers are not directly comparable to MUCS-2021 published baselines, which use a different (Kaldi-style) normalization.

The OpenSLR-104 gain is real and large, but the HiACC regression is also real and large: a tutorial-only fine-tune at this scale meaningfully damages conversational performance. This is the central evidence that motivates the union model Srota.

🚀 Quickstart

Install the inference package, then load Srota-Tutorial and call transcribe.

bash

pip install qwen-asr==0.0.6

python

import torch
from qwen_asr import Qwen3ASRModel
model = Qwen3ASRModel.from_pretrained(
"moorlee/qwen3-asr-0.6b-hinglish-openslr104-v2",
dtype=torch.bfloat16,
device_map="cuda:0",
attn_implementation="flash_attention_2",
)
results = model.transcribe(audio="path/to/tutorial.wav", language=None)
print(results[0].text)
# e.g. "इस tutorial में हम nested और multilevel if statement के बारे में सीखेंगे"
  • language=None enables the language-agnostic decoding prefix this model was trained with. Pass it explicitly.
  • Audio should be mono; keep segments under 30 s per call (chunk longer audio).
  • bf16 + FlashAttention 2 is recommended; attn_implementation can be dropped on CPU or older GPUs.
  • Output style. OpenSLR-104 references are lowercase and unpunctuated by design, so this model emits lowercase, no-punctuation, mixed Devanagari + Latin text. Apply your own casing and punctuation if you need production-formatted output.

🎯 Intended Use

Intended use

  • Transcribing Hindi-English spoken tutorials: software walkthroughs, lecture-style technical instruction, step-by-step product demos, in the same distribution as the IIT Bombay Spoken Tutorial / OpenSLR-104 corpus.
  • Research baseline for in-domain fine-tuning on OpenSLR-104 / MUCS-2021.
  • Producing lowercase, no-punctuation mixed Devanagari + Latin Hinglish text (the OpenSLR-104 transcript style).

Out of scope / not recommended

  • General conversational Hinglish. This model is +12.91 pp WORSE than the base on HiACC. Use Srota (the union model) for conversational or mixed-domain audio.
  • Production text needing case or punctuation without a post-processing layer.
  • Monolingual pure-Hindi or pure-English ASR.
  • High-stakes uses (medical, legal) without human review.

Full failure modes are described in the Limitations & Biases section below.

📚 Training Data

Srota-Tutorial is trained on OpenSLR-104 alone (the MUCS-2021 Multilingual & Code-Switching ASR challenge Hindi-English subtask; CC BY 4.0): 89.86 h of Hindi-English spoken-tutorial speech from the IIT Bombay Spoken Tutorial project, 16 kHz mono WAV. Transcripts are lowercase and unpunctuated by design.

SplitUtterancesNotes
Train50,005OpenSLR-104 train
Val2,764Speaker-disjoint from train: 26 of 520 train speakers held out
Test3,132Official OpenSLR-104 test

The training audio is sourced from long-form tutorial recordings that were chunked into utterance-length segments before fine-tuning, then re-joined at evaluation time per the official splits.

🧠 Training Procedure

Srota-Tutorial is a full-parameter fine-tune of Qwen3-ASR-0.6B: no frozen layers, no LoRA. Every native weight is updated. The "0.6B" in the base model's name refers only to its LLM backbone (Qwen3-0.6B, ~600M parameters); the full speech model is ~780M parameters because it also includes a ~180M AuT audio encoder and a small projector. That extra ~180M is the audio encoder, not a LoRA adapter: all three components (audio encoder, projector, and LLM) are trained end-to-end.

SettingValue
Base modelQwen/Qwen3-ASR-0.6B
Fine-tune scopeFull-parameter (no frozen layers, no LoRA)
Fine-tune scriptqwen3_asr_sft.py (QwenLM/Qwen3-ASR)
OptimizerAdamW
Learning rate2e-5, linear schedule, warmup_ratio 0.02
Gradient clippingnorm 1.0
Effective batch32 (per-device 8 × grad-accum 2 × 2 GPUs)
Precisionbf16 + FlashAttention 2
Epochs3 (4,690 steps)
Best checkpointstep 3000 (epoch 1.92), eval_loss 0.1436
Hardware2× NVIDIA H100 80GB
Wall-clock~72 min (4,351 s)
Seed42 (data shuffle)

Data format. Targets use the language-agnostic prefix language None<asr_text>... (following Polyglot-Lion / Toshniwal et al., 2018), with transcripts kept in OpenSLR-104's native lowercase mixed Devanagari + Latin script.

📈 Evaluation

Methodology. For each test utterance, we call transcribe(audio=…, language=None), strip the leading language ?<asr_text> prefix, apply the symmetric lowercase + strip-punctuation normalizer to both hypothesis and reference, and compute WER with jiwer. Srota-Tutorial was evaluated on both test sets to surface cross-domain transfer behavior: in-domain on OpenSLR-104 test (3,132 utts) and cross-domain on HiACC test (1,036 utts).

In-domain (OpenSLR-104 test). WER drops from 50.66% (base) to 32.83% (Srota-Tutorial), a −17.83 pp absolute / −35% relative improvement. This is the headline in-domain result.

Cross-domain (HiACC test). WER goes from 24.73% (base) to 37.64% (Srota-Tutorial), a +12.91 pp regression: this model is meaningfully worse than the base on conversational Hinglish. This is the why-Srota-exists result: a tutorial-only fine-tune at this scale negatively transfers to conversational speech, which is precisely what the union model Srota was built to fix (it converts that +12.91 pp HiACC regression into a −8.88 pp improvement).

⚠️ Limitations & Biases

  • Cross-domain regression. On conversational HiACC, Srota-Tutorial is +12.91 pp worse than the base Qwen3-ASR-0.6B (37.64% vs 24.73%). Do not use it on non-tutorial audio; use Srota instead.
  • Lowercase, no-punctuation output. OpenSLR-104 transcripts are lowercase and unpunctuated by design, so the model emits the same. It is not production-formatted; a casing/punctuation post-processor is required for downstream display.
  • In-domain WER is still substantial (32.83%). Dense technical vocabulary (commands, file paths, version strings) and rapid Hindi-English code-switching remain hard for a ~780M-parameter model, even after a 35% relative reduction.
  • Not comparable to MUCS-2021 published numbers without matching their Kaldi-style normalization.
  • Single seed, single configuration. No hyperparameter sweep was run.
  • Bias note. All training audio comes from the IIT Bombay Spoken Tutorial project: a specific Indian-accented, lecture-style register. Accent, dialect, speaking-style, and topic coverage outside that distribution may degrade quickly (the HiACC result is a concrete example).

📬 Contact

Questions, feedback, or want Srota-Tutorial tailored to your use case? Email surajprasad8977@gmail.com.

📄 License

Apache-2.0, inherited from the base Qwen3-ASR-0.6B model. Apache-2.0 is a permissive open-source license: you may use, modify, and redistribute Srota-Tutorial freely, including for commercial purposes, with no copyleft. The only obligations are to retain the license/copyright notice and to state significant changes.

The training data is licensed CC BY 4.0: OpenSLR-104 is CC BY 4.0 (see openslr.org/104 for full terms). Users must comply with the dataset license's attribution requirements.

📝 Citation

If you use Srota-Tutorial, please cite this model and the underlying works.

bibtex

@misc{srota_tutorial2026,
title = {Srota-Tutorial: A Hinglish tutorial-speech ASR model fine-tuned from Qwen3-ASR-0.6B on OpenSLR-104},
author = {Suraj},
year = {2026},
url = {https://huggingface.co/moorlee/qwen3-asr-0.6b-hinglish-openslr104-v2}
}
@article{shi2026qwen3asr,
title = {Qwen3-ASR Technical Report},
author = {Shi, Xian and Wang, Xiong and Guo, Zhifang and Wang, Yongqi and
Zhang, Pei and Zhang, Xinyu and Guo, Zishan and Hao, Hongkun and
Xi, Yu and Yang, Baosong and Xu, Jin and Zhou, Jingren and
Lin, Junyang},
year = {2026},
url = {https://arxiv.org/abs/2601.21337}
}
@article{dang2026polyglot,
title = {Polyglot-Lion: Efficient Multilingual ASR for Singapore via
Balanced Fine-Tuning of Qwen3-ASR},
author = {Dang, Quy-Anh and Ngo, Chris},
year = {2026},
url = {https://arxiv.org/abs/2603.16184}
}
@inproceedings{diwan2021mucs,
title = {{MUCS} 2021: Multilingual and Code-Switching {ASR} Challenges for Low Resource {Indian} Languages},
author = {Diwan, Anuj and Vaideeswaran, Rakesh and Shah, Sanket and others},
booktitle = {Proc. Interspeech 2021},
year = {2021}
}

🙏 Acknowledgements

Srota-Tutorial builds directly on the work of others. We thank the Qwen team for Qwen3-ASR-0.6B, the base model, and for the open qwen3_asr_sft.py training script. We thank the IIT Bombay Spoken Tutorial project and the MUCS-2021 / OpenSLR-104 organizers for the training and evaluation data. We also thank the authors of Polyglot-Lion (Dang & Ngo) for the language-agnostic decoding prefix that this work builds on.

Built by the team behind susrota.com, a voice-dictation tool that currently runs in English. Srota will power its upcoming Hinglish support; the live product does not run this model yet.

Srota-Tutorial stands entirely on Qwen3-ASR-0.6B; this work is the OpenSLR-104 tutorial-domain adaptation, not a new foundation model.

Model provider

moorlee

Model tree

Base

Qwen/Qwen3-ASR-0.6B

Fine-tuned

this model

Modalities

Input

Audio

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today