Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Prompting
For emotion and style control, place the tags at the end of the sentence.
For example: मुझे यह फिल्म बहुत पसंद आई! <happy> or I am not sure if I can do this. <confused>
Tags for Indian languages: <happy>, <sad>, <angry>, <disgust>, <fear>, <surprise>
Tags for English: <happy>, <sad>, <enunciated>, <confused>, <angry>, <whisper>
A word can be stressed by using asterisks(*) around it. For example: No! I could *never* do it!
Inference
Approach 1: With MioTTS-Inference (recommended)
Install vllm and set up MioTTS-Inference.
bash
vllm serve SPRINGLab/Indic-Mio --gpu-memory-utilization 0.5
bash
cd MioTTS-Inferencepython run_server.py
bash
python run_gradio.py
Approach 2: Directly with Transformers
bash
from transformers import AutoTokenizer, AutoModelForCausalLMfrom miocodec import MioCodecimport numpy as npimport torchmodel_name = "SPRINGLab/Indic-Mio"tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained(model_name, torch_dtype=torch.bfloat16, device_map="cuda")text = "नमस्ते, आप कैसे हैं?"messages = [{"role": "user", "content": text}]prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tokenizer(prompt, return_tensors="pt").to(model.device)output = model.generate(**inputs,max_new_tokens=1024,temperature=0.9,top_p=0.9,)generated = output[0][inputs["input_ids"].shape[1]:]speech_offset = 151669audio_codes = [t.item() - speech_offset for t in generatedif speech_offset <= t.item() < speech_offset + 12800]# Convert audio_codes by decoding with MioCodec# audio_codes -> numpy array -> MioCodec decode -> wavcodec = MioCodec.from_pretrained("Aratako/MioCodec-25Hz-24kHz")codes_tensor = torch.tensor([audio_codes], dtype=torch.long).unsqueeze(0) # [1, 1, T]wav = codec.decode(codes_tensor) # -> [1, 1, num_samples]import soundfile as sfsf.write("output.wav", wav.squeeze().cpu().numpy(), 44100)
Training
This model was trained on a single NVIDIA A6000 ADA GPU in less than 6 hours.
For Indian languages, IndicTTS, Rasa and Syspin datasets were used. For American English, LibriTTS and Expresso, while for Indian English, SPICOR dataset was used.
Fine-tuning
This model is robust yet flexible. You can fine-tune it on your own dataset for better performance on specific languages, accents, speakers, styles or emotions. Just a few steps of LoRA fine-tuning can significantly improve the performance for your target task.
Citations
In case you use this model, please cite this huggingface repository as follows:
bibtex
@misc{indic-mio-tts,title={Indic-Mio TTS},author={Advait Joglekar},year={2026},publisher = {Hugging Face},howpublished={\url{https://huggingface.co/SPRINGLab/Indic-Mio}},}
Model provider
SPRINGLab
Model tree
Base
Aratako/MioTTS-0.6B
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information