Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Prompting

For emotion and style control, place the tags at the end of the sentence.

For example: मुझे यह फिल्म बहुत पसंद आई! <happy> or I am not sure if I can do this. <confused>

Tags for Indian languages: <happy>, <sad>, <angry>, <disgust>, <fear>, <surprise> Tags for English: <happy>, <sad>, <enunciated>, <confused>, <angry>, <whisper>

A word can be stressed by using asterisks(*) around it. For example: No! I could *never* do it!

Inference

Approach 1: With MioTTS-Inference (recommended)

Install vllm and set up MioTTS-Inference.

bash

vllm serve SPRINGLab/Indic-Mio --gpu-memory-utilization 0.5

bash

cd MioTTS-Inference
python run_server.py

bash

python run_gradio.py

Approach 2: Directly with Transformers

bash

from transformers import AutoTokenizer, AutoModelForCausalLM
from miocodec import MioCodec
import numpy as np
import torch
model_name = "SPRINGLab/Indic-Mio"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name, torch_dtype=torch.bfloat16, device_map="cuda"
)
text = "नमस्ते, आप कैसे हैं?"
messages = [{"role": "user", "content": text}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
**inputs,
max_new_tokens=1024,
temperature=0.9,
top_p=0.9,
)
generated = output[0][inputs["input_ids"].shape[1]:]
speech_offset = 151669
audio_codes = [t.item() - speech_offset for t in generated
if speech_offset <= t.item() < speech_offset + 12800]
# Convert audio_codes by decoding with MioCodec
# audio_codes -> numpy array -> MioCodec decode -> wav
codec = MioCodec.from_pretrained("Aratako/MioCodec-25Hz-24kHz")
codes_tensor = torch.tensor([audio_codes], dtype=torch.long).unsqueeze(0) # [1, 1, T]
wav = codec.decode(codes_tensor) # -> [1, 1, num_samples]
import soundfile as sf
sf.write("output.wav", wav.squeeze().cpu().numpy(), 44100)

Training

This model was trained on a single NVIDIA A6000 ADA GPU in less than 6 hours.

For Indian languages, IndicTTS, Rasa and Syspin datasets were used. For American English, LibriTTS and Expresso, while for Indian English, SPICOR dataset was used.

Fine-tuning

This model is robust yet flexible. You can fine-tune it on your own dataset for better performance on specific languages, accents, speakers, styles or emotions. Just a few steps of LoRA fine-tuning can significantly improve the performance for your target task.

Citations

In case you use this model, please cite this huggingface repository as follows:

bibtex

@misc{indic-mio-tts,
title={Indic-Mio TTS},
author={Advait Joglekar},
year={2026},
publisher = {Hugging Face},
howpublished={\url{https://huggingface.co/SPRINGLab/Indic-Mio}},
}

Model provider

SPRINGLab

SPRINGLab

Model tree

Base

Aratako/MioTTS-0.6B

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today