SPRINGLab

Indic-Mio

README

License: apache-2.0

Prompting

For emotion and style control, place the tags at the end of the sentence.

For example: मुझे यह फिल्म बहुत पसंद आई! <happy> or I am not sure if I can do this. <confused>

Tags for Indian languages: <happy>, <sad>, <angry>, <disgust>, <fear>, <surprise> Tags for English: <happy>, <sad>, <enunciated>, <confused>, <angry>, <whisper>

A word can be stressed by using asterisks(*) around it. For example: No! I could *never* do it!

Inference

Approach 1: With MioTTS-Inference (recommended)

Install vllm and set up MioTTS-Inference.

bash
vllm serve SPRINGLab/Indic-Mio --gpu-memory-utilization 0.5

bash
cd MioTTS-Inference
python run_server.py

bash
python run_gradio.py

Approach 2: Directly with Transformers

bash
from transformers import AutoTokenizer, AutoModelForCausalLM
from miocodec import MioCodec
import numpy as np
import torch

model_name = "SPRINGLab/Indic-Mio"
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    model_name, torch_dtype=torch.bfloat16, device_map="cuda"
)

text = "नमस्ते, आप कैसे हैं?"
messages = [{"role": "user", "content": text}]
prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
output = model.generate(
    **inputs,
    max_new_tokens=1024,
    temperature=0.9,
    top_p=0.9,
)

generated = output[0][inputs["input_ids"].shape[1]:]
speech_offset = 151669
audio_codes = [t.item() - speech_offset for t in generated 
               if speech_offset <= t.item() < speech_offset + 12800]

# Convert audio_codes by decoding with MioCodec
# audio_codes -> numpy array -> MioCodec decode -> wav

codec = MioCodec.from_pretrained("Aratako/MioCodec-25Hz-24kHz")
codes_tensor = torch.tensor([audio_codes], dtype=torch.long).unsqueeze(0)  # [1, 1, T]
wav = codec.decode(codes_tensor)  # -> [1, 1, num_samples]

import soundfile as sf
sf.write("output.wav", wav.squeeze().cpu().numpy(), 44100)

Training

This model was trained on a single NVIDIA A6000 ADA GPU in less than 6 hours.

For Indian languages, IndicTTS, Rasa and Syspin datasets were used. For American English, LibriTTS and Expresso, while for Indian English, SPICOR dataset was used.

Fine-tuning

This model is robust yet flexible. You can fine-tune it on your own dataset for better performance on specific languages, accents, speakers, styles or emotions. Just a few steps of LoRA fine-tuning can significantly improve the performance for your target task.

Citations

In case you use this model, please cite this huggingface repository as follows:

bibtex
@misc{indic-mio-tts,
  title={Indic-Mio TTS},
  author={Advait Joglekar},
  year={2026},
  publisher = {Hugging Face},
  howpublished={\url{https://huggingface.co/SPRINGLab/Indic-Mio}},
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

SPRINGLab

Model Tree

Base

Aratako/MioTTS-0.6B

Fine-tuned

this model

Input Modalities

Text

Output Modalities