Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Why this quant
Single-stream TTS decode is memory-bandwidth bound. Quantizing the weights raises throughput on bandwidth-limited GPUs (e.g. GB10's ~273 GB/s unified memory). Measured on GB10 (decode tok/s): bf16 28 → fp8 54 → NVFP4 ~72–75.
Calibration matters for a TTS model. Generic-text calibration mis-scales the audio-token path. This checkpoint was calibrated on 96 in-distribution Maya sequences (real
<description=…> textprompts + their generated SNAC token streams), so the emotion/audio tokens are properly represented. Quant method:llm-compressorQuantizationModifier(scheme="NVFP4"),lm_headleft unquantized.
Measured (GB10, NVFP4)
| Metric | Value |
|---|---|
| First-audio (streamed) | ~0.46 s |
| Throughput | ~72–75 tok/s |
| RTF | ~1.25 |
| GPU resident | ~12 GB |
| Sample rate | 24 kHz mono |
How to run
Maya emits SNAC audio tokens, not audio — you need (1) vLLM to serve the LM and (2) a thin decoder that formats the prompt, parses the SNAC tokens, and decodes them to a waveform. Both below.
1) Serve the LM with vLLM (compressed-tensors NVFP4 is auto-detected):
bash
vllm serve <this-repo> --served-model-name maya1 \--return-tokens-as-token-ids --max-model-len 4096 --trust-remote-code# DGX Spark / aarch64: use the vllm/vllm-openai:*-aarch64-cu130 image.
2) Decode SNAC → audio with the included wrapper (maya_tts_server.py), which exposes an
OpenAI-compatible /v1/audio/speech (streaming raw PCM or wav):
bash
pip install snac fastapi uvicorn httpx soundfile numpy torchVLLM_URL=http://localhost:8002 python maya_tts_server.py # serves :8003
bash
curl -X POST http://localhost:8003/v1/audio/speech -H "Content-Type: application/json" \-d '{"input":"Our update <laugh_harder> finally ships!","description":"American female, 20s, warm, fast pacing.","stream":true,"stream_format":"audio","response_format":"pcm"}' | ffplay -f s16le -ar 24000 -ac 1 -
Usage notes
description= natural-language voice design (gender/age/accent/pitch/timbre/pace).- Emotion tags:
<laugh_harder><sigh><whisper><angry><giggle><chuckle><gasp><cry>… Plain<laugh>can read subtle on some voices — prefer<laugh_harder>for an audible laugh. fast pacingin the description tightens delivery (lower total latency).- Stream the PCM and play as it arrives to get the ~0.46 s first-audio.
Samples
Voice design — gritty man
Emotion — angry (<angry>)
Emotion — laugh (<laugh_harder>)
Style — whisper
License & ethics
Derivative of maya-research/maya1, Apache-2.0. No quantization changes the license. Do not use
for unauthorized voice cloning, impersonation, fraud, or any unlawful/unethical purpose.
Attribution
Base model: Maya1 by Maya Research. Quantization + serving wrapper: Sggin1.
Codec: hubertsiuzdak/snac_24khz.
Model provider
Sggin
Model tree
Base
maya-research/maya1
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information