Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Why this quant

Single-stream TTS decode is memory-bandwidth bound. Quantizing the weights raises throughput on bandwidth-limited GPUs (e.g. GB10's ~273 GB/s unified memory). Measured on GB10 (decode tok/s): bf16 28 → fp8 54 → NVFP4 ~72–75.

Calibration matters for a TTS model. Generic-text calibration mis-scales the audio-token path. This checkpoint was calibrated on 96 in-distribution Maya sequences (real <description=…> text prompts + their generated SNAC token streams), so the emotion/audio tokens are properly represented. Quant method: llm-compressor QuantizationModifier(scheme="NVFP4"), lm_head left unquantized.

Measured (GB10, NVFP4)

MetricValue
First-audio (streamed)~0.46 s
Throughput~72–75 tok/s
RTF~1.25
GPU resident~12 GB
Sample rate24 kHz mono

How to run

Maya emits SNAC audio tokens, not audio — you need (1) vLLM to serve the LM and (2) a thin decoder that formats the prompt, parses the SNAC tokens, and decodes them to a waveform. Both below.

1) Serve the LM with vLLM (compressed-tensors NVFP4 is auto-detected):

bash

vllm serve <this-repo> --served-model-name maya1 \
--return-tokens-as-token-ids --max-model-len 4096 --trust-remote-code
# DGX Spark / aarch64: use the vllm/vllm-openai:*-aarch64-cu130 image.

2) Decode SNAC → audio with the included wrapper (maya_tts_server.py), which exposes an OpenAI-compatible /v1/audio/speech (streaming raw PCM or wav):

bash

pip install snac fastapi uvicorn httpx soundfile numpy torch
VLLM_URL=http://localhost:8002 python maya_tts_server.py # serves :8003

bash

curl -X POST http://localhost:8003/v1/audio/speech -H "Content-Type: application/json" \
-d '{"input":"Our update <laugh_harder> finally ships!",
"description":"American female, 20s, warm, fast pacing.",
"stream":true,"stream_format":"audio","response_format":"pcm"}' | ffplay -f s16le -ar 24000 -ac 1 -

Usage notes

  • description = natural-language voice design (gender/age/accent/pitch/timbre/pace).
  • Emotion tags: <laugh_harder> <sigh> <whisper> <angry> <giggle> <chuckle> <gasp> <cry> … Plain <laugh> can read subtle on some voices — prefer <laugh_harder> for an audible laugh.
  • fast pacing in the description tightens delivery (lower total latency).
  • Stream the PCM and play as it arrives to get the ~0.46 s first-audio.

Samples

Voice design — gritty man

Emotion — angry (<angry>)

Emotion — laugh (<laugh_harder>)

Style — whisper

License & ethics

Derivative of maya-research/maya1, Apache-2.0. No quantization changes the license. Do not use for unauthorized voice cloning, impersonation, fraud, or any unlawful/unethical purpose.

Attribution

Base model: Maya1 by Maya Research. Quantization + serving wrapper: Sggin1. Codec: hubertsiuzdak/snac_24khz.

Model provider

Sggin

Model tree

Base

maya-research/maya1

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today