Sggin/maya1-nvfp4 API & Inference Endpoint

Why this quant

Single-stream TTS decode is memory-bandwidth bound. Quantizing the weights raises throughput on bandwidth-limited GPUs (e.g. GB10's ~273 GB/s unified memory). Measured on GB10 (decode tok/s): bf16 28 → fp8 54 → NVFP4 ~72–75.

Calibration matters for a TTS model. Generic-text calibration mis-scales the audio-token path. This checkpoint was calibrated on 96 in-distribution Maya sequences (real <description=…> text prompts + their generated SNAC token streams), so the emotion/audio tokens are properly represented. Quant method: llm-compressor QuantizationModifier(scheme="NVFP4"), lm_head left unquantized.

Measured (GB10, NVFP4)

Metric	Value
First-audio (streamed)	~0.46 s
Throughput	~72–75 tok/s
RTF	~1.25
GPU resident	~12 GB
Sample rate	24 kHz mono

How to run

Maya emits SNAC audio tokens, not audio — you need (1) vLLM to serve the LM and (2) a thin decoder that formats the prompt, parses the SNAC tokens, and decodes them to a waveform. Both below.

1) Serve the LM with vLLM (compressed-tensors NVFP4 is auto-detected):

bash
vllm serve <this-repo> --served-model-name maya1 \
  --return-tokens-as-token-ids --max-model-len 4096 --trust-remote-code
# DGX Spark / aarch64: use the vllm/vllm-openai:*-aarch64-cu130 image.

2) Decode SNAC → audio with the included wrapper (maya_tts_server.py), which exposes an OpenAI-compatible /v1/audio/speech (streaming raw PCM or wav):

bash
pip install snac fastapi uvicorn httpx soundfile numpy torch
VLLM_URL=http://localhost:8002 python maya_tts_server.py   # serves :8003

bash
curl -X POST http://localhost:8003/v1/audio/speech -H "Content-Type: application/json" \
  -d '{"input":"Our update <laugh_harder> finally ships!",
       "description":"American female, 20s, warm, fast pacing.",
       "stream":true,"stream_format":"audio","response_format":"pcm"}' | ffplay -f s16le -ar 24000 -ac 1 -

Usage notes

description = natural-language voice design (gender/age/accent/pitch/timbre/pace).
Emotion tags: <laugh_harder> <sigh> <whisper> <angry> <giggle> <chuckle> <gasp> <cry> … Plain <laugh> can read subtle on some voices — prefer <laugh_harder> for an audible laugh.
fast pacing in the description tightens delivery (lower total latency).
Stream the PCM and play as it arrives to get the ~0.46 s first-audio.

Samples

Voice design — gritty man

Emotion — angry (<angry>)

Emotion — laugh (<laugh_harder>)

Style — whisper

License & ethics

Derivative of maya-research/maya1, Apache-2.0. No quantization changes the license. Do not use for unauthorized voice cloning, impersonation, fraud, or any unlawful/unethical purpose.

Attribution

Base model: Maya1 by Maya Research. Quantization + serving wrapper: Sggin1. Codec: hubertsiuzdak/snac_24khz.

maya1-nvfp4

Get help setting up a custom Dedicated Endpoints.

README