At a Glance
Table with columns: Property, Value| Property | Value |
|---|
| Base model | nvidia/Nemotron-Nano-30B |
| Architecture | Hybrid Mamba SSM + Transformer attention |
| Quant format | compressed-tensors (native vLLM) |
| Method | AutoRound W8A16 (RTN, datafree) |
| Disk size | ~30 GB |
| Min GPU | 1× A100 40GB or RTX A6000 48GB |
Note on Mamba layers
Mamba SSM layers are excluded from quantization — only the transformer attention projections are quantized. W4A16 for this model is not supported (Mamba requires specialized calibration not available in current tools).
Memory Requirements
Table with columns: Configuration, BF16, W8A16| Configuration | BF16 | W8A16 |
|---|
| Weights | ~60 GB | ~30 GB |
| Min GPU | 2× A100 40GB | 1× A100 40GB |
Quick Start
vLLM
docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Nemotron-Nano-30B-W8A16 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92
Weights are in compressed-tensors format — no --quantization flag needed.
SGLang
SGLang v0.5.8 has Mamba dual-pool support. Verify nemotron_h (hybrid Mamba-Transformer) architecture is recognized before production use.
docker run --gpus device=0 -p 30000:30000 \
lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
--model-path nvidia/Nemotron-Nano-30B \
--tp 1 \
--mem-fraction-static 0.85 \
--port 30000
llama.cpp
Text-only model — single GGUF, no mmproj needed. Verify nemotron_h architecture support in your build.
python convert_hf_to_gguf.py nvidia/Nemotron-Nano-30B \
--outfile Nemotron-Nano-30B-BF16.gguf
llama-quantize Nemotron-Nano-30B-BF16.gguf Nemotron-Nano-30B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
Nemotron-Nano-30B-BF16.gguf Nemotron-Nano-30B-IQ4_XS.gguf IQ4_XS
llama-server \
--model Nemotron-Nano-30B-Q8_0.gguf \
--n-gpu-layers 999 \
--ctx-size 32768 \
--port 8081
Benchmarks
Results pending.
Table with columns: Engine, Format, Batch, ctx, tok/s, TTFT p50, TTFT p99, VRAM| Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM |
|---|
| vLLM v0.21.0 | W8A16 | 1 | 32k | — | — | — | — |
| vLLM v0.21.0 | W8A16 | 8 | 32k | — |
Hardware: A6000 48 GB, CUDA 12.9, driver 570.
Quality Targets
Table with columns: Metric, Target| Metric | Target |
|---|
| KL divergence vs BF16 | < 0.005 |
| MMLU recovery | ≥ 99.7% |
Citation
@misc{nemotronnanoreport,
title = {Nemotron-Nano: A Family of Small Language Models},
author = {NVIDIA},
year = {2025},
url = {https://huggingface.co/nvidia/Nemotron-Nano-30B}
}
About
Produced by 88plug AI Lab — zero-loss quantizations of frontier omni and voice models.