At a Glance
Table with columns: Property, Value| Property | Value |
|---|
| Base model | nvidia/Nemotron-Nano-30B |
| Release tier | Provisional (datafree RTN — re-quant scheduled) |
| Quant method | datafree RTN W8A16 (weight-only INT8) |
| FLAC status | Not measured (T+7d milestone) |
| Architecture | Hybrid Mamba SSM + Transformer attention |
| Quant format | compressed-tensors (native vLLM) |
| Method | RTN W8A16 (data-free; Mamba2 selective_scan not FX-traceable) |
| Disk size | ~30 GB |
| Min GPU | 1× A100 40GB or RTX A6000 48GB |
Note on Mamba layers
Mamba SSM layers are excluded from quantization — only the transformer attention projections are quantized. W4A16 for this model is not supported (Mamba requires specialized calibration not available in current tools).
Memory Requirements
Table with columns: Configuration, BF16, W8A16| Configuration | BF16 | W8A16 |
|---|
| Weights | ~60 GB | ~30 GB |
| Min GPU | 2× A100 40GB | 1× A100 40GB |
Quick Start
Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.
vLLM
docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Nemotron-3-Nano-30B-A3B-W8A16 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.92
Weights are in compressed-tensors format — no --quantization flag needed. Requires vLLM ≥ v0.21.0.
SGLang
SGLang v0.5.8 has Mamba dual-pool support. Verify nemotron_h (hybrid Mamba-Transformer) architecture is recognized before production use.
docker run --gpus device=0 -p 30000:30000 \
lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
--model-path nvidia/Nemotron-Nano-30B \
--tp 1 \
--mem-fraction-static 0.85 \
--port 30000
llama.cpp
Text-only model — single GGUF, no mmproj needed. Verify nemotron_h architecture support in your build.
python convert_hf_to_gguf.py nvidia/Nemotron-Nano-30B \
--outfile Nemotron-Nano-30B-BF16.gguf
llama-quantize Nemotron-Nano-30B-BF16.gguf Nemotron-Nano-30B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
Nemotron-Nano-30B-BF16.gguf Nemotron-Nano-30B-IQ4_XS.gguf IQ4_XS
llama-server \
--model Nemotron-Nano-30B-Q8_0.gguf \
--n-gpu-layers 999 \
--ctx-size 32768 \
--port 8081
Benchmarks
Results pending.
Table with columns: Engine, Format, Batch, ctx, tok/s, TTFT p50, TTFT p99, VRAM| Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM |
|---|
| vLLM v0.21.0 | W8A16 | 1 | 32k | — | — | — | — |
| vLLM v0.21.0 | W8A16 | 8 | 32k | — |
Hardware: A6000 48 GB, CUDA 12.9, driver 570.
Quality Targets
Table with columns: Metric, Target| Metric | Target |
|---|
| KL divergence vs BF16 | < 0.005 |
| MMLU recovery | ≥ 99.7% |
vs. Other Nemotron-Nano-30B Quants
This is the first compressed-tensors W8A16 checkpoint for Nemotron-Nano-30B. It cuts VRAM from 60 GB to ~30 GB, enabling single A100 40GB deployment without FP8 hardware.
Table with columns: Quant, Method, Size, GPU Compatibility, Notes| Quant | Method | Size | GPU Compatibility | Notes |
|---|
| 88plug W8A16 (this) | compressed-tensors RTN W8A16 | ~30 GB | Any Ampere+ ≥40 GB | First W8A16; attention layers quantized; native vLLM |
| NVIDIA NVFP4 | FP4 (NVFP4) | ~15 GB | H100/H200 only | Official NVIDIA release; FP4 hardware required |
| BF16 baseline | None | ~60 GB | 2× A100 40GB |
Limitations
- Mamba SSM layers excluded: Mamba selective-scan layers stay BF16 — they cannot be quantized via standard
targets=["Linear"] approach. Only transformer attention projections are quantized.
- W4A16 not supported: Mamba layers require specialized calibration not available in current quantization tools. W4A16 is not planned.
- RTN (data-free) quantization: No calibration corpus used. Near-lossless at W8A16 but not AutoRound-calibrated.
- SGLang arch caveat: Verify
nemotron_h (hybrid Mamba-Transformer) architecture is recognized by your SGLang build before production use.
- Benchmark results pending: Throughput and quality benchmarks will be added post-publication.
Citation
@misc{nemotronnanoreport,
title = {Nemotron-Nano: A Family of Small Language Models},
author = {NVIDIA},
year = {2025},
url = {https://huggingface.co/nvidia/Nemotron-Nano-30B}
}
About
88plug AI Lab ships compressed-tensors quantizations for native vLLM v0.21.0+ deployment.
This release: Provisional tier — datafree RTN (weight-only rounding, no calibration corpus). A gold AutoRound re-quant is scheduled; 88plug architecture forbids new provisional W4A16 uploads.
Browse all releases → huggingface.co/88plug