88plug

Nemotron-3-Nano-30B-A3B-W8A16-NeuralMax

README

License: apache-2.0

At a Glance

Table with columns: Property, Value
Property	Value
Base model	`nvidia/Nemotron-Nano-30B`
Architecture	Hybrid Mamba SSM + Transformer attention
Quant format	compressed-tensors (native vLLM)
Method	AutoRound W8A16 (RTN, datafree)
Disk size	~30 GB
Min GPU	1× A100 40GB or RTX A6000 48GB

Note on Mamba layers

Mamba SSM layers are excluded from quantization — only the transformer attention projections are quantized. W4A16 for this model is not supported (Mamba requires specialized calibration not available in current tools).

Memory Requirements

Table with columns: Configuration, BF16, W8A16
Configuration	BF16	W8A16
Weights	~60 GB	~30 GB
Min GPU	2× A100 40GB	1× A100 40GB

Quick Start

vLLM

bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Nemotron-Nano-30B-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

Weights are in compressed-tensors format — no --quantization flag needed.

SGLang

SGLang v0.5.8 has Mamba dual-pool support. Verify nemotron_h (hybrid Mamba-Transformer) architecture is recognized before production use.

bash
docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path nvidia/Nemotron-Nano-30B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp

Text-only model — single GGUF, no mmproj needed. Verify nemotron_h architecture support in your build.

bash
python convert_hf_to_gguf.py nvidia/Nemotron-Nano-30B \
  --outfile Nemotron-Nano-30B-BF16.gguf

llama-quantize Nemotron-Nano-30B-BF16.gguf Nemotron-Nano-30B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Nemotron-Nano-30B-BF16.gguf Nemotron-Nano-30B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Nemotron-Nano-30B-Q8_0.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Results pending.

Table with columns: Engine, Format, Batch, ctx, tok/s, TTFT p50, TTFT p99, VRAM
Engine	Format	Batch	ctx	tok/s	TTFT p50	TTFT p99	VRAM
vLLM v0.21.0	W8A16	1	32k	—	—	—	—
vLLM v0.21.0	W8A16	8	32k	—

Hardware: A6000 48 GB, CUDA 12.9, driver 570.

Quality Targets

Table with columns: Metric, Target
Metric	Target
KL divergence vs BF16	< 0.005
MMLU recovery	≥ 99.7%

Citation

bibtex
@misc{nemotronnanoreport,
  title  = {Nemotron-Nano: A Family of Small Language Models},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/Nemotron-Nano-30B}
}

About

Produced by 88plug AI Lab — zero-loss quantizations of frontier omni and voice models.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

88plug

Model Tree

Base

nvidia/NVIDIA-Nemotron-3-Nano-30B-A3B-Base-BF16

Quantized

this model

Input Modalities

Text

Output Modalities