Nemotron-3-Nano-30B-A3B-W8A16 API & Inference Endpoint

Load path (important)

These weights are compressed-tensors (pack-quantized / int-quantized).

Table with columns: Runtime, Supported
Runtime	Supported
vLLM ≥ 0.21	Yes — preferred (auto-detect CT; no `--quantization` flag)
transformers + `compressed-tensors`	Yes for many text models; multimodal may need custom code
Text Generation Inference (TGI)	Not supported for these CT packs
Hugging Face Inference Widget	Often fails — use vLLM locally instead

bash
# Preferred
vllm serve 88plug/<ModelName> --trust-remote-code

Do not deploy via TGI “text-generation-inference” paths — that backend does not load our CT format and produces opaque worker/load errors.

Nemotron-Nano-30B-W8A16

INT8 post-training quantization of nvidia/Nemotron-Nano-30B — NVIDIA's hybrid Mamba-Transformer 30B model. ~30 GB on disk. Runs on a single A100 40GB.

At a Glance

Table with columns: Property, Value
Property	Value
Base model	`nvidia/Nemotron-Nano-30B`
Release tier	Provisional (datafree RTN — re-quant scheduled)
Quant method	datafree RTN W8A16 (weight-only INT8)
FLAC status	Not measured (T+7d milestone)
Architecture	Hybrid Mamba SSM + Transformer attention
Quant format	compressed-tensors (native vLLM)
Method	RTN W8A16 (data-free; Mamba2 selective_scan not FX-traceable)
Disk size

Note on Mamba layers

Mamba SSM layers are excluded from quantization — only the transformer attention projections are quantized. W4A16 for this model is not supported (Mamba requires specialized calibration not available in current tools).

Memory Requirements

Table with columns: Configuration, BF16, W8A16
Configuration	BF16	W8A16
Weights	~60 GB	~30 GB
Min GPU	2× A100 40GB	1× A100 40GB

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM

bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Nemotron-3-Nano-30B-A3B-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

Weights are in compressed-tensors format — no --quantization flag needed. Requires vLLM ≥ v0.21.0.

SGLang

SGLang v0.5.8 has Mamba dual-pool support. Verify nemotron_h (hybrid Mamba-Transformer) architecture is recognized before production use.

bash
docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path nvidia/Nemotron-Nano-30B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp

Text-only model — single GGUF, no mmproj needed. Verify nemotron_h architecture support in your build.

bash
python convert_hf_to_gguf.py nvidia/Nemotron-Nano-30B \
  --outfile Nemotron-Nano-30B-BF16.gguf

llama-quantize Nemotron-Nano-30B-BF16.gguf Nemotron-Nano-30B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Nemotron-Nano-30B-BF16.gguf Nemotron-Nano-30B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Nemotron-Nano-30B-Q8_0.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Table with columns: Metric, Score
Metric	Score
Throughput total (in2048_out512)	2265 tok/s (prompt+output)
Throughput total (in512_out256)	1009 tok/s (prompt+output)
Throughput total (in8192_out512)	4641 tok/s (prompt+output)

Results from 88plug benchmark ladder — vLLM v0.21.0, lm-evaluation-harness. No fabricated numbers.

Quality Targets

Table with columns: Metric, Target
Metric	Target
KL divergence vs BF16	< 0.005
MMLU recovery	≥ 99.7%

vs. Other Nemotron-Nano-30B Quants

This is the first compressed-tensors W8A16 checkpoint for Nemotron-Nano-30B. It cuts VRAM from 60 GB to ~30 GB, enabling single A100 40GB deployment without FP8 hardware.

Table with columns: Quant, Method, Size, GPU Compatibility, Notes
Quant	Method	Size	GPU Compatibility	Notes
88plug W8A16 (this)	compressed-tensors RTN W8A16	~30 GB	Any Ampere+ ≥40 GB	First W8A16; attention layers quantized; native vLLM
NVIDIA NVFP4	FP4 (NVFP4)	~15 GB	H100/H200 only	Official NVIDIA release; FP4 hardware required
BF16 baseline	None	~60 GB	2× A100 40GB

Limitations

Mamba SSM layers excluded: Mamba selective-scan layers stay BF16 — they cannot be quantized via standard targets=["Linear"] approach. Only transformer attention projections are quantized.
W4A16 not supported: Mamba layers require specialized calibration not available in current quantization tools. W4A16 is not planned.
RTN (data-free) quantization: No calibration corpus used. Near-lossless at W8A16 but not AutoRound-calibrated.
SGLang arch caveat: Verify nemotron_h (hybrid Mamba-Transformer) architecture is recognized by your SGLang build before production use.
Benchmark results pending: Throughput and quality benchmarks will be added post-publication.

Citation

bibtex
@misc{nemotronnanoreport,
  title  = {Nemotron-Nano: A Family of Small Language Models},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/Nemotron-Nano-30B}
}

About

88plug AI Lab ships compressed-tensors quantizations for native vLLM v0.21.0+ deployment.

This release: Provisional tier — datafree RTN (weight-only rounding, no calibration corpus). A gold AutoRound re-quant is scheduled; 88plug architecture forbids new provisional W4A16 uploads.

Browse all releases → huggingface.co/88plug

Load path (important)

These weights are compressed-tensors (pack-quantized / int-quantized).

Table with columns: Runtime, Supported
Runtime	Supported
vLLM ≥ 0.21	Yes — preferred (auto-detect CT; no `--quantization` flag)
transformers + `compressed-tensors`	Yes for many text models; multimodal may need custom code
Text Generation Inference (TGI)	Not supported for these CT packs
Hugging Face Inference Widget	Often fails — use vLLM locally instead

bash
# Preferred
vllm serve 88plug/<ModelName> --trust-remote-code

Do not deploy via TGI “text-generation-inference” paths — that backend does not load our CT format and produces opaque worker/load errors.

Nemotron-Nano-30B-W8A16

INT8 post-training quantization of nvidia/Nemotron-Nano-30B — NVIDIA's hybrid Mamba-Transformer 30B model. ~30 GB on disk. Runs on a single A100 40GB.

At a Glance

Table with columns: Property, Value
Property	Value
Base model	`nvidia/Nemotron-Nano-30B`
Release tier	Provisional (datafree RTN — re-quant scheduled)
Quant method	datafree RTN W8A16 (weight-only INT8)
FLAC status	Not measured (T+7d milestone)
Architecture	Hybrid Mamba SSM + Transformer attention
Quant format	compressed-tensors (native vLLM)
Method	RTN W8A16 (data-free; Mamba2 selective_scan not FX-traceable)
Disk size

Note on Mamba layers

Memory Requirements

Table with columns: Configuration, BF16, W8A16
Configuration	BF16	W8A16
Weights	~60 GB	~30 GB
Min GPU	2× A100 40GB	1× A100 40GB

Quick Start

vLLM

bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Nemotron-3-Nano-30B-A3B-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.92

Weights are in compressed-tensors format — no --quantization flag needed. Requires vLLM ≥ v0.21.0.

SGLang

SGLang v0.5.8 has Mamba dual-pool support. Verify nemotron_h (hybrid Mamba-Transformer) architecture is recognized before production use.

bash
docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path nvidia/Nemotron-Nano-30B \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp

Text-only model — single GGUF, no mmproj needed. Verify nemotron_h architecture support in your build.

bash
python convert_hf_to_gguf.py nvidia/Nemotron-Nano-30B \
  --outfile Nemotron-Nano-30B-BF16.gguf

llama-quantize Nemotron-Nano-30B-BF16.gguf Nemotron-Nano-30B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Nemotron-Nano-30B-BF16.gguf Nemotron-Nano-30B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Nemotron-Nano-30B-Q8_0.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Table with columns: Metric, Score
Metric	Score
Throughput total (in2048_out512)	2265 tok/s (prompt+output)
Throughput total (in512_out256)	1009 tok/s (prompt+output)
Throughput total (in8192_out512)	4641 tok/s (prompt+output)

Results from 88plug benchmark ladder — vLLM v0.21.0, lm-evaluation-harness. No fabricated numbers.

Quality Targets

Table with columns: Metric, Target
Metric	Target
KL divergence vs BF16	< 0.005
MMLU recovery	≥ 99.7%

vs. Other Nemotron-Nano-30B Quants

This is the first compressed-tensors W8A16 checkpoint for Nemotron-Nano-30B. It cuts VRAM from 60 GB to ~30 GB, enabling single A100 40GB deployment without FP8 hardware.

Table with columns: Quant, Method, Size, GPU Compatibility, Notes
Quant	Method	Size	GPU Compatibility	Notes
88plug W8A16 (this)	compressed-tensors RTN W8A16	~30 GB	Any Ampere+ ≥40 GB	First W8A16; attention layers quantized; native vLLM
NVIDIA NVFP4	FP4 (NVFP4)	~15 GB	H100/H200 only	Official NVIDIA release; FP4 hardware required
BF16 baseline	None	~60 GB	2× A100 40GB

Limitations

Mamba SSM layers excluded: Mamba selective-scan layers stay BF16 — they cannot be quantized via standard targets=["Linear"] approach. Only transformer attention projections are quantized.
W4A16 not supported: Mamba layers require specialized calibration not available in current quantization tools. W4A16 is not planned.
RTN (data-free) quantization: No calibration corpus used. Near-lossless at W8A16 but not AutoRound-calibrated.
SGLang arch caveat: Verify nemotron_h (hybrid Mamba-Transformer) architecture is recognized by your SGLang build before production use.
Benchmark results pending: Throughput and quality benchmarks will be added post-publication.

Citation

bibtex
@misc{nemotronnanoreport,
  title  = {Nemotron-Nano: A Family of Small Language Models},
  author = {NVIDIA},
  year   = {2025},
  url    = {https://huggingface.co/nvidia/Nemotron-Nano-30B}
}

About

88plug AI Lab ships compressed-tensors quantizations for native vLLM v0.21.0+ deployment.

This release: Provisional tier — datafree RTN (weight-only rounding, no calibration corpus). A gold AutoRound re-quant is scheduled; 88plug architecture forbids new provisional W4A16 uploads.

Browse all releases → huggingface.co/88plug

Nemotron-3-Nano-30B-A3B-W8A16

README

Load path (important)

Nemotron-Nano-30B-W8A16

At a Glance

Note on Mamba layers

Memory Requirements

Quick Start

vLLM

SGLang

llama.cpp

Benchmarks

Quality Targets

vs. Other Nemotron-Nano-30B Quants

Limitations

Citation

About

Explore FriendliAI today

README

Load path (important)

Nemotron-Nano-30B-W8A16

At a Glance

Note on Mamba layers

Memory Requirements

Quick Start

vLLM

SGLang

llama.cpp

Benchmarks

Quality Targets

vs. Other Nemotron-Nano-30B Quants

Limitations

Citation

About