Nemotron-3-Super-92B-W4A16 API & Inference Endpoint

At a glance

Table

Base model	nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Format	W4A16
Total params	92B
Active / token	12B
Experts / layer	384
Layers	—
Hidden size	4096
Context	262,144
On-disk size	56 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`Nemotron-3-Super-64B`	BF16	link
`Nemotron-3-Super-64B-W4A16`	W4A16	link
`Nemotron-3-Super-92B`	BF16	link

Draft AutoRound quantization of a Nemotron Super checkpoint.

Base model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Draft status

This is a draft research release. It is published for inspection, reproducibility, and early runtime validation. It should not be treated as a final benchmarked production checkpoint.

How this was produced

We quantized the checkpoint with Intel AutoRound using the W4A16 scheme on the remote 8x RTX 3090 host. This lane is optimized for overnight completion and resumability rather than final accuracy tuning.

Settings used

source checkpoint: /mnt/llm_models/nemotron-super-compressions/nemotron_super_merged_long50_short15120_v2/reap_25pct
source type: REAP 25% pruned checkpoint
quantizer: intel/auto-round 0.10.2
scheme: W4A16
format: auto_round
calibration dataset: NeelNanda/pile-10k
device_map: auto
nsamples: 128
iters: 50

Notes

upstream provenance is preserved through the base model link above
this repo is intentionally marked draft while quantization/runtime validation is still in progress
donation link added per maintainer request

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

At a glance

Table

Base model	nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Format	W4A16
Total params	92B
Active / token	12B
Experts / layer	384
Layers	—
Hidden size	4096
Context	262,144
On-disk size	56 GB

Which variant should I pick?

Table with columns: Variant, Format, Link
Variant	Format	Link
`Nemotron-3-Super-64B`	BF16	link
`Nemotron-3-Super-64B-W4A16`	W4A16	link
`Nemotron-3-Super-92B`	BF16	link

Draft AutoRound quantization of a Nemotron Super checkpoint.

Base model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Draft status

This is a draft research release. It is published for inspection, reproducibility, and early runtime validation. It should not be treated as a final benchmarked production checkpoint.

How this was produced

Settings used

source checkpoint: /mnt/llm_models/nemotron-super-compressions/nemotron_super_merged_long50_short15120_v2/reap_25pct
source type: REAP 25% pruned checkpoint
quantizer: intel/auto-round 0.10.2
scheme: W4A16
format: auto_round
calibration dataset: NeelNanda/pile-10k
device_map: auto
nsamples: 128
iters: 50

Notes

upstream provenance is preserved through the base model link above
this repo is intentionally marked draft while quantization/runtime validation is still in progress
donation link added per maintainer request

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Nemotron-3-Super-92B-W4A16

Get help setting up a custom Dedicated Endpoints.

README

At a glance

Which variant should I pick?

Draft status

How this was produced

Settings used

Notes

License & citation

Sponsors

Explore FriendliAI today

README

At a glance

Which variant should I pick?

Draft status

How this was produced

Settings used

Notes

License & citation

Sponsors