Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitAt a glance
| Base model | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 |
| Format | W4A16 |
| Total params | 64B |
| Active / token | 12B |
| Experts / layer | 256 |
| Layers | — |
| Hidden size | 4096 |
| Context | 262,144 |
| On-disk size | 42 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
Nemotron-3-Super-64B | BF16 | link |
Nemotron-3-Super-64B-W4A16 (this) | W4A16 | link |
Nemotron-3-Super-92B | BF16 | link |
Nemotron-3-Super-92B-W4A16 | W4A16 | link |
Draft AutoRound quantization of a Nemotron Super checkpoint.
Base model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Draft status
This is a draft research release. It is published for inspection, reproducibility, and early runtime validation. It should not be treated as a final benchmarked production checkpoint.
How this was produced
We quantized the checkpoint with Intel AutoRound using the W4A16 scheme on the remote 8x RTX 3090 host. This lane is optimized for overnight completion and resumability rather than final accuracy tuning.
Settings used
- source checkpoint:
/mnt/llm_models/nemotron-super-compressions/nemotron_super_merged_long50_short15120_v2/reap_50pct - source type:
REAP 50% pruned checkpoint - quantizer:
intel/auto-round 0.10.2 - scheme:
W4A16 - format:
auto_round - calibration dataset:
NeelNanda/pile-10k - device_map:
auto - nsamples:
128 - iters:
50 - seqlen:
1024 - batch_size:
2 - nblocks:
1 - low_gpu_mem_usage:
True - output dir:
/home/ser/nemotron-super/autoround_w4a16/reap_50pct
Notes
- upstream provenance is preserved through the base model link above
- this repo is intentionally marked draft while quantization/runtime validation is still in progress
- donation link added per maintainer request
License & citation
License inherited from the base model.
bibtex
@misc{lasby2025reap,title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
Model provider
0xSero
Model tree
Base
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information