Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherAt a glance
| Base model | nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16 |
| Format | BF16 |
| Total params | 92B |
| Active / token | 12B |
| Experts / layer | 384 |
| Layers | — |
| Hidden size | 4096 |
| Context | 262,144 |
| On-disk size | 185 GB |
Which variant should I pick?
| Variant | Format | Link |
|---|---|---|
Nemotron-3-Super-64B | BF16 | link |
Nemotron-3-Super-64B-W4A16 | W4A16 | link |
Nemotron-3-Super-92B (this) | BF16 | link |
Nemotron-3-Super-92B-W4A16 | W4A16 | link |
This repo is a draft REAP-derived checkpoint based on nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16.
Provenance
Research References
-
Paper: arXiv:2510.13999
-
Code and experiment source: 0xSero/reap-expert-swap
-
Support and research funding: donate.sybilsolutions.ai
-
Upstream base model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
-
Relationship: REAP-derived expert-pruned checkpoint, not a mirror of the original base release
-
Draft status: draft research release
-
This repo publishes the derived checkpoint weights and runtime files produced from our REAP pruning workflow.
Pruning details
- Experts per MoE layer in upstream base:
512 - Experts retained per layer in this variant:
384 - Experts pruned per layer in this variant:
128 - Expected safetensor shard count in this draft repo:
4 - Source merged observation workflow:
nemotron_super_merged_long50_short15120_v2
Method summary
The pruning signal comes from layerwise REAP observations collected over a mixed calibration corpus dominated by a personal AI-session history plus a bounded public augmentation slice.
Validated observation lanes used in the merged signal:
nemotron_super_long50_16k_v3- longest personal trajectories first
50trajectories- capped at
16384tokens each
nemotron_super_short_mix_15120_t1024_b8192_v415000short personal prompts plus120bounded public prompts- capped at
1024tokens each - packed under a safe
8192token batch budget
- merged canonical state:
nemotron_super_merged_long50_short15120_v2
Model facts from the merged observation lane:
- runtime architecture class:
NemotronHForCausalLM - total blocks:
88 - MoE blocks:
40 - Mamba blocks:
40 - attention blocks:
8 - routed experts per token:
22
Intended use
This draft checkpoint is published for research into expert activation structure, residency planning, CPU offloading, and prompt-conditioned expert selection. It is not a production claim and it is not an NVIDIA release.
Draft caveats
- This is a draft derived checkpoint.
- We have not yet completed a full serving benchmark and quality benchmark campaign for this release on Hugging Face.
- The repo preserves provenance back to the upstream NVIDIA release and should be evaluated in that context.
License and terms
Distribution of this derived checkpoint is intended to comply with the NVIDIA Open Model License included in LICENSE.txt. The required attribution notice is included in NOTICE.
License & citation
License inherited from the base model.
bibtex
@misc{lasby2025reap,title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}}
Sponsors
Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.
Model provider
0xSero
Model tree
Base
nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information