Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

At a glance

Base modelnvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16
FormatW4A16
Total params92B
Active / token12B
Experts / layer384
Layers
Hidden size4096
Context262,144
On-disk size56 GB

Which variant should I pick?

VariantFormatLink
Nemotron-3-Super-64BBF16link
Nemotron-3-Super-64B-W4A16W4A16link
Nemotron-3-Super-92BBF16link
Nemotron-3-Super-92B-W4A16 (this)W4A16link

Draft AutoRound quantization of a Nemotron Super checkpoint.

Base model: nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Draft status

This is a draft research release. It is published for inspection, reproducibility, and early runtime validation. It should not be treated as a final benchmarked production checkpoint.

How this was produced

We quantized the checkpoint with Intel AutoRound using the W4A16 scheme on the remote 8x RTX 3090 host. This lane is optimized for overnight completion and resumability rather than final accuracy tuning.

Settings used

  • source checkpoint: /mnt/llm_models/nemotron-super-compressions/nemotron_super_merged_long50_short15120_v2/reap_25pct
  • source type: REAP 25% pruned checkpoint
  • quantizer: intel/auto-round 0.10.2
  • scheme: W4A16
  • format: auto_round
  • calibration dataset: NeelNanda/pile-10k
  • device_map: auto
  • nsamples: 128
  • iters: 50
  • seqlen: 1024
  • batch_size: 2
  • nblocks: 1
  • low_gpu_mem_usage: True
  • output dir: /home/ser/nemotron-super/autoround_w4a16/reap_25pct

Notes

  • upstream provenance is preserved through the base model link above
  • this repo is intentionally marked draft while quantization/runtime validation is still in progress
  • donation link added per maintainer request

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

nvidia/NVIDIA-Nemotron-3-Super-120B-A12B-BF16

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today