jkim96
Nemotron-3-Ultra-550B-A55B-DASHQ-INT2-g32
Dedicated Endpoints
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherInstall
bash
pip install git+https://github.com/JaeminK/dashq.git
Load
python
from dashq import load_quantizedmodel, tokenizer = load_quantized("jkim96/Nemotron-3-Ultra-550B-A55B-DASHQ-INT2-g32", device_map="auto")
Quantization
| Field | Value |
|---|---|
| Base model | nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16 |
| Precision | INT2, group size 32 |
| Scale / zero dtype | float16 |
| Calibration | wikitext2, 128 samples x 2048 |
| Size | 209.8371 GB · original 1121.0559 GB · 5.3x compression |
Benchmarks
Full zero-shot / few-shot results for every DASH-Q checkpoint: github.com/JaeminK/dashq#benchmarks
Model provider
jkim96
Model tree
Base
nvidia/NVIDIA-Nemotron-3-Ultra-550B-A55B-BF16
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information