Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quantization details
| Component | Precision | Notes |
|---|---|---|
| MLP weights (32 layers) | NVFP4 (W4A4, block-16, e2m1 / e4m3 scale) | quantized |
self_attn QKVO (8 layers) | BF16 | preserved |
linear_attn blocks (24 layers) | BF16 | preserved (Mamba-style hybrid layers) |
embed, lm_head, norm, mtp, visual | BF16 | preserved |
| KV cache | FP8 | use_constant_amax: true |
| Calibration | MSE + fp8_scale_sweep: true | static MLP weight scales |
Checkpoint size: 12.38 GB (vs 19.3 GB BF16, −36%).
Evaluation
Δ columns are absolute percentage-point differences vs Qwen/Qwen3.5-9B BF16.
| Metric | BF16 | This model |
|---|---|---|
| MMLU-Pro pass@1 | 82.89 | 82.40 (−0.49) |
| AIME 2025 avg-of-64 | 67.34 | 65.36 (−1.98) |
| AIME 2025 majority@64 | 90.00 | 87.78 (−2.22) |
| LCB pass@3 | 66.08 | 68.72 (+2.64) |
| GPQA avg-of-8 | 81.06 | 80.68 (−0.38) |
| GPQA majority@8 | 83.84 | 83.59 (−0.25) |
| AA-LCR pass@1 [avg-of-3] | 56.33 | 50.67 (−5.66) |
| AA-LCR pass@3 | 71.00 | 66.00 (−5.00) |
| τ²-bench-telecom pass@1 | 15.79 | 12.28 (−3.51) |
Eval was run via nemo-evaluator-launcher using the nemo-skills:26.03 container. AA-LCR judge: Qwen3-235B-A22B-Instruct-2507 (Non-Reasoning).
Usage
vLLM
bash
vllm serve davidyu-nv/Qwen3.5-9B-NVFP4-MSE \--tensor-parallel-size 2 \--data-parallel-size 4 \--reasoning-parser qwen3 \--max-model-len 131072 \--trust-remote-code \--disable-custom-all-reduce \--no-enable-prefix-caching
For tool-calling workloads (e.g. τ²-bench), also pass:
markdown
--enable-auto-tool-choice --tool-call-parser hermes
Container
Tested with nvcr.io/nvstaging/nim/vllm-modelopt:v0.19.1.
License
Apache 2.0, inherited from Qwen/Qwen3.5-9B.
Acknowledgments
Model provider
davidyu-nv
Model tree
Base
Qwen/Qwen3.5-9B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information