Quantization details
Table with columns: Component, Precision, Notes| Component | Precision | Notes |
|---|
| MLP weights (32 layers) | NVFP4 (W4A4, block-16, e2m1 / e4m3 scale) | quantized |
self_attn QKVO (8 layers) | BF16 | preserved |
linear_attn blocks (24 layers) | BF16 | preserved (Mamba-style hybrid layers) |
embed, lm_head, norm, mtp, visual | BF16 | preserved |
| KV cache | FP8 | use_constant_amax: true |
| Calibration | MSE + fp8_scale_sweep: true | static MLP weight scales |
Checkpoint size: 12.38 GB (vs 19.3 GB BF16, −36%).
Evaluation
Δ columns are absolute percentage-point differences vs Qwen/Qwen3.5-9B BF16.
Table with columns: Metric, BF16, This model| Metric | BF16 | This model |
|---|
| MMLU-Pro pass@1 | 82.89 | 82.40 (−0.49) |
| AIME 2025 avg-of-64 | 67.34 | 65.36 (−1.98) |
| AIME 2025 majority@64 | 90.00 | 87.78 (−2.22) |
| LCB pass@3 | 66.08 | 68.72 (+2.64) |
| GPQA avg-of-8 | 81.06 | 80.68 (−0.38) |
| GPQA majority@8 | 83.84 |
Eval was run via nemo-evaluator-launcher using the nemo-skills:26.03 container. AA-LCR judge: Qwen3-235B-A22B-Instruct-2507 (Non-Reasoning).
Usage
vLLM
vllm serve davidyu-nv/Qwen3.5-9B-NVFP4-MSE \
--tensor-parallel-size 2 \
--data-parallel-size 4 \
--reasoning-parser qwen3 \
--max-model-len 131072 \
--trust-remote-code \
--disable-custom-all-reduce \
--no-enable-prefix-caching
For tool-calling workloads (e.g. τ²-bench), also pass:
--enable-auto-tool-choice --tool-call-parser hermes
Container
Tested with nvcr.io/nvstaging/nim/vllm-modelopt:v0.19.1.
License
Apache 2.0, inherited from Qwen/Qwen3.5-9B.
Acknowledgments
- Base model: Qwen team
- Quantization recipe: NVIDIA Model Optimizer PR #1391