Qwen3.5-9B-NVFP4-MSE API & Inference Endpoint

Quantization details

Table with columns: Component, Precision, Notes
Component	Precision	Notes
MLP weights (32 layers)	NVFP4 (W4A4, block-16, e2m1 / e4m3 scale)	quantized
`self_attn` QKVO (8 layers)	BF16	preserved
`linear_attn` blocks (24 layers)	BF16	preserved (Mamba-style hybrid layers)
`embed`, `lm_head`, `norm`, `mtp`, `visual`	BF16	preserved
KV cache	FP8	`use_constant_amax: true`
Calibration	MSE + `fp8_scale_sweep: true`	static MLP weight scales

Checkpoint size: 12.38 GB (vs 19.3 GB BF16, −36%).

Evaluation

Δ columns are absolute percentage-point differences vs Qwen/Qwen3.5-9B BF16.

Table with columns: Metric, BF16, This model
Metric	BF16	This model
MMLU-Pro pass@1	82.89	82.40 (−0.49)
AIME 2025 avg-of-64	67.34	65.36 (−1.98)
AIME 2025 majority@64	90.00	87.78 (−2.22)
LCB pass@3	66.08	68.72 (+2.64)
GPQA avg-of-8	81.06	80.68 (−0.38)
GPQA majority@8	83.84

Eval was run via nemo-evaluator-launcher using the nemo-skills:26.03 container. AA-LCR judge: Qwen3-235B-A22B-Instruct-2507 (Non-Reasoning).

Usage

vLLM

bash
vllm serve davidyu-nv/Qwen3.5-9B-NVFP4-MSE \
  --tensor-parallel-size 2 \
  --data-parallel-size 4 \
  --reasoning-parser qwen3 \
  --max-model-len 131072 \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --no-enable-prefix-caching

For tool-calling workloads (e.g. τ²-bench), also pass:

markdown
--enable-auto-tool-choice --tool-call-parser hermes

Container

Tested with nvcr.io/nvstaging/nim/vllm-modelopt:v0.19.1.

License

Apache 2.0, inherited from Qwen/Qwen3.5-9B.

Acknowledgments

Base model: Qwen team
Quantization recipe: NVIDIA Model Optimizer PR #1391

Quantization details

Table with columns: Component, Precision, Notes
Component	Precision	Notes
MLP weights (32 layers)	NVFP4 (W4A4, block-16, e2m1 / e4m3 scale)	quantized
`self_attn` QKVO (8 layers)	BF16	preserved
`linear_attn` blocks (24 layers)	BF16	preserved (Mamba-style hybrid layers)
`embed`, `lm_head`, `norm`, `mtp`, `visual`	BF16	preserved
KV cache	FP8	`use_constant_amax: true`
Calibration	MSE + `fp8_scale_sweep: true`	static MLP weight scales

Checkpoint size: 12.38 GB (vs 19.3 GB BF16, −36%).

Evaluation

Δ columns are absolute percentage-point differences vs Qwen/Qwen3.5-9B BF16.

Table with columns: Metric, BF16, This model
Metric	BF16	This model
MMLU-Pro pass@1	82.89	82.40 (−0.49)
AIME 2025 avg-of-64	67.34	65.36 (−1.98)
AIME 2025 majority@64	90.00	87.78 (−2.22)
LCB pass@3	66.08	68.72 (+2.64)
GPQA avg-of-8	81.06	80.68 (−0.38)
GPQA majority@8	83.84

Eval was run via nemo-evaluator-launcher using the nemo-skills:26.03 container. AA-LCR judge: Qwen3-235B-A22B-Instruct-2507 (Non-Reasoning).

Usage

vLLM

bash
vllm serve davidyu-nv/Qwen3.5-9B-NVFP4-MSE \
  --tensor-parallel-size 2 \
  --data-parallel-size 4 \
  --reasoning-parser qwen3 \
  --max-model-len 131072 \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --no-enable-prefix-caching

For tool-calling workloads (e.g. τ²-bench), also pass:

markdown
--enable-auto-tool-choice --tool-call-parser hermes

Container

Tested with nvcr.io/nvstaging/nim/vllm-modelopt:v0.19.1.

License

Apache 2.0, inherited from Qwen/Qwen3.5-9B.

Acknowledgments

Base model: Qwen team
Quantization recipe: NVIDIA Model Optimizer PR #1391

Qwen3.5-9B-NVFP4-MSE

README

Quantization details

Evaluation

Usage

vLLM

Container

License

Acknowledgments

Explore FriendliAI today

README

Quantization details

Evaluation

Usage

vLLM

Container

License

Acknowledgments