Qwen3.5-9B-NVFP4-W4A16 API & Inference Endpoint

Quantization scope

Table with columns: Component, Precision, Notes
Component	Precision	Notes
MLP gate/up/down (32 layers × 3)	NVFP4 W4A16 (e2m1, block-16, e4m3 scale)	weights only; activations BF16
self_attn QKVO (8 layers × 4)	FP8 W+A (e4m3)	hybrid attention layers
linear_attn out_proj / in_proj_qkv / in_proj_z (24 layers × 3)	FP8 W+A (e4m3)	hybrid linear-attention layers
linear_attn in_proj_{a,b} / conv1d	BF16	state-space submodules preserved
lm_head	NVFP4 W4 (weight-only)	block-16, e4m3 scale
KV cache	FP8	with constant amax
visual / vision_tower / mtp	BF16	preserved
Calibration	max	cnn_dailymail, 512 samples

201 quantized layers total: 96 W4A16_NVFP4 + 104 FP8 + 1 lm_head.

Checkpoint size: 8.4 GB (vs 19.3 GB BF16, −57%).

Usage

vLLM

bash
vllm serve davidyu-nv/Qwen3.5-9B-NVFP4-W4A16 \
  --tensor-parallel-size 2 \
  --data-parallel-size 4 \
  --reasoning-parser qwen3 \
  --max-model-len 131072 \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --no-enable-prefix-caching

For tool-calling (e.g. τ²-bench): add --enable-auto-tool-choice --tool-call-parser hermes.

Tested with nvcr.io/nvstaging/nim/vllm-modelopt:v0.19.1.

Comparison to sibling NVFP4 variants

Table with columns: Variant, Size, MLP, attn, lm_head
Variant	Size	MLP	attn	lm_head
BF16	19.3 GB	—	—	—
P0 v2 (W4A4 MLP-only, max calib)	12.36 GB	NVFP4 W4A4	BF16	BF16
Upstream MSE (W4A4 MLP-only)	12.38 GB	NVFP4 W4A4	BF16	BF16

Eval results forthcoming — will be added once benchmark sweeps complete.

License

Apache 2.0, inherited from Qwen/Qwen3.5-9B.

Quantization scope

Table with columns: Component, Precision, Notes
Component	Precision	Notes
MLP gate/up/down (32 layers × 3)	NVFP4 W4A16 (e2m1, block-16, e4m3 scale)	weights only; activations BF16
self_attn QKVO (8 layers × 4)	FP8 W+A (e4m3)	hybrid attention layers
linear_attn out_proj / in_proj_qkv / in_proj_z (24 layers × 3)	FP8 W+A (e4m3)	hybrid linear-attention layers
linear_attn in_proj_{a,b} / conv1d	BF16	state-space submodules preserved
lm_head	NVFP4 W4 (weight-only)	block-16, e4m3 scale
KV cache	FP8	with constant amax
visual / vision_tower / mtp	BF16	preserved
Calibration	max	cnn_dailymail, 512 samples

201 quantized layers total: 96 W4A16_NVFP4 + 104 FP8 + 1 lm_head.

Checkpoint size: 8.4 GB (vs 19.3 GB BF16, −57%).

Usage

vLLM

bash
vllm serve davidyu-nv/Qwen3.5-9B-NVFP4-W4A16 \
  --tensor-parallel-size 2 \
  --data-parallel-size 4 \
  --reasoning-parser qwen3 \
  --max-model-len 131072 \
  --trust-remote-code \
  --disable-custom-all-reduce \
  --no-enable-prefix-caching

For tool-calling (e.g. τ²-bench): add --enable-auto-tool-choice --tool-call-parser hermes.

Tested with nvcr.io/nvstaging/nim/vllm-modelopt:v0.19.1.

Comparison to sibling NVFP4 variants

Table with columns: Variant, Size, MLP, attn, lm_head
Variant	Size	MLP	attn	lm_head
BF16	19.3 GB	—	—	—
P0 v2 (W4A4 MLP-only, max calib)	12.36 GB	NVFP4 W4A4	BF16	BF16
Upstream MSE (W4A4 MLP-only)	12.38 GB	NVFP4 W4A4	BF16	BF16

Eval results forthcoming — will be added once benchmark sweeps complete.

License

Apache 2.0, inherited from Qwen/Qwen3.5-9B.

Qwen3.5-9B-NVFP4-W4A16

README

Quantization scope

Usage

vLLM

Comparison to sibling NVFP4 variants

License

Explore FriendliAI today

README

Quantization scope

Usage

vLLM

Comparison to sibling NVFP4 variants

License