Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quantization scope
| Component | Precision | Notes |
|---|---|---|
| MLP gate/up/down (32 layers × 3) | NVFP4 W4A16 (e2m1, block-16, e4m3 scale) | weights only; activations BF16 |
| self_attn QKVO (8 layers × 4) | FP8 W+A (e4m3) | hybrid attention layers |
| linear_attn out_proj / in_proj_qkv / in_proj_z (24 layers × 3) | FP8 W+A (e4m3) | hybrid linear-attention layers |
| linear_attn in_proj_{a,b} / conv1d | BF16 | state-space submodules preserved |
| lm_head | NVFP4 W4 (weight-only) | block-16, e4m3 scale |
| KV cache | FP8 | with constant amax |
| visual / vision_tower / mtp | BF16 | preserved |
| Calibration | max | cnn_dailymail, 512 samples |
201 quantized layers total: 96 W4A16_NVFP4 + 104 FP8 + 1 lm_head.
Checkpoint size: 8.4 GB (vs 19.3 GB BF16, −57%).
Usage
vLLM
bash
vllm serve davidyu-nv/Qwen3.5-9B-NVFP4-W4A16 \--tensor-parallel-size 2 \--data-parallel-size 4 \--reasoning-parser qwen3 \--max-model-len 131072 \--trust-remote-code \--disable-custom-all-reduce \--no-enable-prefix-caching
For tool-calling (e.g. τ²-bench): add --enable-auto-tool-choice --tool-call-parser hermes.
Tested with nvcr.io/nvstaging/nim/vllm-modelopt:v0.19.1.
Comparison to sibling NVFP4 variants
| Variant | Size | MLP | attn | lm_head |
|---|---|---|---|---|
| BF16 | 19.3 GB | — | — | — |
| P0 v2 (W4A4 MLP-only, max calib) | 12.36 GB | NVFP4 W4A4 | BF16 | BF16 |
| Upstream MSE (W4A4 MLP-only) | 12.38 GB | NVFP4 W4A4 | BF16 | BF16 |
| This (W4A16 + FP8 attn, max calib) | 8.4 GB | NVFP4 W4A16 | FP8 | NVFP4 |
Eval results forthcoming — will be added once benchmark sweeps complete.
License
Apache 2.0, inherited from Qwen/Qwen3.5-9B.
Model provider
davidyu-nv
Model tree
Base
Qwen/Qwen3.5-9B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information