Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quantization scope

ComponentPrecisionNotes
MLP gate/up/down (32 layers × 3)NVFP4 W4A16 (e2m1, block-16, e4m3 scale)weights only; activations BF16
self_attn QKVO (8 layers × 4)FP8 W+A (e4m3)hybrid attention layers
linear_attn out_proj / in_proj_qkv / in_proj_z (24 layers × 3)FP8 W+A (e4m3)hybrid linear-attention layers
linear_attn in_proj_{a,b} / conv1dBF16state-space submodules preserved
lm_headNVFP4 W4 (weight-only)block-16, e4m3 scale
KV cacheFP8with constant amax
visual / vision_tower / mtpBF16preserved
Calibrationmaxcnn_dailymail, 512 samples

201 quantized layers total: 96 W4A16_NVFP4 + 104 FP8 + 1 lm_head.

Checkpoint size: 8.4 GB (vs 19.3 GB BF16, −57%).

Usage

vLLM

bash

vllm serve davidyu-nv/Qwen3.5-9B-NVFP4-W4A16 \
--tensor-parallel-size 2 \
--data-parallel-size 4 \
--reasoning-parser qwen3 \
--max-model-len 131072 \
--trust-remote-code \
--disable-custom-all-reduce \
--no-enable-prefix-caching

For tool-calling (e.g. τ²-bench): add --enable-auto-tool-choice --tool-call-parser hermes.

Tested with nvcr.io/nvstaging/nim/vllm-modelopt:v0.19.1.

Comparison to sibling NVFP4 variants

VariantSizeMLPattnlm_head
BF1619.3 GB
P0 v2 (W4A4 MLP-only, max calib)12.36 GBNVFP4 W4A4BF16BF16
Upstream MSE (W4A4 MLP-only)12.38 GBNVFP4 W4A4BF16BF16
This (W4A16 + FP8 attn, max calib)8.4 GBNVFP4 W4A16FP8NVFP4

Eval results forthcoming — will be added once benchmark sweeps complete.

License

Apache 2.0, inherited from Qwen/Qwen3.5-9B.

Model provider

davidyu-nv

Model tree

Base

Qwen/Qwen3.5-9B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today