Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quantization details

ComponentPrecisionNotes
MLP weights (32 layers)NVFP4 (W4A4, block-16, e2m1 / e4m3 scale)quantized
self_attn QKVO (8 layers)BF16preserved
linear_attn blocks (24 layers)BF16preserved (Mamba-style hybrid layers)
embed, lm_head, norm, mtp, visualBF16preserved
KV cacheFP8use_constant_amax: true
CalibrationMSE + fp8_scale_sweep: truestatic MLP weight scales

Checkpoint size: 12.38 GB (vs 19.3 GB BF16, −36%).

Evaluation

Δ columns are absolute percentage-point differences vs Qwen/Qwen3.5-9B BF16.

MetricBF16This model
MMLU-Pro pass@182.8982.40 (−0.49)
AIME 2025 avg-of-6467.3465.36 (−1.98)
AIME 2025 majority@6490.0087.78 (−2.22)
LCB pass@366.0868.72 (+2.64)
GPQA avg-of-881.0680.68 (−0.38)
GPQA majority@883.8483.59 (−0.25)
AA-LCR pass@1 [avg-of-3]56.3350.67 (−5.66)
AA-LCR pass@371.0066.00 (−5.00)
τ²-bench-telecom pass@115.7912.28 (−3.51)

Eval was run via nemo-evaluator-launcher using the nemo-skills:26.03 container. AA-LCR judge: Qwen3-235B-A22B-Instruct-2507 (Non-Reasoning).

Usage

vLLM

bash

vllm serve davidyu-nv/Qwen3.5-9B-NVFP4-MSE \
--tensor-parallel-size 2 \
--data-parallel-size 4 \
--reasoning-parser qwen3 \
--max-model-len 131072 \
--trust-remote-code \
--disable-custom-all-reduce \
--no-enable-prefix-caching

For tool-calling workloads (e.g. τ²-bench), also pass:

markdown

--enable-auto-tool-choice --tool-call-parser hermes

Container

Tested with nvcr.io/nvstaging/nim/vllm-modelopt:v0.19.1.

License

Apache 2.0, inherited from Qwen/Qwen3.5-9B.

Acknowledgments

  • Base model: Qwen team
  • Quantization recipe: NVIDIA Model Optimizer PR #1391

Model provider

davidyu-nv

Model tree

Base

Qwen/Qwen3.5-9B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today