kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 API & Inference Endpoint | FriendliAI
README
License:apache-2.0
Benchmark vs official BF16
Measured on the same hardware (1× RTX PRO 6000 Blackwell) and identical vLLM 0.23 config
(--max-model-len 8192 --max-num-seqs 64 --gpu-memory-utilization 0.90, temperature=0):
Table with columns: Metric, Official BF16, NVFP4 (this model), Δ
Metric
Official BF16
NVFP4 (this model)
Δ
Disk size
66 GB
24.96 GB
−62%
First-token latency (TTFT)
35 ms
32 ms
−8.6%
Single-stream decode
157.9 tok/s
184.1 tok/s
+16.6%
Concurrent throughput (N=16)
1351.6 tok/s
1430.5 tok/s
+5.8%
Quality (temperature=0): 17×23, a factorial function, "why is the sky blue", and
echo $((6*7)) — all correct and equivalent to the BF16 reference. The NVFP4 build is faster
and ~1/3 the size with matching quality.
Left in BF16 (ignored): lm_head, embed_tokens, router mlp.gate, shared expert, GDN state
(), first/last MoE layer experts, and the vision tower.
Serving (vLLM)
Text-only deployment (the source defines a vision tower but sets language_model_only=true).
vLLM auto-detects compressed-tensors — no --quantization flag needed.