kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 API & Inference Endpoint

Benchmark vs official BF16

NVFP4 vs BF16 benchmark

Measured on the same hardware (1× RTX PRO 6000 Blackwell) and identical vLLM 0.23 config (--max-model-len 8192 --max-num-seqs 64 --gpu-memory-utilization 0.90, temperature=0):

Table with columns: Metric, Official BF16, NVFP4 (this model), Δ
Metric	Official BF16	NVFP4 (this model)	Δ
Disk size	66 GB	24.96 GB	−62%
First-token latency (TTFT)	35 ms	32 ms	−8.6%
Single-stream decode	157.9 tok/s	184.1 tok/s	+16.6%
Concurrent throughput (N=16)	1351.6 tok/s	1430.5 tok/s	+5.8%

Quality (temperature=0): 17×23, a factorial function, "why is the sky blue", and echo $((6*7)) — all correct and equivalent to the BF16 reference. The NVFP4 build is faster and ~1/3 the size with matching quality.

Quantization

Tool: llm-compressor 0.12, compressed-tensors (format mixed-precision).
Scheme:
- Attention (self_attn.{q,k,v,o}_proj, GDN linear_attn.{in_proj_qkv,in_proj_z,out_proj}) → FP8 (block [128,128]).
- MoE experts (gate_proj/up_proj/down_proj) → NVFP4 (group size 16, fp8_e4m3 scales).
- Left in BF16 (ignored): lm_head, embed_tokens, router mlp.gate, shared expert, GDN state (), first/last MoE layer experts, and the vision tower.

Serving (vLLM)

Text-only deployment (the source defines a vision tower but sets language_model_only=true). vLLM auto-detects compressed-tensors — no --quantization flag needed.

bash
vllm serve kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 \
  --max-model-len 262144 --max-num-batched-tokens 2096 --max-num-seqs 256 \
  --enable-prefix-caching --disable-custom-all-reduce --trust-remote-code \
  --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml

Notes: hybrid Gated-DeltaNet attention requires --max-num-batched-tokens 2096 (cache alignment); head_dim=256 auto-selects the Triton attention backend.

License

Apache-2.0 (inherited from the base model).

Benchmark vs official BF16

NVFP4 vs BF16 benchmark

Measured on the same hardware (1× RTX PRO 6000 Blackwell) and identical vLLM 0.23 config (--max-model-len 8192 --max-num-seqs 64 --gpu-memory-utilization 0.90, temperature=0):

Table with columns: Metric, Official BF16, NVFP4 (this model), Δ
Metric	Official BF16	NVFP4 (this model)	Δ
Disk size	66 GB	24.96 GB	−62%
First-token latency (TTFT)	35 ms	32 ms	−8.6%
Single-stream decode	157.9 tok/s	184.1 tok/s	+16.6%
Concurrent throughput (N=16)	1351.6 tok/s	1430.5 tok/s	+5.8%

Quantization

Tool: llm-compressor 0.12, compressed-tensors (format mixed-precision).
Scheme:
- Attention (self_attn.{q,k,v,o}_proj, GDN linear_attn.{in_proj_qkv,in_proj_z,out_proj}) → FP8 (block [128,128]).
- MoE experts (gate_proj/up_proj/down_proj) → NVFP4 (group size 16, fp8_e4m3 scales).
- Left in BF16 (ignored): lm_head, embed_tokens, router mlp.gate, shared expert, GDN state (), first/last MoE layer experts, and the vision tower.

Serving (vLLM)

Text-only deployment (the source defines a vision tower but sets language_model_only=true). vLLM auto-detects compressed-tensors — no --quantization flag needed.

bash
vllm serve kyaky/Qwen-AgentWorld-35B-A3B-NVFP4 \
  --max-model-len 262144 --max-num-batched-tokens 2096 --max-num-seqs 256 \
  --enable-prefix-caching --disable-custom-all-reduce --trust-remote-code \
  --reasoning-parser qwen3 --enable-auto-tool-choice --tool-call-parser qwen3_xml

Notes: hybrid Gated-DeltaNet attention requires --max-num-batched-tokens 2096 (cache alignment); head_dim=256 auto-selects the Triton attention backend.

License

Apache-2.0 (inherited from the base model).

Qwen-AgentWorld-35B-A3B-NVFP4

Get help setting up a custom Dedicated Endpoints.

README

Benchmark vs official BF16

Quantization

Serving (vLLM)

License

Explore FriendliAI today

README

Benchmark vs official BF16

Quantization

Serving (vLLM)

License