Qwen3-30B-A3B-NVFP4-W4A4 API & Inference Endpoint

Quantization details

Scheme: NVFP4 W4A4 — per-tensor global scale + per-group (size 16) FP8 (e4m3) local scales for weights, per-tensor activation scales
Ignored layers: lm_head, MoE router (re:.*mlp.gate$)
Calibration: 512 chat-formatted samples from HuggingFaceH4/ultrachat_200k (train_sft), max sequence length 2048
Tooling: llm-compressor 0.9.0, compressed-tensors 0.13.0
Format: compressed-tensors

⚠️ Hardware / runtime support note

NVFP4 fused-MoE inference requires a runtime kernel for your GPU. The cutlass NVFP4 grouped-GEMM MoE kernel (get_cutlass_moe_mm_data in vLLM) is currently compiled only for CUDA compute capability 9.0 (Hopper) and 10.0 (datacenter Blackwell, B200/GB200).

On SM120 GPUs (e.g. RTX PRO 6000 Blackwell, compute capability 12.0), the stock vLLM build used to produce this checkpoint does not yet ship a compiled NVFP4 MoE kernel, so vLLM serving fails with No compiled get_cutlass_moe_mm_data: ... capability 120. Required capability: 90 or 100.

This is a runtime kernel limitation, not a problem with the checkpoint — the quantized weights were verified (correct NVFP4 packing; router and lm_head left in full precision). To serve this model, use a GPU/kernel combination that provides an NVFP4 fused-MoE kernel (Hopper or B200), or a vLLM build with NVFP4 MoE kernels compiled for your architecture.

Usage (vLLM, on supported hardware)

python
from vllm import LLM, SamplingParams

llm = LLM(model="JongYeop/Qwen3-30B-A3B-NVFP4-W4A4")
out = llm.generate(
    ["Explain mixture-of-experts in one sentence."],
    SamplingParams(temperature=0.7, max_tokens=128),
)
print(out[0].outputs[0].text)

Recipe

yaml
quant_stage:
  quant_modifiers:
    QuantizationModifier:
      ignore: ["lm_head", "re:.*mlp.gate$"]
      scheme: "NVFP4"
      targets: ["Linear"]

Quantization details

Scheme: NVFP4 W4A4 — per-tensor global scale + per-group (size 16) FP8 (e4m3) local scales for weights, per-tensor activation scales
Ignored layers: lm_head, MoE router (re:.*mlp.gate$)
Calibration: 512 chat-formatted samples from HuggingFaceH4/ultrachat_200k (train_sft), max sequence length 2048
Tooling: llm-compressor 0.9.0, compressed-tensors 0.13.0
Format: compressed-tensors

⚠️ Hardware / runtime support note

Usage (vLLM, on supported hardware)

python
from vllm import LLM, SamplingParams

llm = LLM(model="JongYeop/Qwen3-30B-A3B-NVFP4-W4A4")
out = llm.generate(
    ["Explain mixture-of-experts in one sentence."],
    SamplingParams(temperature=0.7, max_tokens=128),
)
print(out[0].outputs[0].text)

Recipe

yaml
quant_stage:
  quant_modifiers:
    QuantizationModifier:
      ignore: ["lm_head", "re:.*mlp.gate$"]
      scheme: "NVFP4"
      targets: ["Linear"]

Qwen3-30B-A3B-NVFP4-W4A4

README

Quantization details

⚠️ Hardware / runtime support note

Usage (vLLM, on supported hardware)

Recipe

Explore FriendliAI today

README

Quantization details

⚠️ Hardware / runtime support note

Usage (vLLM, on supported hardware)

Recipe