Qwen3-30B-A3B-FP8-W8A8 API & Inference Endpoint

Quantization details

Scheme: FP8 W8A8, static per-tensor, symmetric (weights + input activations)
Ignored layers: lm_head, MoE router (re:.*mlp.gate$)
Calibration: 512 chat-formatted samples from HuggingFaceH4/ultrachat_200k (train_sft), max sequence length 2048
Tooling: llm-compressor 0.9.0, compressed-tensors 0.13.0
Format: compressed-tensors (loadable directly by vLLM)

Usage (vLLM)

python
from vllm import LLM, SamplingParams

llm = LLM(model="JongYeop/Qwen3-30B-A3B-FP8-W8A8")
out = llm.generate(
    ["Explain mixture-of-experts in one sentence."],
    SamplingParams(temperature=0.7, max_tokens=128),
)
print(out[0].outputs[0].text)

Recipe

yaml
quant_stage:
  quant_modifiers:
    QuantizationModifier:
      ignore: ["lm_head", "re:.*mlp.gate$"]
      config_groups:
        group_0:
          weights:
            num_bits: 8
            type: float
            strategy: tensor
            dynamic: false
            symmetric: true
          input_activations:
            num_bits: 8
            type: float
            strategy: tensor
            dynamic: false
            symmetric: true
          targets: ["Linear"]

Quantization details

Scheme: FP8 W8A8, static per-tensor, symmetric (weights + input activations)
Ignored layers: lm_head, MoE router (re:.*mlp.gate$)
Calibration: 512 chat-formatted samples from HuggingFaceH4/ultrachat_200k (train_sft), max sequence length 2048
Tooling: llm-compressor 0.9.0, compressed-tensors 0.13.0
Format: compressed-tensors (loadable directly by vLLM)

Usage (vLLM)

python
from vllm import LLM, SamplingParams

llm = LLM(model="JongYeop/Qwen3-30B-A3B-FP8-W8A8")
out = llm.generate(
    ["Explain mixture-of-experts in one sentence."],
    SamplingParams(temperature=0.7, max_tokens=128),
)
print(out[0].outputs[0].text)

Recipe

yaml
quant_stage:
  quant_modifiers:
    QuantizationModifier:
      ignore: ["lm_head", "re:.*mlp.gate$"]
      config_groups:
        group_0:
          weights:
            num_bits: 8
            type: float
            strategy: tensor
            dynamic: false
            symmetric: true
          input_activations:
            num_bits: 8
            type: float
            strategy: tensor
            dynamic: false
            symmetric: true
          targets: ["Linear"]

Qwen3-30B-A3B-FP8-W8A8

README

Quantization details

Usage (vLLM)

Recipe

Explore FriendliAI today

README

Quantization details

Usage (vLLM)

Recipe