Qwen3.5-397B-A17B-MXFP4 API & Inference Endpoint

Model Overview

Model Architecture: Qwen3_5MoeForConditionalGeneration
- Input: Text, Image, Video
- Output: Text
Supported Hardware Microarchitecture: AMD MI300 MI350/MI355
ROCm: 7.0.0
PyTorch: 2.9.1
Transformers: 5.3.0
Operating System(s): Linux
Inference Engine: SGLang/vLLM
Model Optimizer: AMD-Quark (v0.12)
- Quantized layers: Experts in language model only
- Weight quantization: OCP MXFP4, Static
- Activation quantization: OCP MXFP4, Dynamic

Model Quantization

The model was quantized from Qwen/Qwen3.5-397B-A17B-FP8 using AMD-Quark. The weights are quantized to MXFP4 and activations are quantized to MXFP4.

Quantization scripts:

markdown
import os
from quark.torch import LLMTemplate, ModelQuantizer


# Configuration
ckpt_path = "Qwen/Qwen3.5-397B-A17B-FP8"
output_dir = "amd/Qwen3.5-397B-A17B-MXFP4"
quant_scheme = "mxfp4"
exclude_layers = ["lm_head", "model.visual.*", "mtp.*", "*mlp.gate", "*shared_expert_gate*", "*.linear_attn.*", "*.self_attn.*", "*.shared_expert.*"]

# Get quant config from template
template = LLMTemplate.get("qwen3_5_moe")
quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers)

# Quantize with File-to-file mode
quantizer = ModelQuantizer(quant_config)
quantizer.direct_quantize_checkpoint(
    pretrained_model_path=ckpt_path,
    save_path=output_dir,
)

For further details or issues, please refer to the AMD-Quark documentation or contact the respective developers.

Evaluation

The model was evaluated on gsm8k benchmarks using the vllm framework.

Accuracy

Reproduction

The GSM8K results were obtained using the vLLM framework, based on the Docker image rocm/vllm-dev:nightly_main_20260211, and vLLM is installed inside the container.

Evaluating model in a new terminal

markdown
lm_eval \
  --model vllm \
  --model_args pretrained=amd/Qwen3.5-397B-A17B-MXFP4,tensor_parallel_size=4,max_model_len=262144,gpu_memory_utilization=0.90,max_gen_toks=2048,trust_remote_code=True,reasoning_parser=qwen3 \
  --tasks gsm8k  --num_fewshot 5 \
  --batch_size auto

License

Model Overview

Model Architecture: Qwen3_5MoeForConditionalGeneration

Input: Text, Image, Video
Output: Text

Supported Hardware Microarchitecture: AMD MI300 MI350/MI355

ROCm: 7.0.0

PyTorch: 2.9.1

Transformers: 5.3.0

Operating System(s): Linux

Inference Engine: SGLang/vLLM

Model Optimizer: AMD-Quark (v0.12)

Quantized layers: Experts in language model only
Weight quantization: OCP MXFP4, Static
Activation quantization: OCP MXFP4, Dynamic

Model Quantization

The model was quantized from Qwen/Qwen3.5-397B-A17B-FP8 using AMD-Quark. The weights are quantized to MXFP4 and activations are quantized to MXFP4.

Quantization scripts:

markdown

import os
from quark.torch import LLMTemplate, ModelQuantizer


# Configuration
ckpt_path = "Qwen/Qwen3.5-397B-A17B-FP8"
output_dir = "amd/Qwen3.5-397B-A17B-MXFP4"
quant_scheme = "mxfp4"
exclude_layers = ["lm_head", "model.visual.*", "mtp.*", "*mlp.gate", "*shared_expert_gate*", "*.linear_attn.*", "*.self_attn.*", "*.shared_expert.*"]

# Get quant config from template
template = LLMTemplate.get("qwen3_5_moe")
quant_config = template.get_config(scheme=quant_scheme, exclude_layers=exclude_layers)

# Quantize with File-to-file mode
quantizer = ModelQuantizer(quant_config)
quantizer.direct_quantize_checkpoint(
    pretrained_model_path=ckpt_path,
    save_path=output_dir,
)

For further details or issues, please refer to the AMD-Quark documentation or contact the respective developers.

Evaluation

The model was evaluated on gsm8k benchmarks using the vllm framework.

Accuracy

Reproduction

The GSM8K results were obtained using the vLLM framework, based on the Docker image rocm/vllm-dev:nightly_main_20260211, and vLLM is installed inside the container.

Evaluating model in a new terminal

markdown

lm_eval \
  --model vllm \
  --model_args pretrained=amd/Qwen3.5-397B-A17B-MXFP4,tensor_parallel_size=4,max_model_len=262144,gpu_memory_utilization=0.90,max_gen_toks=2048,trust_remote_code=True,reasoning_parser=qwen3 \
  --tasks gsm8k  --num_fewshot 5 \
  --batch_size auto

Qwen3.5-397B-A17B-MXFP4

README

Model Overview

Model Quantization

Evaluation

Accuracy

Reproduction

Evaluating model in a new terminal

License

Explore FriendliAI today

README

Model Overview

Model Quantization

Evaluation

Accuracy

Reproduction

Evaluating model in a new terminal

License