Qwen3.5-35B-A3B-MXFP4 API & Inference Endpoint

Model Overview

Model Architecture: Qwen3_5MoeForConditionalGeneration
- Input: Text
- Output: Text
Supported Hardware Microarchitecture: AMD MI300 MI350/MI355
ROCm: 7.0.0
PyTorch: 2.9.1
Transformers: 5.3.0
vLLM: 0.16.0rc2
lm-evaluation-harness: 0.4.11
Operating System(s): Linux
Inference Engine: SGLang/vLLM
Model Optimizer: AMD-Quark (v0.12)
- Weight quantization: OCP MXFP4, Static
- Activation quantization: OCP MXFP4, Dynamic

Model Quantization

The model was quantized from Qwen/Qwen3.5-35B-A3B-FP8 using AMD-Quark. The weights are quantized to MXFP4 and activations are quantized to MXFP4.

Quantization scripts:

markdown
cd Quark/examples/torch/language_modeling/llm_ptq/
export exclude_layers="lm_head model.visual.* mtp.* *mlp.gate *shared_expert_gate* *.linear_attn.* *.self_attn.* *.shared_expert.*"
python3 quantize_quark.py --model_dir Qwen/Qwen3.5-35B-A3B-FP8 \
                          --quant_scheme mxfp4 \
                          --file2file_quantization \
                          --exclude_layers $exclude_layers \
                          --output_dir amd/Qwen3.5-35B-A3B-MXFP4

For further details or issues, please refer to the AMD-Quark documentation or contact the respective developers.

Evaluation

The model was evaluated on gsm8k benchmarks using the vllm framework.

Accuracy

Reproduction

The GSM8K results were obtained using the vLLM framework, based on the Docker image [rocm/vllm-dev:nightly_main_20260211], and vLLM is installed inside the container.

docker pull rocm/vllm-dev:nightly_main_20260211

Evaluating model in a new terminal

markdown
lm_eval \
  --model vllm \
  --model_args pretrained=amd/Qwen3.5-35B-A3B-MXFP4,tensor_parallel_size=4,max_model_len=262144,gpu_memory_utilization=0.90,max_gen_toks=2048,trust_remote_code=True,reasoning_parser=qwen3 \
  --tasks gsm8k  --num_fewshot 5 \
  --batch_size auto

License

Model Overview

Model Architecture: Qwen3_5MoeForConditionalGeneration
- Input: Text
- Output: Text
Supported Hardware Microarchitecture: AMD MI300 MI350/MI355
ROCm: 7.0.0
PyTorch: 2.9.1
Transformers: 5.3.0
vLLM: 0.16.0rc2
lm-evaluation-harness: 0.4.11
Operating System(s): Linux
Inference Engine: SGLang/vLLM
Model Optimizer: AMD-Quark (v0.12)
- Weight quantization: OCP MXFP4, Static
- Activation quantization: OCP MXFP4, Dynamic

Model Quantization

The model was quantized from Qwen/Qwen3.5-35B-A3B-FP8 using AMD-Quark. The weights are quantized to MXFP4 and activations are quantized to MXFP4.

Quantization scripts:

markdown
cd Quark/examples/torch/language_modeling/llm_ptq/
export exclude_layers="lm_head model.visual.* mtp.* *mlp.gate *shared_expert_gate* *.linear_attn.* *.self_attn.* *.shared_expert.*"
python3 quantize_quark.py --model_dir Qwen/Qwen3.5-35B-A3B-FP8 \
                          --quant_scheme mxfp4 \
                          --file2file_quantization \
                          --exclude_layers $exclude_layers \
                          --output_dir amd/Qwen3.5-35B-A3B-MXFP4

For further details or issues, please refer to the AMD-Quark documentation or contact the respective developers.

Evaluation

The model was evaluated on gsm8k benchmarks using the vllm framework.

Accuracy

Reproduction

The GSM8K results were obtained using the vLLM framework, based on the Docker image [rocm/vllm-dev:nightly_main_20260211], and vLLM is installed inside the container.

docker pull rocm/vllm-dev:nightly_main_20260211

Evaluating model in a new terminal

markdown
lm_eval \
  --model vllm \
  --model_args pretrained=amd/Qwen3.5-35B-A3B-MXFP4,tensor_parallel_size=4,max_model_len=262144,gpu_memory_utilization=0.90,max_gen_toks=2048,trust_remote_code=True,reasoning_parser=qwen3 \
  --tasks gsm8k  --num_fewshot 5 \
  --batch_size auto

Qwen3.5-35B-A3B-MXFP4

README

Model Overview

Model Quantization

Evaluation

Accuracy

Reproduction

Evaluating model in a new terminal

License

Explore FriendliAI today

README

Model Overview

Model Quantization

Evaluation

Accuracy

Reproduction

Evaluating model in a new terminal

License