djdeniro

MiniMax-M2.7-MXFP416

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: other

mxfp4_16 Quantization of MiniMaxAI/MiniMax-M2.7

Runtime: Requires tcclaviger/vllm22:latest — a RDNA 4 (gfx12xx) vLLM image with mxfp4_16 kernel support. No other vLLM build currently loads these weights.


1. Introduction

This is an MXFP4-16 (Mixed-precision 4-bit with 16-element group size) quantized variant of MiniMaxAI/MiniMax-M2.7, produced using compressed-tensors with an IQ4_NL codebook.

The quantization:

  • 4-bit weights with 16-element group size, IQ4_NL codebook
  • All Linear layers quantized (MoE experts, FFN, attention projections)
  • Attention k/v_proj scales, router gate, norms, embeddings kept BF16
  • KV cache: FP8 (e4m3), calibrated scales baked into checkpoint

The result fits in ~17.5 GiB per GPU (TP8) while retaining near-BF16 quality.


2. Model Architecture

  • 229B total params (BF16), ~12B activated per token (top-8)
  • 256 experts per MoE layer, top-8 routing, 62 transformer layers
  • 200k context window
  • Native tool-calling support

3. Runtime Requirements

  • GPU: 8× RX 9700 (RDNA 4 / gfx12xx)
  • Memory: 128GB+ system RAM
  • Docker: tcclaviger/vllm22:latest — only validated runtime

The Docker image includes:

  • Custom Triton attention kernels tuned for RDNA4
  • Fixed FP8 KV-cache quantization path
  • Pre-tuned GEMM configs for RX 9700
  • MXFP4-16 kernels for gfx12xx

4. Deployment

Full deployment guide (RDNA4 / RX 9700): docs/vllm_deploy_guide.md

Quick-start:

bash

docker run --name minimax-mxfp416 \
--rm --tty --ipc=host --shm-size=128g \
--device /dev/kfd:/dev/kfd \
--device /dev/dri/renderD128:/dev/dri/renderD128 \
--device /dev/dri/renderD129:/dev/dri/renderD129 \
--device /dev/dri/renderD130:/dev/dri/renderD130 \
--device /dev/dri/renderD132:/dev/dri/renderD132 \
--device /dev/dri/renderD137:/dev/dri/renderD137 \
--device /dev/dri/renderD138:/dev/dri/renderD138 \
--device /dev/dri/renderD139:/dev/dri/renderD139 \
--device /dev/dri/renderD140:/dev/dri/renderD140 \
-e HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
-e TRUST_REMOTE_CODE=1 \
-v /path/to/models:/app/models:ro \
-p 8000:8000 \
tcclaviger/vllm22:latest \
bash -c "cp /app/models/vllm22_minimax_m2.py /app/vllm/vllm/model_executor/models/minimax_m2.py && \
pip install -q sentencepiece && \
exec vllm serve /app/models/MiniMax-M2.7-MXFP416 \
--served-model-name minimax-m2.7-mxfp416 \
--host 0.0.0.0 --port 8000 --trust-remote-code \
--tensor-parallel-size 8 --enable-expert-parallel \
--disable-cascade-attn \
--reasoning-parser minimax_m2 \
--enable-auto-tool-choice --tool-call-parser minimax_m2 \
--enable-prefix-caching --gpu-memory-utilization 0.93 \
--max-model-len 180000 --max-num-seqs 48 --max-num-batched-tokens 2048 \
--kv-cache-dtype fp8_e4m3 --attention-backend TRITON_ATTN \
--override-generation-config '{\"max_tokens\": 16384}'"

Performance (8× RX 9700, 210W power limit)

Table
MetricValue
Generation throughput~30–35 tokens/s
Prefill throughputup to 2,190 tokens/s (w/ prefix cache)
Prefix cache hit rate~93%
KV cache memory11.35 GiB
KV cache capacity767,856 tokens
Max context per request180,000 tokens
Max concurrent (180k)4 requests
Model weight memory (TP8)~17.5 GiB/GPU

Power tip: Set rocm-smi --setpowerlimit <i> 210 per GPU. At 210W sustained throughput is higher than at full 300W due to reduced thermal throttling.


5. API Usage

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
completion = client.chat.completions.create(
model="minimax-m2.7-mxfp416",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
temperature=1.0,
max_tokens=1024,
)
print(completion.choices[0].message.content)

6. Chat Template

The model uses a Jinja chat template supporting system messages, tool calls (<minimax:tool_call>/</minimax:tool_call>), reasoning content (<think>/</think>), and tool responses (<response>).

python

from transformers import AutoProcessor, AutoModelForCausalLM
processor = AutoProcessor.from_pretrained(
"djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
"djdeniro/MiniMax-M2.7-MXFP416",
device_map="auto", dtype="auto", trust_remote_code=True
)
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello, how are you?"}
]
inputs = processor.apply_chat_template(
messages, tokenize=True, add_generation_prompt=True,
return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(processor.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

7. Inference Parameters

  • temperature: 1.0
  • top_p: 0.95
  • top_k: 40
  • max_tokens: 16384 (default)

8. Acknowledgments


9. License

Apache 2.0 — inherits from base model.

Model provider

djdeniro

Model tree

Base

MiniMaxAI/MiniMax-M2.7

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today