djdeniro
MiniMax-M2.7-MXFP416
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: othermxfp4_16 Quantization of MiniMaxAI/MiniMax-M2.7
Runtime: Requires tcclaviger/vllm22:latest — a RDNA 4 (gfx12xx) vLLM image with mxfp4_16 kernel support. No other vLLM build currently loads these weights.
1. Introduction
This is an MXFP4-16 (Mixed-precision 4-bit with 16-element group size) quantized variant of MiniMaxAI/MiniMax-M2.7, produced using compressed-tensors with an IQ4_NL codebook.
The quantization:
- 4-bit weights with 16-element group size, IQ4_NL codebook
- All
Linearlayers quantized (MoE experts, FFN, attention projections) - Attention
k/v_projscales, router gate, norms, embeddings kept BF16 - KV cache: FP8 (e4m3), calibrated scales baked into checkpoint
The result fits in ~17.5 GiB per GPU (TP8) while retaining near-BF16 quality.
2. Model Architecture
- 229B total params (BF16), ~12B activated per token (top-8)
- 256 experts per MoE layer, top-8 routing, 62 transformer layers
- 200k context window
- Native tool-calling support
3. Runtime Requirements
- GPU: 8× RX 9700 (RDNA 4 / gfx12xx)
- Memory: 128GB+ system RAM
- Docker:
tcclaviger/vllm22:latest— only validated runtime
The Docker image includes:
- Custom Triton attention kernels tuned for RDNA4
- Fixed FP8 KV-cache quantization path
- Pre-tuned GEMM configs for RX 9700
- MXFP4-16 kernels for gfx12xx
4. Deployment
Full deployment guide (RDNA4 / RX 9700): docs/vllm_deploy_guide.md
Quick-start:
bash
docker run --name minimax-mxfp416 \--rm --tty --ipc=host --shm-size=128g \--device /dev/kfd:/dev/kfd \--device /dev/dri/renderD128:/dev/dri/renderD128 \--device /dev/dri/renderD129:/dev/dri/renderD129 \--device /dev/dri/renderD130:/dev/dri/renderD130 \--device /dev/dri/renderD132:/dev/dri/renderD132 \--device /dev/dri/renderD137:/dev/dri/renderD137 \--device /dev/dri/renderD138:/dev/dri/renderD138 \--device /dev/dri/renderD139:/dev/dri/renderD139 \--device /dev/dri/renderD140:/dev/dri/renderD140 \-e HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \-e ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \-e TRUST_REMOTE_CODE=1 \-v /path/to/models:/app/models:ro \-p 8000:8000 \tcclaviger/vllm22:latest \bash -c "cp /app/models/vllm22_minimax_m2.py /app/vllm/vllm/model_executor/models/minimax_m2.py && \pip install -q sentencepiece && \exec vllm serve /app/models/MiniMax-M2.7-MXFP416 \--served-model-name minimax-m2.7-mxfp416 \--host 0.0.0.0 --port 8000 --trust-remote-code \--tensor-parallel-size 8 --enable-expert-parallel \--disable-cascade-attn \--reasoning-parser minimax_m2 \--enable-auto-tool-choice --tool-call-parser minimax_m2 \--enable-prefix-caching --gpu-memory-utilization 0.93 \--max-model-len 180000 --max-num-seqs 48 --max-num-batched-tokens 2048 \--kv-cache-dtype fp8_e4m3 --attention-backend TRITON_ATTN \--override-generation-config '{\"max_tokens\": 16384}'"
Performance (8× RX 9700, 210W power limit)
| Metric | Value |
|---|---|
| Generation throughput | ~30–35 tokens/s |
| Prefill throughput | up to 2,190 tokens/s (w/ prefix cache) |
| Prefix cache hit rate | ~93% |
| KV cache memory | 11.35 GiB |
| KV cache capacity | 767,856 tokens |
| Max context per request | 180,000 tokens |
| Max concurrent (180k) | 4 requests |
| Model weight memory (TP8) | ~17.5 GiB/GPU |
Power tip: Set
rocm-smi --setpowerlimit <i> 210per GPU. At 210W sustained throughput is higher than at full 300W due to reduced thermal throttling.
5. API Usage
python
from openai import OpenAIclient = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")completion = client.chat.completions.create(model="minimax-m2.7-mxfp416",messages=[{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello!"}],temperature=1.0,max_tokens=1024,)print(completion.choices[0].message.content)
6. Chat Template
The model uses a Jinja chat template supporting system messages, tool calls (<minimax:tool_call>/</minimax:tool_call>), reasoning content (<think>/</think>), and tool responses (<response>).
python
from transformers import AutoProcessor, AutoModelForCausalLMprocessor = AutoProcessor.from_pretrained("djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained("djdeniro/MiniMax-M2.7-MXFP416",device_map="auto", dtype="auto", trust_remote_code=True)messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Hello, how are you?"}]inputs = processor.apply_chat_template(messages, tokenize=True, add_generation_prompt=True,return_dict=True, return_tensors="pt").to(model.device)out = model.generate(**inputs, max_new_tokens=128, do_sample=False)print(processor.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
7. Inference Parameters
temperature: 1.0top_p: 0.95top_k: 40max_tokens: 16384 (default)
8. Acknowledgments
- Base model: MiniMaxAI/MiniMax-M2.7
- Quantization inspiration: tcclaviger/Step-3.7-Flash-240REAP-MXFP416
- Runtime: tcclaviger/vllm22
9. License
Apache 2.0 — inherits from base model.
Model provider
djdeniro
Model tree
Base
MiniMaxAI/MiniMax-M2.7
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information