djdeniro/MiniMax-M2.7-MXFP416 API & Inference Endpoint

`mxfp4_16` Quantization of MiniMaxAI/MiniMax-M2.7

Runtime: Requires tcclaviger/vllm22:latest — a RDNA 4 (gfx12xx) vLLM image with mxfp4_16 kernel support. No other vLLM build currently loads these weights.

1. Introduction

This is an MXFP4-16 (Mixed-precision 4-bit with 16-element group size) quantized variant of MiniMaxAI/MiniMax-M2.7, produced using compressed-tensors with an IQ4_NL codebook.

The quantization:

4-bit weights with 16-element group size, IQ4_NL codebook
All Linear layers quantized (MoE experts, FFN, attention projections)
Attention k/v_proj scales, router gate, norms, embeddings kept BF16
KV cache: FP8 (e4m3), calibrated scales baked into checkpoint

The result fits in ~17.5 GiB per GPU (TP8) while retaining near-BF16 quality.

2. Model Architecture

229B total params (BF16), ~12B activated per token (top-8)
256 experts per MoE layer, top-8 routing, 62 transformer layers
200k context window
Native tool-calling support

3. Runtime Requirements

GPU: 8× RX 9700 (RDNA 4 / gfx12xx)
Memory: 128GB+ system RAM
Docker: tcclaviger/vllm22:latest — only validated runtime

The Docker image includes:

Custom Triton attention kernels tuned for RDNA4
Fixed FP8 KV-cache quantization path
Pre-tuned GEMM configs for RX 9700
MXFP4-16 kernels for gfx12xx

4. Deployment

Full deployment guide (RDNA4 / RX 9700): docs/vllm_deploy_guide.md

Quick-start:

bash
docker run --name minimax-mxfp416 \
  --rm --tty --ipc=host --shm-size=128g \
  --device /dev/kfd:/dev/kfd \
  --device /dev/dri/renderD128:/dev/dri/renderD128 \
  --device /dev/dri/renderD129:/dev/dri/renderD129 \
  --device /dev/dri/renderD130:/dev/dri/renderD130 \
  --device /dev/dri/renderD132:/dev/dri/renderD132 \
  --device /dev/dri/renderD137:/dev/dri/renderD137 \
  --device /dev/dri/renderD138:/dev/dri/renderD138 \
  --device /dev/dri/renderD139:/dev/dri/renderD139 \
  --device /dev/dri/renderD140:/dev/dri/renderD140 \
  -e HIP_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -e ROCR_VISIBLE_DEVICES=0,1,2,3,4,5,6,7 \
  -e TRUST_REMOTE_CODE=1 \
  -v /path/to/models:/app/models:ro \
  -p 8000:8000 \
  tcclaviger/vllm22:latest \
  bash -c "cp /app/models/vllm22_minimax_m2.py /app/vllm/vllm/model_executor/models/minimax_m2.py && \
    pip install -q sentencepiece && \
    exec vllm serve /app/models/MiniMax-M2.7-MXFP416 \
      --served-model-name minimax-m2.7-mxfp416 \
      --host 0.0.0.0 --port 8000 --trust-remote-code \
      --tensor-parallel-size 8 --enable-expert-parallel \
      --disable-cascade-attn \
      --reasoning-parser minimax_m2 \
      --enable-auto-tool-choice --tool-call-parser minimax_m2 \
      --enable-prefix-caching --gpu-memory-utilization 0.93 \
      --max-model-len 180000 --max-num-seqs 48 --max-num-batched-tokens 2048 \
      --kv-cache-dtype fp8_e4m3 --attention-backend TRITON_ATTN \
      --override-generation-config '{\"max_tokens\": 16384}'"

Performance (8× RX 9700, 210W power limit)

Table
Metric	Value
Generation throughput	~30–35 tokens/s
Prefill throughput	up to 2,190 tokens/s (w/ prefix cache)
Prefix cache hit rate	~93%
KV cache memory	11.35 GiB
KV cache capacity	767,856 tokens
Max context per request	180,000 tokens
Max concurrent (180k)	4 requests
Model weight memory (TP8)	~17.5 GiB/GPU

Power tip: Set rocm-smi --setpowerlimit <i> 210 per GPU. At 210W sustained throughput is higher than at full 300W due to reduced thermal throttling.

5. API Usage

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

completion = client.chat.completions.create(
    model="minimax-m2.7-mxfp416",
    messages=[
        {"role": "system", "content": "You are a helpful assistant."},
        {"role": "user", "content": "Hello!"}
    ],
    temperature=1.0,
    max_tokens=1024,
)
print(completion.choices[0].message.content)

6. Chat Template

The model uses a Jinja chat template supporting system messages, tool calls (<minimax:tool_call>/</minimax:tool_call>), reasoning content (<think>/</think>), and tool responses (<response>).

python
from transformers import AutoProcessor, AutoModelForCausalLM

processor = AutoProcessor.from_pretrained(
    "djdeniro/MiniMax-M2.7-MXFP416", trust_remote_code=True
)
model = AutoModelForCausalLM.from_pretrained(
    "djdeniro/MiniMax-M2.7-MXFP416",
    device_map="auto", dtype="auto", trust_remote_code=True
)

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Hello, how are you?"}
]
inputs = processor.apply_chat_template(
    messages, tokenize=True, add_generation_prompt=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=128, do_sample=False)
print(processor.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

7. Inference Parameters

temperature: 1.0
top_p: 0.95
top_k: 40
max_tokens: 16384 (default)

8. Acknowledgments

Base model: MiniMaxAI/MiniMax-M2.7
Quantization inspiration: tcclaviger/Step-3.7-Flash-240REAP-MXFP416
Runtime: tcclaviger/vllm22

9. License

Apache 2.0 — inherits from base model.

MiniMax-M2.7-MXFP416

Get help setting up a custom Dedicated Endpoints.

README

`mxfp4_16` Quantization of MiniMaxAI/MiniMax-M2.7

1. Introduction

2. Model Architecture

3. Runtime Requirements

4. Deployment

Performance (8× RX 9700, 210W power limit)

5. API Usage

6. Chat Template

7. Inference Parameters

8. Acknowledgments

9. License

Explore FriendliAI today

MiniMax-M2.7-MXFP416

Get help setting up a custom Dedicated Endpoints.

mxfp4_16 Quantization of MiniMaxAI/MiniMax-M2.7

1. Introduction

2. Model Architecture

3. Runtime Requirements

4. Deployment

Performance (8× RX 9700, 210W power limit)

5. API Usage

6. Chat Template

7. Inference Parameters

8. Acknowledgments

9. License

Explore FriendliAI today

`mxfp4_16` Quantization of MiniMaxAI/MiniMax-M2.7