0xSero/MiniMax-M2.1-162B API & Inference Endpoint

At a glance


Base model	MiniMaxAI/MiniMax-M2.1
Format	BF16
Total params	162B
Active / token	—
Experts / layer	180
Layers	62
Hidden size	3072
Context	196,608
On-disk size	163 GB

Which variant should I pick?

Variant	Format	Link
`MiniMax-M2.1-139B`	BF16	link
`MiniMax-M2.1-162B` (this)	BF16	link

30% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)

Property	Value
Base Model	MiniMaxAI/MiniMax-M2.1
Parameters	~162B
Experts	180/256 (70% retained)
Architecture	MoE (Mixture of Experts)
Precision	BF16
VRAM Required	~324GB
Stability	0 loops in stress tests

Stress Test Results

Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests):

Temperature	math_word	reasoning	code	json	instruction	creative
0.0	OK	OK	OK	OK	OK	OK
0.2	OK	OK	OK	OK	OK	OK
0.7	OK	OK	OK	OK	OK	OK
1.0	OK	OK	OK	OK	OK	OK

Result: 24/24 tests passed, 0 loops detected

Extended High-Temperature Testing

Additional tests at temperatures 0.5, 0.8, 0.9, 1.2 (results in stress_test_results.json).

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/MiniMax-M2.1-162B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "0xSero/MiniMax-M2.1-162B",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

DynamicCache Compatibility Fix (transformers 4.55+)

If you encounter TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument, add this before importing the model:

python
from transformers import cache_utils
_orig = cache_utils.DynamicCache.__init__
def _patched(self, *args, **kwargs):
    cfg = kwargs.get("config")
    if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")):
        kwargs.pop("config", None)
        kwargs.pop("max_cache_len", None)
        kwargs.pop("max_batch_size", None)
        return _orig(self, None)
    return _orig(self, *args, **kwargs)
cache_utils.DynamicCache.__init__ = _patched

Model Comparison

Model	Experts	Loops	Size	Status
MiniMax-M2.1-REAP-20	204	1	185B	Deprecated
MiniMax-M2.1-REAP-30	180	0	162B	Recommended
MiniMax-M2.1-REAP-40	154	0	139B	Recommended
MiniMax-M2.1-REAP-50	128	2	116B	Deprecated

Quantized Versions

MiniMax-M2.1-REAP-40-W4A16 (Coming Soon) - 4-bit weights, ~58GB

REAP Methodology

REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.

Calibration Dataset: 2098 samples

pile-10k: 498 samples (general text)
evol-codealpaca: 800 samples (code generation)
xlam-function-calling: 800 samples (function calling)

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

MiniMax-M2.1-162B

Get help setting up a custom Dedicated Endpoints.

README