0xSero/MiniMax-M2.1-139B API & Inference Endpoint

At a glance


Base model	MiniMaxAI/MiniMax-M2.1
Format	BF16
Total params	139B
Active / token	—
Experts / layer	154
Layers	62
Hidden size	3072
Context	196,608
On-disk size	140 GB

Which variant should I pick?

Variant	Format	Link
`MiniMax-M2.1-139B` (this)	BF16	link
`MiniMax-M2.1-162B`	BF16	link

40% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)

Property	Value
Base Model	MiniMaxAI/MiniMax-M2.1
Parameters	~139B
Experts	154/256 (60% retained)
Architecture	MoE (Mixture of Experts)
Precision	BF16
VRAM Required	~278GB
Stability	0 loops in stress tests

Stress Test Results

Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests):

Temperature	math_word	reasoning	code	json	instruction	creative
0.0	OK	OK	OK	OK	OK	OK
0.2	OK	OK	OK	OK	OK	OK
0.7	OK	OK	OK	OK	OK	OK
1.0	OK	OK	OK	OK	OK	OK

Result: 24/24 tests passed, 0 loops detected

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "0xSero/MiniMax-M2.1-139B",
    torch_dtype=torch.bfloat16,
    device_map="auto",
    trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
    "0xSero/MiniMax-M2.1-139B",
    trust_remote_code=True,
)

messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

DynamicCache Compatibility Fix (transformers 4.55+)

If you encounter TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument, add this before importing the model:

python
from transformers import cache_utils
_orig = cache_utils.DynamicCache.__init__
def _patched(self, *args, **kwargs):
    cfg = kwargs.get("config")
    if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")):
        kwargs.pop("config", None)
        kwargs.pop("max_cache_len", None)
        kwargs.pop("max_batch_size", None)
        return _orig(self, None)
    return _orig(self, *args, **kwargs)
cache_utils.DynamicCache.__init__ = _patched

Model Comparison

Model	Experts	Loops	Size	Status
MiniMax-M2.1-REAP-20	204	1	185B	Deprecated
MiniMax-M2.1-REAP-30	180	0	162B	Recommended
MiniMax-M2.1-REAP-40	154	0	139B	Recommended
MiniMax-M2.1-REAP-50	128	2	116B	Deprecated

Quantized Versions

MiniMax-M2.1-REAP-40-W4A16 (Coming Soon) - 4-bit weights, ~58GB VRAM

Why 40% Pruning?

The 40% pruning ratio offers the best balance of:

Size reduction: 139B vs 456B original (70% smaller)
VRAM savings: ~278GB vs ~912GB (fits on 4x H100 80GB)
Stability: 0 loops in comprehensive stress testing
Performance: Minimal quality degradation from strategic expert selection

REAP Methodology

REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.

Calibration Dataset: 2098 samples

pile-10k: 498 samples (general text)
evol-codealpaca: 800 samples (code generation)
xlam-function-calling: 800 samples (function calling)

License & citation

License inherited from the base model.

bibtex
@misc{lasby2025reap,
  title  = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
  author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
  year   = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

MiniMax-M2.1-139B

Get help setting up a custom Dedicated Endpoints.

README