Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a glance

Base modelMiniMaxAI/MiniMax-M2.1
FormatBF16
Total params139B
Active / token
Experts / layer154
Layers62
Hidden size3072
Context196,608
On-disk size140 GB

Which variant should I pick?

VariantFormatLink
MiniMax-M2.1-139B (this)BF16link
MiniMax-M2.1-162BBF16link

40% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)

PropertyValue
Base ModelMiniMaxAI/MiniMax-M2.1
Parameters~139B
Experts154/256 (60% retained)
ArchitectureMoE (Mixture of Experts)
PrecisionBF16
VRAM Required~278GB
Stability0 loops in stress tests

Stress Test Results

Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests):

Temperaturemath_wordreasoningcodejsoninstructioncreative
0.0OKOKOKOKOKOK
0.2OKOKOKOKOKOK
0.7OKOKOKOKOKOK
1.0OKOKOKOKOKOK

Result: 24/24 tests passed, 0 loops detected

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"0xSero/MiniMax-M2.1-139B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"0xSero/MiniMax-M2.1-139B",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Write a Python function to calculate fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

DynamicCache Compatibility Fix (transformers 4.55+)

If you encounter TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument, add this before importing the model:

python

from transformers import cache_utils
_orig = cache_utils.DynamicCache.__init__
def _patched(self, *args, **kwargs):
cfg = kwargs.get("config")
if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")):
kwargs.pop("config", None)
kwargs.pop("max_cache_len", None)
kwargs.pop("max_batch_size", None)
return _orig(self, None)
return _orig(self, *args, **kwargs)
cache_utils.DynamicCache.__init__ = _patched

Model Comparison

ModelExpertsLoopsSizeStatus
MiniMax-M2.1-REAP-202041185BDeprecated
MiniMax-M2.1-REAP-301800162BRecommended
MiniMax-M2.1-REAP-401540139BRecommended
MiniMax-M2.1-REAP-501282116BDeprecated

Quantized Versions

  • MiniMax-M2.1-REAP-40-W4A16 (Coming Soon) - 4-bit weights, ~58GB VRAM

Why 40% Pruning?

The 40% pruning ratio offers the best balance of:

  • Size reduction: 139B vs 456B original (70% smaller)
  • VRAM savings: ~278GB vs ~912GB (fits on 4x H100 80GB)
  • Stability: 0 loops in comprehensive stress testing
  • Performance: Minimal quality degradation from strategic expert selection

REAP Methodology

REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.

Calibration Dataset: 2098 samples

  • pile-10k: 498 samples (general text)
  • evol-codealpaca: 800 samples (code generation)
  • xlam-function-calling: 800 samples (function calling)

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

MiniMaxAI/MiniMax-M2.1

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today