Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a glance

Base modelMiniMaxAI/MiniMax-M2.1
FormatBF16
Total params162B
Active / token
Experts / layer180
Layers62
Hidden size3072
Context196,608
On-disk size163 GB

Which variant should I pick?

VariantFormatLink
MiniMax-M2.1-139BBF16link
MiniMax-M2.1-162B (this)BF16link

30% expert-pruned MiniMax-M2.1 using REAP (Router-weighted Expert Activation Pruning)

PropertyValue
Base ModelMiniMaxAI/MiniMax-M2.1
Parameters~162B
Experts180/256 (70% retained)
ArchitectureMoE (Mixture of Experts)
PrecisionBF16
VRAM Required~324GB
Stability0 loops in stress tests

Stress Test Results

Tested at 4 temperatures (0.0, 0.2, 0.7, 1.0) across 6 prompt types (24 total tests):

Temperaturemath_wordreasoningcodejsoninstructioncreative
0.0OKOKOKOKOKOK
0.2OKOKOKOKOKOK
0.7OKOKOKOKOKOK
1.0OKOKOKOKOKOK

Result: 24/24 tests passed, 0 loops detected

Extended High-Temperature Testing

Additional tests at temperatures 0.5, 0.8, 0.9, 1.2 (results in stress_test_results.json).

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"0xSero/MiniMax-M2.1-162B",
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
tokenizer = AutoTokenizer.from_pretrained(
"0xSero/MiniMax-M2.1-162B",
trust_remote_code=True,
)
messages = [{"role": "user", "content": "Explain quantum computing in simple terms."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
response = tokenizer.decode(outputs[0], skip_special_tokens=True)
print(response)

DynamicCache Compatibility Fix (transformers 4.55+)

If you encounter TypeError: CacheLayerMixin.__init__() got an unexpected keyword argument, add this before importing the model:

python

from transformers import cache_utils
_orig = cache_utils.DynamicCache.__init__
def _patched(self, *args, **kwargs):
cfg = kwargs.get("config")
if cfg and hasattr(cfg, "model_type") and "minimax" in str(getattr(cfg, "model_type", "")):
kwargs.pop("config", None)
kwargs.pop("max_cache_len", None)
kwargs.pop("max_batch_size", None)
return _orig(self, None)
return _orig(self, *args, **kwargs)
cache_utils.DynamicCache.__init__ = _patched

Model Comparison

ModelExpertsLoopsSizeStatus
MiniMax-M2.1-REAP-202041185BDeprecated
MiniMax-M2.1-REAP-301800162BRecommended
MiniMax-M2.1-REAP-401540139BRecommended
MiniMax-M2.1-REAP-501282116BDeprecated

Quantized Versions

  • MiniMax-M2.1-REAP-40-W4A16 (Coming Soon) - 4-bit weights, ~58GB

REAP Methodology

REAP (Router-weighted Expert Activation Pruning) uses calibration data to identify which experts are most important based on router activation patterns. Unlike random or magnitude-based pruning, REAP preserves the experts that are actually used during inference.

Calibration Dataset: 2098 samples

  • pile-10k: 498 samples (general text)
  • evol-codealpaca: 800 samples (code generation)
  • xlam-function-calling: 800 samples (function calling)

License & citation

License inherited from the base model.

bibtex

@misc{lasby2025reap,
title = {REAP the Experts: Why Pruning Prevails for One-Shot MoE Compression},
author = {Mike Lasby and Ivan Lazarevich and Nish Sinnadurai and Sean Lie and Yani Ioannou and Vithursan Thangarasa},
year = {2025}, eprint = {2510.13999}, archivePrefix = {arXiv}
}

Sponsors

Made possible by NVIDIA · TNG Technology · Lambda · Prime Intellect · Hot Aisle.

Model provider

0xSero

Model tree

Base

MiniMaxAI/MiniMax-M2.1

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today