Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: modified-mit

How to Use

Benchmarks

MiniMax-M2.1 delivers a significant leap over M2 on core software engineering leaderboards. It shines particularly bright in multilingual scenarios, where it outperforms Claude Sonnet 4.5 and closely approaches Claude Opus 4.5.

BenchmarkMiniMax-M2.1MiniMax-M2Claude Sonnet 4.5Claude Opus 4.5Gemini 3 ProGPT-5.2 (thinking)DeepSeek V3.2
SWE-bench Verified74.069.477.280.978.080.073.1
Multi-SWE-bench49.436.244.350.042.7x37.4
SWE-bench Multilingual72.556.56877.565.072.070.2
Terminal-bench 2.047.930.050.057.854.254.046.4

We also evaluated MiniMax-M2.1 on SWE-bench Verified across a variety of coding agent frameworks. The results highlight the model's exceptional framework generalization and robust stability.

Furthermore, across specific benchmarks—including test case generation, code performance optimization, code review, and instruction following—MiniMax-M2.1 demonstrates comprehensive improvements over M2. In these specialized domains, it consistently matches or exceeds the performance of Claude Sonnet 4.5.

BenchmarkMiniMax-M2.1MiniMax-M2Claude Sonnet 4.5Claude Opus 4.5Gemini 3 ProGPT-5.2 (thinking)DeepSeek V3.2
SWE-bench Verified (Droid)71.368.172.375.2xx67.0
SWE-bench Verified (mini-swe-agent)67.061.070.674.471.874.260.0
SWT-bench69.332.869.580.279.780.762.0
SWE-Perf3.11.43.04.76.53.60.9
SWE-Review8.93.410.516.2xx6.4
OctoCodingbench26.113.322.836.222.9x26.0

To evaluate the model's full-stack capability to architect complete, functional applications "from zero to one," we established a novel benchmark: VIBE (Visual & Interactive Benchmark for Execution in Application Development). This suite encompasses five core subsets: Web, Simulation, Android, iOS, and Backend. Distinguishing itself from traditional benchmarks, VIBE leverages an innovative Agent-as-a-Verifier (AaaV) paradigm to automatically assess the interactive logic and visual aesthetics of generated applications within a real runtime environment.

MiniMax-M2.1 delivers outstanding performance on the VIBE aggregate benchmark, achieving an average score of 88.6—demonstrating robust full-stack development capabilities. It excels particularly in the VIBE-Web (91.5) and VIBE-Android (89.7) subsets.

BenchmarkMiniMax-M2.1MiniMax-M2Claude Sonnet 4.5Claude Opus 4.5Gemini 3 Pro
VIBE (Average)88.667.585.290.782.4
VIBE-Web91.580.487.389.189.5
VIBE-Simulation87.177.079.184.089.2
VIBE-Android89.769.287.592.278.7
VIBE-iOS88.039.581.290.075.8
VIBE-Backend86.767.890.898.078.7

MiniMax-M2.1 also demonstrates steady improvements over M2 in both long-horizon tool use and comprehensive intelligence metrics.

BenchmarkMiniMax-M2.1MiniMax-M2Claude Sonnet 4.5Claude Opus 4.5Gemini 3 ProGPT-5.2 (thinking)DeepSeek V3.2
Toolathlon43.516.738.943.536.441.735.2
BrowseComp47.444.019.637.037.865.851.4
BrowseComp (context management)62.056.926.157.859.270.067.6
AIME2583.078.088.091.096.098.092.0
MMLU-Pro88.082.088.090.090.087.086.0
GPQA-D83.078.083.087.091.090.084.0
HLE w/o tools22.212.517.328.437.231.422.2
LCB81.083.071.087.092.089.086.0
SciCode41.036.045.050.056.052.039.0
IFBench70.072.057.058.070.075.061.0
AA-LCR62.061.066.074.071.073.065.0
𝜏²-Bench Telecom87.087.078.090.087.085.091.0

Evaluation Methodology Notes:

  • SWE-bench Verified: Tested on internal infrastructure using Claude Code, Droid, or mini-swe-agent as scaffolding. By default, we utilized Claude Code metrics. When using Claude Code, the default system prompt was overridden. Results represent the average of 4 runs.
  • Multi-SWE-Bench & SWE-bench Multilingual & SWT-bench & SWE-Perf: Tested on internal infrastructure using Claude Code as scaffolding, with the default system prompt overridden. Results represent the average of 4 runs.
  • Terminal-bench 2.0: Tested using Claude Code on our internal evaluation framework. We verified the full dataset and fixed environmental issues. Timeout limits were removed, while all other configurations remained consistent with official settings. Results represent the average of 4 runs.
  • SWE Review: Built upon the SWE framework, this internal benchmark for code defect review covers diverse languages and scenarios, evaluating both defect recall and hallucination rates. A review is deemed "correct" only if the model accurately identifies the target defect and ensures all other reported findings are valid and free of hallucinations. All evaluations are executed using Claude Code, with final results reflecting the average of four independent runs per test case. We plan to open-source this benchmark soon.
  • OctoCodingbench: An internal benchmark focused on long-horizon instruction following for Code Agents in complex development scenarios. It conducts end-to-end behavioral supervision within a dynamic environment spanning diverse tech stacks and scaffolding frameworks. The core objective is to evaluate the model's ability to integrate and execute "composite instruction constraints"—encompassing System Prompts (SP), User Queries, Memory, Tool Schemas, and specifications such as Agents.md, Claude.md, and Skill.md. Adopting a strict "single-violation-failure" scoring mechanism, the final result is the average pass rate across 4 runs, quantifying the model's robustness in translating static constraints into precise behaviors. We plan to open-source this benchmark soon.
  • VIBE: An internal benchmark that utilizes Claude Code as scaffolding to automatically verify a program's interactive logic and visual effects. Scores are calculated through a unified pipeline comprising requirement sets, containerized deployment, and dynamic interaction environments. Final results represent the average of 3 runs. We have open-sourced this benchmark at VIBE.
  • Toolathlon: The evaluation protocol remains consistent with the original paper.
  • BrowseComp: All scores were obtained using the same agent framework as WebExplorer (Liu et al. 2025), with only minor fine-tuning of tool descriptions. We utilized the same 103-sample GAIA text-only validation subset as WebExplorer.
  • BrowseComp (context management): When token usage exceeds 30% of the maximum context window, we retain the first AI response, the last five AI responses, and the tool outputs, discarding the remaining content.
  • AIME25 ~ 𝜏²-Bench Telecom: Derived from internal testing based on the evaluation datasets and methodology referenced in the Artificial Analysis Intelligence Index.

Local Deployment Guide

Download the model from HuggingFace repository: https://huggingface.co/MiniMaxAI/MiniMax-M2.1

We recommend using the following inference frameworks (listed alphabetically) to serve the model:

SGLang

We recommend using SGLang to serve MiniMax-M2.1. Please refer to our SGLang Deployment Guide.

vLLM

We recommend using vLLM to serve MiniMax-M2.1. Please refer to our vLLM Deployment Guide.

Transformers

We recommend using Transformers to serve MiniMax-M2.1. Please refer to our Transformers Deployment Guide.

KTransformers

We recommend using KTransformers to serve MiniMax-M2.1. Please refer to KTransformers Deployment Guide

Other Inference Engines

Inference Parameters

We recommend using the following parameters for best performance: temperature=1.0, top_p = 0.95, top_k = 40. Default system prompt:

markdown

You are a helpful assistant. Your name is MiniMax-M2.1 and is built by MiniMax.

Tool Calling Guide

Please refer to our Tool Calling Guide.

Contact Us

Contact us at model@minimax.io.

Model provider

MiniMaxAI

MiniMaxAI

Model tree

Base

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today