Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

Use this model (vLLM)

MiMo's default vLLM path hard-selects a FlashAttention-3 backend (SM90+ only). Bind-mount the two patch files over the image copies (details in PATCHES.md):

bash

docker run --rm --gpus all --ipc=host -p 8000:8000 \
-v /path/to/MiMo-V2.5-AWQ-int4:/model:ro \
-v /path/to/vllm-patches/mimo_v2.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2.py:ro \
-v /path/to/vllm-patches/mimo_v2_omni.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2_omni.py:ro \
vllm/vllm-openai:v0.21.0 \
--model /model \
--served-model-name mimo-v2.5 \
--tensor-parallel-size 8 \
--enable-expert-parallel \
--enable-prefix-caching \
--reasoning-parser qwen3 \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--trust-remote-code \
--max-model-len 262144 \
--gpu-memory-utilization 0.90
  • TP-4 works too — set --tensor-parallel-size 4. (The patches are correct at both; see the QKV note in PATCHES.md.)
  • Sampling: temperature 1.0, top_p 0.95 (the model's shipped generation_config.json; thinking-mode on).
  • --max-model-len can be raised toward the native 1,048,576 as VRAM allows.

Files

filewhat
model-0000{1..4}-of-00004.safetensorsint4 weights (W4A16)
config.json, recipe.yamlquant config + the full quantization recipe
modeling_mimo_v2.py, configuration_mimo_v2.pymodel code (--trust-remote-code)
tokenizer*, chat_template.jinja, preprocessor_config.jsontokenizer + chat / vision preprocessing
vllm-patches/mimo_v2.py, vllm-patches/mimo_v2_omni.pythe two vLLM serving patches — mount over the image copies
vllm-patches/PATCHES.mdfull patch writeup + validation

Method

AWQ W4A16, routed-experts-only. Only the MoE routed-expert projections (mlp.experts.*_proj) are quantized to 4-bit. Everything quality-sensitive stays high-precision: attention, the router/gate, shared paths, embeddings & lm_head, MTP, and the vision + audio towers are all left untouched. In addition, layer-41's experts are kept at bf16 (a composite carve-out — that one layer quantized worst).

This is why MoE tolerates 4-bit far better than dense models: the sensitive machinery is untouched and only the redundant expert bulk is compressed. The exact llm-compressor recipe (group-wise AWQ, smoothing maps, ignore list) is in recipe.yaml.

A100 serving patches

Base MiMo-V2.5 is Hopper-only in vLLM for architectural, not precision, reasons: it uses SWA attention sinks + asymmetric head dims (qk=192, v=128), so vLLM selects a FlashAttention-3 backend that asserts SM90+. On A100, FA2 supports neither sinks nor asymmetric V.

The two files fix this on stock vLLM 0.21.0, exactly (no approximation):

  1. Triton attention on SM80 — branch the backend by device capability: Hopper keeps the native FA3 path; SM80 uses the Triton backend, which supports attention sinks on Ampere.
  2. V head-dim padding (128 → 192) — Triton needs K and V at the same head size; pad V with zeros before attention and slice it back off the output. Provably exact.
  3. Fused-QKV de-shard / re-shard (the TP-8 fix) — the checkpoint's fused qkv_proj is pre-sharded for TP-4; a naive chunk() silently corrupts K/V at TP-8. The patch de-shards to canonical Q/K/V then re-shards for the serving TP — exact at TP-4, correct at TP-8 for both full and sliding-window layers.
  4. (vision) Merger LayerNorm fix (mimo_v2_omni.py) — matches the checkpoint's own LayerNorm + biased-linear merger (vLLM's copy used RMSNorm + bias-less).

Full derivation, validation, and the one checkpoint-specific assumption (NB=4, the quant-time TP the fused QKV is pre-sharded for) are in PATCHES.md. Bind-mounting is the zero-rebuild path; baking the two files into a derived image is the clean end-state.

Quality

Routed-experts-only 4-bit is near-lossless here: measured symmetric-KL vs the bf16 reference ≈ 0.046 — in the "good MoE 4-bit" range (dense 4-bit is typically ~0.05+). The high-precision attention/router/shared paths plus the layer-41 bf16 carve-out are what keep it faithful.

License & credits

Model provider

shadowlilac

shadowlilac

Model tree

Base

XiaomiMiMo/MiMo-V2.5

Quantized

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today