Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitUse this model (vLLM)
MiMo's default vLLM path hard-selects a FlashAttention-3 backend (SM90+ only). Bind-mount the two patch files over the image copies (details in PATCHES.md):
bash
docker run --rm --gpus all --ipc=host -p 8000:8000 \-v /path/to/MiMo-V2.5-AWQ-int4:/model:ro \-v /path/to/vllm-patches/mimo_v2.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2.py:ro \-v /path/to/vllm-patches/mimo_v2_omni.py:/usr/local/lib/python3.12/dist-packages/vllm/model_executor/models/mimo_v2_omni.py:ro \vllm/vllm-openai:v0.21.0 \--model /model \--served-model-name mimo-v2.5 \--tensor-parallel-size 8 \--enable-expert-parallel \--enable-prefix-caching \--reasoning-parser qwen3 \--enable-auto-tool-choice --tool-call-parser qwen3_xml \--trust-remote-code \--max-model-len 262144 \--gpu-memory-utilization 0.90
- TP-4 works too — set
--tensor-parallel-size 4. (The patches are correct at both; see the QKV note inPATCHES.md.) - Sampling:
temperature 1.0, top_p 0.95(the model's shippedgeneration_config.json; thinking-mode on). --max-model-lencan be raised toward the native 1,048,576 as VRAM allows.
Files
| file | what |
|---|---|
model-0000{1..4}-of-00004.safetensors | int4 weights (W4A16) |
config.json, recipe.yaml | quant config + the full quantization recipe |
modeling_mimo_v2.py, configuration_mimo_v2.py | model code (--trust-remote-code) |
tokenizer*, chat_template.jinja, preprocessor_config.json | tokenizer + chat / vision preprocessing |
vllm-patches/mimo_v2.py, vllm-patches/mimo_v2_omni.py | the two vLLM serving patches — mount over the image copies |
vllm-patches/PATCHES.md | full patch writeup + validation |
Method
AWQ W4A16, routed-experts-only. Only the MoE routed-expert projections (mlp.experts.*_proj) are quantized to 4-bit. Everything quality-sensitive stays high-precision: attention, the router/gate, shared paths, embeddings & lm_head, MTP, and the vision + audio towers are all left untouched. In addition, layer-41's experts are kept at bf16 (a composite carve-out — that one layer quantized worst).
This is why MoE tolerates 4-bit far better than dense models: the sensitive machinery is untouched and only the redundant expert bulk is compressed. The exact llm-compressor recipe (group-wise AWQ, smoothing maps, ignore list) is in recipe.yaml.
A100 serving patches
Base MiMo-V2.5 is Hopper-only in vLLM for architectural, not precision, reasons: it uses SWA attention sinks + asymmetric head dims (qk=192, v=128), so vLLM selects a FlashAttention-3 backend that asserts SM90+. On A100, FA2 supports neither sinks nor asymmetric V.
The two files fix this on stock vLLM 0.21.0, exactly (no approximation):
- Triton attention on SM80 — branch the backend by device capability: Hopper keeps the native FA3 path; SM80 uses the Triton backend, which supports attention sinks on Ampere.
- V head-dim padding (128 → 192) — Triton needs K and V at the same head size; pad V with zeros before attention and slice it back off the output. Provably exact.
- Fused-QKV de-shard / re-shard (the TP-8 fix) — the checkpoint's fused
qkv_projis pre-sharded for TP-4; a naivechunk()silently corrupts K/V at TP-8. The patch de-shards to canonical Q/K/V then re-shards for the serving TP — exact at TP-4, correct at TP-8 for both full and sliding-window layers. - (vision) Merger LayerNorm fix (
mimo_v2_omni.py) — matches the checkpoint's ownLayerNorm+ biased-linear merger (vLLM's copy used RMSNorm + bias-less).
Full derivation, validation, and the one checkpoint-specific assumption (NB=4, the quant-time TP the fused QKV is pre-sharded for) are in PATCHES.md. Bind-mounting is the zero-rebuild path; baking the two files into a derived image is the clean end-state.
Quality
Routed-experts-only 4-bit is near-lossless here: measured symmetric-KL vs the bf16 reference ≈ 0.046 — in the "good MoE 4-bit" range (dense 4-bit is typically ~0.05+). The high-precision attention/router/shared paths plus the layer-41 bf16 carve-out are what keep it faithful.
License & credits
- Original model: MiMo-V2.5 © Xiaomi, MIT license.
- int4 quantization + A100 vLLM patches by @spectator2026.
Model provider
shadowlilac
Model tree
Base
XiaomiMiMo/MiMo-V2.5
Quantized
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information