Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: mitI. Introduction
MiMo-Embodied, a powerful cross-embodied vision-language model that shows state-of-the-art performance in both autonomous driving and embodied AI tasks, the first open-source VLM that integrates these two critical areas, significantly enhancing understanding and reasoning in dynamic physical environments.
II. Model Capabilities
III. Model Details
IV. Evaluation Results
MiMo-Embodied demonstrates superior performance across 17 benchmarks in three key embodied AI capabilities: Task Planning, Affordance Prediction, and Spatial Understanding, significantly surpassing existing open-source embodied VLM models and rivaling closed-source models.
Additionally, MiMo-Embodied excels in 12 autonomous driving benchmarks across three key capabilities: Environmental Perception, Status Prediction, and Driving Planning—significantly outperforming both existing open-source and closed-source VLM models, as well as proprietary VLM models.
Moreover, evaluation on 8 general visual understanding benchmarks confirms that MiMo-Embodied retains and even strengthens its general capabilities, showing that domain-specialized training enhances rather than diminishes overall model proficiency.
Embodied AI Benchmarks
Affordance & Planning
Spatial Understanding
Autonomous Driving Benchmarks
Single-View Image & Multi-View Video
Multi-View Image & Single-View Video
General Visual Understanding Benchmarks
Results marked with * are obtained using our evaluation framework.
V. Case Visualization
Embodied AI
Affordance Prediction
Task Planning
Spatial Understanding
Autonomous Driving
Environmental Perception
Status Prediction
Driving Planning
Real-world Tasks
Embodied Navigation
Embodied Manipulation
VI. Citation
bibtex
@misc{hao2025mimoembodiedxembodiedfoundationmodel,title={MiMo-Embodied: X-Embodied Foundation Model Technical Report},author={Xiaomi Embodied Intelligence Team},year={2025},eprint={2511.16518},archivePrefix={arXiv},primaryClass={cs.RO},url={https://arxiv.org/abs/2511.16518},}
Model provider
XiaomiMiMo
Model tree
Base
XiaomiMiMo/MiMo-Embodied-7B
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information