Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

I. Introduction

MiMo-Embodied, a powerful cross-embodied vision-language model that shows state-of-the-art performance in both autonomous driving and embodied AI tasks, the first open-source VLM that integrates these two critical areas, significantly enhancing understanding and reasoning in dynamic physical environments.

II. Model Capabilities

III. Model Details

IV. Evaluation Results

MiMo-Embodied demonstrates superior performance across 17 benchmarks in three key embodied AI capabilities: Task Planning, Affordance Prediction, and Spatial Understanding, significantly surpassing existing open-source embodied VLM models and rivaling closed-source models.

Additionally, MiMo-Embodied excels in 12 autonomous driving benchmarks across three key capabilities: Environmental Perception, Status Prediction, and Driving Planning—significantly outperforming both existing open-source and closed-source VLM models, as well as proprietary VLM models.

Moreover, evaluation on 8 general visual understanding benchmarks confirms that MiMo-Embodied retains and even strengthens its general capabilities, showing that domain-specialized training enhances rather than diminishes overall model proficiency.

Embodied AI Benchmarks

Affordance & Planning

Spatial Understanding

Autonomous Driving Benchmarks

Single-View Image & Multi-View Video

Multi-View Image & Single-View Video

General Visual Understanding Benchmarks

Results marked with * are obtained using our evaluation framework.

V. Case Visualization

Embodied AI

Affordance Prediction

Task Planning

Spatial Understanding

Autonomous Driving

Environmental Perception

Status Prediction

Driving Planning

Real-world Tasks

Embodied Navigation

Embodied Manipulation

VI. Citation

bibtex

@misc{hao2025mimoembodiedxembodiedfoundationmodel,
title={MiMo-Embodied: X-Embodied Foundation Model Technical Report},
author={Xiaomi Embodied Intelligence Team},
year={2025},
eprint={2511.16518},
archivePrefix={arXiv},
primaryClass={cs.RO},
url={https://arxiv.org/abs/2511.16518},
}

Model provider

XiaomiMiMo

XiaomiMiMo

Model tree

Base

XiaomiMiMo/MiMo-Embodied-7B

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today