Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Highlights
- Unified embodied capability system. A single 8B model unifies three capability dimensions: Cognition & Spatial Reasoning, Planning & Correction, and Pointing & Location.
- State-of-the-art performance. Achieves SOTA on 16 out of 24 embodied VLM benchmarks, with an average score of 70.4% across 21 main accuracy-based benchmarks, surpassing Gemini-Robotics-ER-1.5 and GPT-5.4 by 17.0% and 21.7% respectively.
- Closed-loop autonomy. The PGC framework lets one model serve as planner, grounder, and corrector simultaneously, completing long-horizon real-world tasks (e.g., making milk tea, sweeping garbage, stacking cups) without human intervention.
- Efficient adaptation to action. Because embodied reasoning is internalized upstream, the model can be fine-tuned into Embodied-R1.5-VLA with only a small amount of action data, outperforming strong VLA baselines such as ฯ0.5โ across 4 popular manipulation benchmark suites (e.g., 92.4% on SimplerEnv Google Robot Visual Matching).
- Fully open-source. We release model weights, datasets, training code, and EmbodiedEvalKit, an evaluation framework tailored for embodied tasks.
Model Details
- Architecture: Qwen3-VL (
Qwen3VLForConditionalGeneration) - Parameters: ~8B
- Modality: Image / Video + Text โ Text
- Output format: All outputs are plain-text token sequences. Coordinates are normalized to [0,1000], trajectories are ordered coordinate sequences, and reasoning is free-form text. The final decision is emitted within an
<answer>...</answer>tag.
Unified Capabilities
- Embodied Cognition & Spatial Reasoning โ comprehends the semantic and spatial structure of the physical world, including static geometric relations and dynamic interaction possibilities.
- Embodied Planning & Correction โ covers the full task life cycle: long-horizon task decomposition, next-step planning, process detection, error localization, and error correction.
- Embodied Pointing & Location โ grounds high-level reasoning in coordinates and trajectories, covering referring expression grounding, region-level localization, functional (affordance) grounding, and visual trace generation.
Quick Start
python
from transformers import AutoModelForImageTextToText, AutoProcessorfrom PIL import Imagemodel_id = "IffYuan/Embodied-R1.5"model = AutoModelForImageTextToText.from_pretrained(model_id, torch_dtype="auto", device_map="auto")processor = AutoProcessor.from_pretrained(model_id)image = Image.open("scene.jpg")messages = [{"role": "user","content": [{"type": "image"},{"type": "text", "text": "You are a robot performing manipulation tasks. ""The task instruction is: move the blue cube on top of the yellow cube. ""Use 2D points to mark the target location."},],}]text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)out = model.generate(**inputs, max_new_tokens=512)print(processor.batch_decode(out, skip_special_tokens=True)[0])
The model reasons over the visual observation and emits its final decision within an <answer> tag, e.g. <answer>[{"point_2d": [750, 748]}]</answer>.
Citation
bibtex
@article{yuan2026embodiedr15,title = {Embodied-R1.5: Evolving Physical Intelligence via Embodied Foundation Models},author = {Yuan, Yifu and others},year = {2026}}
License
Released under the Apache 2.0 license.
Model provider
IffYuan
Model tree
Base
Qwen/Qwen3-VL-8B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information