Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
- Model type: multimodal mobile GUI agent
- Base model:
Qwen/Qwen3-VL-8B-Instruct - Training data:
lgy0404/MemGUI-3K - Training recipe: supervised fine-tuning with ms-swift
- Output protocol: ConAct 5-part structured output
- License: Apache 2.0
Intended Use
MemGUI-8B-SFT is intended for research on mobile GUI agents, long-horizon GUI control, context management, UI memory, and history folding. It can be used as an action policy in mobile GUI environments that provide screenshots and execute structured tool calls.
This model is not a general-purpose chatbot. It expects the MemGUI-Agent system prompt, a screenshot, and a structured mobile GUI context state.
Input and Output Format
The model expects a multimodal conversation with:
- a system prompt defining the MemGUI-Agent tools and response format,
- a user message containing
<image>plus the task goal and structured context, - one screenshot image.
The assistant response follows this order:
xml
<thinking>...</thinking><folding>{"range": [start_step, current_step], "summary": "..."}</folding><tool_call>{"name": "mobile_use", "arguments": {...}}</tool_call><ui_observation>...</ui_observation><action_intent>...</action_intent>
For the first step of a trajectory, <folding> is omitted because there is no
previous step to fold.
Evaluation
| Benchmark | Metric | Score |
|---|---|---|
| MemGUI-Bench | Pass@1 | 23.4 |
| MemGUI-Bench | Pass@3 | 35.9 |
| MemGUI-Bench | IRR | 30.2 |
| MobileWorld GUI-Only | Success Rate | 17.9 |
On MemGUI-Bench, MemGUI-8B-SFT improves over the Qwen3-VL-8B-Instruct baseline and achieves the best open-data 8B performance reported in our experiments. On MobileWorld GUI-Only, it transfers beyond the source benchmark and reaches 17.9% success rate.
Dataset
MemGUI-3K contains 2,956 successful mobile GUI trajectories and 64,430 reasonable step-level training samples with ConAct annotations. The dataset includes full trajectories, screenshots, step-level reasonableness annotations, and multimodal training files.
Dataset page: https://huggingface.co/datasets/lgy0404/MemGUI-3K
Citation
bibtex
@article{memguiagent2026,title = {MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management},year = {2026}}
Model provider
lgy0404
Model tree
Base
Qwen/Qwen3-VL-8B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information