lgy0404

MemGUI-8B-SFT

README

License: apache-2.0

Model Details

Model type: multimodal mobile GUI agent
Base model: Qwen/Qwen3-VL-8B-Instruct
Training data: lgy0404/MemGUI-3K
Training recipe: supervised fine-tuning with ms-swift
Output protocol: ConAct 5-part structured output
License: Apache 2.0

Intended Use

MemGUI-8B-SFT is intended for research on mobile GUI agents, long-horizon GUI control, context management, UI memory, and history folding. It can be used as an action policy in mobile GUI environments that provide screenshots and execute structured tool calls.

This model is not a general-purpose chatbot. It expects the MemGUI-Agent system prompt, a screenshot, and a structured mobile GUI context state.

Input and Output Format

The model expects a multimodal conversation with:

a system prompt defining the MemGUI-Agent tools and response format,
a user message containing <image> plus the task goal and structured context,
one screenshot image.

The assistant response follows this order:

xml
<thinking>...</thinking>
<folding>{"range": [start_step, current_step], "summary": "..."}</folding>
<tool_call>{"name": "mobile_use", "arguments": {...}}</tool_call>
<ui_observation>...</ui_observation>
<action_intent>...</action_intent>

For the first step of a trajectory, <folding> is omitted because there is no previous step to fold.

Evaluation

Table with columns: Benchmark, Metric, Score
Benchmark	Metric	Score
MemGUI-Bench	Pass@1	23.4
MemGUI-Bench	Pass@3	35.9
MemGUI-Bench	IRR	30.2
MobileWorld GUI-Only	Success Rate	17.9

On MemGUI-Bench, MemGUI-8B-SFT improves over the Qwen3-VL-8B-Instruct baseline and achieves the best open-data 8B performance reported in our experiments. On MobileWorld GUI-Only, it transfers beyond the source benchmark and reaches 17.9% success rate.

Dataset

MemGUI-3K contains 2,956 successful mobile GUI trajectories and 64,430 reasonable step-level training samples with ConAct annotations. The dataset includes full trajectories, screenshots, step-level reasonableness annotations, and multimodal training files.

Dataset page: https://huggingface.co/datasets/lgy0404/MemGUI-3K

Citation

bibtex
@article{memguiagent2026,
  title = {MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management},
  author = {Guangyi Liu and Gao Wu and Congxiao Liu and Pengxiang Zhao and Liang Liu and Mading Li and Qi Zhang and Mengyan Wang and Liang Guo and Yong Liu},
  year = {2026},
  journal = {arXiv preprint arXiv:2606.19926}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

lgy0404

Model Tree

Base

Qwen/Qwen3-VL-8B-Instruct

Fine-tuned

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Model Details

Model type: multimodal mobile GUI agent
Base model: Qwen/Qwen3-VL-8B-Instruct
Training data: lgy0404/MemGUI-3K
Training recipe: supervised fine-tuning with ms-swift
Output protocol: ConAct 5-part structured output
License: Apache 2.0

Intended Use

This model is not a general-purpose chatbot. It expects the MemGUI-Agent system prompt, a screenshot, and a structured mobile GUI context state.

Input and Output Format

The model expects a multimodal conversation with:

a system prompt defining the MemGUI-Agent tools and response format,
a user message containing <image> plus the task goal and structured context,
one screenshot image.

The assistant response follows this order:

xml
<thinking>...</thinking>
<folding>{"range": [start_step, current_step], "summary": "..."}</folding>
<tool_call>{"name": "mobile_use", "arguments": {...}}</tool_call>
<ui_observation>...</ui_observation>
<action_intent>...</action_intent>

For the first step of a trajectory, <folding> is omitted because there is no previous step to fold.

Evaluation

Table with columns: Benchmark, Metric, Score
Benchmark	Metric	Score
MemGUI-Bench	Pass@1	23.4
MemGUI-Bench	Pass@3	35.9
MemGUI-Bench	IRR	30.2
MobileWorld GUI-Only	Success Rate	17.9

Dataset

Dataset page: https://huggingface.co/datasets/lgy0404/MemGUI-3K

Citation

bibtex
@article{memguiagent2026,
  title = {MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management},
  author = {Guangyi Liu and Gao Wu and Congxiao Liu and Pengxiang Zhao and Liang Liu and Mading Li and Qi Zhang and Mengyan Wang and Liang Guo and Yong Liu},
  year = {2026},
  journal = {arXiv preprint arXiv:2606.19926}
}