Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

  • Model type: multimodal mobile GUI agent
  • Base model: Qwen/Qwen3-VL-8B-Instruct
  • Training data: lgy0404/MemGUI-3K
  • Training recipe: supervised fine-tuning with ms-swift
  • Output protocol: ConAct 5-part structured output
  • License: Apache 2.0

Intended Use

MemGUI-8B-SFT is intended for research on mobile GUI agents, long-horizon GUI control, context management, UI memory, and history folding. It can be used as an action policy in mobile GUI environments that provide screenshots and execute structured tool calls.

This model is not a general-purpose chatbot. It expects the MemGUI-Agent system prompt, a screenshot, and a structured mobile GUI context state.

Input and Output Format

The model expects a multimodal conversation with:

  • a system prompt defining the MemGUI-Agent tools and response format,
  • a user message containing <image> plus the task goal and structured context,
  • one screenshot image.

The assistant response follows this order:

xml

<thinking>...</thinking>
<folding>{"range": [start_step, current_step], "summary": "..."}</folding>
<tool_call>{"name": "mobile_use", "arguments": {...}}</tool_call>
<ui_observation>...</ui_observation>
<action_intent>...</action_intent>

For the first step of a trajectory, <folding> is omitted because there is no previous step to fold.

Evaluation

BenchmarkMetricScore
MemGUI-BenchPass@123.4
MemGUI-BenchPass@335.9
MemGUI-BenchIRR30.2
MobileWorld GUI-OnlySuccess Rate17.9

On MemGUI-Bench, MemGUI-8B-SFT improves over the Qwen3-VL-8B-Instruct baseline and achieves the best open-data 8B performance reported in our experiments. On MobileWorld GUI-Only, it transfers beyond the source benchmark and reaches 17.9% success rate.

Dataset

MemGUI-3K contains 2,956 successful mobile GUI trajectories and 64,430 reasonable step-level training samples with ConAct annotations. The dataset includes full trajectories, screenshots, step-level reasonableness annotations, and multimodal training files.

Dataset page: https://huggingface.co/datasets/lgy0404/MemGUI-3K

Citation

bibtex

@article{memguiagent2026,
title = {MemGUI-Agent: An End-to-End Long-Horizon Mobile GUI Agent with Proactive Context Management},
year = {2026}
}

Model provider

lgy0404

Model tree

Base

Qwen/Qwen3-VL-8B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today