Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Main Capabilities

  • Complex GUI Automation: Autonomously complete complex interface operations containing hundreds of interactive elements
  • Cross-System Data Integration: Extract and integrate multi-source data through pure visual interaction without API interfaces
  • Long-Task Planning Execution: Support enterprise-level business process automation of dozens to hundreds of steps
  • Intelligent Report Generation: Automatically generate structured documents such as data analysis reports and work summaries

Technical Background

Mano-CUA builds upon the complete technical framework of the Mano project (see Mano Technical Report), employing the Mano-Action bidirectional self-reinforcement learning method, three-stage progressive training (SFT → Offline Reinforcement Learning → Online Reinforcement Learning), "think-act-verify" loop reasoning mechanism, and a closed-loop data circulation system to achieve high-precision GUI understanding and operation capabilities. The edge version is optimized through mixed-precision quantization, visual token pruning, and edge inference adaptation, enabling large-scale parameter models to run efficiently on edge devices like Mac mini/MacBook/computing sticks.

Quick Start

Requirements

  • macOS with Apple Silicon (M1+)
  • Python >= 3.12

Installation

bash

pip install transformers torch torchvision qwen-vl-utils

Single-Step Demo

python

from transformers import Qwen3VLForConditionalGeneration, AutoProcessor
from qwen_vl_utils import process_vision_info
from PIL import Image
# 1. Load model
model = Qwen3VLForConditionalGeneration.from_pretrained(
"Mininglamp-2718/Mano-CUA-4B-Thinking-1.1",
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("Mininglamp-2718/Mano-CUA-4B-Thinking-1.1")
# 2. Load a screenshot
img = Image.open("screenshot.png")
ratio = 1280 / img.width
img = img.resize((1280, int(img.height * ratio)), Image.LANCZOS)
# 3. Build prompt
task = "Click the search bar and type hello"
prompt_text = f"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.
## Output Format
<action>action</action>
## Action Space
open_app(app_name='') # Open an application by name.
open_url(url='') # Open a URL in the browser.
click(start_box='<|box_start|>(x1,y1)<|box_end|>')
type(content='') # type the content.
hotkey(key='') # Trigger a keyboard shortcut.
scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left', amount='scroll_amount')
drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')
wait(duration='') # Sleep for specified duration (in seconds).
finish() # The task is completed.
stop(reason='') # If the item can not found in the image, give the reason
## User Instruction
{task}"""
messages = [
{{"role": "system", "content": "You are a helpful assistant."}},
{{"role": "user", "content": [
{{"type": "image", "image": img}},
{{"type": "text", "text": prompt_text}},
]}},
]
# 4. Run inference
text_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text_input], images=image_inputs, videos=video_inputs,
padding=True, return_tensors="pt",
).to(model.device)
output_ids = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False)
output_ids = output_ids[:, inputs.input_ids.shape[1]:]
output = processor.batch_decode(output_ids, skip_special_tokens=True)[0]
print(output)

Output Format

The model outputs structured XML:

xml

<think>The search bar is at the top of the page...</think>
<action_desp>Click the search bar to focus it</action_desp>
<action>click(start_box='<|box_start|>(500,38)<|box_end|>')</action>

Coordinates are normalized to [0, 1000] range. To convert to pixel coordinates:

python

pixel_x = int(x / 1000 * screen_width)
pixel_y = int(y / 1000 * screen_height)

Full Action Space

ActionSyntaxDescription
open_appopen_app(app_name='')Open an application
open_urlopen_url(url='')Open a URL
clickclick(start_box='<|box_start|>(x,y)<|box_end|>')Left click
doubleclickdoubleclick(start_box='<|box_start|>(x,y)<|box_end|>')Double click
triple_clicktriple_click(start_box='<|box_start|>(x,y)<|box_end|>')Triple click (select line)
right_singleright_single(start_box='<|box_start|>(x,y)<|box_end|>')Right click
hoverhover(start_box='<|box_start|>(x,y)<|box_end|>')Mouse hover
typetype(content='text')Type text
hotkeyhotkey(key='cmd+c')Keyboard shortcut
hotkey_clickhotkey_click(start_box='<|box_start|>(x,y)<|box_end|>', key='shift')Modifier + click
scrollscroll(start_box='<|box_start|>(x,y)<|box_end|>', direction='down', amount='3')Scroll
dragdrag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x2,y2)<|box_end|>')Drag and drop
waitwait(duration='2')Wait (seconds)
finishfinish()Task completed
stopstop(reason='...')Task infeasible
call_usercall_user()Request human help

Other Versions

VersionRepoDescription
fp16 (this)Mano-CUA-4B-Thinking-1.1Full precision, for archival / re-quantization / GPU inference
MLX-8bitMano-CUA-4B-Thinking-1.1-MLX-8bitMLX 8-bit quantized, recommended for Apple Silicon local inference

Contact

Model provider

Mininglamp-2718

Model tree

Base

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today