Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Main Capabilities
- Complex GUI Automation: Autonomously complete complex interface operations containing hundreds of interactive elements
- Cross-System Data Integration: Extract and integrate multi-source data through pure visual interaction without API interfaces
- Long-Task Planning Execution: Support enterprise-level business process automation of dozens to hundreds of steps
- Intelligent Report Generation: Automatically generate structured documents such as data analysis reports and work summaries
Technical Background
Mano-CUA builds upon the complete technical framework of the Mano project (see Mano Technical Report), employing the Mano-Action bidirectional self-reinforcement learning method, three-stage progressive training (SFT → Offline Reinforcement Learning → Online Reinforcement Learning), "think-act-verify" loop reasoning mechanism, and a closed-loop data circulation system to achieve high-precision GUI understanding and operation capabilities. The edge version is optimized through mixed-precision quantization, visual token pruning, and edge inference adaptation, enabling large-scale parameter models to run efficiently on edge devices like Mac mini/MacBook/computing sticks.
Quick Start
Requirements
- macOS with Apple Silicon (M1+)
- Python >= 3.12
Installation
bash
pip install transformers torch torchvision qwen-vl-utils
Single-Step Demo
python
from transformers import Qwen3VLForConditionalGeneration, AutoProcessorfrom qwen_vl_utils import process_vision_infofrom PIL import Image# 1. Load modelmodel = Qwen3VLForConditionalGeneration.from_pretrained("Mininglamp-2718/Mano-CUA-4B-Thinking-1.1",torch_dtype="auto",device_map="auto",)processor = AutoProcessor.from_pretrained("Mininglamp-2718/Mano-CUA-4B-Thinking-1.1")# 2. Load a screenshotimg = Image.open("screenshot.png")ratio = 1280 / img.widthimg = img.resize((1280, int(img.height * ratio)), Image.LANCZOS)# 3. Build prompttask = "Click the search bar and type hello"prompt_text = f"""You are a GUI agent. You are given a task and your action history, with screenshots. You need to perform the next action to complete the task.## Output Format<action>action</action>## Action Spaceopen_app(app_name='') # Open an application by name.open_url(url='') # Open a URL in the browser.click(start_box='<|box_start|>(x1,y1)<|box_end|>')type(content='') # type the content.hotkey(key='') # Trigger a keyboard shortcut.scroll(start_box='<|box_start|>(x1,y1)<|box_end|>', direction='down or up or right or left', amount='scroll_amount')drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x3,y3)<|box_end|>')wait(duration='') # Sleep for specified duration (in seconds).finish() # The task is completed.stop(reason='') # If the item can not found in the image, give the reason## User Instruction{task}"""messages = [{{"role": "system", "content": "You are a helpful assistant."}},{{"role": "user", "content": [{{"type": "image", "image": img}},{{"type": "text", "text": prompt_text}},]}},]# 4. Run inferencetext_input = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)image_inputs, video_inputs = process_vision_info(messages)inputs = processor(text=[text_input], images=image_inputs, videos=video_inputs,padding=True, return_tensors="pt",).to(model.device)output_ids = model.generate(**inputs, max_new_tokens=512, temperature=0.0, do_sample=False)output_ids = output_ids[:, inputs.input_ids.shape[1]:]output = processor.batch_decode(output_ids, skip_special_tokens=True)[0]print(output)
Output Format
The model outputs structured XML:
xml
<think>The search bar is at the top of the page...</think><action_desp>Click the search bar to focus it</action_desp><action>click(start_box='<|box_start|>(500,38)<|box_end|>')</action>
Coordinates are normalized to [0, 1000] range. To convert to pixel coordinates:
python
pixel_x = int(x / 1000 * screen_width)pixel_y = int(y / 1000 * screen_height)
Full Action Space
| Action | Syntax | Description |
|---|---|---|
| open_app | open_app(app_name='') | Open an application |
| open_url | open_url(url='') | Open a URL |
| click | click(start_box='<|box_start|>(x,y)<|box_end|>') | Left click |
| doubleclick | doubleclick(start_box='<|box_start|>(x,y)<|box_end|>') | Double click |
| triple_click | triple_click(start_box='<|box_start|>(x,y)<|box_end|>') | Triple click (select line) |
| right_single | right_single(start_box='<|box_start|>(x,y)<|box_end|>') | Right click |
| hover | hover(start_box='<|box_start|>(x,y)<|box_end|>') | Mouse hover |
| type | type(content='text') | Type text |
| hotkey | hotkey(key='cmd+c') | Keyboard shortcut |
| hotkey_click | hotkey_click(start_box='<|box_start|>(x,y)<|box_end|>', key='shift') | Modifier + click |
| scroll | scroll(start_box='<|box_start|>(x,y)<|box_end|>', direction='down', amount='3') | Scroll |
| drag | drag(start_box='<|box_start|>(x1,y1)<|box_end|>', end_box='<|box_start|>(x2,y2)<|box_end|>') | Drag and drop |
| wait | wait(duration='2') | Wait (seconds) |
| finish | finish() | Task completed |
| stop | stop(reason='...') | Task infeasible |
| call_user | call_user() | Request human help |
Other Versions
| Version | Repo | Description |
|---|---|---|
| fp16 (this) | Mano-CUA-4B-Thinking-1.1 | Full precision, for archival / re-quantization / GPU inference |
| MLX-8bit | Mano-CUA-4B-Thinking-1.1-MLX-8bit | MLX 8-bit quantized, recommended for Apple Silicon local inference |
Contact
- Website: https://github.com/Mininglamp-AI/Mano-P
- Email: model@mininglamp.com
Model provider
Mininglamp-2718
Model tree
Base
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information