duvoai
duvo-eye-1.5
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Results
All numbers come from one fixed, validated harness with enable_thinking=False and deterministic (greedy) decoding. The harness is calibrated against v1's published result: it reproduces duvo-eye-1's ScreenSpot-Pro 72.9 exactly (72.99% on the full 1,581 samples), so the v1 → v1.5 deltas below are a trustworthy, apples-to-apples A/B. Figures are as of 2026-06.
| Benchmark | duvo-eye-1 (v1) | duvo-eye-1.5 | Δ |
|---|---|---|---|
| ScreenSpot-Pro (1,581) | 72.99% | 73.31% | +0.32 pp |
| OSWorld-G (510) | 80.2% | 80.78% | +0.58 pp |
| UI-I2E-Bench (1,477) | 84.2% | 84.90% | +0.70 pp |
Parse failures were 0% on every board for both models. The v1 column is v1 re-measured under this same harness (the A/B reference); for OSWorld-G that re-measurement (80.2) runs a couple of points above v1's separately-published 78.0/510 due to harness settings — what is comparable here is the within-harness Δ, not the cross-harness absolute.
Honest framing. This is a real but modest improvement — a clean, monotone uplift from RL grounding refinement on an already-strong model. The headline gains are fractions of a point per board. It is not SOTA and not a step change; if you already run duvo-eye-1, expect a small, safe upgrade, not a new tier of capability.
By target type, the model is much stronger on text targets (~82%) than on icon targets (~60%). That icon / tiny-target gap is the main remaining weakness, and the RL stage — trained at ~1M px while ScreenSpot-Pro screenshots are 4K — does not close it (see Limitations).
Positioning
The credible pitch is efficiency, reliability, and single-model grounding quality — not "beats frontier." At ~3B active parameters, the v1 lineage sits among the very top single-forward-pass models on ScreenSpot-Pro: v1's 72.9 was verified as #2 of 86 entries on the official leaderboard (2026-06-13), behind only one 8B model and ahead of every larger single model; v1.5's 73.31 nudges that up. The handful of entries scoring higher are multi-step scaffolds and ensembles (iterative zoom, agentic refinement; top ≈80.9) — a separate, more expensive inference class that composes on top of any grounder. duvo-eye-1.5 is a strong, cheap single-model grounder; it is not an agent and does not beat frontier models on agentic benchmarks. For the full landscape (ScreenSpot-v2, UI-I2E, UI-Vision, WebClick, Showdown, and the verified-vs-self-reported breakdown), see the duvo-eye-1 card — those boards were measured on the v1 lineage and were not re-run for v1.5.
Usage
duvo-eye-1.5 is a standard transformers image-text-to-text model. Disable thinking, use the exact grounding prompt below, and parse the JSON point. The returned {x, y} are normalized to [0, 1000]; scale by the original screenshot dimensions, not the downscaled model input.
python
import jsonimport refrom PIL import Imagefrom transformers import AutoModelForImageTextToText, AutoProcessormodel_id = "duvoai/duvo-eye-1.5"processor = AutoProcessor.from_pretrained(model_id)model = AutoModelForImageTextToText.from_pretrained(model_id,torch_dtype="bfloat16",device_map="auto",)image = Image.open("screenshot.png").convert("RGB")target = "the Save button in the top toolbar"prompt = ("Localize an element on the GUI image according to the provided target ""and output a click position.\n"' * You must output a valid JSON following the format: {"x": int 0-1000, "y": int 0-1000}\n'" Your target is:\n" + target)messages = [{"role": "user","content": [{"type": "image", "image": image},{"type": "text", "text": prompt},],}]inputs = processor.apply_chat_template(messages,add_generation_prompt=True,tokenize=True,return_dict=True,return_tensors="pt",enable_thinking=False, # REQUIRED: emit the coordinate directly, not reasoning).to(model.device)generated = model.generate(**inputs, max_new_tokens=64, do_sample=False)output = processor.batch_decode(generated[:, inputs["input_ids"].shape[1]:],skip_special_tokens=True,)[0]# Parse {"x": int 0-1000, "y": int 0-1000}point = json.loads(re.search(r"\{.*?\}", output, re.DOTALL).group(0))x_norm, y_norm = point["x"], point["y"]# Scale to original pixelsw, h = image.sizex_px = round(x_norm / 1000 * w)y_px = round(y_norm / 1000 * h)print(x_px, y_px)
Notes:
- Always pass
enable_thinking=False. This injects an empty<think></think>block so the model emits the JSON coordinate directly. Without it the model produces reasoning text and may not reach a parseable point. - Use greedy / deterministic decoding (
do_sample=False) for reproducible grounding. - For high-resolution screenshots (4K, professional software), raise the processor's
max_pixelsso the input is not over-downscaled — resolution is the single biggest lever on dense, small-icon UIs.
Serving
bf16 weights are ~66 GB:
- vLLM TP=1 on one 141 GB H200, or TP=2 on 2×80 GB.
- Tune
max_pixelsfor input resolution; for 4K professional-software screenshots, raise it so fine icons aren't over-downscaled. The[0, 1000]output is resolution-independent — always rescale by the original screenshot size.
Training
duvo-eye-1.5 = duvo-eye-1 + a GRPO reinforcement-learning stage, merged.
Method. GRPO (via TRL) with an attention-only LoRA (rank 16, on q/k/v/o), beta = 0, 200 steps on 4×H100. Rollouts were generated with enable_thinking=False (v1 already grounds the RL data at ~80% greedy, which gives a dense reward signal), then the trained LoRA was folded into v1 to produce the released full model.
Reward. A point-in-bbox reward plus a small format reward:
1.0if the predicted point falls inside the ground-truth bounding box;- otherwise
0.25 · exp(−dist / 150)distance shaping toward the box; - a small format reward for emitting a parseable, in-range
{x, y}.
Data. ~6.5k grounding prompts mined from ServiceNow/GroundCUA (open-source professional-desktop GUIs), trained at ~1M px with bounding boxes normalized to [0, 1000]. The train/eval resolution gap (1M px training vs 4K test screenshots) is why icon precision did not improve much — higher-resolution RL is the next lever.
For the v1 SFT recipe (LoRA on Holo-3.1-35B-A3B over the private duvoai/SynthUI corpus), see the duvo-eye-1 card.
Intended use
- GUI element grounding in a computer-use / desktop-automation pipeline: map a textual target description to a single click point on a screenshot.
- As the grounding stage behind a separate planner (which decides what to click) and, optionally, a verifier / test-time-scaling layer on top.
- Enterprise back-office and professional-software UIs (the v1 lineage was tuned on synthetic back-office UIs; the RL stage used open-source professional-desktop screenshots).
Limitations
- Icons and tiny targets are the weak spot. The model is much stronger on text targets (~82%) than on icon targets (~60%); the RL stage does not close this gap. Dense professional UIs with small icon controls remain the hardest case.
- Disable thinking — it is not a reasoning model. You must call
apply_chat_template(..., enable_thinking=False)for direct grounding. The inherited template defaults to thinking ON; left on, the model emits reasoning instead of a coordinate and often fails to produce a parseable point within a short token budget. Thinking does not improve grounding here. - It always returns a coordinate and cannot abstain. There is no mechanism to say "the target is not present." If the described element is absent, the model still emits a (wrong) click point. Handle absence at the pipeline level.
- Grounder, not agent. No planning, navigation, multi-step execution, or function calling. It resolves one description to one click and nothing more.
- Modest gain over v1. Expect a small, clean uplift (fractions of a point per board), not a step change.
Citation
bibtex
@misc{duvo2026eye15,title = {duvo-eye-1.5: a GRPO-refined GUI grounding model},author = {Duvo AI},year = {2026},url = {https://huggingface.co/duvoai/duvo-eye-1.5},}
Built on Hcompany/Holo-3.1-35B-A3B (H Company) and refined with ServiceNow/GroundCUA. License: Apache 2.0, same as the base.
Model provider
duvoai
Model tree
Base
Hcompany/Holo-3.1-35B-A3B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information