duvoai

duvo-eye-1.5

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Results

All numbers come from one fixed, validated harness with enable_thinking=False and deterministic (greedy) decoding. The harness is calibrated against v1's published result: it reproduces duvo-eye-1's ScreenSpot-Pro 72.9 exactly (72.99% on the full 1,581 samples), so the v1 → v1.5 deltas below are a trustworthy, apples-to-apples A/B. Figures are as of 2026-06.

Table
Benchmarkduvo-eye-1 (v1)duvo-eye-1.5Δ
ScreenSpot-Pro (1,581)72.99%73.31%+0.32 pp
OSWorld-G (510)80.2%80.78%+0.58 pp
UI-I2E-Bench (1,477)84.2%84.90%+0.70 pp

Parse failures were 0% on every board for both models. The v1 column is v1 re-measured under this same harness (the A/B reference); for OSWorld-G that re-measurement (80.2) runs a couple of points above v1's separately-published 78.0/510 due to harness settings — what is comparable here is the within-harness Δ, not the cross-harness absolute.

Honest framing. This is a real but modest improvement — a clean, monotone uplift from RL grounding refinement on an already-strong model. The headline gains are fractions of a point per board. It is not SOTA and not a step change; if you already run duvo-eye-1, expect a small, safe upgrade, not a new tier of capability.

By target type, the model is much stronger on text targets (~82%) than on icon targets (~60%). That icon / tiny-target gap is the main remaining weakness, and the RL stage — trained at ~1M px while ScreenSpot-Pro screenshots are 4K — does not close it (see Limitations).

Positioning

The credible pitch is efficiency, reliability, and single-model grounding quality — not "beats frontier." At ~3B active parameters, the v1 lineage sits among the very top single-forward-pass models on ScreenSpot-Pro: v1's 72.9 was verified as #2 of 86 entries on the official leaderboard (2026-06-13), behind only one 8B model and ahead of every larger single model; v1.5's 73.31 nudges that up. The handful of entries scoring higher are multi-step scaffolds and ensembles (iterative zoom, agentic refinement; top ≈80.9) — a separate, more expensive inference class that composes on top of any grounder. duvo-eye-1.5 is a strong, cheap single-model grounder; it is not an agent and does not beat frontier models on agentic benchmarks. For the full landscape (ScreenSpot-v2, UI-I2E, UI-Vision, WebClick, Showdown, and the verified-vs-self-reported breakdown), see the duvo-eye-1 card — those boards were measured on the v1 lineage and were not re-run for v1.5.

Usage

duvo-eye-1.5 is a standard transformers image-text-to-text model. Disable thinking, use the exact grounding prompt below, and parse the JSON point. The returned {x, y} are normalized to [0, 1000]; scale by the original screenshot dimensions, not the downscaled model input.

python

import json
import re
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "duvoai/duvo-eye-1.5"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype="bfloat16",
device_map="auto",
)
image = Image.open("screenshot.png").convert("RGB")
target = "the Save button in the top toolbar"
prompt = (
"Localize an element on the GUI image according to the provided target "
"and output a click position.\n"
' * You must output a valid JSON following the format: {"x": int 0-1000, "y": int 0-1000}\n'
" Your target is:\n" + target
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
}
]
inputs = processor.apply_chat_template(
messages,
add_generation_prompt=True,
tokenize=True,
return_dict=True,
return_tensors="pt",
enable_thinking=False, # REQUIRED: emit the coordinate directly, not reasoning
).to(model.device)
generated = model.generate(**inputs, max_new_tokens=64, do_sample=False)
output = processor.batch_decode(
generated[:, inputs["input_ids"].shape[1]:],
skip_special_tokens=True,
)[0]
# Parse {"x": int 0-1000, "y": int 0-1000}
point = json.loads(re.search(r"\{.*?\}", output, re.DOTALL).group(0))
x_norm, y_norm = point["x"], point["y"]
# Scale to original pixels
w, h = image.size
x_px = round(x_norm / 1000 * w)
y_px = round(y_norm / 1000 * h)
print(x_px, y_px)

Notes:

  • Always pass enable_thinking=False. This injects an empty <think></think> block so the model emits the JSON coordinate directly. Without it the model produces reasoning text and may not reach a parseable point.
  • Use greedy / deterministic decoding (do_sample=False) for reproducible grounding.
  • For high-resolution screenshots (4K, professional software), raise the processor's max_pixels so the input is not over-downscaled — resolution is the single biggest lever on dense, small-icon UIs.

Serving

bf16 weights are ~66 GB:

  • vLLM TP=1 on one 141 GB H200, or TP=2 on 2×80 GB.
  • Tune max_pixels for input resolution; for 4K professional-software screenshots, raise it so fine icons aren't over-downscaled. The [0, 1000] output is resolution-independent — always rescale by the original screenshot size.

Training

duvo-eye-1.5 = duvo-eye-1 + a GRPO reinforcement-learning stage, merged.

Method. GRPO (via TRL) with an attention-only LoRA (rank 16, on q/k/v/o), beta = 0, 200 steps on 4×H100. Rollouts were generated with enable_thinking=False (v1 already grounds the RL data at ~80% greedy, which gives a dense reward signal), then the trained LoRA was folded into v1 to produce the released full model.

Reward. A point-in-bbox reward plus a small format reward:

  • 1.0 if the predicted point falls inside the ground-truth bounding box;
  • otherwise 0.25 · exp(−dist / 150) distance shaping toward the box;
  • a small format reward for emitting a parseable, in-range {x, y}.

Data. ~6.5k grounding prompts mined from ServiceNow/GroundCUA (open-source professional-desktop GUIs), trained at ~1M px with bounding boxes normalized to [0, 1000]. The train/eval resolution gap (1M px training vs 4K test screenshots) is why icon precision did not improve much — higher-resolution RL is the next lever.

For the v1 SFT recipe (LoRA on Holo-3.1-35B-A3B over the private duvoai/SynthUI corpus), see the duvo-eye-1 card.

Intended use

  • GUI element grounding in a computer-use / desktop-automation pipeline: map a textual target description to a single click point on a screenshot.
  • As the grounding stage behind a separate planner (which decides what to click) and, optionally, a verifier / test-time-scaling layer on top.
  • Enterprise back-office and professional-software UIs (the v1 lineage was tuned on synthetic back-office UIs; the RL stage used open-source professional-desktop screenshots).

Limitations

  • Icons and tiny targets are the weak spot. The model is much stronger on text targets (~82%) than on icon targets (~60%); the RL stage does not close this gap. Dense professional UIs with small icon controls remain the hardest case.
  • Disable thinking — it is not a reasoning model. You must call apply_chat_template(..., enable_thinking=False) for direct grounding. The inherited template defaults to thinking ON; left on, the model emits reasoning instead of a coordinate and often fails to produce a parseable point within a short token budget. Thinking does not improve grounding here.
  • It always returns a coordinate and cannot abstain. There is no mechanism to say "the target is not present." If the described element is absent, the model still emits a (wrong) click point. Handle absence at the pipeline level.
  • Grounder, not agent. No planning, navigation, multi-step execution, or function calling. It resolves one description to one click and nothing more.
  • Modest gain over v1. Expect a small, clean uplift (fractions of a point per board), not a step change.

Citation

bibtex

@misc{duvo2026eye15,
title = {duvo-eye-1.5: a GRPO-refined GUI grounding model},
author = {Duvo AI},
year = {2026},
url = {https://huggingface.co/duvoai/duvo-eye-1.5},
}

Built on Hcompany/Holo-3.1-35B-A3B (H Company) and refined with ServiceNow/GroundCUA. License: Apache 2.0, same as the base.

Model provider

duvoai

Model tree

Base

Hcompany/Holo-3.1-35B-A3B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today