Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Description

Given a screenshot and a target description, duvo-eye-1 outputs {"x": int, "y": int} in [0, 1000]. It is the grounding component of a computer-use agent stack, not an agent by itself: a planner decides what to interact with, duvo-eye-1 resolves that description to where.

  • Developed by: Duvo
  • Model type: Vision-Language Model for single-step GUI element grounding (click-point localization)
  • Base model: Hcompany/Holo-3.1-35B-A3B — 35B-A3B MoE, 3B active, Apache 2.0
  • Method: LoRA (rank 64, alpha 128), 1 epoch on duvoai/SynthUI; merged to bf16 weights
  • Supported environments: web, desktop, and professional-software UIs; English-language instructions over English / French / German interfaces
  • Evaluation evidence: per-sample predictions + harness at duvoai/duvo-eye-1-evals
  • License: Apache 2.0
  • Contact: work@tomcupr.com

Benchmark Results

vs. the base model (Holo-3.1-35B-A3B)

duvo-eye-1's only fair baseline is the model it fine-tunes — Holo-3.1-35B-A3B. H Company publishes two public grounding benchmarks for it:

BenchmarkHolo-3.1-35B-A3B (base)duvo-eye-1
ScreenSpot-Pro (1,581)71.572.9
OSWorld-G (510 / 564)78.8 †78.0 / 70.6
SynthUI test (in-domain, private) ‡62.586.6

duvo-eye-1 edges the base on ScreenSpot-Pro (72.9 vs 71.5, published all-samples; our 72.9 is reproduced at 72.87 under the official harness). On OSWorld-G the two are on par, measured differently — † the base's 78.8 is H Company's internal implementation, while ours is 78.0 on the standard 510-sample subset (70.6 on the full 564 with refusals as misses) under the maintainer's own scorer. The fine-tune's clearer gains are output reliability (0.0% malformed outputs vs the base's 14–21% under the same constrained-JSON harness) and in-domain accuracy.

‡ SynthUI is Duvo's private dataset — the enterprise back-office target domain — held out from training; the base figure is our measurement under the same harness. It is an in-distribution result.

Standing on the broader grounding boards

Holo-3.1-35B-A3B reports no number on these; duvo-eye-1 is shown against the field (larger models included). Standings are as of 2026-06 and may change.

Benchmarkduvo-eye-1 (3B active)Field reference
ScreenSpot-v2 (1,272)95.1parity at the top — UI-Venus-72B 95.3 (0.2 pt, saturated board)
UI-I2E-Bench (1,477)84.2#1 on the maintained leaderboard (next listed UGround-V1-72B 76.3); paper-reported UI-Ins-32B 87.3 is higher but not on the board
UI-Vision element grounding (5,479)64.4exceeds the best published number we're aware of (UI-Venus-1.5-30B 54.7); self-run, not yet third-party-confirmed
WebClick (1,639)93.6
Showdown-Clicks (557)78.8tied with top public entry ace-control-medium 77.6 (1.2 pt within the ±3.5 pp CI)

These place duvo-eye-1 at or near the top of each board at a 3B-active serving cost. On ScreenSpot-Pro (72.9 single-shot), the entries that outscore it on the public grounding leaderboard are almost all multi-step scaffolds — iterative zoom, heterogeneous ensembles, agentic refinement — a separate, more expensive inference class. Among single-forward-pass models, 72.9 is the second-highest entry on the board (behind one 8B model at 73.2, ahead of every larger single model — including a 235B-active-22B one at 70.6); no general-purpose VLM is close (the strongest, Qwen2.5-VL-72B, scores 53.3). Such scaffolds compose on top of duvo-eye-1 as readily as on any grounder. Other OSWorld-G splits: refined instructions 82.9 (510) / 75.0 (564). Showdown uses its official point-in-bbox metric (is_in_bbox; n=557, 95% CI ≈ ±3.5 pp). Competitor figures are leaderboard- or paper-sourced as noted; the duvo-eye-1 numbers come from our published harness.

What's driving the numbers (honest decomposition)

A malformed output scores as a miss, so headline deltas over the base mix two effects. Restricting to samples where each model produced parseable output:

Base (among answered)duvo-eye-1 (among answered)
ScreenSpot-Pro70.672.9
OSWorld-G75.478.0

(The base's ScreenSpot-Pro figure appears two ways in this card — the same model, two denominators: 71.5 is H Company's published all-samples score; 70.6 is its among-answered, parseable-only score under our reproduction.)

So on real-world public benchmarks, the gain over the base is mostly output reliability — the base emits malformed JSON on 14–21% of samples under our prompt contract (measured with the identical prompt and decoding for both models), which constrained decoding would also largely fix — plus a consistent +2–3 pt grounding edge among answered samples. On SynthUI the base's malformed rate is only 0.7%, so its lower score there is genuine in-domain difficulty, not format errors. Predictions are non-degenerate: 98% are unique points, and the median target covers 0.03% of screen area.

Reproducibility & comparability

These are self-reported numbers — we ran the evals, as is standard for a model card — but built to be checkable, not taken on trust:

  • Every public-benchmark prediction is published (raw output, parsed point, hit/miss) at duvoai/duvo-eye-1-evals, with the harness, so anyone can recompute the numbers.
  • Three benchmarks are reproduced under their own official harnesses, each matching our harness within rounding (official result files included in the evals dataset): ScreenSpot-Pro 72.87 (eval_screenspot_pro.py, PR #29); OSWorld-G 78.0 (510) / 70.6 (564) under xlang-ai's own eval.py scorer; UI-I2E-Bench 84.4 under its is_correct scorer.
  • Submitted to every official channel that exists: the ScreenSpot-Pro / ScreenSpot-v2 grounding leaderboard, OSWorld-G, and UI-I2E-Bench.
  • One caveat: the official OSWorld-G metric grants refusal credit our coordinate-only protocol cannot earn, so we report both the refusal-excluded (510) and refusal-as-miss (564) figures.

Get Started

Serve with vLLM (bf16 weights are ~66 GB: one 141 GB H200 at TP=1, or 2×80 GB at TP=2):

bash

vllm serve duvoai/duvo-eye-1 \
--served-model-name duvo-eye-1 \
--tensor-parallel-size 2 \
--max-model-len 16384 \
--gpu-memory-utilization 0.90 \
--mm-processor-kwargs '{"max_pixels": 1310720}'

For high-resolution screenshots (4K+, professional software), raise max_pixels — we used {"max_pixels": 8000000} for ScreenSpot-Pro.

Query it with the prompt contract used in training and evaluation:

python

import base64
import json
from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
with open("screenshot.png", "rb") as f:
image_b64 = base64.b64encode(f.read()).decode()
target = "the Save button in the toolbar"
prompt = (
"Localize an element on the GUI image according to the provided target "
"and output a click position.\n"
' * You must output a valid JSON following the format: '
'{"x": int 0-1000, "y": int 0-1000}\n'
f" Your target is:\n{target}"
)
resp = client.chat.completions.create(
model="duvo-eye-1",
messages=[{
"role": "user",
"content": [
{"type": "image_url",
"image_url": {"url": f"data:image/png;base64,{image_b64}"}},
{"type": "text", "text": prompt},
],
}],
temperature=0.0,
max_tokens=64,
extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)
pred = json.loads(resp.choices[0].message.content) # {"x": ..., "y": ...} in [0, 1000]
click_x = round(pred["x"] / 1000 * original_width) # scale by the ORIGINAL screenshot
click_y = round(pred["y"] / 1000 * original_height) # size, not the downscaled input

duvo-eye-1 produced 0.0% malformed outputs across all runs without constrained decoding; for a hard guarantee, vLLM's guided_json can enforce the schema.

Training

LoRA fine-tune of the base model on all 14,923 rows of duvoai/SynthUI, 1 epoch.

MethodLoRA, rank 64, alpha 128, dropout 0.05
Target modulesall-linear, including MoE expert tensors; vision encoder + aligner frozen
LR7e-5, cosine decay, 3% warmup
Epochs / batch1 epoch, global batch 64, packing, max length 8192, bf16
Stackms-swift Megatron backend (expert parallelism EP=4), 4×H100, 33 minutes
ExportLoRA merged into the base; published as bf16 HF weights

Intended Use & Limitations

Intended use: resolving a natural-language element description on a screenshot to a single click coordinate, as one step inside a larger agent loop. Strongest on enterprise/back-office web UIs resembling the training distribution.

Limitations:

  • Single-shot grounding only. One click point per request — no planning, navigation, multi-step execution, or function calling. The model always returns a coordinate and cannot abstain when the target is absent. For destructive actions in enterprise apps, pair it with a confidence/verification layer or human-in-the-loop — it is a grounding component, not an end-to-end agent.
  • Synthetic training domain. SynthUI is DOM-rendered and private; the largest gains are in-domain and may not transfer fully to real UIs. On real-world public benchmarks the among-answered grounding edge over the base is +2–3 pt (see decomposition).
  • Single-shot has a ceiling on ScreenSpot-Pro. At 72.9 single-shot, duvo-eye-1 is the second-highest single-forward-pass entry on the public leaderboard, but the top of the board (~78–81) is held by multi-step scaffolds (iterative zoom, ensembles, agentic refinement). To reach the absolute top, pair it with such a scaffold; its weakest sub-score is icons in dense professional software (icon 58.9 vs text 81.5 on ScreenSpot-Pro).
  • Self-reported but verified. Standard point-in-region scoring, all public-benchmark predictions published; reproduced under the maintainers' own scorers on three benchmarks (ScreenSpot-Pro, OSWorld-G, UI-I2E-Bench). Other boards are self-run with predictions published but not yet third-party-confirmed; small cross-leaderboard differences can arise from prompt/decoding.
  • Instruction language. Trained and evaluated with English-language instructions, but robust to non-English UI text — its training screenshots include English, French, and German enterprise interfaces — and it inherits the broader multilingual capabilities of its base, Holo-3.1-35B-A3B.

Citation

bibtex

@misc{duvo2026duvoeye1,
title={duvo-eye-1: GUI Grounding for Enterprise Computer Use},
author={Duvo},
year={2026},
url={https://huggingface.co/duvoai/duvo-eye-1},
}

Model provider

duvoai

Model tree

Base

Hcompany/Holo-3.1-35B-A3B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today