duvoai/duvo-eye-1 API & Inference Endpoint

Model Description

Given a screenshot and a target description, duvo-eye-1 outputs {"x": int, "y": int} in [0, 1000]. It is the grounding component of a computer-use agent stack, not an agent by itself: a planner decides what to interact with, duvo-eye-1 resolves that description to where.

Developed by: Duvo
Model type: Vision-Language Model for single-step GUI element grounding (click-point localization)
Base model: Hcompany/Holo-3.1-35B-A3B — 35B-A3B MoE, 3B active, Apache 2.0
Method: LoRA (rank 64, alpha 128), 1 epoch on duvoai/SynthUI; merged to bf16 weights
Supported environments: web, desktop, and professional-software UIs; English-language instructions over English / French / German interfaces
Evaluation evidence: per-sample predictions + harness at duvoai/duvo-eye-1-evals
License: Apache 2.0
Contact: work@tomcupr.com

Benchmark Results

vs. the base model (Holo-3.1-35B-A3B)

duvo-eye-1's only fair baseline is the model it fine-tunes — Holo-3.1-35B-A3B. H Company publishes two public grounding benchmarks for it:

Benchmark	Holo-3.1-35B-A3B (base)	duvo-eye-1
ScreenSpot-Pro (1,581)	71.5	72.9
OSWorld-G (510 / 564)	78.8 †	78.0 / 70.6
SynthUI test (in-domain, private) ‡	62.5	86.6

duvo-eye-1 edges the base on ScreenSpot-Pro (72.9 vs 71.5, published all-samples; our 72.9 is reproduced at 72.87 under the official harness). On OSWorld-G the two are on par, measured differently — † the base's 78.8 is H Company's internal implementation, while ours is 78.0 on the standard 510-sample subset (70.6 on the full 564 with refusals as misses) under the maintainer's own scorer. The fine-tune's clearer gains are output reliability (0.0% malformed outputs vs the base's 14–21% under the same constrained-JSON harness) and in-domain accuracy.

‡ SynthUI is Duvo's private dataset — the enterprise back-office target domain — held out from training; the base figure is our measurement under the same harness. It is an in-distribution result.

Standing on the broader grounding boards

Holo-3.1-35B-A3B reports no number on these; duvo-eye-1 is shown against the field (larger models included). Standings are as of 2026-06 and may change.

Benchmark	duvo-eye-1 (3B active)	Field reference
ScreenSpot-v2 (1,272)	95.1	parity at the top — UI-Venus-72B 95.3 (0.2 pt, saturated board)
UI-I2E-Bench (1,477)	84.2	#1 on the maintained leaderboard (next listed UGround-V1-72B 76.3); paper-reported UI-Ins-32B 87.3 is higher but not on the board
UI-Vision element grounding (5,479)	64.4	exceeds the best published number we're aware of (UI-Venus-1.5-30B 54.7); self-run, not yet third-party-confirmed
WebClick (1,639)	93.6	—
Showdown-Clicks (557)	78.8	tied with top public entry ace-control-medium 77.6 (1.2 pt within the ±3.5 pp CI)

These place duvo-eye-1 at or near the top of each board at a 3B-active serving cost. On ScreenSpot-Pro (72.9 single-shot), the entries that outscore it on the public grounding leaderboard are almost all multi-step scaffolds — iterative zoom, heterogeneous ensembles, agentic refinement — a separate, more expensive inference class. Among single-forward-pass models, 72.9 is the second-highest entry on the board (behind one 8B model at 73.2, ahead of every larger single model — including a 235B-active-22B one at 70.6); no general-purpose VLM is close (the strongest, Qwen2.5-VL-72B, scores 53.3). Such scaffolds compose on top of duvo-eye-1 as readily as on any grounder. Other OSWorld-G splits: refined instructions 82.9 (510) / 75.0 (564). Showdown uses its official point-in-bbox metric (is_in_bbox; n=557, 95% CI ≈ ±3.5 pp). Competitor figures are leaderboard- or paper-sourced as noted; the duvo-eye-1 numbers come from our published harness.

What's driving the numbers (honest decomposition)

A malformed output scores as a miss, so headline deltas over the base mix two effects. Restricting to samples where each model produced parseable output:

	Base (among answered)	duvo-eye-1 (among answered)
ScreenSpot-Pro	70.6	72.9
OSWorld-G	75.4	78.0

(The base's ScreenSpot-Pro figure appears two ways in this card — the same model, two denominators: 71.5 is H Company's published all-samples score; 70.6 is its among-answered, parseable-only score under our reproduction.)

So on real-world public benchmarks, the gain over the base is mostly output reliability — the base emits malformed JSON on 14–21% of samples under our prompt contract (measured with the identical prompt and decoding for both models), which constrained decoding would also largely fix — plus a consistent +2–3 pt grounding edge among answered samples. On SynthUI the base's malformed rate is only 0.7%, so its lower score there is genuine in-domain difficulty, not format errors. Predictions are non-degenerate: 98% are unique points, and the median target covers 0.03% of screen area.

Reproducibility & comparability

These are self-reported numbers — we ran the evals, as is standard for a model card — but built to be checkable, not taken on trust:

Every public-benchmark prediction is published (raw output, parsed point, hit/miss) at duvoai/duvo-eye-1-evals, with the harness, so anyone can recompute the numbers.
Three benchmarks are reproduced under their own official harnesses, each matching our harness within rounding (official result files included in the evals dataset): ScreenSpot-Pro 72.87 (eval_screenspot_pro.py, PR #29); OSWorld-G 78.0 (510) / 70.6 (564) under xlang-ai's own eval.py scorer; UI-I2E-Bench 84.4 under its is_correct scorer.
Submitted to every official channel that exists: the ScreenSpot-Pro / ScreenSpot-v2 grounding leaderboard, OSWorld-G, and UI-I2E-Bench.
One caveat: the official OSWorld-G metric grants refusal credit our coordinate-only protocol cannot earn, so we report both the refusal-excluded (510) and refusal-as-miss (564) figures.

Get Started

Serve with vLLM (bf16 weights are ~66 GB: one 141 GB H200 at TP=1, or 2×80 GB at TP=2):

bash
vllm serve duvoai/duvo-eye-1 \
  --served-model-name duvo-eye-1 \
  --tensor-parallel-size 2 \
  --max-model-len 16384 \
  --gpu-memory-utilization 0.90 \
  --mm-processor-kwargs '{"max_pixels": 1310720}'

For high-resolution screenshots (4K+, professional software), raise max_pixels — we used {"max_pixels": 8000000} for ScreenSpot-Pro.

Query it with the prompt contract used in training and evaluation:

python
import base64
import json

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

with open("screenshot.png", "rb") as f:
    image_b64 = base64.b64encode(f.read()).decode()

target = "the Save button in the toolbar"
prompt = (
    "Localize an element on the GUI image according to the provided target "
    "and output a click position.\n"
    ' * You must output a valid JSON following the format: '
    '{"x": int 0-1000, "y": int 0-1000}\n'
    f" Your target is:\n{target}"
)

resp = client.chat.completions.create(
    model="duvo-eye-1",
    messages=[{
        "role": "user",
        "content": [
            {"type": "image_url",
             "image_url": {"url": f"data:image/png;base64,{image_b64}"}},
            {"type": "text", "text": prompt},
        ],
    }],
    temperature=0.0,
    max_tokens=64,
    extra_body={"chat_template_kwargs": {"enable_thinking": False}},
)

pred = json.loads(resp.choices[0].message.content)  # {"x": ..., "y": ...} in [0, 1000]
click_x = round(pred["x"] / 1000 * original_width)   # scale by the ORIGINAL screenshot
click_y = round(pred["y"] / 1000 * original_height)  # size, not the downscaled input

duvo-eye-1 produced 0.0% malformed outputs across all runs without constrained decoding; for a hard guarantee, vLLM's guided_json can enforce the schema.

Training

LoRA fine-tune of the base model on all 14,923 rows of duvoai/SynthUI, 1 epoch.


Method	LoRA, rank 64, alpha 128, dropout 0.05
Target modules	all-linear, including MoE expert tensors; vision encoder + aligner frozen
LR	7e-5, cosine decay, 3% warmup
Epochs / batch	1 epoch, global batch 64, packing, max length 8192, bf16
Stack	ms-swift Megatron backend (expert parallelism EP=4), 4×H100, 33 minutes
Export	LoRA merged into the base; published as bf16 HF weights

Intended Use & Limitations

Intended use: resolving a natural-language element description on a screenshot to a single click coordinate, as one step inside a larger agent loop. Strongest on enterprise/back-office web UIs resembling the training distribution.

Limitations:

Single-shot grounding only. One click point per request — no planning, navigation, multi-step execution, or function calling. The model always returns a coordinate and cannot abstain when the target is absent. For destructive actions in enterprise apps, pair it with a confidence/verification layer or human-in-the-loop — it is a grounding component, not an end-to-end agent.
Synthetic training domain. SynthUI is DOM-rendered and private; the largest gains are in-domain and may not transfer fully to real UIs. On real-world public benchmarks the among-answered grounding edge over the base is +2–3 pt (see decomposition).
Single-shot has a ceiling on ScreenSpot-Pro. At 72.9 single-shot, duvo-eye-1 is the second-highest single-forward-pass entry on the public leaderboard, but the top of the board (~78–81) is held by multi-step scaffolds (iterative zoom, ensembles, agentic refinement). To reach the absolute top, pair it with such a scaffold; its weakest sub-score is icons in dense professional software (icon 58.9 vs text 81.5 on ScreenSpot-Pro).
Self-reported but verified. Standard point-in-region scoring, all public-benchmark predictions published; reproduced under the maintainers' own scorers on three benchmarks (ScreenSpot-Pro, OSWorld-G, UI-I2E-Bench). Other boards are self-run with predictions published but not yet third-party-confirmed; small cross-leaderboard differences can arise from prompt/decoding.
Instruction language. Trained and evaluated with English-language instructions, but robust to non-English UI text — its training screenshots include English, French, and German enterprise interfaces — and it inherits the broader multilingual capabilities of its base, Holo-3.1-35B-A3B.

Citation

bibtex
@misc{duvo2026duvoeye1,
      title={duvo-eye-1: GUI Grounding for Enterprise Computer Use},
      author={Duvo},
      year={2026},
      url={https://huggingface.co/duvoai/duvo-eye-1},
}

duvo-eye-1

Get help setting up a custom Dedicated Endpoints.

README