Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Description
Given a screenshot and a target description, duvo-eye-1 outputs {"x": int, "y": int} in [0, 1000]. It is the grounding component of a computer-use agent stack, not an agent by itself: a planner decides what to interact with, duvo-eye-1 resolves that description to where.
- Developed by: Duvo
- Model type: Vision-Language Model for single-step GUI element grounding (click-point localization)
- Base model: Hcompany/Holo-3.1-35B-A3B — 35B-A3B MoE, 3B active, Apache 2.0
- Method: LoRA (rank 64, alpha 128), 1 epoch on duvoai/SynthUI; merged to bf16 weights
- Supported environments: web, desktop, and professional-software UIs; English-language instructions over English / French / German interfaces
- Evaluation evidence: per-sample predictions + harness at duvoai/duvo-eye-1-evals
- License: Apache 2.0
- Contact: work@tomcupr.com
Benchmark Results
vs. the base model (Holo-3.1-35B-A3B)
duvo-eye-1's only fair baseline is the model it fine-tunes — Holo-3.1-35B-A3B. H Company publishes two public grounding benchmarks for it:
| Benchmark | Holo-3.1-35B-A3B (base) | duvo-eye-1 |
|---|---|---|
| ScreenSpot-Pro (1,581) | 71.5 | 72.9 |
| OSWorld-G (510 / 564) | 78.8 † | 78.0 / 70.6 |
| SynthUI test (in-domain, private) ‡ | 62.5 | 86.6 |
duvo-eye-1 edges the base on ScreenSpot-Pro (72.9 vs 71.5, published all-samples; our 72.9 is reproduced at 72.87 under the official harness). On OSWorld-G the two are on par, measured differently — † the base's 78.8 is H Company's internal implementation, while ours is 78.0 on the standard 510-sample subset (70.6 on the full 564 with refusals as misses) under the maintainer's own scorer. The fine-tune's clearer gains are output reliability (0.0% malformed outputs vs the base's 14–21% under the same constrained-JSON harness) and in-domain accuracy.
‡ SynthUI is Duvo's private dataset — the enterprise back-office target domain — held out from training; the base figure is our measurement under the same harness. It is an in-distribution result.
Standing on the broader grounding boards
Holo-3.1-35B-A3B reports no number on these; duvo-eye-1 is shown against the field (larger models included). Standings are as of 2026-06 and may change.
| Benchmark | duvo-eye-1 (3B active) | Field reference |
|---|---|---|
| ScreenSpot-v2 (1,272) | 95.1 | parity at the top — UI-Venus-72B 95.3 (0.2 pt, saturated board) |
| UI-I2E-Bench (1,477) | 84.2 | #1 on the maintained leaderboard (next listed UGround-V1-72B 76.3); paper-reported UI-Ins-32B 87.3 is higher but not on the board |
| UI-Vision element grounding (5,479) | 64.4 | exceeds the best published number we're aware of (UI-Venus-1.5-30B 54.7); self-run, not yet third-party-confirmed |
| WebClick (1,639) | 93.6 | — |
| Showdown-Clicks (557) | 78.8 | tied with top public entry ace-control-medium 77.6 (1.2 pt within the ±3.5 pp CI) |
These place duvo-eye-1 at or near the top of each board at a 3B-active serving cost. On ScreenSpot-Pro (72.9 single-shot), the entries that outscore it on the public grounding leaderboard are almost all multi-step scaffolds — iterative zoom, heterogeneous ensembles, agentic refinement — a separate, more expensive inference class. Among single-forward-pass models, 72.9 is the second-highest entry on the board (behind one 8B model at 73.2, ahead of every larger single model — including a 235B-active-22B one at 70.6); no general-purpose VLM is close (the strongest, Qwen2.5-VL-72B, scores 53.3). Such scaffolds compose on top of duvo-eye-1 as readily as on any grounder. Other OSWorld-G splits: refined instructions 82.9 (510) / 75.0 (564). Showdown uses its official point-in-bbox metric (is_in_bbox; n=557, 95% CI ≈ ±3.5 pp). Competitor figures are leaderboard- or paper-sourced as noted; the duvo-eye-1 numbers come from our published harness.
What's driving the numbers (honest decomposition)
A malformed output scores as a miss, so headline deltas over the base mix two effects. Restricting to samples where each model produced parseable output:
| Base (among answered) | duvo-eye-1 (among answered) | |
|---|---|---|
| ScreenSpot-Pro | 70.6 | 72.9 |
| OSWorld-G | 75.4 | 78.0 |
(The base's ScreenSpot-Pro figure appears two ways in this card — the same model, two denominators: 71.5 is H Company's published all-samples score; 70.6 is its among-answered, parseable-only score under our reproduction.)
So on real-world public benchmarks, the gain over the base is mostly output reliability — the base emits malformed JSON on 14–21% of samples under our prompt contract (measured with the identical prompt and decoding for both models), which constrained decoding would also largely fix — plus a consistent +2–3 pt grounding edge among answered samples. On SynthUI the base's malformed rate is only 0.7%, so its lower score there is genuine in-domain difficulty, not format errors. Predictions are non-degenerate: 98% are unique points, and the median target covers 0.03% of screen area.
Reproducibility & comparability
These are self-reported numbers — we ran the evals, as is standard for a model card — but built to be checkable, not taken on trust:
- Every public-benchmark prediction is published (raw output, parsed point, hit/miss) at duvoai/duvo-eye-1-evals, with the harness, so anyone can recompute the numbers.
- Three benchmarks are reproduced under their own official harnesses, each matching our harness within rounding (official result files included in the evals dataset): ScreenSpot-Pro 72.87 (
eval_screenspot_pro.py, PR #29); OSWorld-G 78.0 (510) / 70.6 (564) under xlang-ai's owneval.pyscorer; UI-I2E-Bench 84.4 under itsis_correctscorer. - Submitted to every official channel that exists: the ScreenSpot-Pro / ScreenSpot-v2 grounding leaderboard, OSWorld-G, and UI-I2E-Bench.
- One caveat: the official OSWorld-G metric grants refusal credit our coordinate-only protocol cannot earn, so we report both the refusal-excluded (510) and refusal-as-miss (564) figures.
Get Started
Serve with vLLM (bf16 weights are ~66 GB: one 141 GB H200 at TP=1, or 2×80 GB at TP=2):
bash
vllm serve duvoai/duvo-eye-1 \--served-model-name duvo-eye-1 \--tensor-parallel-size 2 \--max-model-len 16384 \--gpu-memory-utilization 0.90 \--mm-processor-kwargs '{"max_pixels": 1310720}'
For high-resolution screenshots (4K+, professional software), raise max_pixels — we used {"max_pixels": 8000000} for ScreenSpot-Pro.
Query it with the prompt contract used in training and evaluation:
python
import base64import jsonfrom openai import OpenAIclient = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")with open("screenshot.png", "rb") as f:image_b64 = base64.b64encode(f.read()).decode()target = "the Save button in the toolbar"prompt = ("Localize an element on the GUI image according to the provided target ""and output a click position.\n"' * You must output a valid JSON following the format: ''{"x": int 0-1000, "y": int 0-1000}\n'f" Your target is:\n{target}")resp = client.chat.completions.create(model="duvo-eye-1",messages=[{"role": "user","content": [{"type": "image_url","image_url": {"url": f"data:image/png;base64,{image_b64}"}},{"type": "text", "text": prompt},],}],temperature=0.0,max_tokens=64,extra_body={"chat_template_kwargs": {"enable_thinking": False}},)pred = json.loads(resp.choices[0].message.content) # {"x": ..., "y": ...} in [0, 1000]click_x = round(pred["x"] / 1000 * original_width) # scale by the ORIGINAL screenshotclick_y = round(pred["y"] / 1000 * original_height) # size, not the downscaled input
duvo-eye-1 produced 0.0% malformed outputs across all runs without constrained decoding; for a hard guarantee, vLLM's guided_json can enforce the schema.
Training
LoRA fine-tune of the base model on all 14,923 rows of duvoai/SynthUI, 1 epoch.
| Method | LoRA, rank 64, alpha 128, dropout 0.05 |
| Target modules | all-linear, including MoE expert tensors; vision encoder + aligner frozen |
| LR | 7e-5, cosine decay, 3% warmup |
| Epochs / batch | 1 epoch, global batch 64, packing, max length 8192, bf16 |
| Stack | ms-swift Megatron backend (expert parallelism EP=4), 4×H100, 33 minutes |
| Export | LoRA merged into the base; published as bf16 HF weights |
Intended Use & Limitations
Intended use: resolving a natural-language element description on a screenshot to a single click coordinate, as one step inside a larger agent loop. Strongest on enterprise/back-office web UIs resembling the training distribution.
Limitations:
- Single-shot grounding only. One click point per request — no planning, navigation, multi-step execution, or function calling. The model always returns a coordinate and cannot abstain when the target is absent. For destructive actions in enterprise apps, pair it with a confidence/verification layer or human-in-the-loop — it is a grounding component, not an end-to-end agent.
- Synthetic training domain. SynthUI is DOM-rendered and private; the largest gains are in-domain and may not transfer fully to real UIs. On real-world public benchmarks the among-answered grounding edge over the base is +2–3 pt (see decomposition).
- Single-shot has a ceiling on ScreenSpot-Pro. At 72.9 single-shot, duvo-eye-1 is the second-highest single-forward-pass entry on the public leaderboard, but the top of the board (~78–81) is held by multi-step scaffolds (iterative zoom, ensembles, agentic refinement). To reach the absolute top, pair it with such a scaffold; its weakest sub-score is icons in dense professional software (icon 58.9 vs text 81.5 on ScreenSpot-Pro).
- Self-reported but verified. Standard point-in-region scoring, all public-benchmark predictions published; reproduced under the maintainers' own scorers on three benchmarks (ScreenSpot-Pro, OSWorld-G, UI-I2E-Bench). Other boards are self-run with predictions published but not yet third-party-confirmed; small cross-leaderboard differences can arise from prompt/decoding.
- Instruction language. Trained and evaluated with English-language instructions, but robust to non-English UI text — its training screenshots include English, French, and German enterprise interfaces — and it inherits the broader multilingual capabilities of its base, Holo-3.1-35B-A3B.
Citation
bibtex
@misc{duvo2026duvoeye1,title={duvo-eye-1: GUI Grounding for Enterprise Computer Use},author={Duvo},year={2026},url={https://huggingface.co/duvoai/duvo-eye-1},}
Model provider
duvoai
Model tree
Base
Hcompany/Holo-3.1-35B-A3B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information