inclusionAI

inclusionAI

VISTA-4B

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Description

VISTA-4B is a GUI-grounding model that maps a screenshot and a natural-language instruction to a click coordinate in the normalized 0-1000 image frame.

  • View-consistent GRPO training. VISTA builds each GRPO comparison group from target-preserving views of the same GUI instance, with exact coordinate remapping across cropped views. This exposes localization behavior under semantically equivalent but geometrically different screenshots.
  • Self-verified cross-view anchoring. The training objective adds an oracle-format center-point anchor only when model-generated rollouts have already produced a maximum-reward prediction, stabilizing short coordinate generation without unconditional imitation on all-fail groups.

Evaluation

Accuracy is reported for GUI grounding. The model predicts a normalized coordinate in the 0-1000 frame, and the prediction is counted as correct if the point lies inside the target element. All reported results use deterministic decoding at temperature 0 and single-view inference.

Results on GUI Grounding benchmarks

Table
ModelSSProSSV2OSWorld-GOSWorld-G-R
Qwen3.5-4B60.390.454.466.8
GRPO-4B62.294.259.969.2
VISTA-4B64.293.861.269.7
Δ+2.0-0.4+1.3+0.5
Qwen3.5-9B65.291.963.174.6
GRPO-9B68.395.267.575.2
VISTA-9B69.295.868.175.5
Δ+0.9+0.6+0.6+0.3
Qwen3.5-35B-A3B68.693.865.872.5
GRPO-35B-A3B71.795.770.474.3
VISTA-35B-A3B72.995.871.575.3
Δ+1.2+0.1+1.1+1.0

Quick Start

Use the same image-chat interface as the underlying Qwen3.5 vision-language model. The recommended prompt is:

text

Output the center point of the position corresponding to the instruction: {instruction}. The output should just be the coordinates of a point, in the format [x,y].

Example:

python

import torch
from PIL import Image
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "inclusionAI/VISTA-4B"
model = AutoModelForImageTextToText.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto",
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
image = Image.open("screenshot.png").convert("RGB")
instruction = "Click the search button"
prompt = (
"Output the center point of the position corresponding to the instruction: "
f"{instruction}. The output should just be the coordinates of a point, "
"in the format [x,y]."
)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": prompt},
],
}
]
text = processor.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = processor(
text=[text],
images=[image],
padding=True,
return_tensors="pt",
).to(model.device)
generated = model.generate(
**inputs,
max_new_tokens=32,
do_sample=False,
)
new_tokens = generated[:, inputs.input_ids.shape[1]:]
response = processor.batch_decode(new_tokens, skip_special_tokens=True)[0].strip()
print(response) # e.g. [512,384]

Citation

Please consider citing if you find our work useful:

plain

@misc{qiu2026vista,
title={VISTA: View-Consistent Self-Verified Training for GUI Grounding},
author={Xinyu Qiu, Yunzhu Zhang, Heng Jia, Shuheng Shen, Changhua Meng, Linchao Zhu},
year={2026},
eprint={2606.14579},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2606.14579},
}

Model provider

inclusionAI

inclusionAI

Model tree

Base

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today