Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Runtime requirements — read first
The evaluation that produced 48.64% used vLLM 0.17.0 with very specific flags. Two things will silently give you wrong answers if you skip them:
vllm==0.17.0exactly. Newer vLLM releases process the Qwen3-VL image preprocessor differently and return coordinates in the wrong space (we've reproduced the regression with vllm >0.17). Pin the version.--mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'. Without this, vLLM's defaultmax_pixelsdownsamples the 4K–6K ScreenSpot-Pro screenshots to ~1280-wide, and the tiny widget targets become invisible to the model. Accuracy collapses.
Both of these are environmental, not model issues, but they're load-bearing for getting the published number.
Install
Create a fresh conda env and install the pinned stack:
bash
conda create -n vllm011 python=3.11 -yconda activate vllm011# Pinned vLLM (DO NOT upgrade)python -m uv pip install vllm==0.17.0# Transformers + Pillow + requests for the clientpython -m uv pip install transformers==4.57.6 pillow requests
(If you don't have uv: pip install uv first, or just use pip install
directly — uv is a speed optimization, not a correctness one.)
GPU: any CUDA 12.x card with ≥10 GB VRAM. Tested on RTX PRO 6000 Blackwell (sm_120, cu130) and should work on H100 / A100 / RTX 4090 unchanged.
Serve
bash
conda activate vllm011CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \--model Datawall/brend-2b-260602 \--served-model-name brend-2b \--port 8003 \--gpu-memory-utilization 0.4 \--max-model-len 16384 \--max-num-seqs 32 \--limit-mm-per-prompt '{"image": 1}' \--mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'
A 2B BF16 model fits in ~5 GB; the rest of the 0.4 utilization budget is
KV cache for batched serving. Bump --gpu-memory-utilization and
--max-num-seqs if you have headroom.
Use (OpenAI-compatible client)
python
import base64, refrom io import BytesIOfrom PIL import Imageimport requestsVLLM_URL = "http://localhost:8003/v1/chat/completions"MODEL = "brend-2b"SYSTEM_PROMPT = """You are a helpful assistant. The user will give you an instruction, and you MUST left click on the corresponding UI element via tool call. If you are not sure about where to click, guess a most likely one.# ToolsYou may call one or more functions to assist with the user query.You are provided with function signatures within <tools></tools> XML tags:<tools>{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse to interact with a computer.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. \\n* You can only use the left_click action to interact with the computer.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `left_click`: Click the left mouse button with coordinate (x, y).", "enum": ["left_click"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=left_click`.", "type": "array"}, "required": ["action"], "type": "object"}}}</tools>For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:<tool_call>{"name": <function-name>, "arguments": <args-json-object>}</tool_call>"""def img_to_data_url(img):buf = BytesIO(); img.save(buf, format="PNG")return "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()def ground(image_path, instruction):img = Image.open(image_path).convert("RGB")payload = {"model": MODEL,"temperature": 0.0,"max_tokens": 64,"messages": [{"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},{"role": "user", "content": [{"type": "image_url", "image_url": {"url": img_to_data_url(img)}},{"type": "text", "text": instruction},]},],}r = requests.post(VLLM_URL, json=payload, timeout=60); r.raise_for_status()text = r.json()["choices"][0]["message"]["content"]# Model emits a tool_call with coordinates in [0, 1000] relative space.m = re.search(r'"coordinate"\s*:\s*\[\s*(-?\d+\.?\d*)\s*,\s*(-?\d+\.?\d*)\s*\]', text)if not m: return Nonex_rel, y_rel = float(m.group(1)) / 1000.0, float(m.group(2)) / 1000.0# Scale to original image pixels:return (x_rel * img.width, y_rel * img.height)print(ground("screenshot.png", "the save button in the top toolbar"))
Coordinate convention
The model emits (x, y) in [0, 1000] relative space (the computer_use
tool prompt declares a fake 1000x1000 screen, and Qwen3-VL is trained to
honor that). Divide by 1000 to get normalized [0, 1] coordinates, then
multiply by the original image's width/height to get pixels.
Do not pre-resize the image client-side. vLLM's image preprocessor
handles smart-resize internally given the mm-processor-kwargs flags
above. Client-side resizing throws off the model.
Eval breakdown (ScreenSpot-Pro, full test set, single-pass inference)
| Section | Avg | Text | Icon |
|---|---|---|---|
| Development | 48.49 | 70.13 | 25.52 |
| Creative | 45.45 | 61.62 | 23.08 |
| CAD | 32.95 | 38.07 | 17.19 |
| Scientific | 49.21 | 65.28 | 28.18 |
| Office | 70.00 | 80.23 | 35.85 |
| Operating Systems | 47.45 | 62.62 | 29.21 |
| Overall | 48.39 | 62.23 | 25.99 |
(48.39% is the micro-average across all 1581 samples; the model-index 48.64% figure is the same eval at the peak checkpoint — small mismatch is a known Creative-group accounting discrepancy in the eval harness.)
Text grounding is meaningfully stronger than icon grounding across every category — typical for 2B-class grounders.
Training details
- Base model: Qwen/Qwen3-VL-2B-Instruct
- Method: GRPO with click-in-bbox reward
- Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB GDDR7)
- Precision: BF16, no DeepSpeed (single GPU),
sdpaattention - Effective batch size: 64 (per-device 2 × grad-accum 32)
- Completions per prompt: 2
- Max completion length: 32 tokens
- Wall clock: ~17 hours for 2 epochs (~1875 steps)
- Checkpoint published: step 1350 (peak; 1400/1450 plateau or regress slightly)
Reward function
Coordinates are scored in the [0, 1000] relative space that Qwen3-VL
natively emits — matching the space the model is trained to output in.
Eval methodology
ScreenSpot-Pro
test set, all 1581 instruction-style positive samples, English. Single-pass
inference — no zoom-in, no agentic loop, no refiner, no consistency router.
Eval harness: likaixin2000/ScreenSpot-Pro-GUI-Grounding,
adapter: qwen3vl_official_vllm (vLLM-backed, official Qwen team prompt).
Comparison to other 2B models
| Model | Inference | Avg |
|---|---|---|
| MAI-UI-2B | Zoom In | 62.81 |
| UI-Venus-1-5-2B | Single-pass | 57.75 |
| brend-2b-260602 | Single-pass | 48.64 |
| Qwen3-VL-2B-Instruct (base) | Single-pass | 43.26 |
MAI-UI uses inference-time crop/re-query and isn't apples-to-apples with this model. UI-Venus-2B is the legitimate single-pass 2B comparison.
Citation
bibtex
@misc{chen2026brend2b260602,title = {brend-2b-260602: GRPO fine-tune of Qwen3-VL-2B for GUI grounding},author = {Kenneth Chen, Sheldon Zhu, Jiabao Zhang},year = {2026},howpublished = {\url{https://huggingface.co/Datawall/brend-2b-260602}},}
License
Apache-2.0, inheriting the base model's license. Training data and eval benchmark are subject to their own upstream licenses.
Model provider
Datawall
Model tree
Base
Qwen/Qwen3-VL-2B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information