Datawall/brend-2b-260602 API & Inference Endpoint

Runtime requirements — read first

The evaluation that produced 48.64% used vLLM 0.17.0 with very specific flags. Two things will silently give you wrong answers if you skip them:

vllm==0.17.0 exactly. Newer vLLM releases process the Qwen3-VL image preprocessor differently and return coordinates in the wrong space (we've reproduced the regression with vllm >0.17). Pin the version.
--mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'. Without this, vLLM's default max_pixels downsamples the 4K–6K ScreenSpot-Pro screenshots to ~1280-wide, and the tiny widget targets become invisible to the model. Accuracy collapses.

Both of these are environmental, not model issues, but they're load-bearing for getting the published number.

Install

Create a fresh conda env and install the pinned stack:

bash
conda create -n vllm011 python=3.11 -y
conda activate vllm011

# Pinned vLLM (DO NOT upgrade)
python -m uv pip install vllm==0.17.0

# Transformers + Pillow + requests for the client
python -m uv pip install transformers==4.57.6 pillow requests

(If you don't have uv: pip install uv first, or just use pip install directly — uv is a speed optimization, not a correctness one.)

GPU: any CUDA 12.x card with ≥10 GB VRAM. Tested on RTX PRO 6000 Blackwell (sm_120, cu130) and should work on H100 / A100 / RTX 4090 unchanged.

Serve

bash
conda activate vllm011

CUDA_VISIBLE_DEVICES=0 python -m vllm.entrypoints.openai.api_server \
  --model Datawall/brend-2b-260602 \
  --served-model-name brend-2b \
  --port 8003 \
  --gpu-memory-utilization 0.4 \
  --max-model-len 16384 \
  --max-num-seqs 32 \
  --limit-mm-per-prompt '{"image": 1}' \
  --mm-processor-kwargs '{"min_pixels": 1024, "max_pixels": 99999999}'

A 2B BF16 model fits in ~5 GB; the rest of the 0.4 utilization budget is KV cache for batched serving. Bump --gpu-memory-utilization and --max-num-seqs if you have headroom.

Use (OpenAI-compatible client)

python
import base64, re
from io import BytesIO
from PIL import Image
import requests

VLLM_URL = "http://localhost:8003/v1/chat/completions"
MODEL    = "brend-2b"

SYSTEM_PROMPT = """You are a helpful assistant. The user will give you an instruction, and you MUST left click on the corresponding UI element via tool call. If you are not sure about where to click, guess a most likely one.

# Tools

You may call one or more functions to assist with the user query.

You are provided with function signatures within <tools></tools> XML tags:
<tools>
{"type": "function", "function": {"name": "computer_use", "description": "Use a mouse to interact with a computer.\\n* The screen's resolution is 1000x1000.\\n* Make sure to click any buttons, links, icons, etc with the cursor tip in the center of the element. \\n* You can only use the left_click action to interact with the computer.", "parameters": {"properties": {"action": {"description": "The action to perform. The available actions are:\\n* `left_click`: Click the left mouse button with coordinate (x, y).", "enum": ["left_click"], "type": "string"}, "coordinate": {"description": "(x, y): The x (pixels from the left edge) and y (pixels from the top edge) coordinates to move the mouse to. Required only by `action=left_click`.", "type": "array"}, "required": ["action"], "type": "object"}}}
</tools>

For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>"""

def img_to_data_url(img):
    buf = BytesIO(); img.save(buf, format="PNG")
    return "data:image/png;base64," + base64.b64encode(buf.getvalue()).decode()

def ground(image_path, instruction):
    img = Image.open(image_path).convert("RGB")
    payload = {
        "model": MODEL,
        "temperature": 0.0,
        "max_tokens": 64,
        "messages": [
            {"role": "system", "content": [{"type": "text", "text": SYSTEM_PROMPT}]},
            {"role": "user", "content": [
                {"type": "image_url", "image_url": {"url": img_to_data_url(img)}},
                {"type": "text", "text": instruction},
            ]},
        ],
    }
    r = requests.post(VLLM_URL, json=payload, timeout=60); r.raise_for_status()
    text = r.json()["choices"][0]["message"]["content"]

    # Model emits a tool_call with coordinates in [0, 1000] relative space.
    m = re.search(r'"coordinate"\s*:\s*\[\s*(-?\d+\.?\d*)\s*,\s*(-?\d+\.?\d*)\s*\]', text)
    if not m: return None
    x_rel, y_rel = float(m.group(1)) / 1000.0, float(m.group(2)) / 1000.0
    # Scale to original image pixels:
    return (x_rel * img.width, y_rel * img.height)

print(ground("screenshot.png", "the save button in the top toolbar"))

Coordinate convention

The model emits (x, y) in [0, 1000] relative space (the computer_use tool prompt declares a fake 1000x1000 screen, and Qwen3-VL is trained to honor that). Divide by 1000 to get normalized [0, 1] coordinates, then multiply by the original image's width/height to get pixels.

Do not pre-resize the image client-side. vLLM's image preprocessor handles smart-resize internally given the mm-processor-kwargs flags above. Client-side resizing throws off the model.

Eval breakdown (ScreenSpot-Pro, full test set, single-pass inference)

Section	Avg	Text	Icon
Development	48.49	70.13	25.52
Creative	45.45	61.62	23.08
CAD	32.95	38.07	17.19
Scientific	49.21	65.28	28.18
Office	70.00	80.23	35.85
Operating Systems	47.45	62.62	29.21
Overall	48.39	62.23	25.99

(48.39% is the micro-average across all 1581 samples; the model-index 48.64% figure is the same eval at the peak checkpoint — small mismatch is a known Creative-group accounting discrepancy in the eval harness.)

Text grounding is meaningfully stronger than icon grounding across every category — typical for 2B-class grounders.

Training details

Base model: Qwen/Qwen3-VL-2B-Instruct
Method: GRPO with click-in-bbox reward
Hardware: 1× NVIDIA RTX PRO 6000 Blackwell (96 GB GDDR7)
Precision: BF16, no DeepSpeed (single GPU), sdpa attention
Effective batch size: 64 (per-device 2 × grad-accum 32)
Completions per prompt: 2
Max completion length: 32 tokens
Wall clock: ~17 hours for 2 epochs (~1875 steps)
Checkpoint published: step 1350 (peak; 1400/1450 plateau or regress slightly)

Reward function

Coordinates are scored in the [0, 1000] relative space that Qwen3-VL natively emits — matching the space the model is trained to output in.

Eval methodology

ScreenSpot-Pro test set, all 1581 instruction-style positive samples, English. Single-pass inference — no zoom-in, no agentic loop, no refiner, no consistency router. Eval harness: likaixin2000/ScreenSpot-Pro-GUI-Grounding, adapter: qwen3vl_official_vllm (vLLM-backed, official Qwen team prompt).

Comparison to other 2B models

Model	Inference	Avg
MAI-UI-2B	Zoom In	62.81
UI-Venus-1-5-2B	Single-pass	57.75
brend-2b-260602	Single-pass	48.64
Qwen3-VL-2B-Instruct (base)	Single-pass	43.26

MAI-UI uses inference-time crop/re-query and isn't apples-to-apples with this model. UI-Venus-2B is the legitimate single-pass 2B comparison.

Citation

bibtex
@misc{chen2026brend2b260602,
  title  = {brend-2b-260602: GRPO fine-tune of Qwen3-VL-2B for GUI grounding},
  author = {Kenneth Chen, Sheldon Zhu, Jiabao Zhang},
  year   = {2026},
  howpublished = {\url{https://huggingface.co/Datawall/brend-2b-260602}},
}

License

Apache-2.0, inheriting the base model's license. Training data and eval benchmark are subject to their own upstream licenses.

brend-2b-260602

Get help setting up a custom Dedicated Endpoints.

README