Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Models

ModelHF repoBase modelParams
Vero-Qwen3I-8Bzlab-princeton/Vero-Qwen3I-8BQwen3-VL-8B-Instruct8B
Vero-Qwen3T-8Bzlab-princeton/Vero-Qwen3T-8BQwen3-VL-8B-Thinking8B
Vero-MiMo-7Bzlab-princeton/Vero-MiMo-7BMiMo-VL-7B-SFT-25087B
Vero-Qwen25-7Bzlab-princeton/Vero-Qwen25-7BQwen2.5-VL-7B-Instruct7B

Highlights

  • Fully open release of models, training code, evaluation, and the Vero-600K dataset.
  • 600K curated RL samples from 59 datasets across 6 visual reasoning categories.
  • Trained for broad transfer across chart and OCR, STEM, spatial and action, knowledge and recognition, grounding and counting, and captioning and instruction following.
  • SOTA 8B on VeroEval, a 30-benchmark suite for general visual reasoning.
  • Improves performance across multiple base model families, including Qwen2.5-VL, Qwen3-VL, and MiMo-VL.

Usage

Example for zlab-princeton/Vero-Qwen35-9B:

python

from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
model_path = "zlab-princeton/Vero-Qwen35-9B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_path)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "What is the x axis value with the largest population?"},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=2048)
output = processor.batch_decode(
generated_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True,
)[0]
print(output)

Vero models generate a reasoning trace in <think> tags followed by a final answer in <answer> tags. For downstream use, parse the final response from <answer>.

Recommended sampling parameters, following the Qwen3.5 defaults:

  • Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0, max_new_tokens=16384.

Citation

bibtex

@misc{sarch2026vero,
title={Vero: An Open RL Recipe for General Visual Reasoning},
author={Sarch, Gabriel and Cai, Linrong and Wang, Qunzhong and Wu, Haoyang and Chen, Danqi and Liu, Zhuang},
year={2026}
}

License

Vero is released under the Apache-2.0 license. Users should also review the licenses and usage terms of the underlying base models and any upstream datasets.

Model provider

zlab-princeton

Model tree

Base

Qwen/Qwen3.5-9B

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today