Models
Table with columns: Model, HF repo, Base model, Params| Model | HF repo | Base model | Params |
|---|
Vero-Qwen3I-8B | zlab-princeton/Vero-Qwen3I-8B | Qwen3-VL-8B-Instruct | 8B |
Vero-Qwen3T-8B | zlab-princeton/Vero-Qwen3T-8B | Qwen3-VL-8B-Thinking | 8B |
Vero-MiMo-7B | zlab-princeton/Vero-MiMo-7B | MiMo-VL-7B-SFT-2508 | 7B |
Vero-Qwen25-7B | zlab-princeton/Vero-Qwen25-7B | Qwen2.5-VL-7B-Instruct | 7B |
Highlights
- Fully open release of models, training code, evaluation, and the
Vero-600K dataset.
- 600K curated RL samples from 59 datasets across 6 visual reasoning categories.
- Trained for broad transfer across chart and OCR, STEM, spatial and action, knowledge and recognition, grounding and counting, and captioning and instruction following.
- SOTA 8B on
VeroEval, a 30-benchmark suite for general visual reasoning.
- Improves performance across multiple base model families, including Qwen2.5-VL, Qwen3-VL, and MiMo-VL.
Usage
Example for zlab-princeton/Vero-Qwen35-9B:
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info
model_path = "zlab-princeton/Vero-Qwen35-9B"
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
model_path,
torch_dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_path)
messages = [
{
"role": "user",
"content": [
{"type": "image", "image": "path/to/image.jpg"},
{"type": "text", "text": "What is the x axis value with the largest population?"},
],
}
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
text=[text],
images=image_inputs,
videos=video_inputs,
padding=True,
return_tensors="pt",
).to(model.device)
generated_ids = model.generate(**inputs, max_new_tokens=2048)
output = processor.batch_decode(
generated_ids[:, inputs.input_ids.shape[1]:],
skip_special_tokens=True,
)[0]
print(output)
Vero models generate a reasoning trace in <think> tags followed by a final answer in <answer> tags. For downstream use, parse the final response from <answer>.
Recommended sampling parameters, following the Qwen3.5 defaults:
- Thinking mode for general tasks:
temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0, max_new_tokens=16384.
Citation
@misc{sarch2026vero,
title={Vero: An Open RL Recipe for General Visual Reasoning},
author={Sarch, Gabriel and Cai, Linrong and Wang, Qunzhong and Wu, Haoyang and Chen, Danqi and Liu, Zhuang},
year={2026}
}
License
Vero is released under the Apache-2.0 license. Users should also review the licenses and usage terms of the underlying base models and any upstream datasets.