Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Models
| Model | HF repo | Base model | Params |
|---|---|---|---|
Vero-Qwen3I-8B | zlab-princeton/Vero-Qwen3I-8B | Qwen3-VL-8B-Instruct | 8B |
Vero-Qwen3T-8B | zlab-princeton/Vero-Qwen3T-8B | Qwen3-VL-8B-Thinking | 8B |
Vero-MiMo-7B | zlab-princeton/Vero-MiMo-7B | MiMo-VL-7B-SFT-2508 | 7B |
Vero-Qwen25-7B | zlab-princeton/Vero-Qwen25-7B | Qwen2.5-VL-7B-Instruct | 7B |
Highlights
- Fully open release of models, training code, evaluation, and the
Vero-600Kdataset. - 600K curated RL samples from 59 datasets across 6 visual reasoning categories.
- Trained for broad transfer across chart and OCR, STEM, spatial and action, knowledge and recognition, grounding and counting, and captioning and instruction following.
- SOTA 8B on
VeroEval, a 30-benchmark suite for general visual reasoning. - Improves performance across multiple base model families, including Qwen2.5-VL, Qwen3-VL, and MiMo-VL.
Usage
Example for zlab-princeton/Vero-Qwen35-9B:
python
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGenerationfrom qwen_vl_utils import process_vision_infomodel_path = "zlab-princeton/Vero-Qwen35-9B"model = Qwen2_5_VLForConditionalGeneration.from_pretrained(model_path,torch_dtype="auto",device_map="auto",)processor = AutoProcessor.from_pretrained(model_path)messages = [{"role": "user","content": [{"type": "image", "image": "path/to/image.jpg"},{"type": "text", "text": "What is the x axis value with the largest population?"},],}]text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)image_inputs, video_inputs = process_vision_info(messages)inputs = processor(text=[text],images=image_inputs,videos=video_inputs,padding=True,return_tensors="pt",).to(model.device)generated_ids = model.generate(**inputs, max_new_tokens=2048)output = processor.batch_decode(generated_ids[:, inputs.input_ids.shape[1]:],skip_special_tokens=True,)[0]print(output)
Vero models generate a reasoning trace in <think> tags followed by a final answer in <answer> tags. For downstream use, parse the final response from <answer>.
Recommended sampling parameters, following the Qwen3.5 defaults:
- Thinking mode for general tasks:
temperature=1.0,top_p=0.95,top_k=20,min_p=0.0,presence_penalty=1.5,repetition_penalty=1.0,max_new_tokens=16384.
Citation
bibtex
@misc{sarch2026vero,title={Vero: An Open RL Recipe for General Visual Reasoning},author={Sarch, Gabriel and Cai, Linrong and Wang, Qunzhong and Wu, Haoyang and Chen, Danqi and Liu, Zhuang},year={2026}}
License
Vero is released under the Apache-2.0 license. Users should also review the licenses and usage terms of the underlying base models and any upstream datasets.
Model provider
zlab-princeton
Model tree
Base
Qwen/Qwen3.5-9B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information