zlab-princeton

Vero-Qwen35-9B

README

License: apache-2.0

Models

Table with columns: Model, HF repo, Base model, Params
Model	HF repo	Base model	Params
`Vero-Qwen3I-8B`	`zlab-princeton/Vero-Qwen3I-8B`	`Qwen3-VL-8B-Instruct`	8B
`Vero-Qwen3T-8B`	`zlab-princeton/Vero-Qwen3T-8B`	`Qwen3-VL-8B-Thinking`	8B
`Vero-MiMo-7B`	`zlab-princeton/Vero-MiMo-7B`	`MiMo-VL-7B-SFT-2508`	7B
`Vero-Qwen25-7B`	`zlab-princeton/Vero-Qwen25-7B`	`Qwen2.5-VL-7B-Instruct`	7B

Highlights

Fully open release of models, training code, evaluation, and the Vero-600K dataset.
600K curated RL samples from 59 datasets across 6 visual reasoning categories.
Trained for broad transfer across chart and OCR, STEM, spatial and action, knowledge and recognition, grounding and counting, and captioning and instruction following.
SOTA 8B on VeroEval, a 30-benchmark suite for general visual reasoning.
Improves performance across multiple base model families, including Qwen2.5-VL, Qwen3-VL, and MiMo-VL.

Usage

Example for zlab-princeton/Vero-Qwen35-9B:

python
from transformers import AutoProcessor, Qwen2_5_VLForConditionalGeneration
from qwen_vl_utils import process_vision_info

model_path = "zlab-princeton/Vero-Qwen35-9B"

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    model_path,
    torch_dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained(model_path)

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image", "image": "path/to/image.jpg"},
            {"type": "text", "text": "What is the x axis value with the largest population?"},
        ],
    }
]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = processor(
    text=[text],
    images=image_inputs,
    videos=video_inputs,
    padding=True,
    return_tensors="pt",
).to(model.device)

generated_ids = model.generate(**inputs, max_new_tokens=2048)
output = processor.batch_decode(
    generated_ids[:, inputs.input_ids.shape[1]:],
    skip_special_tokens=True,
)[0]
print(output)

Vero models generate a reasoning trace in <think> tags followed by a final answer in <answer> tags. For downstream use, parse the final response from <answer>.

Recommended sampling parameters, following the Qwen3.5 defaults:

Thinking mode for general tasks: temperature=1.0, top_p=0.95, top_k=20, min_p=0.0, presence_penalty=1.5, repetition_penalty=1.0, max_new_tokens=16384.

Citation

bibtex
@misc{sarch2026vero,
  title={Vero: An Open RL Recipe for General Visual Reasoning},
  author={Sarch, Gabriel and Cai, Linrong and Wang, Qunzhong and Wu, Haoyang and Chen, Danqi and Liu, Zhuang},
  year={2026}
}

License

Vero is released under the Apache-2.0 license. Users should also review the licenses and usage terms of the underlying base models and any upstream datasets.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

zlab-princeton

Model Tree

Base

Qwen/Qwen3.5-9B

Fine-tuned

this model

Input Modalities

Text

Image

Video

Output Modalities