Model Details
- Model family: VRPRM
- Release variant: MiMo-7B
- Serialized architecture:
Qwen2_5_VLForConditionalGeneration
- Model type:
qwen2_5_vl
- Weights format: sharded
safetensors
- Recommended library:
transformers
Training Summary
The VRPRM paper trains the model with a two-stage recipe:
- Supervised fine-tuning cold start on high-quality CoT-PRM data. Open-sourced on VRPRM3.6K.
- Reinforcement learning scaling on lower-cost non-CoT PRM data.
Intended Use
This model is intended for research on:
- Visual process reward modeling
- Multimodal reasoning evaluation
- Step-level scoring of visual question answering rationales
- Best-of-N selection for vision-language model responses
This model is not intended to be used as a standalone assistant.
Usage
Load the model with Hugging Face Transformers from the repository root:
from transformers import AutoModelForVision2Seq, AutoProcessor
model_id = "YOUR_USERNAME/VRPRM-MiMo-7B"
processor = AutoProcessor.from_pretrained(model_id, trust_remote_code=True)
model = AutoModelForVision2Seq.from_pretrained(
model_id,
torch_dtype="auto",
device_map="auto",
trust_remote_code=True,
)
For the complete inference and evaluation pipeline, use the VRPRM project code.
Citation
@misc{chen2026vrprmprocessrewardmodeling,
title={VRPRM: Process Reward Modeling via Visual Reasoning},
author={Xinquan Chen and Chongying Yue and Bangwei Liu and Xuhong Wang and Yingchun Wang and Chaochao Lu},
year={2026},
eprint={2508.03556},
archivePrefix={arXiv},
primaryClass={cs.LG},
url={https://arxiv.org/abs/2508.03556},
}