Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0About
This repository contains the VideoKR-Qwen2.5-VL-7B model presented in VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding (ICML 2026 Spotlight).
VideoKR-Qwen2.5-VL-7B is obtained through a standard SFT → GRPO pipeline on Qwen2.5-VL-7B-Instruct:
- Supervised fine-tuning on VideoKR-SFT-201K with CoT rationales → VideoKR-Qwen2.5-VL-7B-SFT
- GRPO reinforcement learning on VideoKR-RL-114K with verifiable rewards → this model
VideoKR is the first large-scale training corpus designed for knowledge- and reasoning-intensive video understanding, containing 315K video reasoning examples over 145K newly collected, CC-licensed expert-domain videos across 82 professional subjects.
Links
| Resource | Link |
|---|---|
| Training data | minuzero/VideoKR-Train |
| Evaluation data | minuzero/VideoKR-Eval |
| SFT checkpoint (Qwen2.5-VL) | minuzero/VideoKR-Qwen2.5-VL-7B-SFT |
| SFT checkpoint (Qwen3-VL) | minuzero/VideoKR-Qwen3-VL-8B-SFT |
| GRPO checkpoint (Qwen3-VL) | minuzero/VideoKR-Qwen3-VL-8B |
Performance
Results with 128 input frames. Within each base-model group, bold = best, underline = second best.
| Model | Video-MME | MVBench | LongVBench | General Avg | VideoMMMU | MMVU | SciVidBench | VideoKR-Eval | Knowledge Avg |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | 65.1 | 66.3 | 60.9 | 64.1 | 51.1 | 55.7 | 28.1 | 32.7 | 41.9 |
| VideoAuto-R1 | 66.8 | 70.2 | 59.7 | 65.6 | 52.1 | 55.7 | 32.7 | 36.5 | 44.3 |
| VideoKR (SFT + RL) | 66.4 | 68.9 | 61.3 | 65.5 | 52.2 | 60.5 | 32.5 | 41.2 | 46.6 |
VideoKR achieves the highest knowledge-intensive average (+4.7 over base, +2.3 over VideoAuto-R1) while remaining competitive on general video reasoning.
Results with 16 input frames (comparison with Video-R1 and VideoRFT):
| Model | Video-MME | MVBench | LongVBench | General Avg | VideoMMMU | MMVU | SciVidBench | VideoKR-Eval | Knowledge Avg |
|---|---|---|---|---|---|---|---|---|---|
| Qwen2.5-VL-7B-Instruct | 57.1 | 65.0 | 55.2 | 59.1 | 48.4 | 52.5 | 23.1 | 31.3 | 38.8 |
| Video-R1 | 59.7 | 65.5 | 55.3 | 60.2 | 51.1 | 53.3 | 26.6 | 28.9 | 40.0 |
| VideoRFT | 57.6 | 61.7 | 53.6 | 57.6 | 51.1 | 53.6 | 26.3 | 29.8 | 40.2 |
| VideoKR (SFT + RL) | 56.6 | 66.6 | 57.0 | 60.1 | 52.6 | 59.2 | 27.3 | 37.7 | 44.2 |
Under the 16-frame setting, VideoKR outperforms Video-R1 and VideoRFT by +4.2 and +4.0 on knowledge-intensive average, respectively.
Evaluation
bash
cd /path/to/VideoKR/lmms_evalconda activate videokr_evalexport CUDA_VISIBLE_DEVICES=0export VIDEOKR_MODEL=minuzero/VideoKR-Qwen2.5-VL-7Bexport TASKS=videokr_evalexport BATCH_SIZE=1export RUN_NAME=videokr_evalbash examples/models/videokr_vllm.sh
Citation
If you find VideoKR useful in your research, please cite our paper:
bibtex
@misc{fu2026videokrknowledgereasoningintensivevideo,title={VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding},author={Lin Fu and Zheyuan Yang and Yang Wang and Tingyu Song and Arman Cohan and Yilun Zhao},year={2026},eprint={2606.05259},archivePrefix={arXiv},primaryClass={cs.CV},url={https://arxiv.org/abs/2606.05259},}
Model provider
minuzero
Model tree
Base
Qwen/Qwen2.5-VL-7B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information