minuzero/VideoKR-Qwen2.5-VL-7B API & Inference Endpoint

About

This repository contains the VideoKR-Qwen2.5-VL-7B model presented in VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding (ICML 2026 Spotlight).

VideoKR-Qwen2.5-VL-7B is obtained through a standard SFT → GRPO pipeline on Qwen2.5-VL-7B-Instruct:

Supervised fine-tuning on VideoKR-SFT-201K with CoT rationales → VideoKR-Qwen2.5-VL-7B-SFT
GRPO reinforcement learning on VideoKR-RL-114K with verifiable rewards → this model

VideoKR is the first large-scale training corpus designed for knowledge- and reasoning-intensive video understanding, containing 315K video reasoning examples over 145K newly collected, CC-licensed expert-domain videos across 82 professional subjects.

Links

Resource	Link
Training data	minuzero/VideoKR-Train
Evaluation data	minuzero/VideoKR-Eval
SFT checkpoint (Qwen2.5-VL)	minuzero/VideoKR-Qwen2.5-VL-7B-SFT
SFT checkpoint (Qwen3-VL)	minuzero/VideoKR-Qwen3-VL-8B-SFT
GRPO checkpoint (Qwen3-VL)	minuzero/VideoKR-Qwen3-VL-8B

Performance

Results with 128 input frames. Within each base-model group, bold = best, underline = second best.

Model	Video-MME	MVBench	LongVBench	General Avg	VideoMMMU	MMVU	SciVidBench	VideoKR-Eval	Knowledge Avg
Qwen2.5-VL-7B-Instruct	65.1	66.3	60.9	64.1	51.1	55.7	28.1	32.7	41.9
VideoAuto-R1	66.8	70.2	59.7	65.6	52.1	55.7	32.7	36.5	44.3
VideoKR (SFT + RL)	66.4	68.9	61.3	65.5	52.2	60.5	32.5	41.2	46.6

VideoKR achieves the highest knowledge-intensive average (+4.7 over base, +2.3 over VideoAuto-R1) while remaining competitive on general video reasoning.

Results with 16 input frames (comparison with Video-R1 and VideoRFT):

Model	Video-MME	MVBench	LongVBench	General Avg	VideoMMMU	MMVU	SciVidBench	VideoKR-Eval	Knowledge Avg
Qwen2.5-VL-7B-Instruct	57.1	65.0	55.2	59.1	48.4	52.5	23.1	31.3	38.8
Video-R1	59.7	65.5	55.3	60.2	51.1	53.3	26.6	28.9	40.0
VideoRFT	57.6	61.7	53.6	57.6	51.1	53.6	26.3	29.8	40.2
VideoKR (SFT + RL)	56.6	66.6	57.0	60.1	52.6	59.2	27.3	37.7	44.2

Under the 16-frame setting, VideoKR outperforms Video-R1 and VideoRFT by +4.2 and +4.0 on knowledge-intensive average, respectively.

Evaluation

bash
cd /path/to/VideoKR/lmms_eval
conda activate videokr_eval

export CUDA_VISIBLE_DEVICES=0
export VIDEOKR_MODEL=minuzero/VideoKR-Qwen2.5-VL-7B
export TASKS=videokr_eval
export BATCH_SIZE=1
export RUN_NAME=videokr_eval

bash examples/models/videokr_vllm.sh

Citation

If you find VideoKR useful in your research, please cite our paper:

bibtex
@misc{fu2026videokrknowledgereasoningintensivevideo,
      title={VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding}, 
      author={Lin Fu and Zheyuan Yang and Yang Wang and Tingyu Song and Arman Cohan and Yilun Zhao},
      year={2026},
      eprint={2606.05259},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.05259}, 
}

VideoKR-Qwen2.5-VL-7B

Get help setting up a custom Dedicated Endpoints.

README