Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0About
This repository contains the VideoKR-Qwen3-VL-8B model presented in VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding (ICML 2026 Spotlight).
VideoKR-Qwen3-VL-8B is obtained through a standard SFT → GRPO pipeline on Qwen3-VL-8B-Instruct:
- Supervised fine-tuning on VideoKR-SFT-201K with CoT rationales → VideoKR-Qwen3-VL-8B-SFT
- GRPO reinforcement learning on VideoKR-RL-114K with verifiable rewards → this model
VideoKR is the first large-scale training corpus designed for knowledge- and reasoning-intensive video understanding, containing 315K video reasoning examples over 145K newly collected, CC-licensed expert-domain videos across 82 professional subjects.
Links
| Resource | Link |
|---|---|
| Training data | minuzero/VideoKR-Train |
| Evaluation data | minuzero/VideoKR-Eval |
| SFT checkpoint (Qwen2.5-VL) | minuzero/VideoKR-Qwen2.5-VL-7B-SFT |
| GRPO checkpoint (Qwen2.5-VL) | minuzero/VideoKR-Qwen2.5-VL-7B |
| SFT checkpoint (Qwen3-VL) | minuzero/VideoKR-Qwen3-VL-8B-SFT |
Performance
Results with 128 input frames. Within the Qwen3-VL-8B group, bold = best, underline = second best.
| Model | Video-MME | MVBench | LongVBench | General Avg | VideoMMMU | MMVU | SciVidBench | VideoKR-Eval | Knowledge Avg |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-8B-Instruct | 68.2 | 67.9 | 61.6 | 65.9 | 61.8 | 59.6 | 33.4 | 39.0 | 48.5 |
| OneThinker | 65.8 | 69.3 | 61.4 | 65.5 | 62.9 | 61.6 | 33.8 | 38.3 | 49.2 |
| VideoAuto-R1 | 68.7 | 68.8 | 58.8 | 65.4 | 63.1 | 59.6 | 32.7 | 43.8 | 49.8 |
| Qwen3-VL-8B-Thinking | 67.6 | 68.0 | 60.0 | 65.2 | 64.9 | 60.5 | 33.0 | 41.5 | 50.0 |
| VideoKR (SFT + RL) | 67.8 | 67.0 | 61.5 | 65.4 | 63.0 | 64.8 | 32.8 | 45.3 | 51.5 |
VideoKR achieves the highest knowledge-intensive average (+3.0 over base, +1.5 over Qwen3-VL-8B-Thinking) among all Qwen3-VL-8B based methods, while maintaining competitive general video reasoning performance.
Evaluation
bash
cd /path/to/VideoKR/lmms_evalconda activate videokr_evalexport CUDA_VISIBLE_DEVICES=0export VIDEOKR_MODEL=minuzero/VideoKR-Qwen3-VL-8Bexport TASKS=videokr_evalexport BATCH_SIZE=1export RUN_NAME=videokr_evalbash examples/models/videokr_vllm.sh
Citation
If you find VideoKR useful in your research, please cite our paper:
bibtex
@misc{fu2026videokrknowledgereasoningintensivevideo,title={VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding},author={Lin Fu and Zheyuan Yang and Yang Wang and Tingyu Song and Arman Cohan and Yilun Zhao},year={2026},eprint={2606.05259},archivePrefix={arXiv},primaryClass={cs.CV},url={https://arxiv.org/abs/2606.05259},}
Model provider
minuzero
Model tree
Base
Qwen/Qwen3-VL-8B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information