Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0About
This repository contains the VideoKR-Qwen3-VL-8B-SFT model presented in VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding (ICML 2026 Spotlight).
VideoKR-Qwen3-VL-8B-SFT is obtained by supervised fine-tuning Qwen3-VL-8B-Instruct on VideoKR-SFT-201K for one epoch. Each training example includes a high-quality chain-of-thought (CoT) rationale as the supervision target. This SFT checkpoint serves as the starting point for subsequent GRPO reinforcement learning, yielding the final VideoKR-Qwen3-VL-8B model.
Links
| Resource | Link |
|---|---|
| Training data | minuzero/VideoKR-Train |
| Evaluation data | minuzero/VideoKR-Eval |
| SFT checkpoint (Qwen2.5-VL) | minuzero/VideoKR-Qwen2.5-VL-7B-SFT |
| GRPO checkpoint (Qwen2.5-VL) | minuzero/VideoKR-Qwen2.5-VL-7B |
| GRPO checkpoint (Qwen3-VL) | minuzero/VideoKR-Qwen3-VL-8B |
Performance
Results with 128 input frames:
| Model | Video-MME | MVBench | LongVBench | General Avg | VideoMMMU | MMVU | SciVidBench | VideoKR-Eval | Knowledge Avg |
|---|---|---|---|---|---|---|---|---|---|
| Qwen3-VL-8B-Instruct | 68.2 | 67.9 | 61.6 | 65.9 | 61.8 | 59.6 | 33.4 | 39.0 | 48.5 |
| VideoKR (SFT) | 64.8 | 63.6 | 58.5 | 62.3 | 61.7 | 63.0 | 28.3 | 43.6 | 49.2 |
| VideoKR (SFT + RL) | 67.8 | 67.0 | 61.5 | 65.4 | 63.0 | 64.8 | 32.8 | 45.3 | 51.5 |
The SFT checkpoint already shows strong gains on knowledge-intensive benchmarks (e.g., +4.6 on VideoKR-Eval, +3.4 on MMVU) compared to the base model, while the subsequent GRPO stage further recovers general video reasoning performance.
Training
For detailed training instructions, please refer to the GitHub repository.
bash
cd /path/to/VideoKR/llamafactoryconda activate videokr_train# Prepare SFT datamkdir -p data/rawhuggingface-cli download minuzero/VideoKR-Train \--repo-type dataset --local-dir data/raw \--include "VideoKR-COT-201K.jsonl"python local_script/prepare_videokr_sft_data.py \--input data/raw/VideoKR-COT-201K.jsonl \--output data/videokr_train.json# Launch SFTbash local_script/train_videokr.sh qwen3vl
Citation
If you find VideoKR useful in your research, please cite our paper:
bibtex
@misc{fu2026videokrknowledgereasoningintensivevideo,title={VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding},author={Lin Fu and Zheyuan Yang and Yang Wang and Tingyu Song and Arman Cohan and Yilun Zhao},year={2026},eprint={2606.05259},archivePrefix={arXiv},primaryClass={cs.CV},url={https://arxiv.org/abs/2606.05259},}
Model provider
minuzero
Model tree
Base
Qwen/Qwen3-VL-8B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information