minuzero

VideoKR-Qwen3-VL-8B-SFT

README

License: apache-2.0

About

This repository contains the VideoKR-Qwen3-VL-8B-SFT model presented in VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding (ICML 2026 Spotlight).

VideoKR-Qwen3-VL-8B-SFT is obtained by supervised fine-tuning Qwen3-VL-8B-Instruct on VideoKR-SFT-201K for one epoch. Each training example includes a high-quality chain-of-thought (CoT) rationale as the supervision target. This SFT checkpoint serves as the starting point for subsequent GRPO reinforcement learning, yielding the final VideoKR-Qwen3-VL-8B model.

Links

Table with columns: Resource, Link
Resource	Link
Training data	minuzero/VideoKR-Train
Evaluation data	minuzero/VideoKR-Eval
SFT checkpoint (Qwen2.5-VL)	minuzero/VideoKR-Qwen2.5-VL-7B-SFT
GRPO checkpoint (Qwen2.5-VL)	minuzero/VideoKR-Qwen2.5-VL-7B
GRPO checkpoint (Qwen3-VL)	minuzero/VideoKR-Qwen3-VL-8B

Performance

Results with 128 input frames:

Table with columns: Model, Video-MME, MVBench, LongVBench, General Avg, VideoMMMU, MMVU, SciVidBench, VideoKR-Eval, Knowledge Avg
Model	Video-MME	MVBench	LongVBench	General Avg	VideoMMMU	MMVU	SciVidBench	VideoKR-Eval	Knowledge Avg
Qwen3-VL-8B-Instruct	68.2	67.9	61.6	65.9	61.8	59.6	33.4	39.0	48.5

The SFT checkpoint already shows strong gains on knowledge-intensive benchmarks (e.g., +4.6 on VideoKR-Eval, +3.4 on MMVU) compared to the base model, while the subsequent GRPO stage further recovers general video reasoning performance.

Training

For detailed training instructions, please refer to the GitHub repository.

bash
cd /path/to/VideoKR/llamafactory
conda activate videokr_train

# Prepare SFT data
mkdir -p data/raw
huggingface-cli download minuzero/VideoKR-Train \
  --repo-type dataset --local-dir data/raw \
  --include "VideoKR-COT-201K.jsonl"

python local_script/prepare_videokr_sft_data.py \
  --input data/raw/VideoKR-COT-201K.jsonl \
  --output data/videokr_train.json

# Launch SFT
bash local_script/train_videokr.sh qwen3vl

Citation

If you find VideoKR useful in your research, please cite our paper:

bibtex
@misc{fu2026videokrknowledgereasoningintensivevideo,
      title={VideoKR: Towards Knowledge- and Reasoning-Intensive Video Understanding}, 
      author={Lin Fu and Zheyuan Yang and Yang Wang and Tingyu Song and Arman Cohan and Yilun Zhao},
      year={2026},
      eprint={2606.05259},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2606.05259}, 
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

minuzero

Model Tree

Base

Qwen/Qwen3-VL-8B-Instruct

Fine-tuned

this model

Input Modalities

Text

Image

Output Modalities