ddz16
Qwen3-VL-4B-CRPO
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Introduction
Video Large Language Models (Video LLMs) often rely on shortcuts, such as single-frame cues and language priors, rather than tracking spatiotemporal dynamics. Counterfactual Relational Policy Optimization (CRPO) addresses this by using a dual-branch RL framework.
CRPO constructs counterfactual videos (e.g., through horizontal flips and temporal reversals) and introduces a Counterfactual Relation Reward (CRR). This reward encourages the model's answers to change for dynamic questions when the visual world changes, and to remain unchanged for static questions, making it difficult for shortcut-based policies to be consistently rewarded.
Evaluation
The model was evaluated using DyBench, a paired counterfactual video benchmark with over 3,000 videos covering:
- Reversible dynamics
- Moving directions
- Event sequences
Experiments show that CRPO significantly outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench pair-accuracy, indicating improved sensitivity to video dynamics rather than reliance on static shortcuts.
Model provider
ddz16
Model tree
Base
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information