ddz16

Qwen3-VL-4B-CRPO

Introduction

Video Large Language Models (Video LLMs) often rely on shortcuts, such as single-frame cues and language priors, rather than tracking spatiotemporal dynamics. Counterfactual Relational Policy Optimization (CRPO) addresses this by using a dual-branch RL framework.

CRPO constructs counterfactual videos (e.g., through horizontal flips and temporal reversals) and introduces a Counterfactual Relation Reward (CRR). This reward encourages the model's answers to change for dynamic questions when the visual world changes, and to remain unchanged for static questions, making it difficult for shortcut-based policies to be consistently rewarded.

Evaluation

The model was evaluated using DyBench, a paired counterfactual video benchmark with over 3,000 videos covering:

Reversible dynamics
Moving directions
Event sequences

Experiments show that CRPO significantly outperforms prior RL methods on spatiotemporal-sensitive evaluations while maintaining competitive general video performance. On Qwen3-VL-8B, CRPO improves DyBench pair-accuracy, indicating improved sensitivity to video dynamics rather than reliance on static shortcuts.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

ddz16

Model Tree

Base

this model

Input Modalities

Text

Image

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Container

Explore FriendliAI today

Get started Talk to an engineer

Introduction

Evaluation

The model was evaluated using DyBench, a paired counterfactual video benchmark with over 3,000 videos covering:

Reversible dynamics
Moving directions
Event sequences

Qwen3-VL-4B-CRPO

README

Introduction

Evaluation

Explore FriendliAI today

README

Introduction

Evaluation