ngqtrung

video-8b-grpo-base

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Results

Offline full-set eval (VideoMME-v1 2700 + PerceptionComp 1108 + Video-Holmes 1837 = 5645 rows), scored with the repo's vero compute_score (the exact training val metric). mean = macro-mean of the 3 bench accuracies.

Keeper = global_step_80 (peak):

Table with columns: metric, mean, videomme, holmes, perceptioncomp
metric	mean	videomme	holmes	perceptioncomp
base RL @80 (keeper)	0.4918	0.6581	0.4741	0.3430
stock Qwen3-VL-8B ckpt-0 (zero-shot)	0.4444	0.6426	0.4143	0.2762

Trajectory (mean): 0.430 @20 → 0.475 @40 → 0.479 @60 → 0.492 @80 → then oscillates tightly 0.475–0.485 around ~0.484 through step 240 (flat plateau, no upward trend, no collapse). RL's gain concentrates where the base is weakest (pcomp 0.276→0.343); videomme is already strong at base and barely moves.

Training

Base model: stock Qwen/Qwen3-VL-8B-Instruct (cold-start RL, resume_mode=auto).
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter, partial rollout.
Warm start: none (cold-start). This run is the A/B ablation vs the SFT-770-warmstart run.
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty.
Exploration: OFF.
Topology: 4-node 2+2 — 2 trainer nodes (16-GPU FSDP2, dp=16) + 2 rollout nodes (16 GPU, vLLM TP=2 → 8 replicas). H100×8 per node.

W&B

Project verl_fully_async (entity quangtrung5705-nanyang-technological-university-singapore). Main segment (train metrics only — video val is offline, not on W&B): https://wandb.ai/quangtrung5705-nanyang-technological-university-singapore/verl_fully_async/runs/mwuxsj29

Intended use / limitations

Research checkpoint — the best video keeper of an 8B controlled study, but note the headline finding: 8B video-MC RL is a documented 7-way dead-heat at ~0.485 full-set val. No lever tested (SFT warm-start, cold-base RL, exploration at several τ, larger GRPO groups) beats this ceiling; the bottleneck is outside the exploration/warm-start/RL-duration space (data / reward / task design / model scale). RL does buy a real +4-5 pt over the zero-shot base. Multiple-choice video QA, <think>…</think> then-answer format. No safety/RLHF alignment beyond the base.

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "ngqtrung/video-8b-grpo-base"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "video": "clip.mp4"},
        {"type": "text", "text": "Answer the multiple-choice question. Reason inside <think>...</think>, then give the final letter."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). Inherits the Qwen3-VL license — review the base model's terms; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Study: controlled OMR/Video exploration study on Qwen3-VL-8B; this is the cold-start video baseline arm and the best video keeper (docs/experiments_summary_8b.md).

Model provider

ngqtrung

Model tree

Base

Qwen/Qwen3-VL-8B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Results

Keeper = global_step_80 (peak):

Table with columns: metric, mean, videomme, holmes, perceptioncomp
metric	mean	videomme	holmes	perceptioncomp
base RL @80 (keeper)	0.4918	0.6581	0.4741	0.3430
stock Qwen3-VL-8B ckpt-0 (zero-shot)	0.4444	0.6426	0.4143	0.2762

Training

Base model: stock Qwen/Qwen3-VL-8B-Instruct (cold-start RL, resume_mode=auto).
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter, partial rollout.
Warm start: none (cold-start). This run is the A/B ablation vs the SFT-770-warmstart run.
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty.
Exploration: OFF.
Topology: 4-node 2+2 — 2 trainer nodes (16-GPU FSDP2, dp=16) + 2 rollout nodes (16 GPU, vLLM TP=2 → 8 replicas). H100×8 per node.

W&B

Intended use / limitations

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "ngqtrung/video-8b-grpo-base"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "video": "clip.mp4"},
        {"type": "text", "text": "Answer the multiple-choice question. Reason inside <think>...</think>, then give the final letter."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). Inherits the Qwen3-VL license — review the base model's terms; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Study: controlled OMR/Video exploration study on Qwen3-VL-8B; this is the cold-start video baseline arm and the best video keeper (docs/experiments_summary_8b.md).

video-8b-grpo-base

Get help setting up a custom Dedicated Endpoints.

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage

Explore FriendliAI today

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage