ngqtrung

video-8b-grpo-ppexplore

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Results

Offline full-set eval (VideoMME-v1 2700 + PerceptionComp 1108 + Video-Holmes 1837 = 5645 rows), scored with the repo's vero compute_score. mean = macro-mean of the 3 bench accuracies.

Keeper = global_step_140 (peak):

Table with columns: metric, mean, videomme, holmes, perceptioncomp
metric	mean	videomme	holmes	perceptioncomp
ppexplore τ0.95 @140 (keeper)	0.4890	0.656	0.465	0.346
cold-base RL keeper @80 (sibling)	0.4918	0.658	0.474	0.343
SFT-770 RL keeper @180 (sibling)	0.4845	0.657	0.451	0.345
stock Qwen3-VL-8B ckpt-0 (zero-shot)	0.4444	0.6426	0.4143	0.2762

Trajectory (mean): 0.451 @20 → 0.470 @40 → 0.477 @60 → 0.488 @100 → 0.489 @140 → faded late (0.477 @200, 0.480 @260). Ran to ~step 780 (near 2 epochs); further training brought no val gain — same post-peak fade as every other video run. ppexplore's lower train reward did not translate to higher val.

Training

Base model: stock Qwen/Qwen3-VL-8B-Instruct (cold-start, resume_mode=auto).
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter.
Warm start: none (cold-start base).
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty.
Topology: 4-node 2+2 — 2 trainer nodes (16-GPU FSDP2, dp=16) + 2 rollout nodes (16 GPU, vLLM TP=2 → 8 replicas). H100×8 per node.
× × = .

Table with columns: key, value
key	value
`enable`	true
`trigger_mode`	high
`top_prob_threshold` (τ)	0.95
`k_explore`	4 (of n=8 rollouts explore; 4 stay clean)
`prompt_exploration_prob`	0.5
`deterministic`

Validation: inline val OFF (test_freq=10000); video val is offline full-set eval on a dedicated 8×H100 node (vLLM TP1), every 20 fit-steps.
Train metrics: ~219 s/step; final reward 0.858; final response_length ~164 tok; 780 steps trained (~2 epochs).

W&B

Project verl_fully_async (entity quangtrung5705-nanyang-technological-university-singapore). Train metrics only — video val is offline, not on W&B: https://wandb.ai/quangtrung5705-nanyang-technological-university-singapore/verl_fully_async/runs/88lmybpk

Intended use / limitations

Research checkpoint — the cold-start exploration arm of an 8B video study. Headline finding: on video, exploration is a documented dead-heat — the OMR-winning recipe applied to video does not break the ~0.485 full-set val ceiling (this run, base, SFT-warmstart, and several further exploration ablations all land ~0.485-0.49). Useful as an A/B reference, not as a "better video model" — prefer video-8b-grpo-base (0.4918) if you just want the best keeper. Multiple-choice video QA, <think>…</think> then-answer format. No safety/RLHF alignment beyond the base.

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "ngqtrung/video-8b-grpo-ppexplore"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "video": "clip.mp4"},
        {"type": "text", "text": "Answer the multiple-choice question. Reason inside <think>...</think>, then give the final letter."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). Inherits the Qwen3-VL license — review the base model's terms; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Method: entropy-aware token-dropout exploration ("ppexplore", τ=0.95) at the rollout stage. Part of a controlled OMR/Video exploration study on Qwen3-VL-8B (docs/experiments_summary_8b.md).

Model provider

ngqtrung

Model tree

Base

Qwen/Qwen3-VL-8B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Results

Offline full-set eval (VideoMME-v1 2700 + PerceptionComp 1108 + Video-Holmes 1837 = 5645 rows), scored with the repo's vero compute_score. mean = macro-mean of the 3 bench accuracies.

Keeper = global_step_140 (peak):

Table with columns: metric, mean, videomme, holmes, perceptioncomp
metric	mean	videomme	holmes	perceptioncomp
ppexplore τ0.95 @140 (keeper)	0.4890	0.656	0.465	0.346
cold-base RL keeper @80 (sibling)	0.4918	0.658	0.474	0.343
SFT-770 RL keeper @180 (sibling)	0.4845	0.657	0.451	0.345
stock Qwen3-VL-8B ckpt-0 (zero-shot)	0.4444	0.6426	0.4143	0.2762

Training

Base model: stock Qwen/Qwen3-VL-8B-Instruct (cold-start, resume_mode=auto).
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter.
Warm start: none (cold-start base).
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty.
Topology: 4-node 2+2 — 2 trainer nodes (16-GPU FSDP2, dp=16) + 2 rollout nodes (16 GPU, vLLM TP=2 → 8 replicas). H100×8 per node.
× × = .

Table with columns: key, value
key	value
`enable`	true
`trigger_mode`	high
`top_prob_threshold` (τ)	0.95
`k_explore`	4 (of n=8 rollouts explore; 4 stay clean)
`prompt_exploration_prob`	0.5
`deterministic`

Validation: inline val OFF (test_freq=10000); video val is offline full-set eval on a dedicated 8×H100 node (vLLM TP1), every 20 fit-steps.
Train metrics: ~219 s/step; final reward 0.858; final response_length ~164 tok; 780 steps trained (~2 epochs).

W&B

Intended use / limitations

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "ngqtrung/video-8b-grpo-ppexplore"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "video": "clip.mp4"},
        {"type": "text", "text": "Answer the multiple-choice question. Reason inside <think>...</think>, then give the final letter."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). Inherits the Qwen3-VL license — review the base model's terms; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Method: entropy-aware token-dropout exploration ("ppexplore", τ=0.95) at the rollout stage. Part of a controlled OMR/Video exploration study on Qwen3-VL-8B (docs/experiments_summary_8b.md).

video-8b-grpo-ppexplore

Get help setting up a custom Dedicated Endpoints.

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage

Explore FriendliAI today

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage