ngqtrung

video-8b-grpo-ppexplore-n16k8

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Results

Offline full-set eval (VideoMME-v1 2700 + PerceptionComp 1108 + Video-Holmes 1837 = 5645 rows), scored with the repo's vero compute_score. mean = macro-mean of the 3 bench accuracies.

Keeper = global_step_80 (peak):

Table with columns: metric, mean, videomme, holmes, perceptioncomp
metric	mean	videomme	holmes	perceptioncomp
τ0.8 n16/k8 @80 (keeper)	0.4913	0.653	0.471	0.349
cold-base RL keeper @80	0.4918	0.658	0.474	0.343
τ0.95 sibling peak (n8/k4)	0.4881	—	—	—
τ0.8 sibling peak (n8/k4)	0.4864	—	—	—
stock Qwen3-VL-8B ckpt-0 (zero-shot)	0.4444	0.6426	0.4143	0.2762

Trajectory (mean): 0.488 @20 → 0.490 @40 → 0.487 @60 → 0.491 @80 → clean monotone fade (0.481 @120, 0.473 @140), no collapse. (Step 100 eval skipped — reproducibly hung on one deterministic shard, step-100-specific not systemic.) Peak edges the n8 explore siblings by ~0.3 pt = within noise.

Training

Base model: Qwen/Qwen3-VL-8B-Instruct, warm-started from the SFT-770-RL run's global_step_60 (merged HF _merged/sft770_perf_gs60_hf, itself val 0.4814). Lineage: stock 8B → OMR-SFT-770 → video-RL gs60 → this run.
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter.
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty.
Topology: 4-node 2+2 — 2 trainer nodes (16-GPU FSDP2, dp=16) + 2 rollout nodes (16 GPU, vLLM TP=2 → 8 replicas). H100×8 per node.
× × = (vs 512 at n=8; bigger GRPO groups = 8 explore + 8 anchor per selected prompt).

Table with columns: key, value
key	value
`enable`	true
`trigger_mode`	high
`top_prob_threshold` (τ)	0.8
`k_explore`	8 (of n=16; 8 explore + 8 anchor)
`prompt_exploration_prob`	0.5
`deterministic`

Validation: inline val OFF (test_freq=10000); video val is offline full-set eval on a dedicated H100 node (vLLM TP1), every 20 fit-steps.
Train metrics: ~445 s/step (vs ~290 at n=8 — the bigger groups cost ~55% more wall-clock); reward → ~0.88; response_length ~250 tok; format ~1.0; no OOM/collapse. Stopped by user at fit-step ~106/781 (verdict clear).

W&B

Project verl_fully_async (entity quangtrung5705-nanyang-technological-university-singapore). The logical run has three crash-resume segments (vtlkaq8u, iwwcf7do, hb4vi5h3); the last/most-complete (hb4vi5h3) carries the curated label. Train metrics only — video val is offline, not on W&B: https://wandb.ai/quangtrung5705-nanyang-technological-university-singapore/verl_fully_async/runs/hb4vi5h3

Intended use / limitations

Research checkpoint — the group-size ablation arm of an 8B video study. Headline finding: bigger GRPO groups (n16/k8) do not break the ~0.485 video dead-heat (peak 0.4913 ties the best keeper then fades). With this run, seven levers (SFT warmstart, cold-base RL, cold exploration, base-warm explore, SFT-warm explore τ0.95, SFT-warm explore τ0.8, and this τ0.8 n16/k8) all peak ~0.485-0.49 — the video ceiling is bound by something outside the exploration / warm-start / RL-duration / τ / group-size space. Multiple-choice video QA, <think>…</think> then-answer format. No safety/RLHF alignment beyond the base.

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "ngqtrung/video-8b-grpo-ppexplore-n16k8"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "video": "clip.mp4"},
        {"type": "text", "text": "Answer the multiple-choice question. Reason inside <think>...</think>, then give the final letter."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). Inherits the Qwen3-VL license — review the base model's terms; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Warm start: SFT-770-RL global_step_60 (lineage: stock 8B → OMR-SFT-770 → video-RL gs60).
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Method: entropy-aware token-dropout exploration ("ppexplore", τ=0.8) with enlarged GRPO groups (n=16, k_explore=8). Group-size ablation in a controlled OMR/Video exploration study on Qwen3-VL-8B (docs/experiments_summary_8b.md).

Model provider

ngqtrung

Model tree

Base

Qwen/Qwen3-VL-8B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Results

Offline full-set eval (VideoMME-v1 2700 + PerceptionComp 1108 + Video-Holmes 1837 = 5645 rows), scored with the repo's vero compute_score. mean = macro-mean of the 3 bench accuracies.

Keeper = global_step_80 (peak):

Table with columns: metric, mean, videomme, holmes, perceptioncomp
metric	mean	videomme	holmes	perceptioncomp
τ0.8 n16/k8 @80 (keeper)	0.4913	0.653	0.471	0.349
cold-base RL keeper @80	0.4918	0.658	0.474	0.343
τ0.95 sibling peak (n8/k4)	0.4881	—	—	—
τ0.8 sibling peak (n8/k4)	0.4864	—	—	—
stock Qwen3-VL-8B ckpt-0 (zero-shot)	0.4444	0.6426	0.4143	0.2762

Training

Base model: Qwen/Qwen3-VL-8B-Instruct, warm-started from the SFT-770-RL run's global_step_60 (merged HF _merged/sft770_perf_gs60_hf, itself val 0.4814). Lineage: stock 8B → OMR-SFT-770 → video-RL gs60 → this run.
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter.
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty.
Topology: 4-node 2+2 — 2 trainer nodes (16-GPU FSDP2, dp=16) + 2 rollout nodes (16 GPU, vLLM TP=2 → 8 replicas). H100×8 per node.
× × = (vs 512 at n=8; bigger GRPO groups = 8 explore + 8 anchor per selected prompt).

Table with columns: key, value
key	value
`enable`	true
`trigger_mode`	high
`top_prob_threshold` (τ)	0.8
`k_explore`	8 (of n=16; 8 explore + 8 anchor)
`prompt_exploration_prob`	0.5
`deterministic`

Validation: inline val OFF (test_freq=10000); video val is offline full-set eval on a dedicated H100 node (vLLM TP1), every 20 fit-steps.
Train metrics: ~445 s/step (vs ~290 at n=8 — the bigger groups cost ~55% more wall-clock); reward → ~0.88; response_length ~250 tok; format ~1.0; no OOM/collapse. Stopped by user at fit-step ~106/781 (verdict clear).

W&B

Intended use / limitations

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "ngqtrung/video-8b-grpo-ppexplore-n16k8"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "video": "clip.mp4"},
        {"type": "text", "text": "Answer the multiple-choice question. Reason inside <think>...</think>, then give the final letter."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). Inherits the Qwen3-VL license — review the base model's terms; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Warm start: SFT-770-RL global_step_60 (lineage: stock 8B → OMR-SFT-770 → video-RL gs60).
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Method: entropy-aware token-dropout exploration ("ppexplore", τ=0.8) with enlarged GRPO groups (n=16, k_explore=8). Group-size ablation in a controlled OMR/Video exploration study on Qwen3-VL-8B (docs/experiments_summary_8b.md).

video-8b-grpo-ppexplore-n16k8

Get help setting up a custom Dedicated Endpoints.

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage

Explore FriendliAI today

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage