ngqtrung

video-8b-grpo-sft770

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Results

Offline full-set eval (VideoMME-v1 2700 + PerceptionComp 1108 + Video-Holmes 1837 = 5645 rows), scored with the repo's vero compute_score. mean = macro-mean of the 3 bench accuracies.

Keeper = global_step_180 (peak):

Table with columns: metric, mean, videomme, holmes, perceptioncomp, format
metric	mean	videomme	holmes	perceptioncomp	format
SFT-770 RL @180 (keeper)	0.4845	0.6574	0.4513	0.3448	0.983
cold-base RL keeper @80 (sibling)	0.4918	0.6581	0.4741	0.3430	—
stock Qwen3-VL-8B ckpt-0 (zero-shot)	0.4444	0.6426	0.4143	0.2762	—

Trajectory (mean): flat plateau ~0.481 (±0.01) over steps 60–160, peak 0.4845 @180, then edged down (0.467 @200, 0.471 @220). Format compliance saturated ~0.97–0.98 throughout. The high reward did not convert to held-out accuracy — visible only because of the offline full-set eval.

Training

Base model: Qwen/Qwen3-VL-8B-Instruct, warm-started from the OMR SFT checkpoint qwen3vl8b_ommr_sft_3node/checkpoint-770-hf. (SFT-770 = the stock 8B SFT'd on OpenMMReasoner reasoning data with lmms-engine — a math/visual-reasoning SFT, not a video model — used here to seed video RL.)
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter.
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty.
Exploration: OFF.
Topology: 4-node 2+2 — 2 trainer nodes (16-GPU FSDP2, dp=16) + 2 rollout nodes (16 GPU, vLLM TP=2 → 8 replicas). H100×8 per node.

W&B

Project verl_fully_async (entity quangtrung5705-nanyang-technological-university-singapore). Train metrics only — video val is offline, not on W&B: https://wandb.ai/quangtrung5705-nanyang-technological-university-singapore/verl_fully_async/runs/s59ethn1

Intended use / limitations

Research checkpoint — the SFT-warmstart arm of an 8B video study whose headline finding is a 7-way dead-heat at ~0.485 full-set val. This run shows SFT warm-start gives faster early convergence but no durable val edge over cold-start (which actually peaks slightly higher at 0.4918). Multiple-choice video QA, <think>…</think> then-answer format. No safety/RLHF alignment beyond the base.

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "ngqtrung/video-8b-grpo-sft770"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "video": "clip.mp4"},
        {"type": "text", "text": "Answer the multiple-choice question. Reason inside <think>...</think>, then give the final letter."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). Inherits the Qwen3-VL license — review the base model's terms; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Warm start: OMR-SFT-770 (OpenMMReasoner SFT of Qwen3-VL-8B-Instruct, trained with lmms-engine).
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Study: controlled OMR/Video exploration study on Qwen3-VL-8B; SFT-warmstart video arm (docs/experiments_summary_8b.md).

Model provider

ngqtrung

Model tree

Base

Qwen/Qwen3-VL-8B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Results

Offline full-set eval (VideoMME-v1 2700 + PerceptionComp 1108 + Video-Holmes 1837 = 5645 rows), scored with the repo's vero compute_score. mean = macro-mean of the 3 bench accuracies.

Keeper = global_step_180 (peak):

Table with columns: metric, mean, videomme, holmes, perceptioncomp, format
metric	mean	videomme	holmes	perceptioncomp	format
SFT-770 RL @180 (keeper)	0.4845	0.6574	0.4513	0.3448	0.983
cold-base RL keeper @80 (sibling)	0.4918	0.6581	0.4741	0.3430	—
stock Qwen3-VL-8B ckpt-0 (zero-shot)	0.4444	0.6426	0.4143	0.2762	—

Training

Base model: Qwen/Qwen3-VL-8B-Instruct, warm-started from the OMR SFT checkpoint qwen3vl8b_ommr_sft_3node/checkpoint-770-hf. (SFT-770 = the stock 8B SFT'd on OpenMMReasoner reasoning data with lmms-engine — a math/visual-reasoning SFT, not a video model — used here to seed video RL.)
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter.
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty.
Exploration: OFF.
Topology: 4-node 2+2 — 2 trainer nodes (16-GPU FSDP2, dp=16) + 2 rollout nodes (16 GPU, vLLM TP=2 → 8 replicas). H100×8 per node.

W&B

Intended use / limitations

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "ngqtrung/video-8b-grpo-sft770"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "video", "video": "clip.mp4"},
        {"type": "text", "text": "Answer the multiple-choice question. Reason inside <think>...</think>, then give the final letter."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=1024)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). Inherits the Qwen3-VL license — review the base model's terms; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Warm start: OMR-SFT-770 (OpenMMReasoner SFT of Qwen3-VL-8B-Instruct, trained with lmms-engine).
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Study: controlled OMR/Video exploration study on Qwen3-VL-8B; SFT-warmstart video arm (docs/experiments_summary_8b.md).

video-8b-grpo-sft770

Get help setting up a custom Dedicated Endpoints.

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage

Explore FriendliAI today

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage