ngqtrung

omr-8b-grpo-ppexplore

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Results

OMR 6-image inline validation (mmmu val, mathvista testmini, mathverse testmini Text-Dominant, wemath testmini, charxiv reasoning-qa, dynamath test). overall-6 = unweighted mean of the 6 benches. Metric = accuracy (weight-independent).

Keeper = global_step_250:

Table with columns: metric, overall-6, mmmu, mathvista, mathverse, wemath, charxiv, dynamath
metric	overall-6	mmmu	mathvista	mathverse	wemath	charxiv	dynamath
ppexplore τ0.95 @250	0.7138	0.6789	0.8352	0.8967	0.8132	0.4150	0.6436
baseline peak @25	0.6681	0.6311	0.8019	0.8404	0.7437	0.3930	0.5986
stock Qwen3-VL-8B ckpt-0	0.659	0.629	0.811	0.824	0.723	0.396	0.570
Δ (explore − baseline peak)	+4.6	+4.8	+3.3	+5.7	+6.9	+2.2	+4.5

Exploration wins every benchmark, largest on wemath (+6.9) and mathverse (+5.7). Val trajectory: 0.681 @75 → 0.673 @100 → 0.659 @150 → 0.668 @200 → 0.714 @250 → 0.661 @300 (never below 0.659 over 225 steps; baseline by contrast fell 0.668→0.620→0.499).

Training

Base model: Qwen/Qwen3-VL-8B-Instruct.
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter, partial rollout, staleness-bounded off-policy.
Warm start: from the cold-start OMR baseline (grpo_omr_4node_full_v1_8b_base_perf) global_step_50 (resume_mode=resume_path).
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty (use_kl_in_reward=false, ).

Table with columns: key, value
key	value
`enable`	true
`trigger_mode`	high
`top_prob_threshold` (τ)	0.95
`k_explore`	4 (of n=8 rollouts explore; 4 stay clean)
`prompt_exploration_prob`	0.5
`deterministic`

Train metrics: ~216 s/step; final reward 0.658 (lower than baseline by design — exploration tokens score below the greedy anchor); final response_length ~1947 tok. Run died at step 335 to a repeatable verl resume_path hang (4× confirmed) — keeper step_250 is well before that.

W&B

Project verl_fully_async (entity quangtrung5705-nanyang-technological-university-singapore). The logical run spans two crash-resume segments:

steps 75–124: https://wandb.ai/quangtrung5705-nanyang-technological-university-singapore/verl_fully_async/runs/i56xzqqo
steps 125–335 (main, contains the step-250 keeper): https://wandb.ai/quangtrung5705-nanyang-technological-university-singapore/verl_fully_async/runs/hi1ng1nx

Intended use / limitations

Research checkpoint from a controlled exploration study (does token-dropout exploration help multimodal RLVR?). On OMR-8B the answer is a clear yes: this is the campaign winner (+4.6 pt and training-stability rescue). Best for math / visual-reasoning image QA in a <think>…</think> then-answer format. Not a general-purpose chat model; not tuned for video (see the video-8b-grpo-* siblings, which are a documented dead-heat at ~0.485). No safety/RLHF alignment beyond the base model.

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "ngqtrung/omr-8b-grpo-ppexplore"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": Image.open("problem.png")},
        {"type": "text", "text": "Solve the problem. Think step by step inside <think>...</think>, then give the final answer."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=2048)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). This checkpoint inherits the Qwen3-VL license — review the base model's terms before use; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Method: entropy-aware token-dropout exploration ("ppexplore", τ=0.95) at the rollout stage, warm-started from a mid-RL checkpoint. Part of a controlled OMR/Video exploration study on Qwen3-VL-8B (docs/experiments_summary_8b.md).

Model provider

ngqtrung

Model tree

Base

Qwen/Qwen3-VL-8B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Results

Keeper = global_step_250:

Table with columns: metric, overall-6, mmmu, mathvista, mathverse, wemath, charxiv, dynamath
metric	overall-6	mmmu	mathvista	mathverse	wemath	charxiv	dynamath
ppexplore τ0.95 @250	0.7138	0.6789	0.8352	0.8967	0.8132	0.4150	0.6436
baseline peak @25	0.6681	0.6311	0.8019	0.8404	0.7437	0.3930	0.5986
stock Qwen3-VL-8B ckpt-0	0.659	0.629	0.811	0.824	0.723	0.396	0.570
Δ (explore − baseline peak)	+4.6	+4.8	+3.3	+5.7	+6.9	+2.2	+4.5

Training

Base model: Qwen/Qwen3-VL-8B-Instruct.
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter, partial rollout, staleness-bounded off-policy.
Warm start: from the cold-start OMR baseline (grpo_omr_4node_full_v1_8b_base_perf) global_step_50 (resume_mode=resume_path).
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty (use_kl_in_reward=false, ).

Table with columns: key, value
key	value
`enable`	true
`trigger_mode`	high
`top_prob_threshold` (τ)	0.95
`k_explore`	4 (of n=8 rollouts explore; 4 stay clean)
`prompt_exploration_prob`	0.5
`deterministic`

Train metrics: ~216 s/step; final reward 0.658 (lower than baseline by design — exploration tokens score below the greedy anchor); final response_length ~1947 tok. Run died at step 335 to a repeatable verl resume_path hang (4× confirmed) — keeper step_250 is well before that.

W&B

Project verl_fully_async (entity quangtrung5705-nanyang-technological-university-singapore). The logical run spans two crash-resume segments:

steps 75–124: https://wandb.ai/quangtrung5705-nanyang-technological-university-singapore/verl_fully_async/runs/i56xzqqo
steps 125–335 (main, contains the step-250 keeper): https://wandb.ai/quangtrung5705-nanyang-technological-university-singapore/verl_fully_async/runs/hi1ng1nx

Intended use / limitations

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "ngqtrung/omr-8b-grpo-ppexplore"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": Image.open("problem.png")},
        {"type": "text", "text": "Solve the problem. Think step by step inside <think>...</think>, then give the final answer."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=2048)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). This checkpoint inherits the Qwen3-VL license — review the base model's terms before use; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Method: entropy-aware token-dropout exploration ("ppexplore", τ=0.95) at the rollout stage, warm-started from a mid-RL checkpoint. Part of a controlled OMR/Video exploration study on Qwen3-VL-8B (docs/experiments_summary_8b.md).

omr-8b-grpo-ppexplore

Get help setting up a custom Dedicated Endpoints.

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage

Explore FriendliAI today

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage