Results
OMR 6-image inline validation (mmmu val, mathvista testmini, mathverse testmini Text-Dominant, wemath testmini, charxiv reasoning-qa, dynamath test). overall-6 = unweighted mean of the 6 benches. Metric = accuracy (weight-independent).
Keeper = global_step_250:
Table with columns: metric, overall-6, mmmu, mathvista, mathverse, wemath, charxiv, dynamath| metric | overall-6 | mmmu | mathvista | mathverse | wemath | charxiv | dynamath |
|---|
| ppexplore τ0.95 @250 | 0.7138 | 0.6789 | 0.8352 | 0.8967 | 0.8132 | 0.4150 | 0.6436 |
| baseline peak @25 | 0.6681 | 0.6311 | 0.8019 | 0.8404 | 0.7437 | 0.3930 | 0.5986 |
| stock Qwen3-VL-8B ckpt-0 | 0.659 | 0.629 | 0.811 | 0.824 | 0.723 | 0.396 | 0.570 |
| Δ (explore − baseline peak) | +4.6 | +4.8 | +3.3 | +5.7 | +6.9 | +2.2 | +4.5 |
Exploration wins every benchmark, largest on wemath (+6.9) and mathverse (+5.7). Val trajectory: 0.681 @75 → 0.673 @100 → 0.659 @150 → 0.668 @200 → 0.714 @250 → 0.661 @300 (never below 0.659 over 225 steps; baseline by contrast fell 0.668→0.620→0.499).
Training
- Base model:
Qwen/Qwen3-VL-8B-Instruct.
- Framework: fork of volcengine/verl —
ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter, partial rollout, staleness-bounded off-policy.
- Warm start: from the cold-start OMR baseline (
grpo_omr_4node_full_v1_8b_base_perf) global_step_50 (resume_mode=resume_path).
- Reward: dapo-style
score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty (use_kl_in_reward=false, ).
Table with columns: key, value| key | value |
|---|
enable | true |
trigger_mode | high |
top_prob_threshold (τ) | 0.95 |
k_explore | 4 (of n=8 rollouts explore; 4 stay clean) |
prompt_exploration_prob | 0.5 |
deterministic |
- Train metrics: ~216 s/step; final reward 0.658 (lower than baseline by design — exploration tokens score below the greedy anchor); final response_length ~1947 tok. Run died at step 335 to a repeatable verl
resume_path hang (4× confirmed) — keeper step_250 is well before that.
W&B
Project verl_fully_async (entity quangtrung5705-nanyang-technological-university-singapore). The logical run spans two crash-resume segments:
Intended use / limitations
Research checkpoint from a controlled exploration study (does token-dropout exploration help multimodal RLVR?). On OMR-8B the answer is a clear yes: this is the campaign winner (+4.6 pt and training-stability rescue). Best for math / visual-reasoning image QA in a <think>…</think> then-answer format. Not a general-purpose chat model; not tuned for video (see the video-8b-grpo-* siblings, which are a documented dead-heat at ~0.485). No safety/RLHF alignment beyond the base model.
Usage
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image
model_id = "ngqtrung/omr-8b-grpo-ppexplore"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)
messages = [{
"role": "user",
"content": [
{"type": "image", "image": Image.open("problem.png")},
{"type": "text", "text": "Solve the problem. Think step by step inside <think>...</think>, then give the final answer."},
],
}]
inputs = processor.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=2048)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])
Citation / lineage
- Base model: Qwen3-VL-8B-Instruct (Qwen team). This checkpoint inherits the Qwen3-VL license — review the base model's terms before use; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
- Framework: verl (volcengine/verl), fork
ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
- Method: entropy-aware token-dropout exploration ("ppexplore", τ=0.95) at the rollout stage, warm-started from a mid-RL checkpoint. Part of a controlled OMR/Video exploration study on Qwen3-VL-8B (
docs/experiments_summary_8b.md).