ngqtrung

omr-8b-grpo-base

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Results

OMR 6-image inline validation (mmmu val, mathvista testmini, mathverse testmini Text-Dominant, wemath testmini, charxiv reasoning-qa, dynamath test). overall-6 = unweighted mean. Metric = accuracy.

Keeper = global_step_25 (peak):

Table with columns: metric, overall-6, mmmu, mathvista, mathverse, wemath, charxiv, dynamath
metric	overall-6	mmmu	mathvista	mathverse	wemath	charxiv	dynamath
baseline peak @25	0.6681	0.6311	0.8019	0.8404	0.7437	0.3930	0.5986
stock Qwen3-VL-8B ckpt-0	0.659	0.629	0.811	0.824	0.723	0.396	0.570
baseline @125 (collapse)	0.4994	0.5021	0.6796	0.6995	0.6184	0.2010	0.2958

Full trajectory (overall-6): 0.668 @25 → 0.644 @50 → 0.620 @75 → 0.620 @100 → 0.499 @125 (monotone decline; the step-125 drop is catastrophic — charxiv halved, dynamath collapsed). Run died to a vLLM deepstack crash after step 125, but the model was already degrading.

Training

Base model: Qwen/Qwen3-VL-8B-Instruct (cold-start, resume_mode=auto).
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter.
Warm start: none (cold-start from stock 8B-Instruct).
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty.
Exploration: OFF (this is the bitwise baseline against omr-8b-grpo-ppexplore).
4-node 2+2 — 2 trainer nodes (16-GPU FSDP2, dp=16) + 2 rollout nodes (16 GPU, vLLM TP=2 → 8 replicas). H100×8 per node.

W&B

Project verl_fully_async (entity quangtrung5705-nanyang-technological-university-singapore): https://wandb.ai/quangtrung5705-nanyang-technological-university-singapore/verl_fully_async/runs/gkiiopep

Intended use / limitations

Research checkpoint — the cold-start GRPO baseline in a controlled exploration study. Useful mainly as the A/B reference for omr-8b-grpo-ppexplore (the winner). It only marginally beats the zero-shot base (0.668 vs 0.659) at its peak and is unstable (collapses to 0.499 if trained past ~step 100); for actual use prefer the ppexplore checkpoint. Math / visual-reasoning image QA, <think>…</think> then-answer format. No safety/RLHF alignment beyond the base.

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "ngqtrung/omr-8b-grpo-base"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": Image.open("problem.png")},
        {"type": "text", "text": "Solve the problem. Think step by step inside <think>...</think>, then give the final answer."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=2048)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). Inherits the Qwen3-VL license — review the base model's terms; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Study: controlled OMR/Video exploration study on Qwen3-VL-8B; this is the no-exploration OMR baseline arm (docs/experiments_summary_8b.md).

Model provider

ngqtrung

Model tree

Base

Qwen/Qwen3-VL-8B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Results

OMR 6-image inline validation (mmmu val, mathvista testmini, mathverse testmini Text-Dominant, wemath testmini, charxiv reasoning-qa, dynamath test). overall-6 = unweighted mean. Metric = accuracy.

Keeper = global_step_25 (peak):

Table with columns: metric, overall-6, mmmu, mathvista, mathverse, wemath, charxiv, dynamath
metric	overall-6	mmmu	mathvista	mathverse	wemath	charxiv	dynamath
baseline peak @25	0.6681	0.6311	0.8019	0.8404	0.7437	0.3930	0.5986
stock Qwen3-VL-8B ckpt-0	0.659	0.629	0.811	0.824	0.723	0.396	0.570
baseline @125 (collapse)	0.4994	0.5021	0.6796	0.6995	0.6184	0.2010	0.2958

Training

Base model: Qwen/Qwen3-VL-8B-Instruct (cold-start, resume_mode=auto).
Framework: fork of volcengine/verl — ngquangtrung57/verl@videorl-mods. Fully-async GRPO: FSDP2 trainer + vLLM rollouter.
Warm start: none (cold-start from stock 8B-Instruct).
Reward: dapo-style score = 0.8·accuracy + 0.2·format (FORMAT_WEIGHT=0.2, FORMAT_MIN_THINK_CHARS=100). No KL penalty.
Exploration: OFF (this is the bitwise baseline against omr-8b-grpo-ppexplore).
4-node 2+2 — 2 trainer nodes (16-GPU FSDP2, dp=16) + 2 rollout nodes (16 GPU, vLLM TP=2 → 8 replicas). H100×8 per node.

W&B

Intended use / limitations

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model_id = "ngqtrung/omr-8b-grpo-base"
model = AutoModelForImageTextToText.from_pretrained(model_id, dtype="auto", device_map="auto")
processor = AutoProcessor.from_pretrained(model_id)

messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": Image.open("problem.png")},
        {"type": "text", "text": "Solve the problem. Think step by step inside <think>...</think>, then give the final answer."},
    ],
}]
inputs = processor.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=2048)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0])

Citation / lineage

Base model: Qwen3-VL-8B-Instruct (Qwen team). Inherits the Qwen3-VL license — review the base model's terms; the Apache-2.0 tag refers to this repo's RLVR training artifacts.
Framework: verl (volcengine/verl), fork ngquangtrung57/verl@videorl-mods; fully-async GRPO (FSDP2 + vLLM).
Study: controlled OMR/Video exploration study on Qwen3-VL-8B; this is the no-exploration OMR baseline arm (docs/experiments_summary_8b.md).

omr-8b-grpo-base

Get help setting up a custom Dedicated Endpoints.

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage

Explore FriendliAI today

README

Results

Training

W&B

Intended use / limitations

Usage

Citation / lineage