qwen2vl-omr-lora-v2 API & Inference Endpoint

Training

Table

base model	`Qwen/Qwen2-VL-2B-Instruct`
method	LoRA (r=16, α=32, dropout=0.05) on `q_proj,k_proj,v_proj,o_proj`
starting weights	continued from prior adapter (v1) — not from scratch
dataset	1500 OMR sheets × 3 fields, human-corrected via Django labeling tool
split	80/20 by sheet (no leakage) → 3414 train / 859 eval rows
epochs	3
batch size	1 (per-device) × 8 grad-accum = effective 8
learning rate	5e-5 (lower than v1 since starting from a trained adapter)
warmup	0.03
precision	fp16 on Apple Silicon MPS
gradient checkpointing	on
total steps	1281
final train loss	0.014 (running avg)
wall-clock	~6 hours on M-series MPS

Inference

python
from peft import PeftModel
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
from PIL import Image
import torch

BASE = "Qwen/Qwen2-VL-2B-Instruct"
ADAPTER = "kshitizjangra/qwen2vl-omr-lora-v2"

processor = AutoProcessor.from_pretrained(BASE, trust_remote_code=True)
base = Qwen2VLForConditionalGeneration.from_pretrained(BASE, dtype=torch.float16, trust_remote_code=True)
model = PeftModel.from_pretrained(base, ADAPTER).to("mps").eval()

img = Image.open("path/to/roll_no.jpg").convert("RGB")
messages = [{"role": "user", "content": [
    {"type": "image", "image": img},
    {"type": "text",  "text": "Read the handwritten value. Output only the value."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[[img]], return_tensors="pt").to("mps")
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(processor.tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Intended use

Internal tool for digitizing university OMR sheets at a fixed template (Part-D). The model expects a single section crop (registration_no / roll_no / course_code) and returns the handwritten value as a string.

Limitations

Trained only on this specific Part-D template; will not generalize to arbitrary forms.
Some labeler errors are present in the training data (e.g. occasional field mix-ups where a roll number was entered in the registration field).
Eval accuracy not yet measured against the 859-row held-out split.

base model

Qwen/Qwen2-VL-2B-Instruct

method

LoRA (r=16, α=32, dropout=0.05) on q_proj,k_proj,v_proj,o_proj

starting weights

continued from prior adapter (v1) — not from scratch

dataset

1500 OMR sheets × 3 fields, human-corrected via Django labeling tool

split

80/20 by sheet (no leakage) → 3414 train / 859 eval rows

epochs

batch size

1 (per-device) × 8 grad-accum = effective 8

learning rate

5e-5 (lower than v1 since starting from a trained adapter)

warmup

0.03

precision

fp16 on Apple Silicon MPS

gradient checkpointing

total steps

1281

final train loss

0.014 (running avg)

wall-clock

~6 hours on M-series MPS

python

from peft import PeftModel
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
from PIL import Image
import torch

BASE = "Qwen/Qwen2-VL-2B-Instruct"
ADAPTER = "kshitizjangra/qwen2vl-omr-lora-v2"

processor = AutoProcessor.from_pretrained(BASE, trust_remote_code=True)
base = Qwen2VLForConditionalGeneration.from_pretrained(BASE, dtype=torch.float16, trust_remote_code=True)
model = PeftModel.from_pretrained(base, ADAPTER).to("mps").eval()

img = Image.open("path/to/roll_no.jpg").convert("RGB")
messages = [{"role": "user", "content": [
    {"type": "image", "image": img},
    {"type": "text",  "text": "Read the handwritten value. Output only the value."},
]}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[[img]], return_tensors="pt").to("mps")
with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=64, do_sample=False)
print(processor.tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

qwen2vl-omr-lora-v2

README

Training

Inference

Intended use

Limitations

Explore FriendliAI today

README

Training

Inference

Intended use

Limitations