kshitizjangra

qwen2vl-omr-lora-partc

README

License: apache-2.0

Intended use

Single-shot OCR of a tightly cropped image containing one handwritten numeric/short value. Output is the value only, no prose.

Prompt (used at training and inference):

markdown
Read the handwritten value. Output only the value.

Training

Table

Base model	`Qwen/Qwen2-VL-2B-Instruct`
Method	PEFT LoRA
Rank (`r`)	16
`lora_alpha`	32
`lora_dropout`	0.05
Target modules	`q_proj`, `k_proj`, `v_proj`, `o_proj`
Task type	`CAUSAL_LM`
Data	OMR Part-C cell crops, JSONL splits (80/20 train/eval)
Starting adapter	Earlier Part-D LoRA (continued fine-tune)

Usage

python
from peft import PeftModel
from transformers import AutoProcessor, Qwen2VLForConditionalGeneration
from PIL import Image
import torch

BASE = "Qwen/Qwen2-VL-2B-Instruct"
ADAPTER = "kshitizjangra/qwen2vl-omr-lora-partc"

processor = AutoProcessor.from_pretrained(BASE)
model = Qwen2VLForConditionalGeneration.from_pretrained(BASE, torch_dtype=torch.float16, device_map="auto")
model = PeftModel.from_pretrained(model, ADAPTER)
model.eval()

image = Image.open("crop.jpg").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text",  "text": "Read the handwritten value. Output only the value."},
    ],
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=16, do_sample=False)
print(processor.batch_decode(out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True)[0].strip())

Files

Table with columns: File, Purpose
File	Purpose
`adapter_model.safetensors`	LoRA weights
`adapter_config.json`	PEFT config
`tokenizer.json`, `tokenizer_config.json`, `chat_template.jinja`	Tokenizer + chat template
`processor_config.json`	Image/text processor

Limitations

Trained only on Part-C marks_obtained cells. Other handwriting domains (full-page free-form, non-English script, very long sequences) are out of scope.
Inference expects a tight crop. Loose crops or rotated images degrade accuracy.
Same biases and limitations as the base Qwen2-VL-2B-Instruct model.

Pipeline

Source code for cropping, dataset building, training, and inference lives at: https://github.com/kshitizjangra/omr_validator

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

kshitizjangra

Model Tree

Base

Qwen/Qwen2-VL-2B-Instruct

Adapter

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer