lgtk

qwen25vl-3b-modi-lora

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model details

Table

Base model	Qwen/Qwen2.5-VL-3B-Instruct
Fine-tuning method	QLoRA — LoRA rank 32, alpha 64, 4-bit NF4 quantization
LoRA target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters	74M (1.94% of 3.8B total)
Training data	MoDeTrans — 1,635 real document images (80/10/10 split, seed=42)
Training epochs	3
Training time	~7.5 hours on RTX 5060 (8.5 GB VRAM)
Test CER	0.332 (on 204 held-out MoDeTrans examples)
Zero-shot baseline CER	0.930

A CER of 0.332 means approximately 33% of characters require expert correction — suitable as a first-draft assistant in a human-in-the-loop workflow.

Task

Input: photograph or scan of a handwritten Modi-script Marathi document Output: Devanagari transliteration of that text

This is transliteration, not translation — the language (Marathi) does not change, only the script. Modi was the administrative script of Maharashtra from roughly the 13th to the mid-20th century; most surviving documents are from the Shivakalin (17th c.), Peshwekalin (18th–early 19th c.), and Anglakalin (1818–1952) eras.

How to use

Install dependencies:

bash
pip install transformers peft bitsandbytes accelerate pillow torch

python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
from PIL import Image
import torch

MODEL_ID    = "Qwen/Qwen2.5-VL-3B-Instruct"
ADAPTER_ID  = "lgtk/qwen25vl-3b-modi-lora"   # ← this repo

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
base  = Qwen2_5_VLForConditionalGeneration.from_pretrained(
            MODEL_ID, quantization_config=bnb, device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER_ID)
model.eval()
processor = AutoProcessor.from_pretrained(MODEL_ID, max_pixels=512 * 28 * 28)

PROMPT = (
    "This image contains handwritten text in Modi script, a historical cursive "
    "script used to write the Marathi language. "
    "Transliterate the text in this image into Devanagari script. "
    "Output only the Devanagari text, with no explanation."
)

image = Image.open("your_modi_image.jpg").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {{"type": "image", "image": image}},
        {{"type": "text",  "text": PROMPT}},
    ],
}]
text_in = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs  = processor(text=[text_in], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, do_sample=False)

result = processor.batch_decode(
    out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)[0].strip()
print(result)

Known limitations

Deletions dominate errors (46%) — the model tends to skip characters, especially the anusvāra diacritic (ं, ~240 drops per 204-example test run)
Vowel length confusion — ी ↔ ि (long/short /i/) and ू ↔ ु (long/short /u/) are the most common substitution pairs
No word boundaries — Modi script is continuous (no spaces), so errors sometimes span phrase boundaries
Best on formal letters — trained on Peshwekalin / Shivakalin administrative documents; performance on informal or personal correspondence may be lower
Image preprocessing not yet implemented — the model receives raw images. Denoising, deskewing, and binarisation would likely improve accuracy.

Hardware

Developed and tested on a desktop with an NVIDIA RTX 5060 GPU (8.5 GB VRAM) running WSL2 on Windows. At 4-bit NF4 quantization, the adapter uses approximately 3–4 GB of VRAM during inference. A 5060 or better is recommended; the pipeline should also fit on any GPU with ≥6 GB VRAM.

Citation / acknowledgements

MoDeTrans dataset: IIT Roorkee — historyHulk/MoDeTrans on HuggingFace
Base model: Qwen/Qwen2.5-VL-3B-Instruct (Apache 2.0)
Developed by Sachin Godse (@lgtkgtv) using Claude CLI (Anthropic)

Model provider

lgtk

Model tree

Base

Qwen/Qwen2.5-VL-3B-Instruct

Adapter

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Model details

Table

Base model	Qwen/Qwen2.5-VL-3B-Instruct
Fine-tuning method	QLoRA — LoRA rank 32, alpha 64, 4-bit NF4 quantization
LoRA target modules	q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable parameters	74M (1.94% of 3.8B total)
Training data	MoDeTrans — 1,635 real document images (80/10/10 split, seed=42)
Training epochs	3
Training time	~7.5 hours on RTX 5060 (8.5 GB VRAM)
Test CER	0.332 (on 204 held-out MoDeTrans examples)
Zero-shot baseline CER	0.930

A CER of 0.332 means approximately 33% of characters require expert correction — suitable as a first-draft assistant in a human-in-the-loop workflow.

Task

Input: photograph or scan of a handwritten Modi-script Marathi document Output: Devanagari transliteration of that text

How to use

Install dependencies:

bash
pip install transformers peft bitsandbytes accelerate pillow torch

python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor, BitsAndBytesConfig
from peft import PeftModel
from PIL import Image
import torch

MODEL_ID    = "Qwen/Qwen2.5-VL-3B-Instruct"
ADAPTER_ID  = "lgtk/qwen25vl-3b-modi-lora"   # ← this repo

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
)
base  = Qwen2_5_VLForConditionalGeneration.from_pretrained(
            MODEL_ID, quantization_config=bnb, device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER_ID)
model.eval()
processor = AutoProcessor.from_pretrained(MODEL_ID, max_pixels=512 * 28 * 28)

PROMPT = (
    "This image contains handwritten text in Modi script, a historical cursive "
    "script used to write the Marathi language. "
    "Transliterate the text in this image into Devanagari script. "
    "Output only the Devanagari text, with no explanation."
)

image = Image.open("your_modi_image.jpg").convert("RGB")
messages = [{
    "role": "user",
    "content": [
        {{"type": "image", "image": image}},
        {{"type": "text",  "text": PROMPT}},
    ],
}]
text_in = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs  = processor(text=[text_in], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    out = model.generate(**inputs, max_new_tokens=256, do_sample=False)

result = processor.batch_decode(
    out[:, inputs["input_ids"].shape[1]:], skip_special_tokens=True
)[0].strip()
print(result)

Known limitations

Deletions dominate errors (46%) — the model tends to skip characters, especially the anusvāra diacritic (ं, ~240 drops per 204-example test run)
Vowel length confusion — ी ↔ ि (long/short /i/) and ू ↔ ु (long/short /u/) are the most common substitution pairs
No word boundaries — Modi script is continuous (no spaces), so errors sometimes span phrase boundaries
Best on formal letters — trained on Peshwekalin / Shivakalin administrative documents; performance on informal or personal correspondence may be lower
Image preprocessing not yet implemented — the model receives raw images. Denoising, deskewing, and binarisation would likely improve accuracy.

Hardware

Citation / acknowledgements

MoDeTrans dataset: IIT Roorkee — historyHulk/MoDeTrans on HuggingFace
Base model: Qwen/Qwen2.5-VL-3B-Instruct (Apache 2.0)
Developed by Sachin Godse (@lgtkgtv) using Claude CLI (Anthropic)

qwen25vl-3b-modi-lora

Get help setting up a custom Dedicated Endpoints.

README

Model details

Task

How to use

Known limitations

Hardware

Citation / acknowledgements

Explore FriendliAI today

README

Model details

Task

How to use

Known limitations

Hardware

Citation / acknowledgements