Model Details
Table with columns: Field, Value| Field | Value |
|---|
| Base model | unsloth/qwen2.5-vl-3b-instruct-unsloth-bnb-4bit |
| Architecture | Qwen2.5-VL (3B parameters, vision-language) |
| Fine-tuning technique | QLoRA (4-bit NF4 quantization) |
| Merged | Yes (LoRA weights merged into base) |
| Language | Tamil (ta) |
| Task | OCR / image-to-text |
Training Hyperparameters
Table with columns: Hyperparameter, Value| Hyperparameter | Value |
|---|
| Epochs | 3 |
| Per-device batch size | 1 |
| Gradient accumulation steps | 16 |
| Effective batch size | 16 |
| Learning rate | 2e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 0.03 |
| Weight decay | 0.01 |
| LoRA rank (r) | 64 |
| LoRA alpha | 64 |
| LoRA dropout |
Training Data
Table with columns: Split, Samples, Source| Split | Samples | Source |
|---|
| Train | 159,939 (80%) | Synthetic + real book scans |
| Validation | 19,992 (10%) | Synthetic + real book scans |
| Test | 19,992 (10%) | Synthetic + real book scans |
| Total | 199,925 | |
Synthetic data augmentations: blur, noise, rotation, brightness/contrast, JPEG artifacts, erosion, shadow overlays
Real data: ~150 Tamil books, ~75K pages
Fonts: NotoSerifTamil, NotoSansTamil, MuktaMalar, HindMadurai, Catamaran
Text sources: Tirukkural, Wikipedia sentences, digitized book corpus
Text layouts: 75% single-line, 25% multi-line
Font sizes: 24–72pt (single-line), 24–48pt (multi-line)
Usage
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"sair390/tamil-ocr-qwen25vl",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("sair390/tamil-ocr-qwen25vl")
image = Image.open("tamil_page.jpg")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Read the Tamil text in this image."},
],
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0,
do_sample=False,
)
result = processor.decode(output[0], skip_special_tokens=True)
print(result)
4-Stage Inference Pipeline
- Preprocessing — deskew + adaptive threshold binarization
- Layout analysis — contour detection, bounding box merge, top-to-bottom sort
- OCR — Qwen2.5-VL inference
- Post-correction — Tamil vowel sign fixes, Unicode filter, common OCR confusions
Inference Requirements
Table with columns: Mode, VRAM| Mode | VRAM |
|---|
| bfloat16 (full precision) | ~12–16 GB |
| 4-bit quantized | ~6–8 GB (use LoRA adapter on quantized base instead) |
Dependencies: transformers, torch, Pillow, qwen-vl-utils
Limitations
- Optimized for printed Tamil text; handwritten text not tested
- Performance on degraded/low-resolution scans may vary
- Early checkpoint; full training run in progress
- Not evaluated on non-Tamil scripts