sair390

tamil-ocr-qwen25vl

README

License: apache-2.0

Model Details

Table with columns: Field, Value
Field	Value
Base model	`unsloth/qwen2.5-vl-3b-instruct-unsloth-bnb-4bit`
Architecture	Qwen2.5-VL (3B parameters, vision-language)
Fine-tuning technique	QLoRA (4-bit NF4 quantization)
Merged	Yes (LoRA weights merged into base)
Language	Tamil (ta)
Task	OCR / image-to-text

Training Hyperparameters

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Epochs	3
Per-device batch size	1
Gradient accumulation steps	16
Effective batch size	16
Learning rate	2e-4
LR scheduler	Cosine
Warmup ratio	0.03
Weight decay	0.01
LoRA rank (r)	64
LoRA alpha	64
LoRA dropout

Training Data

Table with columns: Split, Samples, Source
Split	Samples	Source
Train	159,939 (80%)	Synthetic + real book scans
Validation	19,992 (10%)	Synthetic + real book scans
Test	19,992 (10%)	Synthetic + real book scans
Total	199,925

Synthetic data augmentations: blur, noise, rotation, brightness/contrast, JPEG artifacts, erosion, shadow overlays Real data: ~150 Tamil books, ~75K pages Fonts: NotoSerifTamil, NotoSansTamil, MuktaMalar, HindMadurai, Catamaran Text sources: Tirukkural, Wikipedia sentences, digitized book corpus Text layouts: 75% single-line, 25% multi-line Font sizes: 24–72pt (single-line), 24–48pt (multi-line)

Usage

python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "sair390/tamil-ocr-qwen25vl",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("sair390/tamil-ocr-qwen25vl")

image = Image.open("tamil_page.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Read the Tamil text in this image."},
    ],
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0,        # deterministic — best for OCR
        do_sample=False,
    )

result = processor.decode(output[0], skip_special_tokens=True)
print(result)

4-Stage Inference Pipeline

Preprocessing — deskew + adaptive threshold binarization
Layout analysis — contour detection, bounding box merge, top-to-bottom sort
OCR — Qwen2.5-VL inference
Post-correction — Tamil vowel sign fixes, Unicode filter, common OCR confusions

Inference Requirements

Table with columns: Mode, VRAM
Mode	VRAM
bfloat16 (full precision)	~12–16 GB
4-bit quantized	~6–8 GB (use LoRA adapter on quantized base instead)

Dependencies: transformers, torch, Pillow, qwen-vl-utils

Limitations

Optimized for printed Tamil text; handwritten text not tested
Performance on degraded/low-resolution scans may vary
Early checkpoint; full training run in progress
Not evaluated on non-Tamil scripts

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

sair390

Model Tree

Base

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Model Details

Table with columns: Field, Value
Field	Value
Base model	`unsloth/qwen2.5-vl-3b-instruct-unsloth-bnb-4bit`
Architecture	Qwen2.5-VL (3B parameters, vision-language)
Fine-tuning technique	QLoRA (4-bit NF4 quantization)
Merged	Yes (LoRA weights merged into base)
Language	Tamil (ta)
Task	OCR / image-to-text

Training Hyperparameters

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Epochs	3
Per-device batch size	1
Gradient accumulation steps	16
Effective batch size	16
Learning rate	2e-4
LR scheduler	Cosine
Warmup ratio	0.03
Weight decay	0.01
LoRA rank (r)	64
LoRA alpha	64
LoRA dropout

Training Data

Table with columns: Split, Samples, Source
Split	Samples	Source
Train	159,939 (80%)	Synthetic + real book scans
Validation	19,992 (10%)	Synthetic + real book scans
Test	19,992 (10%)	Synthetic + real book scans
Total	199,925

Usage

python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch

model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "sair390/tamil-ocr-qwen25vl",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("sair390/tamil-ocr-qwen25vl")

image = Image.open("tamil_page.jpg")
messages = [{
    "role": "user",
    "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Read the Tamil text in this image."},
    ],
}]

text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)

with torch.no_grad():
    output = model.generate(
        **inputs,
        max_new_tokens=512,
        temperature=0,        # deterministic — best for OCR
        do_sample=False,
    )

result = processor.decode(output[0], skip_special_tokens=True)
print(result)

4-Stage Inference Pipeline

Preprocessing — deskew + adaptive threshold binarization
Layout analysis — contour detection, bounding box merge, top-to-bottom sort
OCR — Qwen2.5-VL inference
Post-correction — Tamil vowel sign fixes, Unicode filter, common OCR confusions

Inference Requirements

Table with columns: Mode, VRAM
Mode	VRAM
bfloat16 (full precision)	~12–16 GB
4-bit quantized	~6–8 GB (use LoRA adapter on quantized base instead)

Dependencies: transformers, torch, Pillow, qwen-vl-utils

Limitations

Optimized for printed Tamil text; handwritten text not tested
Performance on degraded/low-resolution scans may vary
Early checkpoint; full training run in progress
Not evaluated on non-Tamil scripts