Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

FieldValue
Base modelunsloth/qwen2.5-vl-3b-instruct-unsloth-bnb-4bit
ArchitectureQwen2.5-VL (3B parameters, vision-language)
Fine-tuning techniqueQLoRA (4-bit NF4 quantization)
MergedYes (LoRA weights merged into base)
LanguageTamil (ta)
TaskOCR / image-to-text

Training Hyperparameters

HyperparameterValue
Epochs3
Per-device batch size1
Gradient accumulation steps16
Effective batch size16
Learning rate2e-4
LR schedulerCosine
Warmup ratio0.03
Weight decay0.01
LoRA rank (r)64
LoRA alpha64
LoRA dropout0.05
Max sequence length1024
Precisionbf16
Gradient checkpointingYes
Save / eval steps500
Quantization during training4-bit NF4, bfloat16 compute
Inference temperature1.0 (set to 0 for deterministic OCR)

Training Data

SplitSamplesSource
Train159,939 (80%)Synthetic + real book scans
Validation19,992 (10%)Synthetic + real book scans
Test19,992 (10%)Synthetic + real book scans
Total199,925

Synthetic data augmentations: blur, noise, rotation, brightness/contrast, JPEG artifacts, erosion, shadow overlays Real data: ~150 Tamil books, ~75K pages Fonts: NotoSerifTamil, NotoSansTamil, MuktaMalar, HindMadurai, Catamaran Text sources: Tirukkural, Wikipedia sentences, digitized book corpus Text layouts: 75% single-line, 25% multi-line Font sizes: 24–72pt (single-line), 24–48pt (multi-line)

Usage

python

from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from PIL import Image
import torch
model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
"sair390/tamil-ocr-qwen25vl",
torch_dtype=torch.bfloat16,
device_map="auto",
)
processor = AutoProcessor.from_pretrained("sair390/tamil-ocr-qwen25vl")
image = Image.open("tamil_page.jpg")
messages = [{
"role": "user",
"content": [
{"type": "image", "image": image},
{"type": "text", "text": "Read the Tamil text in this image."},
],
}]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(
**inputs,
max_new_tokens=512,
temperature=0, # deterministic — best for OCR
do_sample=False,
)
result = processor.decode(output[0], skip_special_tokens=True)
print(result)

4-Stage Inference Pipeline

  1. Preprocessing — deskew + adaptive threshold binarization
  2. Layout analysis — contour detection, bounding box merge, top-to-bottom sort
  3. OCR — Qwen2.5-VL inference
  4. Post-correction — Tamil vowel sign fixes, Unicode filter, common OCR confusions

Inference Requirements

ModeVRAM
bfloat16 (full precision)~12–16 GB
4-bit quantized~6–8 GB (use LoRA adapter on quantized base instead)

Dependencies: transformers, torch, Pillow, qwen-vl-utils

Limitations

  • Optimized for printed Tamil text; handwritten text not tested
  • Performance on degraded/low-resolution scans may vary
  • Early checkpoint; full training run in progress
  • Not evaluated on non-Tamil scripts

Model provider

sair390

Model tree

Base

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today