Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
| Field | Value |
|---|---|
| Base model | unsloth/qwen2.5-vl-3b-instruct-unsloth-bnb-4bit |
| Architecture | Qwen2.5-VL (3B parameters, vision-language) |
| Fine-tuning technique | QLoRA (4-bit NF4 quantization) |
| Merged | Yes (LoRA weights merged into base) |
| Language | Tamil (ta) |
| Task | OCR / image-to-text |
Training Hyperparameters
| Hyperparameter | Value |
|---|---|
| Epochs | 3 |
| Per-device batch size | 1 |
| Gradient accumulation steps | 16 |
| Effective batch size | 16 |
| Learning rate | 2e-4 |
| LR scheduler | Cosine |
| Warmup ratio | 0.03 |
| Weight decay | 0.01 |
| LoRA rank (r) | 64 |
| LoRA alpha | 64 |
| LoRA dropout | 0.05 |
| Max sequence length | 1024 |
| Precision | bf16 |
| Gradient checkpointing | Yes |
| Save / eval steps | 500 |
| Quantization during training | 4-bit NF4, bfloat16 compute |
| Inference temperature | 1.0 (set to 0 for deterministic OCR) |
Training Data
| Split | Samples | Source |
|---|---|---|
| Train | 159,939 (80%) | Synthetic + real book scans |
| Validation | 19,992 (10%) | Synthetic + real book scans |
| Test | 19,992 (10%) | Synthetic + real book scans |
| Total | 199,925 |
Synthetic data augmentations: blur, noise, rotation, brightness/contrast, JPEG artifacts, erosion, shadow overlays Real data: ~150 Tamil books, ~75K pages Fonts: NotoSerifTamil, NotoSansTamil, MuktaMalar, HindMadurai, Catamaran Text sources: Tirukkural, Wikipedia sentences, digitized book corpus Text layouts: 75% single-line, 25% multi-line Font sizes: 24–72pt (single-line), 24–48pt (multi-line)
Usage
python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessorfrom PIL import Imageimport torchmodel = Qwen2_5_VLForConditionalGeneration.from_pretrained("sair390/tamil-ocr-qwen25vl",torch_dtype=torch.bfloat16,device_map="auto",)processor = AutoProcessor.from_pretrained("sair390/tamil-ocr-qwen25vl")image = Image.open("tamil_page.jpg")messages = [{"role": "user","content": [{"type": "image", "image": image},{"type": "text", "text": "Read the Tamil text in this image."},],}]text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = processor(text=[text], images=[image], return_tensors="pt").to(model.device)with torch.no_grad():output = model.generate(**inputs,max_new_tokens=512,temperature=0, # deterministic — best for OCRdo_sample=False,)result = processor.decode(output[0], skip_special_tokens=True)print(result)
4-Stage Inference Pipeline
- Preprocessing — deskew + adaptive threshold binarization
- Layout analysis — contour detection, bounding box merge, top-to-bottom sort
- OCR — Qwen2.5-VL inference
- Post-correction — Tamil vowel sign fixes, Unicode filter, common OCR confusions
Inference Requirements
| Mode | VRAM |
|---|---|
| bfloat16 (full precision) | ~12–16 GB |
| 4-bit quantized | ~6–8 GB (use LoRA adapter on quantized base instead) |
Dependencies: transformers, torch, Pillow, qwen-vl-utils
Limitations
- Optimized for printed Tamil text; handwritten text not tested
- Performance on degraded/low-resolution scans may vary
- Early checkpoint; full training run in progress
- Not evaluated on non-Tamil scripts
Model provider
sair390
Model tree
Base
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information