Nalandadata
DrishtiTable-Qwen2.5-VL-7B
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Results
| Model | Method | TEDS | S-TEDS |
|---|---|---|---|
| Qwen2.5-VL-7B | Zero-shot | 58.8% | 74.0% |
| o4-mini (OpenAI) | Zero-shot | 61.4% | 70.0% |
| GPT-4.1 (OpenAI) | Zero-shot | 68.0% | 80.8% |
| GPT-4o (OpenAI) | Zero-shot | 71.1% | 84.3% |
| DrishtiTable-Qwen2.5-VL-7B (ours) | SFT | 83.2% | 89.7% |
Breakdown by Table Type
| Table Type | GPT-4o | Ours | Improvement |
|---|---|---|---|
| Statistical | 77.7% | 82.8% | +5.1 |
| Financial | 60.3% | 82.0% | +21.7 |
| Lookup | 71.7% | 85.7% | +14.0 |
| Comparison | 72.4% | 95.9% | +23.5 |
Usage
With Unsloth (Recommended)
python
from unsloth import FastVisionModelfrom qwen_vl_utils import process_vision_infofrom PIL import Image# Load modelmodel, tokenizer = FastVisionModel.from_pretrained("Nalandadata/DrishtiTable-Qwen2.5-VL-7B",max_seq_length=4096,load_in_4bit=True,)FastVisionModel.for_inference(model)# Prepare inputimage = Image.open("table.png").convert("RGB")messages = [{"role": "system", "content": "You are a table structure recognition expert. Given an image of a table, output the HTML representation of the table structure and content. Use <table>, <thead>, <tbody>, <tr>, <th>, <td> tags. Use colspan and rowspan attributes for merged cells. Output ONLY the HTML table, nothing else."},{"role": "user", "content": [{"type": "image", "image": image},{"type": "text", "text": "Convert this table image to HTML. Output only the HTML table structure with cell content."},]},]# Generatetext = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)image_inputs, video_inputs = process_vision_info(messages)inputs = tokenizer(text=[text], images=image_inputs, videos=video_inputs,padding=True, return_tensors="pt").to(model.device)output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)generated = [o[len(i):] for i, o in zip(inputs.input_ids, output)]html = tokenizer.batch_decode(generated, skip_special_tokens=True)[0].strip()print(html)
With Transformers + PEFT
python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessorfrom peft import PeftModelimport torchbase_model = Qwen2_5_VLForConditionalGeneration.from_pretrained("Qwen/Qwen2.5-VL-7B-Instruct",torch_dtype=torch.bfloat16,device_map="auto",)model = PeftModel.from_pretrained(base_model, "Nalandadata/DrishtiTable-Qwen2.5-VL-7B")processor = AutoProcessor.from_pretrained("Nalandadata/DrishtiTable-Qwen2.5-VL-7B")
Training Details
| Parameter | Value |
|---|---|
| Base model | Qwen2.5-VL-7B-Instruct |
| Method | QLoRA (4-bit) via Unsloth |
| LoRA rank | 32 |
| LoRA alpha | 32 |
| Target modules | all-linear (incl. vision layers) |
| Training data | 1,141 table images from DrishtiTable |
| Epochs | 3 |
| Learning rate | 2e-4 (cosine schedule) |
| Batch size | 1 (gradient accumulation 8) |
| Max sequence length | 4,096 |
| Optimizer | AdamW 8-bit |
| Hardware | 1x NVIDIA A100-80GB |
| Training time | ~35 minutes |
| Training cost | ~$5 (Modal cloud) |
Dataset
Trained on DrishtiTable -- 1,421 table images from 9 Indian academic textbooks (S. Chand Publications) spanning Financial Accounting, Business Statistics, Quantitative Techniques, Operation Research, Ethics, and Engineering Steam Tables.
Evaluation
Evaluated using TEDS (Tree Edit Distance Similarity), the standard metric for table structure recognition. TEDS measures structural and content similarity between predicted and ground-truth HTML table trees on a 0-100% scale.
Links
| Resource | Link |
|---|---|
| Live Demo | DrishtiTable Space |
| Dataset (sample) | Nalandadata/DrishtiTable |
| Base Model | Qwen/Qwen2.5-VL-7B-Instruct |
Limitations
- Trained on tables from a single publisher (S. Chand Publications); performance on other publishers/styles is untested
- Optimized for Indian academic textbook tables; may not generalize to web tables, handwritten tables, or camera-captured tables
- HTML output may contain OCR errors in cell text content (S-TEDS 89.7% > TEDS 83.2%)
Citation
bibtex
@article{drishtitable2026,title={Domain-Specific Fine-Tuning for Table Structure Recognition: A 7B Open Model Outperforms GPT-4o with 1,141 Training Samples},author={Nalanda Data},year={2026}}
Commercial Use & Support
This model is released under Apache 2.0. The training data (DrishtiTable) is a public sample of a larger internal corpus of 1,421 expert-annotated tables from Indian academic textbooks.
Available on request:
- Custom fine-tuned TSR models for your document layouts
- Production deployment support (vLLM, quantization, serving)
- Access to the full training corpus under custom commercial license terms
- Partnerships for document-understanding evaluation and integration
Contact
For commercial licensing, full dataset access, custom data work, or partnerships:
📧 info@nalandadata.ai
For technical questions, integration help, or fine-tuning support:
📧 tech@nalandadata.ai
Model provider
Nalandadata
Model tree
Base
Qwen/Qwen2.5-VL-7B-Instruct
Adapter
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information