Nalandadata/DrishtiTable-Qwen2.5-VL-7B API & Inference Endpoint

Results

Table
Model	Method	TEDS	S-TEDS
Qwen2.5-VL-7B	Zero-shot	58.8%	74.0%
o4-mini (OpenAI)	Zero-shot	61.4%	70.0%
GPT-4.1 (OpenAI)	Zero-shot	68.0%	80.8%
GPT-4o (OpenAI)	Zero-shot	71.1%	84.3%
DrishtiTable-Qwen2.5-VL-7B (ours)	SFT	83.2%	89.7%

Breakdown by Table Type

Table
Table Type	GPT-4o	Ours	Improvement
Statistical	77.7%	82.8%	+5.1
Financial	60.3%	82.0%	+21.7
Lookup	71.7%	85.7%	+14.0
Comparison	72.4%	95.9%	+23.5

Usage

With Unsloth (Recommended)

python
from unsloth import FastVisionModel
from qwen_vl_utils import process_vision_info
from PIL import Image

# Load model
model, tokenizer = FastVisionModel.from_pretrained(
    "Nalandadata/DrishtiTable-Qwen2.5-VL-7B",
    max_seq_length=4096,
    load_in_4bit=True,
)
FastVisionModel.for_inference(model)

# Prepare input
image = Image.open("table.png").convert("RGB")
messages = [
    {"role": "system", "content": "You are a table structure recognition expert. Given an image of a table, output the HTML representation of the table structure and content. Use <table>, <thead>, <tbody>, <tr>, <th>, <td> tags. Use colspan and rowspan attributes for merged cells. Output ONLY the HTML table, nothing else."},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Convert this table image to HTML. Output only the HTML table structure with cell content."},
    ]},
]

# Generate
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
image_inputs, video_inputs = process_vision_info(messages)
inputs = tokenizer(text=[text], images=image_inputs, videos=video_inputs,
                   padding=True, return_tensors="pt").to(model.device)

output = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
generated = [o[len(i):] for i, o in zip(inputs.input_ids, output)]
html = tokenizer.batch_decode(generated, skip_special_tokens=True)[0].strip()
print(html)

With Transformers + PEFT

python
from transformers import Qwen2_5_VLForConditionalGeneration, AutoProcessor
from peft import PeftModel
import torch

base_model = Qwen2_5_VLForConditionalGeneration.from_pretrained(
    "Qwen/Qwen2.5-VL-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(base_model, "Nalandadata/DrishtiTable-Qwen2.5-VL-7B")
processor = AutoProcessor.from_pretrained("Nalandadata/DrishtiTable-Qwen2.5-VL-7B")

Training Details

Table
Parameter	Value
Base model	Qwen2.5-VL-7B-Instruct
Method	QLoRA (4-bit) via Unsloth
LoRA rank	32
LoRA alpha	32
Target modules	all-linear (incl. vision layers)
Training data	1,141 table images from DrishtiTable
Epochs	3
Learning rate	2e-4 (cosine schedule)
Batch size	1 (gradient accumulation 8)
Max sequence length	4,096
Optimizer	AdamW 8-bit
Hardware	1x NVIDIA A100-80GB
Training time	~35 minutes
Training cost	~$5 (Modal cloud)

Dataset

Trained on DrishtiTable -- 1,421 table images from 9 Indian academic textbooks (S. Chand Publications) spanning Financial Accounting, Business Statistics, Quantitative Techniques, Operation Research, Ethics, and Engineering Steam Tables.

Evaluation

Evaluated using TEDS (Tree Edit Distance Similarity), the standard metric for table structure recognition. TEDS measures structural and content similarity between predicted and ground-truth HTML table trees on a 0-100% scale.

Links

Table
Resource	Link
Live Demo	DrishtiTable Space
Dataset (sample)	Nalandadata/DrishtiTable
Base Model	Qwen/Qwen2.5-VL-7B-Instruct

Limitations

Trained on tables from a single publisher (S. Chand Publications); performance on other publishers/styles is untested
Optimized for Indian academic textbook tables; may not generalize to web tables, handwritten tables, or camera-captured tables
HTML output may contain OCR errors in cell text content (S-TEDS 89.7% > TEDS 83.2%)

Citation

bibtex
@article{drishtitable2026,
  title={Domain-Specific Fine-Tuning for Table Structure Recognition: A 7B Open Model Outperforms GPT-4o with 1,141 Training Samples},
  author={Nalanda Data},
  year={2026}
}

Commercial Use & Support

This model is released under Apache 2.0. The training data (DrishtiTable) is a public sample of a larger internal corpus of 1,421 expert-annotated tables from Indian academic textbooks.

Available on request:

Custom fine-tuned TSR models for your document layouts
Production deployment support (vLLM, quantization, serving)
Access to the full training corpus under custom commercial license terms
Partnerships for document-understanding evaluation and integration

Contact

For commercial licensing, full dataset access, custom data work, or partnerships:
📧 info@nalandadata.ai

For technical questions, integration help, or fine-tuning support:
📧 tech@nalandadata.ai

🌐 nalandadata.ai

DrishtiTable-Qwen2.5-VL-7B

Get help setting up a custom Dedicated Endpoints.

README