Ethosoft

trdocvqa-paligemma-3b-lora

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Evaluation

Evaluation was run on the TR-DocVQA-Synth test split with 2,000 examples.

Table with columns: Model, Setting, Test Samples, Normalized EM, ANLS, Token F1, Empty Prediction Rate, Invalid Prediction Rate
Model	Setting	Test Samples	Normalized EM	ANLS	Token F1	Empty Prediction Rate	Invalid Prediction Rate
PaliGemma-3B LoRA	Fine-tuned LoRA	2000	0.7205	0.8745	0.7294	0.0000	0.0000

Additional paper-ready metrics, per-field breakdowns, error analysis, and LaTeX tables are included under evaluation/.

Usage

python
import torch
from PIL import Image
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from peft import PeftModel

base_model = "google/paligemma-3b-pt-224"
adapter_id = "omerfaksoy/trdocvqa-paligemma-3b-lora"

processor = AutoProcessor.from_pretrained(adapter_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

image = Image.open("document.png").convert("RGB")
question = "Toplam tutar nedir?"
prompt = f"answer tr {question}\n"

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
    generated = model.generate(**inputs, max_new_tokens=64, do_sample=False, num_beams=1)

prompt_len = inputs["input_ids"].shape[-1]
answer = processor.batch_decode(generated[:, prompt_len:], skip_special_tokens=True)[0].strip()
print(answer)

Training Summary

Method: LoRA fine-tuning
Base model: google/paligemma-3b-pt-224
Dataset: Ethosoft/TR-DocVQA-Synth
Language: Turkish
Input: document image + Turkish question
Output: short answer text
Hardware used: TRUBA Kolyoz H200

Important Notes

This repository contains a LoRA adapter, not a full merged copy of the base PaliGemma model. Users must comply with the terms of the base model and accept any gated access requirements for google/paligemma-3b-pt-224.

The model was developed for research use on synthetic Turkish document VQA data. Before production use, evaluate on real documents from the target domain and review privacy, licensing, and bias considerations.

Model provider

Ethosoft

Model tree

Base

google/paligemma-3b-pt-224

Adapter

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Evaluation

Evaluation was run on the TR-DocVQA-Synth test split with 2,000 examples.

Table with columns: Model, Setting, Test Samples, Normalized EM, ANLS, Token F1, Empty Prediction Rate, Invalid Prediction Rate
Model	Setting	Test Samples	Normalized EM	ANLS	Token F1	Empty Prediction Rate	Invalid Prediction Rate
PaliGemma-3B LoRA	Fine-tuned LoRA	2000	0.7205	0.8745	0.7294	0.0000	0.0000

Additional paper-ready metrics, per-field breakdowns, error analysis, and LaTeX tables are included under evaluation/.

Usage

python
import torch
from PIL import Image
from transformers import AutoProcessor, PaliGemmaForConditionalGeneration
from peft import PeftModel

base_model = "google/paligemma-3b-pt-224"
adapter_id = "omerfaksoy/trdocvqa-paligemma-3b-lora"

processor = AutoProcessor.from_pretrained(adapter_id)
model = PaliGemmaForConditionalGeneration.from_pretrained(
    base_model,
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
model = PeftModel.from_pretrained(model, adapter_id)
model.eval()

image = Image.open("document.png").convert("RGB")
question = "Toplam tutar nedir?"
prompt = f"answer tr {question}\n"

inputs = processor(text=prompt, images=image, return_tensors="pt").to(model.device)
with torch.inference_mode():
    generated = model.generate(**inputs, max_new_tokens=64, do_sample=False, num_beams=1)

prompt_len = inputs["input_ids"].shape[-1]
answer = processor.batch_decode(generated[:, prompt_len:], skip_special_tokens=True)[0].strip()
print(answer)

Training Summary

Method: LoRA fine-tuning
Base model: google/paligemma-3b-pt-224
Dataset: Ethosoft/TR-DocVQA-Synth
Language: Turkish
Input: document image + Turkish question
Output: short answer text
Hardware used: TRUBA Kolyoz H200

trdocvqa-paligemma-3b-lora

Get help setting up a custom Dedicated Endpoints.

README

Evaluation

Usage

Training Summary

Important Notes

Explore FriendliAI today

README

Evaluation

Usage

Training Summary

Important Notes