maru979

qwen2.5-3b-teacher-ocr-rebuilder

README

License: apache-2.0

Task

Input: raw OCR text from a math exam question

Output:

json
{
  "stem": "cleaned problem statement",
  "answer_raw": "raw answer if clearly visible, otherwise empty",
  "solution_raw": "",
  "ocr_notes": ["risk tag 1", "risk tag 2"]
}

Intended boundary

This adapter is designed to sit on a separate line:

OCR -> OCR rebuilder -> existing GPT teaching chain

It should:

improve stem
improve answer_raw
reduce hallucinated answers
add conservative OCR risk notes

It should not:

replace your main GPT explanation model
solve the math problem
generate a polished solution_raw

At the current stage, solution_raw is intentionally kept empty.

Why this adapter exists

The base model can often emit valid JSON, but it tends to:

hallucinate answers when gold should be empty
drift away from the intended field semantics
over-talk beyond the strict OCR rebuild task

This adapter is optimized for a more conservative behavior.

Main test comparison

Evaluation setting:

base model: Qwen/Qwen2.5-3B-Instruct
adapter: current best stage-1 protocol-only LoRA
prompt: conservative non-solver prompt
generation: max_new_tokens=192
test set: 30 held-out samples

Metric table

Table with columns: Metric, Base model, Stage-1 adapter
Metric	Base model	Stage-1 adapter
JSON parse rate	80.00%	76.67%
`stem` exact match	0.00%	16.67%
`answer_raw` exact match	16.67%	60.00%
empty-answer hallucination	23.33%	0.00%

Visual comparison

JSON parse rate

text
Base model      80.00%  ████████████████
Stage-1 adapter 76.67%  ███████████████

answer_raw exact match

text
Base model      16.67%  ███
Stage-1 adapter 60.00%  ████████████

empty-answer hallucination (lower is better)

text
Base model      23.33%  █████
Stage-1 adapter  0.00%

Semantic fidelity

We also measured average character-level similarity against gold labels on the same held-out test set.

Table with columns: Field, Base model, Stage-1 adapter
Field	Base model	Stage-1 adapter
`stem` avg similarity	0.4898	0.7217
`answer_raw` avg similarity	0.4058	0.6667
`ocr_notes` avg similarity	0.1597	0.2391

Visual comparison

stem average similarity

text
Base model      0.4898  ██████████
Stage-1 adapter 0.7217  ██████████████

answer_raw average similarity

text
Base model      0.4058  ████████
Stage-1 adapter 0.6667  █████████████

What this means

The adapter gives up a small amount of parse rate, but buys back the behaviors that matter most for this task:

much better answer_raw
much better stem
zero hallucinated answers on gold-empty cases

For an OCR rebuilding module that feeds a larger teaching system, this tradeoff is usually worth it.

Dataset summary

The project used two task buckets during development:

single_problem_rebuild: 204 synthetic/curated samples
multi_problem_fragment_rebuild: 102 synthetic/curated samples

The released adapter comes from a stage-1 protocol-only training setup that focused on:

one JSON object only
fixed field schema
conservative extraction
no solution_raw generation

Stage-1 smoke subset:

train: 32
dev: 8

Known limitations

solution_raw is intentionally weak and currently fixed to empty.
ocr_notes is helpful but not yet fully normalized.
Multi-problem mixed fragments are harder than single-problem OCR cleanup.
This is a task adapter, not a general OCR foundation model.

Deployment

This repository includes a handler.py for Hugging Face Inference Endpoints custom deployment.

Recommended input:

json
{
  "inputs": "raw OCR text"
}

Recommended output:

json
{
  "stem": "...",
  "answer_raw": "...",
  "solution_raw": "",
  "ocr_notes": ["..."],
  "meta": {
    "raw_ocr_notes": ["model raw notes"]
  }
}

Local usage

python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model = "Qwen/Qwen2.5-3B-Instruct"
adapter_model = "maru979/qwen2.5-3b-teacher-ocr-rebuilder"

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter_model)
model.eval()

If you deploy this adapter as an endpoint, prefer the included handler.py instead of directly exposing raw generation.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

maru979

Model Tree

Base

Qwen/Qwen2.5-3B-Instruct

Adapter

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Task

Input: raw OCR text from a math exam question

Output:

json
{
  "stem": "cleaned problem statement",
  "answer_raw": "raw answer if clearly visible, otherwise empty",
  "solution_raw": "",
  "ocr_notes": ["risk tag 1", "risk tag 2"]
}

Intended boundary

This adapter is designed to sit on a separate line:

OCR -> OCR rebuilder -> existing GPT teaching chain

It should:

improve stem
improve answer_raw
reduce hallucinated answers
add conservative OCR risk notes

It should not:

replace your main GPT explanation model
solve the math problem
generate a polished solution_raw

At the current stage, solution_raw is intentionally kept empty.

Why this adapter exists

The base model can often emit valid JSON, but it tends to:

hallucinate answers when gold should be empty
drift away from the intended field semantics
over-talk beyond the strict OCR rebuild task

This adapter is optimized for a more conservative behavior.

Main test comparison

Evaluation setting:

base model: Qwen/Qwen2.5-3B-Instruct
adapter: current best stage-1 protocol-only LoRA
prompt: conservative non-solver prompt
generation: max_new_tokens=192
test set: 30 held-out samples

Metric table

Table with columns: Metric, Base model, Stage-1 adapter
Metric	Base model	Stage-1 adapter
JSON parse rate	80.00%	76.67%
`stem` exact match	0.00%	16.67%
`answer_raw` exact match	16.67%	60.00%
empty-answer hallucination	23.33%	0.00%

Visual comparison

JSON parse rate

text
Base model      80.00%  ████████████████
Stage-1 adapter 76.67%  ███████████████

answer_raw exact match

text
Base model      16.67%  ███
Stage-1 adapter 60.00%  ████████████

empty-answer hallucination (lower is better)

text
Base model      23.33%  █████
Stage-1 adapter  0.00%

Semantic fidelity

We also measured average character-level similarity against gold labels on the same held-out test set.

Table with columns: Field, Base model, Stage-1 adapter
Field	Base model	Stage-1 adapter
`stem` avg similarity	0.4898	0.7217
`answer_raw` avg similarity	0.4058	0.6667
`ocr_notes` avg similarity	0.1597	0.2391

Visual comparison

stem average similarity

text
Base model      0.4898  ██████████
Stage-1 adapter 0.7217  ██████████████

answer_raw average similarity

text
Base model      0.4058  ████████
Stage-1 adapter 0.6667  █████████████

What this means

The adapter gives up a small amount of parse rate, but buys back the behaviors that matter most for this task:

much better answer_raw
much better stem
zero hallucinated answers on gold-empty cases

For an OCR rebuilding module that feeds a larger teaching system, this tradeoff is usually worth it.

Dataset summary

The project used two task buckets during development:

single_problem_rebuild: 204 synthetic/curated samples
multi_problem_fragment_rebuild: 102 synthetic/curated samples

The released adapter comes from a stage-1 protocol-only training setup that focused on:

one JSON object only
fixed field schema
conservative extraction
no solution_raw generation

Stage-1 smoke subset:

train: 32
dev: 8

Known limitations

solution_raw is intentionally weak and currently fixed to empty.
ocr_notes is helpful but not yet fully normalized.
Multi-problem mixed fragments are harder than single-problem OCR cleanup.
This is a task adapter, not a general OCR foundation model.

Deployment

This repository includes a handler.py for Hugging Face Inference Endpoints custom deployment.

Recommended input:

json
{
  "inputs": "raw OCR text"
}

Recommended output:

json
{
  "stem": "...",
  "answer_raw": "...",
  "solution_raw": "",
  "ocr_notes": ["..."],
  "meta": {
    "raw_ocr_notes": ["model raw notes"]
  }
}

Local usage

python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base_model = "Qwen/Qwen2.5-3B-Instruct"
adapter_model = "maru979/qwen2.5-3b-teacher-ocr-rebuilder"

tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
    base_model,
    torch_dtype=torch.float16,
    device_map="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter_model)
model.eval()

If you deploy this adapter as an endpoint, prefer the included handler.py instead of directly exposing raw generation.