Task
Input: raw OCR text from a math exam question
Output:
{
"stem": "cleaned problem statement",
"answer_raw": "raw answer if clearly visible, otherwise empty",
"solution_raw": "",
"ocr_notes": ["risk tag 1", "risk tag 2"]
}
Intended boundary
This adapter is designed to sit on a separate line:
OCR -> OCR rebuilder -> existing GPT teaching chain
It should:
- improve
stem
- improve
answer_raw
- reduce hallucinated answers
- add conservative OCR risk notes
It should not:
- replace your main GPT explanation model
- solve the math problem
- generate a polished
solution_raw
At the current stage, solution_raw is intentionally kept empty.
Why this adapter exists
The base model can often emit valid JSON, but it tends to:
- hallucinate answers when gold should be empty
- drift away from the intended field semantics
- over-talk beyond the strict OCR rebuild task
This adapter is optimized for a more conservative behavior.
Main test comparison
Evaluation setting:
- base model:
Qwen/Qwen2.5-3B-Instruct
- adapter: current best
stage-1 protocol-only LoRA
- prompt: conservative non-solver prompt
- generation:
max_new_tokens=192
- test set: 30 held-out samples
Metric table
Table with columns: Metric, Base model, Stage-1 adapter| Metric | Base model | Stage-1 adapter |
|---|
| JSON parse rate | 80.00% | 76.67% |
stem exact match | 0.00% | 16.67% |
answer_raw exact match | 16.67% | 60.00% |
| empty-answer hallucination | 23.33% | 0.00% |
Visual comparison
JSON parse rate
Base model 80.00% ████████████████
Stage-1 adapter 76.67% ███████████████
answer_raw exact match
Base model 16.67% ███
Stage-1 adapter 60.00% ████████████
empty-answer hallucination (lower is better)
Base model 23.33% █████
Stage-1 adapter 0.00%
Semantic fidelity
We also measured average character-level similarity against gold labels on the same held-out test set.
Table with columns: Field, Base model, Stage-1 adapter| Field | Base model | Stage-1 adapter |
|---|
stem avg similarity | 0.4898 | 0.7217 |
answer_raw avg similarity | 0.4058 | 0.6667 |
ocr_notes avg similarity | 0.1597 | 0.2391 |
Visual comparison
stem average similarity
Base model 0.4898 ██████████
Stage-1 adapter 0.7217 ██████████████
answer_raw average similarity
Base model 0.4058 ████████
Stage-1 adapter 0.6667 █████████████
What this means
The adapter gives up a small amount of parse rate, but buys back the behaviors that matter most for this task:
- much better
answer_raw
- much better
stem
- zero hallucinated answers on gold-empty cases
For an OCR rebuilding module that feeds a larger teaching system, this tradeoff is usually worth it.
Dataset summary
The project used two task buckets during development:
single_problem_rebuild: 204 synthetic/curated samples
multi_problem_fragment_rebuild: 102 synthetic/curated samples
The released adapter comes from a stage-1 protocol-only training setup that focused on:
- one JSON object only
- fixed field schema
- conservative extraction
- no
solution_raw generation
Stage-1 smoke subset:
Known limitations
solution_raw is intentionally weak and currently fixed to empty.
ocr_notes is helpful but not yet fully normalized.
- Multi-problem mixed fragments are harder than single-problem OCR cleanup.
- This is a task adapter, not a general OCR foundation model.
Deployment
This repository includes a handler.py for Hugging Face Inference Endpoints custom deployment.
Recommended input:
{
"inputs": "raw OCR text"
}
Recommended output:
{
"stem": "...",
"answer_raw": "...",
"solution_raw": "",
"ocr_notes": ["..."],
"meta": {
"raw_ocr_notes": ["model raw notes"]
}
}
Local usage
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base_model = "Qwen/Qwen2.5-3B-Instruct"
adapter_model = "maru979/qwen2.5-3b-teacher-ocr-rebuilder"
tokenizer = AutoTokenizer.from_pretrained(base_model, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
base_model,
torch_dtype=torch.float16,
device_map="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, adapter_model)
model.eval()
If you deploy this adapter as an endpoint, prefer the included handler.py instead of directly exposing raw generation.