small-models-for-glam

index-card-extractor-4b-v0.1

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What it does well

  • Per-collection structured extraction from card images, each collection under its own schema.
  • Cross-lingual / cross-domain in one model: French and English handwritten death records, and English manuscript-catalogue cards.
  • Generalises to new schemas — on a collection + schema it never trained on it still emits 100% valid, schema-conforming JSON (see eval).

Results (held-out test sets; see the project write-up for full detail)

Table
collectionmetricthis modelreference
Teklia (FR handwritten deaths)exact field-F10.887NuExtract-3 zero-shot 0.24
NLS Advocates (manuscript catalogue)retrieval score0.726Qwen3-VL-8B zero-shot 0.760
NLS Advocatesmanuscript-number F10.952Qwen3-VL-8B 0.886
Southborough (EN handwritten deaths)macro field-F10.750¹— (new collection)
Rubenstein (UNSEEN schema)valid-JSON / key-conformance1.000 / 1.000base 0.733 / 1.000

¹ The Southborough test set is agent-labelled, not human-verified → relative to those labels, not gold. Teklia and NLS numbers are against human-grade ground truth.

Notably, on the manuscript-number field (the load-bearing retrieval field) the 4B model exceeds the 8B that generated part of its training labels — though on overall NLS retrieval the 8B (0.760) still leads this model (0.726).

Training

  • Method: LoRA SFT (rank 16, all layers), 3 epochs, on numind/NuExtract3. No RL — GRPO was evaluated and found not to help this task (see write-up).
  • Data (~930 examples), per collection:
    • Teklia (MIT) — expert ground truth (from the source dataset's XML).
    • NLS Advocates — training labels are machine-generated silver (Qwen3-VL-8B); the held-out eval set (103 cards) is cataloguer-reviewed (human-grade).
    • Southborough (public domain) — training labels are machine-generated silver (Qwen3-VL-8B); the held-out eval set (12 cards) is agent-labelled, not human-verified. So two of three collections train on silver labels; only Teklia and the NLS eval are human-grade.
  • Each example pairs an image with its collection's schema-conditioned prompt and target JSON.

Intended use

Libraries, archives and museums digitising card catalogues / index drawers into ingestible structured records. Define the schema your catalogue needs; run the model per collection. Best used as a first-pass extractor with human review, not unattended ground truth.

Limitations & honest caveats

  • Two of three training collections use silver labels → a quality ceiling on free-text fields (names, places); the model can inherit the labeler's conventions.
  • Handwriting remains the hard part: place names and composite/long free-text fields are weakest.
  • Test sets are small (12–103 cards) → treat single-point numbers as directional, not precise.
  • Use greedy / non-thinking decoding; reasoning mode was not trained and underperforms.
  • NLS Advocates source data is not yet publicly released (released model only; data pending).

How to use

Works exactly like NuExtract-3 — pass a card image plus a template (or a Pydantic schema), get JSON shaped to it. Both interfaces are retained, including on schemas the model never trained on (verified: 100% valid, schema-conforming JSON on a held-out collection, better coverage than base).

python

import json
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, trust_remote_code=True, device_map="auto")
processor = AutoProcessor.from_pretrained(MODEL_ID, trust_remote_code=True)
template = { # your collection's schema — define whatever fields you need
"name": "verbatim-string",
"date_of_death": "string",
"cause_of_death": "string",
"age": "string",
"birthplace": "string",
}
messages = [{"role": "user", "content": [{"type": "image", "image": card_image}]}]
text = processor.apply_chat_template(
messages, template=json.dumps(template, indent=2),
enable_thinking=False, add_generation_prompt=True, tokenize=False,
)
# ... processor(text, images=...) -> model.generate(...) -> JSON shaped to `template`

Use greedy / non-thinking decoding (reasoning mode was trained out). A Pydantic schema works via Model.model_json_schema() → NuExtract template (see the NuExtract-3 card / numind utils).

Reproducibility & the bigger idea

Full pipeline, scorer, eval scripts and per-collection schemas are in the project write-up. The evidence here suggests a collaborative, multi-institution index-card GT corpus could yield one strong shared open model for GLAM card digitisation — adding collections improved this model's ability to follow unseen schemas. Contributions of labelled card collections welcome.

Model provider

small-models-for-glam

Model tree

Base

numind/NuExtract3

Fine-tuned

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today