Glazkov

structured-extractor-qwen3vl-4b-exp232

README

License: apache-2.0

Benchmarks

Evaluated on a held-out test split of real financial-statement table crops, with the quality preset (

markdown

num_beams=4, repetition_penalty=1.1, length_penalty=1.0, min_new_tokens=200, max_new_tokens=4096

Table with columns: Metric, Value
Metric	Value
tuple_f1 (STRICT)	0.6637
parameter_f1	0.750
value_accuracy	0.820
date_accuracy	0.788
unit_accuracy	0.740
count_accuracy	0.78
exact_match	0.40

3-seed stability (this recipe re-run with seeds 42 / 2024 / 314):

Table with columns: Decoder, Mean, Std
Decoder	Mean	Std
greedy	0.624	±0.007
b4 + rp1.1 (quality)	0.660	±0.003

Seed variance is ~11× tighter than the previous merged-save recipe — saving the adapter unmerged also stabilizes seed-to-seed jitter.

Why an adapter?

The earlier recipe (exp93, also DoRA + MLP, same data, same hyperparameters) was saved as a merged model and scored 0.5316. exp232 is the same recipe except save_adapter_only=true, and scores 0.6637 — a +0.132 lift from a single config flag.

Root cause: DoRA's merge_and_unload followed by save-to-bf16 silently degrades the directional component of the DoRA decomposition. Loading the unmerged adapter and merging in memory at inference time recovers the full precision. Confirmed across multiple seeds. Discussed in the structured-extractor-train project notes (2026-05-26).

This is also why this repo is library_name: peft — the file layout is the standard PEFT one (adapter_config.json + adapter_model.safetensors) plus an extra_trained_weights.pt for non-LoRA trained pieces (new-token embed/lm_head rows + frozen vision merger snapshot).

⚠️ Earlier versions of this project reported t_f1 ~0.82 — those numbers were inflated by a target-leakage bug in the eval pipeline (the answer was in the model's input). The numbers above are real zero-shot, measured with a leak-free eval (PageDataset(..., eval_mode=True)).

Recipe

Base: Qwen/Qwen3-VL-4B-Instruct (Apache-2.0)
Adapter: DoRA-style LoRA, r=16, alpha=32, dropout=0.05
Targets: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (attention + MLP)
Training data: real financial tables only (~1.5k train, no synthetic augmentation)
Input: pre-cropped table image + markdown OCR of that table + date-column hint
Schedule: 2 epochs, AdamW, lr=1e-4, weight_decay=0.05, warmup_ratio=0.05, seed=42
: — adapter weights persisted unmerged in fp32-equivalent precision.

Quick start

bash
pip install -r requirements.txt   # torch, transformers, accelerate, peft, huggingface_hub, pillow

python
from inference import StructuredExtractor

extractor = StructuredExtractor.from_pretrained(
    "Glazkov/structured-extractor-qwen3vl-4b-exp232"
)

result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    date_columns=["2024", "2023"],
    preset="quality",
)

for row in result["parameters"]:
    print(row)
# {"parameter_name": "Interest income", "parameter_value": "533",
#  "parameter_date": "2024", "parameter_unit": "millions"}

The first call downloads the base model (Qwen/Qwen3-VL-4B-Instruct, ~8 GB) and the adapter (this repo, ~194 MB). The loader then:

Loads the base in bf16 (or fp16 on older GPUs).
Resizes token embeddings to fit the fine-tuned tokenizer (4 added sep tokens).
Applies the DoRA adapter via PEFT and merges it.
Restores new-token embed/lm_head rows + visual-merger snapshot from extra_trained_weights.pt.

All four steps are handled inside StructuredExtractor.from_pretrained.

Required inputs

Table with columns: Input, Status, Notes
Input	Status	Notes
Table image	Required	Pre-cropped to a single table region; long-side resized to 1344px (handled internally)
Markdown OCR of that table	Required for benchmark quality	The per-sample disambiguator. Without it the model picks an arbitrary table and tuple-F1 collapses to near zero. The VLM essentially copies cell text from markdown — image alone is insufficient.
`date_columns` hint	Optional	List of date-column headers; helps when markdown is noisy

The table image must be cropped to the target table, not a full page. Training used single-table crops; full-page inputs at inference time are untested.

Best-quality pipeline

python
result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    date_columns=["2024", "2023"],
    preset="quality",
)

preset="quality" =

markdown

num_beams=4, length_penalty=1.0, min_new_tokens=200, repetition_penalty=1.1, max_new_tokens=4096, do_sample=False

. This is the configuration that yields STRICT 0.6637.

Most-optimal pipeline (greedy)

python
result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    preset="fast",
)

Greedy decoding (num_beams=1). About 3-4× faster than quality with a ~0.04 t_f1 drop (STRICT 0.6203). Use this when latency or throughput matters.

Batch inference

python
from pathlib import Path
from inference import StructuredExtractor

extractor = StructuredExtractor.from_pretrained(
    "Glazkov/structured-extractor-qwen3vl-4b-exp232"
)

paths = sorted(Path("tables/").glob("*.png"))
markdowns = [Path(p.with_suffix(".md")).read_text() for p in paths]
results = extractor.extract_batch(
    paths,
    markdown_batch=markdowns,
    preset="fast",
    batch_size=1,           # beam search is memory-hungry; keep at 1
)

See examples/batch.py for a CLI version. batch_size>1 is unsupported in this wrapper because beam-search batching requires the training-time collator (left-padding + cat of vision tensors), out of scope for the inference module.

Lenient scoring helper

score_lenient.py re-scores a JSONL of (image, parameters) predictions against a reference annotations JSONL using unit aliases (million ↔ millions, млн руб. ↔ млн руб) and date-year normalization. A pure metric helper — the model output itself is identical; the lift comes from accepting orthographic equivalents.

bash
python score_lenient.py preds.jsonl annotations_test.jsonl

Output format

The model emits one parameter per line in pipe-separated sep_labels format:

markdown
<|sep_meta|>
name: Interest income|value: 533|date: 2024|unit: millions
name: Foreign-currency transaction loss|value: 89|date: 2023|unit: millions

parser.py converts that to {"parameters": [{...}, ...]} and strips stray <|...|> control-token artifacts before splitting. The model occasionally emits one mid-row; without this strip a leading < contaminates the previous field. The fix is worth +0.024-0.042 t_f1 on its own.

File layout

markdown
.
├── adapter_config.json             # PEFT/DoRA config
├── adapter_model.safetensors       # adapter weights (~190 MB)
├── extra_trained_weights.pt        # new-token embed/lm_head rows + visual_merger
├── chat_template.jinja             # qwen3-vl chat template
├── tokenizer.json
├── tokenizer_config.json
├── inference.py                    # StructuredExtractor wrapper
├── parser.py                       # sep_labels → structured rows (with regex strip)
├── score_lenient.py                # lenient F1 helper
├── README.md                       # this file
├── LICENSE                         # Apache-2.0
├── requirements.txt
└── examples/
    ├── single_quality.py
    ├── single_fast.py
    └── batch.py

Hardware

Table with columns: Preset, Min VRAM (single image)
Preset	Min VRAM (single image)
fast (greedy)	~12 GB
quality (beam=4)	~24 GB

bf16 on CUDA capability ≥ 8.0, fp16 elsewhere. CPU works but is unusably slow for a 4B VLM with beam search.

Limitations

Trained on financial-statement tables (RU/EN). Behavior on other domains is unmeasured.
Bimodal errors: ~42% of test samples solve well (t_f1 ≥ 0.7), ~34% fail completely (t_f1 < 0.1). Average F1 obscures this. Worst failures cluster in specific source documents with dense multi-table pages where even the markdown disambiguator isn't enough.
Markdown OCR is a hard requirement. The model cannot reliably OCR table cells from the image alone — it leans heavily on the markdown for cell text. Production pipelines need an upstream OCR step.

License

Apache-2.0, matching the base model.

Citation / acknowledgements

Base model: Qwen/Qwen3-VL-4B-Instruct (Apache-2.0)
Training framework: structured-extractor-train (DoRA r=16 + MLP, 2 epochs, real-only, save_adapter_only=true)
Earlier checkpoint: structured-extractor-qwen3vl-4b-exp93 (same recipe, merged save, STRICT 0.5316)

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

Glazkov

Model Tree

Base

Qwen/Qwen3-VL-4B-Instruct

Adapter

this model

Input Modalities

Text

Image

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Container

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Benchmarks

Evaluated on a held-out test split of real financial-statement table crops, with the quality preset (

markdown

num_beams=4, repetition_penalty=1.1, length_penalty=1.0, min_new_tokens=200, max_new_tokens=4096

Table with columns: Metric, Value
Metric	Value
tuple_f1 (STRICT)	0.6637
parameter_f1	0.750
value_accuracy	0.820
date_accuracy	0.788
unit_accuracy	0.740
count_accuracy	0.78
exact_match	0.40

3-seed stability (this recipe re-run with seeds 42 / 2024 / 314):

Table with columns: Decoder, Mean, Std
Decoder	Mean	Std
greedy	0.624	±0.007
b4 + rp1.1 (quality)	0.660	±0.003

Seed variance is ~11× tighter than the previous merged-save recipe — saving the adapter unmerged also stabilizes seed-to-seed jitter.

Why an adapter?

⚠️ Earlier versions of this project reported t_f1 ~0.82 — those numbers were inflated by a target-leakage bug in the eval pipeline (the answer was in the model's input). The numbers above are real zero-shot, measured with a leak-free eval (PageDataset(..., eval_mode=True)).

Recipe

Base: Qwen/Qwen3-VL-4B-Instruct (Apache-2.0)
Adapter: DoRA-style LoRA, r=16, alpha=32, dropout=0.05
Targets: q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj (attention + MLP)
Training data: real financial tables only (~1.5k train, no synthetic augmentation)
Input: pre-cropped table image + markdown OCR of that table + date-column hint
Schedule: 2 epochs, AdamW, lr=1e-4, weight_decay=0.05, warmup_ratio=0.05, seed=42
: — adapter weights persisted unmerged in fp32-equivalent precision.

Quick start

bash
pip install -r requirements.txt   # torch, transformers, accelerate, peft, huggingface_hub, pillow

python
from inference import StructuredExtractor

extractor = StructuredExtractor.from_pretrained(
    "Glazkov/structured-extractor-qwen3vl-4b-exp232"
)

result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    date_columns=["2024", "2023"],
    preset="quality",
)

for row in result["parameters"]:
    print(row)
# {"parameter_name": "Interest income", "parameter_value": "533",
#  "parameter_date": "2024", "parameter_unit": "millions"}

The first call downloads the base model (Qwen/Qwen3-VL-4B-Instruct, ~8 GB) and the adapter (this repo, ~194 MB). The loader then:

Loads the base in bf16 (or fp16 on older GPUs).
Resizes token embeddings to fit the fine-tuned tokenizer (4 added sep tokens).
Applies the DoRA adapter via PEFT and merges it.
Restores new-token embed/lm_head rows + visual-merger snapshot from extra_trained_weights.pt.

All four steps are handled inside StructuredExtractor.from_pretrained.

Required inputs

Table with columns: Input, Status, Notes
Input	Status	Notes
Table image	Required	Pre-cropped to a single table region; long-side resized to 1344px (handled internally)
Markdown OCR of that table	Required for benchmark quality	The per-sample disambiguator. Without it the model picks an arbitrary table and tuple-F1 collapses to near zero. The VLM essentially copies cell text from markdown — image alone is insufficient.
`date_columns` hint	Optional	List of date-column headers; helps when markdown is noisy

The table image must be cropped to the target table, not a full page. Training used single-table crops; full-page inputs at inference time are untested.

Best-quality pipeline

python
result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    date_columns=["2024", "2023"],
    preset="quality",
)

preset="quality" =

markdown

num_beams=4, length_penalty=1.0, min_new_tokens=200, repetition_penalty=1.1, max_new_tokens=4096, do_sample=False

. This is the configuration that yields STRICT 0.6637.

Most-optimal pipeline (greedy)

python
result = extractor.extract(
    "table_crop.png",
    markdown=table_md_text,
    preset="fast",
)

Greedy decoding (num_beams=1). About 3-4× faster than quality with a ~0.04 t_f1 drop (STRICT 0.6203). Use this when latency or throughput matters.

Batch inference

python
from pathlib import Path
from inference import StructuredExtractor

extractor = StructuredExtractor.from_pretrained(
    "Glazkov/structured-extractor-qwen3vl-4b-exp232"
)

paths = sorted(Path("tables/").glob("*.png"))
markdowns = [Path(p.with_suffix(".md")).read_text() for p in paths]
results = extractor.extract_batch(
    paths,
    markdown_batch=markdowns,
    preset="fast",
    batch_size=1,           # beam search is memory-hungry; keep at 1
)

Lenient scoring helper

bash
python score_lenient.py preds.jsonl annotations_test.jsonl

Output format

The model emits one parameter per line in pipe-separated sep_labels format:

markdown
<|sep_meta|>
name: Interest income|value: 533|date: 2024|unit: millions
name: Foreign-currency transaction loss|value: 89|date: 2023|unit: millions

File layout

markdown
.
├── adapter_config.json             # PEFT/DoRA config
├── adapter_model.safetensors       # adapter weights (~190 MB)
├── extra_trained_weights.pt        # new-token embed/lm_head rows + visual_merger
├── chat_template.jinja             # qwen3-vl chat template
├── tokenizer.json
├── tokenizer_config.json
├── inference.py                    # StructuredExtractor wrapper
├── parser.py                       # sep_labels → structured rows (with regex strip)
├── score_lenient.py                # lenient F1 helper
├── README.md                       # this file
├── LICENSE                         # Apache-2.0
├── requirements.txt
└── examples/
    ├── single_quality.py
    ├── single_fast.py
    └── batch.py

Hardware

Table with columns: Preset, Min VRAM (single image)
Preset	Min VRAM (single image)
fast (greedy)	~12 GB
quality (beam=4)	~24 GB

bf16 on CUDA capability ≥ 8.0, fp16 elsewhere. CPU works but is unusably slow for a 4B VLM with beam search.

Limitations

Trained on financial-statement tables (RU/EN). Behavior on other domains is unmeasured.
Bimodal errors: ~42% of test samples solve well (t_f1 ≥ 0.7), ~34% fail completely (t_f1 < 0.1). Average F1 obscures this. Worst failures cluster in specific source documents with dense multi-table pages where even the markdown disambiguator isn't enough.
Markdown OCR is a hard requirement. The model cannot reliably OCR table cells from the image alone — it leans heavily on the markdown for cell text. Production pipelines need an upstream OCR step.

License

Apache-2.0, matching the base model.

Citation / acknowledgements

Base model: Qwen/Qwen3-VL-4B-Instruct (Apache-2.0)
Training framework: structured-extractor-train (DoRA r=16 + MLP, 2 epochs, real-only, save_adapter_only=true)
Earlier checkpoint: structured-extractor-qwen3vl-4b-exp93 (same recipe, merged save, STRICT 0.5316)