Eval
987-prompt LaTeX-extracted test split. Matching uses case-insensitive name
matching (fuzz.ratio, threshold 85) and token_sort_ratio + domain
normalization (expand abbreviations, drop postal codes) for affiliations
(threshold 85, audited at precision 1.0 on a 64-pair labeled set so the
metric does not credit genuinely-different institutions).
Table with columns: Stage, Test reward| Stage | Test reward |
|---|
| Stage 1 only (distil) | 0.918 |
| Stage 1 + Stage 2 (this adapter) | 0.921 |
Per-category precision / recall / F0.5 / F1
Pooled (micro) TP/FP/FN across all 987 prompts. Parse rate (schema-valid JSON
emitted): 0.992 for both stages.
Authors — fuzzy name match across each prompt's gold vs. predicted author list.
Table with columns: Stage, TP, FP, FN, P, R, F0.5, F1| Stage | TP | FP | FN | P | R | F0.5 | F1 |
|---|
| Stage 1 only | 3095 | 147 | 680 | 0.955 | 0.820 | 0.924 | 0.882 |
| Stage 1 + 2 | 3008 | 90 | 767 | |
Affiliations — affiliation matching within matched-author pairs; gold
affiliations of unmatched authors count as FN, predicted of unmatched as FP.
Table with columns: Stage, TP, FP, FN, P, R, F0.5, F1| Stage | TP | FP | FN | P | R | F0.5 | F1 |
|---|
| Stage 1 only | 2821 | 596 | 1348 | 0.826 | 0.677 | 0.791 | 0.744 |
| Stage 1 + 2 | 2793 | 361 | 1376 | |
Macro (mean of per-prompt P/R/F) for Stage 1 + 2 is higher because large
multi-author papers drag the micro denominators down: authors P 0.954 / R
0.954 / F0.5 0.953 / F1 0.953; affiliations P 0.840 / R 0.822 / F0.5 0.833 /
F1 0.826.
DAPO moves precision, not recall — authors FP halves (147 → 90),
affiliations FP drops 39% (596 → 361). The model learned to stop emitting
hallucinated / wrong items. Recall is flat — the unreachable items are
~4% truly bad data (the author block was lost during source extraction
for those papers) plus very large author lists where some authors are
consistently skipped.
A discovery during training was that the naive case-sensitive fuzz.ratio
metric was scoring ~22% of correctly extracted papers as 0 — gold labels
are often ALL-CAPS (LUKASZ PAWELEC) while a correct extraction from the
paper text is mixed-case (Łukasz Pawelec); the corrected metric reveals
the model was always ~0.91, not the 0.81 that the buggy metric showed.
Usage
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3-8B", torch_dtype="bfloat16")
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3-8B")
model = PeftModel.from_pretrained(
base,
"cometadata/affiliation-parsing-lora-Qwen3-8B-distil-GLM_4.5_Air-dapo-latex",
)
SYSTEM = (
"You are an expert at reading academic articles and parsing information "
"about their affiliations. The user will show you an academic article and "
"your job is to extract the authors and their affiliations in a structured "
"format (a JSON array of {name, affiliations}). Respond after </think>."
)
messages = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": "<the paper text>"},
]
inputs = tok.apply_chat_template(messages, add_generation_prompt=True,
return_tensors="pt").to(model.device)
out = model.generate(inputs, max_new_tokens=2048, temperature=0.7, do_sample=True)
print(tok.decode(out[0, inputs.shape[1]:], skip_special_tokens=True))
The model emits <think>…</think> reasoning followed by a JSON array
[{"name": ..., "affiliations": [...]}, ...].
Training & evaluation code
github.com/cometadata/affiliation-parsing-cl-latex (or the
project directory /scratch/m000152-pm05/affiliation-parsing-cl-latex/).
License
Apache-2.0 (matches the Qwen3-8B base).