What it does
A pointwise + pairwise slide examiner: detects semantic slide defects
(title/body mismatch, density, narrative order, missing section) and is
deliberately trained to abstain on pixel-level geometry (overflow / overlap /
alignment / font / color / margin) — those are handled by a symbolic linter, not
the VLM. Output is strict contract JSON (PageExamResult / DeckExamResult /
PairwiseResult).
Headline results (in-domain held-out, balanced accuracy, modality A = image-only)
Table with columns: S-group semantic, this adapter (8B), zero-shot 8B, zero-shot 30B| S-group semantic | this adapter (8B) | zero-shot 8B | zero-shot 30B |
|---|
| balanced accuracy | 1.0 | 0.639 | 0.785 |
The finetuned 8B examiner surpasses the zero-shot 30B model on the S-group
while keeping ~0 false-positive rate on geometry (it abstains rather than
hallucinating geometry from pixels). eval_loss trajectory: None.
Training
- Base:
Qwen/Qwen3-VL-8B-Instruct; QLoRA 4-bit (bitsandbytes), LoRA rank 16, alpha 32, 2 epochs, cosine LR 1e-4.
- Data: ~5.3K synthetic slides (paired clean/defective), architecture-correct routing
(S-group pointwise; geometry restate-from-structure + abstain-under-image; G1/S6 pairwise; S3→linter).
- Framework: LLaMA-Factory, template
qwen3_vl_nothink.
Usage
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor
base = "Qwen/Qwen3-VL-8B-Instruct"
model = AutoModelForImageTextToText.from_pretrained(base, torch_dtype="bfloat16", device_map="auto")
model = PeftModel.from_pretrained(model, "michaelrhs/slide-examiner-8b-qlora")
proc = AutoProcessor.from_pretrained(base)
Adapter files: adapter_config.json, adapter_model.safetensors.