TL;DR
- Base:
Qwen/Qwen3-VL-8B-Instruct (QLoRA, then merged → BF16).
- Task: resume page image(s) → structured JSON (23 fields: identity, contact, skills,
experiences, educations, languages, certificates, projects, preferences).
- Why fine-tune: the 23-field schema and the project's formatting rules are baked into
the weights, so a one-line prompt replaces the ~280-line schema prompt the 32B base needed.
- Measured (full 51-sample held-out split, A100, BF16, greedy): 83.9% weighted score,
88.2% unweighted, 88.2% JSON-valid. See Evaluation for the honest caveats.
- Footprint: ~23 GB VRAM in BF16 at 16K context (vs. ~50 GB for the 32B it replaces).
Intended use
Extracting structured data from resume/CV documents rendered to images (PDF → PNG per
page). The model is tuned for a specific downstream schema (below) used by a recruiting/ATS
pipeline, including its enum vocabularies (PascalCase country names, a fixed list of
roles/technologies/industries). It is most useful when you want one model call to turn a
resume into a database-ready record.
It is not a general document-VQA model and should not be used to make automated decisions
about candidates — see Out-of-scope.
Input: one or more page images of a single resume, plus the short instruction the model
was trained with (see How to use).
Output: a single JSON object with 23 top-level fields. Scalars are null when absent;
list fields default to []; address defaults to {country_name, region_name}.
Table with columns: Field, Type, Notes| Field | Type | Notes |
|---|
first_name, last_name | string | |
email, phone | string | |
date_of_birth | string | YYYY-MM-DD |
desired_position |
Dates are normalized to YYYY-MM-DD (year-only ranges expand to Jan 1 / Dec 31; ongoing
roles set date_to: null). Classification fields (desired_position, project role /
used_technologies / industries, and all country_name fields) are mapped to predefined
option lists, falling back to "Other" when nothing matches.
Real (anonymized) output example:
{
"first_name": "Jane",
"last_name": "Doe",
"date_of_birth": null,
"email": "jane@example.com",
"phone": "+1-555-0100",
"desired_position": "Android Developer",
"about": null,
"job_experience": null,
"job_expectations": null,
"min_salary": null,
"max_salary": null,
"ready_to_relocation": false,
"work_modes": [],
"employment_types": [],
"employment_durations": [],
"hobbies": null,
"address": { "country_name": "Uzbekistan", "region_name": "Tashkent" },
"skills": [
{ "skill_name": "Android Development", "level": null },
{ "skill_name": "Kotlin", "level": null },
{ "skill_name": "Firebase", "level": null }
],
"experiences": [
{
"company_name": "Android Development Course",
"job": "Student / Trainee (Android Development)",
"date_from": "2021-01-01",
"date_to": null,
"description": "Android development course focused on Java/Kotlin/Android.",
"country_name": null
}
],
"languages": [
{ "language_name": "Uzbek", "level": 6 },
{ "language_name": "English", "level": 2 },
{ "language_name": "Russian", "level": 0 }
],
"educations": [
{
"name": "Tashkent University of Information Technologies",
"degree": "Bachelor",
"location": "Tashkent",
"programme": "E-Commerce",
"date_from": null,
"date_to": "2019-01-01",
"country_name": "Uzbekistan"
}
],
"certificates": [],
"projects": [
{
"title": "Wallpaper App",
"summary": "Wallpaper app based on MVVM, Coin, Flow, Retrofit.",
"used_technologies": ["Kotlin", "Other"],
"role": "Mobile Developer(IOS/Android)",
"industries": ["Other"]
}
]
}
Training data
- 513 human-verified resume samples (private internal dataset). Each sample is a PDF
rendered to one or more page PNGs plus a verified ground-truth JSON record.
- Split: 462 train / 51 held-out eval, 90/10, fixed seed
42. Samples whose estimated
token length exceeded ~15.2K (1K below the 16,384 context budget) were dropped from
training, so the effective training count is ≤462.
- Page distribution: 276 single-page, 136 two-page, 101 three-or-more-page (up to 8).
- Language: predominantly English; some records contain non-English values (e.g.
Russian/Uzbek company or language names).
The dataset is not released. Code to rebuild splits and bundles is in the repo
(src/data_prep.py, src/export_eval_bundle.py).
Training procedure
QLoRA via Unsloth (FastVisionModel) + TRL SFTTrainer. The 4-bit base
(unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit, nf4) was adapted with LoRA on both the
vision and language towers (attention + MLP modules), then the adapter was merged back into
the full model and published.
Each training example is a single user turn — the page images followed by the combined
system+user instruction — with the ground-truth JSON as the assistant target. There is no
separate system role; this is why inference uses the same short prompt.
Dtype note: the merge used Unsloth's merged_16bit, and the original upload was
labeled "float16", but the published config.json and stored tensors are bfloat16.
Treat this model as BF16.
Hyperparameters
Table with columns: Hyperparameter, Value| Hyperparameter | Value |
|---|
| Method | QLoRA (4-bit nf4 base + LoRA, merged after training) |
| LoRA rank / alpha / dropout | 16 / 16 / 0 |
| Target modules | vision + language layers, attention + MLP (bias="none", no rslora) |
| Learning rate | 2e-4 |
| LR scheduler / warmup | cosine / 10 steps |
| Optimizer | adamw_8bit |
| Weight decay | 0.01 |
| Per-device batch / grad-accum |
Training time and final loss were not captured from the run.
Evaluation
Measured on 2026-06-05 with notebooks/eval_finetuned.ipynb against the held-out split,
using the project's field-weighted scorer (src/evaluation.py). Setup: the published BF16
weights on a single A100, greedy decoding (do_sample=False, max_new_tokens=4096), on
the full 51-sample held-out split.
Table with columns: Metric, Result| Metric | Result |
|---|
| Overall weighted score | 83.9% |
| Overall unweighted score | 88.2% |
| JSON validity | 88.2% (45/51 parsed; 6 failures) |
| Avg. inference | ~92.0 s/resume |
| Peak VRAM | 23.4 GB |
Per-field accuracy (worst → best):
Table with columns: Field, Acc, Field, Acc| Field | Acc | Field | Acc |
|---|
| skills | 67.5% | ready_to_relocation | 88.2% |
| phone | 74.5% | certificates | 90.8% |
| desired_position | 79.2% | projects | 91.0% |
| address | 81.2% | job_expectations | 92.7% |
| experiences |
Read these numbers with the following caveats:
- Full held-out split, single run. These are all 51 held-out samples with greedy
decoding — a real measurement, but one run on a modest test set, not a large benchmark.
- Partial-credit metric. The scorer uses fuzzy string ratios, date/numeric tolerances,
and greedy best-match over object arrays, with fields weighted by importance (work
experience is weighted highest). It is not strict exact-match and is not comparable
to other parsers' published numbers — it is an internal quality signal. The weighted score
(83.9%) is below the unweighted (88.2%) because the highest-weighted fields —
experiences,
skills, identity/contact — are also the hardest ones.
- The top-scoring fields are mostly "correctly empty."
min_salary/max_salary (100%)
and date_of_birth, work_modes, employment_types, employment_durations (~98%) are
almost always absent in this data, so high scores largely reflect correctly returning empty
— not hard extraction.
- 6/51 invalid JSON (~12%). Most likely 4096-token truncation on long multi-page resumes;
downstream code must handle un-parseable output (retry, repair, or shorter prompts).
For context, the model-selection benchmark that led to Qwen3-VL-8B (base models, ~10 samples,
not reproducible from committed outputs) is noted in the repo's SESSION_LOG.md; it is not a
fine-tuned result and is excluded here.
How to use
Requires a recent transformers (≥4.57 for Qwen3-VL; latest recommended). The published
processor carries the correct chat template, so the modern image-in-messages path works
without extra utilities.
import json
from transformers import AutoProcessor, Qwen3VLForConditionalGeneration
model_id = "sukhrobnurali/qwen3vl-resume-parser"
model = Qwen3VLForConditionalGeneration.from_pretrained(
model_id, dtype="auto", device_map="auto", attn_implementation="sdpa",
)
processor = AutoProcessor.from_pretrained(model_id)
SYSTEM_PROMPT = "You are a resume parser. Extract information from resume images into structured JSON."
USER_PROMPT = "Parse this resume and return the structured JSON."
pages = ["resume_page_1.png", "resume_page_2.png"]
messages = [{
"role": "user",
"content": (
[{"type": "text", "text": SYSTEM_PROMPT}]
+ [{"type": "image", "url": p} for p in pages]
+ [{"type": "text", "text": USER_PROMPT}]
),
}]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
add_generation_prompt=True,
return_dict=True,
return_tensors="pt",
).to(model.device)
inputs.pop("token_type_ids", None)
generated = model.generate(**inputs, max_new_tokens=4096, do_sample=False)
trimmed = generated[:, inputs["input_ids"].shape[1]:]
text = processor.batch_decode(trimmed, skip_special_tokens=True)[0]
resume = json.loads(text)
print(json.dumps(resume, indent=2, ensure_ascii=False))
Use greedy decoding (do_sample=False) for stable structured output. For long multi-page
resumes, raise max_new_tokens if you see truncated JSON.
vLLM serving (the original deployment target):
vllm serve sukhrobnurali/qwen3vl-resume-parser \
--dtype bfloat16 --max-model-len 16384 --trust-remote-code
When calling through the OpenAI-compatible API, pass
extra_body={"chat_template_kwargs": {"enable_thinking": false}} to keep the model in
non-thinking (direct-JSON) mode.
Limitations
- Domain skew. Training resumes skew toward IT/software roles, and the enum vocabularies
(roles, technologies, industries) are IT-centric. Expect degradation on non-technical
resumes, unusual layouts, scans/photos, or handwriting.
- Language. English-dominant; non-English resumes are under-represented.
- Schema lock-in. The model is tuned to one specific 23-field schema and its enum lists.
It will coerce values toward those vocabularies (including
"Other"), which may not match
a different downstream schema.
- Invalid JSON happens (~12% on the held-out split). Always parse defensively.
- Latency. ~90 s/resume on an A100 at 16K context — batch/offline, not real-time.
- Quantization. BF16 peaks at ~23 GB VRAM; it runs in 4-bit on a 16 GB GPU, but accuracy
was only measured in BF16.
Out-of-scope and responsible use
- No automated candidate decisions. Resume parsing for screening/ranking carries fairness
and bias risk. Keep a human in the loop; do not use this model to make or materially
influence hiring decisions without review.
- Not a general VQA / OCR model. It is specialized for this resume schema.
- PII. Resumes contain personal data. Handle outputs under the applicable privacy law
(e.g. GDPR) — secure storage, access control, retention limits, and a lawful basis for
processing.
- Verify before trusting. Outputs are model predictions, not ground truth; validate
critical fields (contact info, dates) downstream.
License
Released under Apache-2.0, inherited from the Qwen/Qwen3-VL-8B-Instruct base model.
Citation
@misc{nurali2026qwen3vlresumeparser,
title = {qwen3vl-resume-parser: a Qwen3-VL-8B fine-tune for resume-to-JSON extraction},
author = {Nurali, Sukhrob},
year = {2026},
howpublished = {\url{https://huggingface.co/sukhrobnurali/qwen3vl-resume-parser}}
}
Built on Qwen3-VL by the Qwen team; see the
Qwen3-VL model card and
Unsloth for the training stack.
Author
Sukhrob Nurali — sukhrobnurali@gmail.com
Hugging Face: @sukhrobnurali ·
GitHub: @sukhrobnurali