Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0TL;DR
- Base:
Qwen/Qwen3-VL-8B-Instruct(QLoRA, then merged → BF16). - Task: resume page image(s) → structured JSON (23 fields: identity, contact, skills, experiences, educations, languages, certificates, projects, preferences).
- Why fine-tune: the 23-field schema and the project's formatting rules are baked into the weights, so a one-line prompt replaces the ~280-line schema prompt the 32B base needed.
- Measured (full 51-sample held-out split, A100, BF16, greedy): 83.9% weighted score, 88.2% unweighted, 88.2% JSON-valid. See Evaluation for the honest caveats.
- Footprint: ~23 GB VRAM in BF16 at 16K context (vs. ~50 GB for the 32B it replaces).
Intended use
Extracting structured data from resume/CV documents rendered to images (PDF → PNG per page). The model is tuned for a specific downstream schema (below) used by a recruiting/ATS pipeline, including its enum vocabularies (PascalCase country names, a fixed list of roles/technologies/industries). It is most useful when you want one model call to turn a resume into a database-ready record.
It is not a general document-VQA model and should not be used to make automated decisions about candidates — see Out-of-scope.
Input / output schema
Input: one or more page images of a single resume, plus the short instruction the model was trained with (see How to use).
Output: a single JSON object with 23 top-level fields. Scalars are null when absent;
list fields default to []; address defaults to {country_name, region_name}.
| Field | Type | Notes |
|---|---|---|
first_name, last_name | string | |
email, phone | string | |
date_of_birth | string | YYYY-MM-DD |
desired_position | string | mapped to a fixed role vocabulary |
about | string | free-text summary |
job_experience | number | total years |
job_expectations, min_salary, max_salary | string / number | |
ready_to_relocation | bool | |
work_modes, employment_types, employment_durations | string[] | enum values |
hobbies | string | |
address | object | {country_name, region_name} |
skills | object[] | {skill_name, level} |
experiences | object[] | {company_name, job, date_from, date_to, description, country_name} |
educations | object[] | {name, degree, location, programme, date_from, date_to, country_name} |
languages | object[] | {language_name, level} (level is an int) |
certificates | object[] | {certificate_name, certificate_programme, issuing_date, expiring_date} |
projects | object[] | {title, summary, used_technologies[], role, industries[]} |
Dates are normalized to YYYY-MM-DD (year-only ranges expand to Jan 1 / Dec 31; ongoing
roles set date_to: null). Classification fields (desired_position, project role /
used_technologies / industries, and all country_name fields) are mapped to predefined
option lists, falling back to "Other" when nothing matches.
Real (anonymized) output example:
json
{"first_name": "Jane","last_name": "Doe","date_of_birth": null,"email": "jane@example.com","phone": "+1-555-0100","desired_position": "Android Developer","about": null,"job_experience": null,"job_expectations": null,"min_salary": null,"max_salary": null,"ready_to_relocation": false,"work_modes": [],"employment_types": [],"employment_durations": [],"hobbies": null,"address": { "country_name": "Uzbekistan", "region_name": "Tashkent" },"skills": [{ "skill_name": "Android Development", "level": null },{ "skill_name": "Kotlin", "level": null },{ "skill_name": "Firebase", "level": null }],"experiences": [{"company_name": "Android Development Course","job": "Student / Trainee (Android Development)","date_from": "2021-01-01","date_to": null,"description": "Android development course focused on Java/Kotlin/Android.","country_name": null}],"languages": [{ "language_name": "Uzbek", "level": 6 },{ "language_name": "English", "level": 2 },{ "language_name": "Russian", "level": 0 }],"educations": [{"name": "Tashkent University of Information Technologies","degree": "Bachelor","location": "Tashkent","programme": "E-Commerce","date_from": null,"date_to": "2019-01-01","country_name": "Uzbekistan"}],"certificates": [],"projects": [{"title": "Wallpaper App","summary": "Wallpaper app based on MVVM, Coin, Flow, Retrofit.","used_technologies": ["Kotlin", "Other"],"role": "Mobile Developer(IOS/Android)","industries": ["Other"]}]}
Training data
- 513 human-verified resume samples (private internal dataset). Each sample is a PDF rendered to one or more page PNGs plus a verified ground-truth JSON record.
- Split: 462 train / 51 held-out eval, 90/10, fixed seed
42. Samples whose estimated token length exceeded ~15.2K (1K below the 16,384 context budget) were dropped from training, so the effective training count is ≤462. - Page distribution: 276 single-page, 136 two-page, 101 three-or-more-page (up to 8).
- Language: predominantly English; some records contain non-English values (e.g. Russian/Uzbek company or language names).
The dataset is not released. Code to rebuild splits and bundles is in the repo
(src/data_prep.py, src/export_eval_bundle.py).
Training procedure
QLoRA via Unsloth (FastVisionModel) + TRL SFTTrainer. The 4-bit base
(unsloth/Qwen3-VL-8B-Instruct-unsloth-bnb-4bit, nf4) was adapted with LoRA on both the
vision and language towers (attention + MLP modules), then the adapter was merged back into
the full model and published.
Each training example is a single user turn — the page images followed by the combined system+user instruction — with the ground-truth JSON as the assistant target. There is no separate system role; this is why inference uses the same short prompt.
Dtype note: the merge used Unsloth's
merged_16bit, and the original upload was labeled "float16", but the publishedconfig.jsonand stored tensors are bfloat16. Treat this model as BF16.
Hyperparameters
| Hyperparameter | Value |
|---|---|
| Method | QLoRA (4-bit nf4 base + LoRA, merged after training) |
| LoRA rank / alpha / dropout | 16 / 16 / 0 |
| Target modules | vision + language layers, attention + MLP (bias="none", no rslora) |
| Learning rate | 2e-4 |
| LR scheduler / warmup | cosine / 10 steps |
| Optimizer | adamw_8bit |
| Weight decay | 0.01 |
| Per-device batch / grad-accum | 1 / 4 (effective batch 4) |
| Epochs | 1 |
| Max sequence length | 16,384 |
| Precision | bf16 (fp16 fallback if unsupported) |
| Seed | 3407 |
| Hardware | Google Colab L4 (24 GB) |
Training time and final loss were not captured from the run.
Evaluation
Measured on 2026-06-05 with notebooks/eval_finetuned.ipynb against the held-out split,
using the project's field-weighted scorer (src/evaluation.py). Setup: the published BF16
weights on a single A100, greedy decoding (do_sample=False, max_new_tokens=4096), on
the full 51-sample held-out split.
| Metric | Result |
|---|---|
| Overall weighted score | 83.9% |
| Overall unweighted score | 88.2% |
| JSON validity | 88.2% (45/51 parsed; 6 failures) |
| Avg. inference | ~92.0 s/resume |
| Peak VRAM | 23.4 GB |
Per-field accuracy (worst → best):
| Field | Acc | Field | Acc |
|---|---|---|---|
| skills | 67.5% | ready_to_relocation | 88.2% |
| phone | 74.5% | certificates | 90.8% |
| desired_position | 79.2% | projects | 91.0% |
| address | 81.2% | job_expectations | 92.7% |
| experiences | 81.7% | hobbies | 96.1% |
| first_name | 82.3% | date_of_birth | 98.0% |
| last_name | 82.3% | work_modes | 98.0% |
| 84.3% | employment_types | 98.0% | |
| job_experience | 84.3% | employment_durations | 98.0% |
| educations | 84.5% | min_salary | 100.0% |
| languages | 87.2% | max_salary | 100.0% |
| about | 88.2% |
Read these numbers with the following caveats:
- Full held-out split, single run. These are all 51 held-out samples with greedy decoding — a real measurement, but one run on a modest test set, not a large benchmark.
- Partial-credit metric. The scorer uses fuzzy string ratios, date/numeric tolerances,
and greedy best-match over object arrays, with fields weighted by importance (work
experience is weighted highest). It is not strict exact-match and is not comparable
to other parsers' published numbers — it is an internal quality signal. The weighted score
(83.9%) is below the unweighted (88.2%) because the highest-weighted fields —
experiences,skills, identity/contact — are also the hardest ones. - The top-scoring fields are mostly "correctly empty."
min_salary/max_salary(100%) anddate_of_birth,work_modes,employment_types,employment_durations(~98%) are almost always absent in this data, so high scores largely reflect correctly returning empty — not hard extraction. - 6/51 invalid JSON (~12%). Most likely 4096-token truncation on long multi-page resumes; downstream code must handle un-parseable output (retry, repair, or shorter prompts).
For context, the model-selection benchmark that led to Qwen3-VL-8B (base models, ~10 samples,
not reproducible from committed outputs) is noted in the repo's SESSION_LOG.md; it is not a
fine-tuned result and is excluded here.
How to use
Requires a recent transformers (≥4.57 for Qwen3-VL; latest recommended). The published
processor carries the correct chat template, so the modern image-in-messages path works
without extra utilities.
python
# pip install -U transformers accelerateimport jsonfrom transformers import AutoProcessor, Qwen3VLForConditionalGenerationmodel_id = "sukhrobnurali/qwen3vl-resume-parser"model = Qwen3VLForConditionalGeneration.from_pretrained(model_id, dtype="auto", device_map="auto", attn_implementation="sdpa",)processor = AutoProcessor.from_pretrained(model_id)# The 23-field schema is baked into the weights, so the short training prompt is all it needs.SYSTEM_PROMPT = "You are a resume parser. Extract information from resume images into structured JSON."USER_PROMPT = "Parse this resume and return the structured JSON."# One entry per page, top to bottom. "url" accepts a local file path or an http(s) URL.pages = ["resume_page_1.png", "resume_page_2.png"]messages = [{"role": "user","content": ([{"type": "text", "text": SYSTEM_PROMPT}]+ [{"type": "image", "url": p} for p in pages]+ [{"type": "text", "text": USER_PROMPT}]),}]inputs = processor.apply_chat_template(messages,tokenize=True,add_generation_prompt=True,return_dict=True,return_tensors="pt",).to(model.device)inputs.pop("token_type_ids", None)generated = model.generate(**inputs, max_new_tokens=4096, do_sample=False)trimmed = generated[:, inputs["input_ids"].shape[1]:]text = processor.batch_decode(trimmed, skip_special_tokens=True)[0]resume = json.loads(text) # the 23-field recordprint(json.dumps(resume, indent=2, ensure_ascii=False))
Use greedy decoding (do_sample=False) for stable structured output. For long multi-page
resumes, raise max_new_tokens if you see truncated JSON.
vLLM serving (the original deployment target):
bash
vllm serve sukhrobnurali/qwen3vl-resume-parser \--dtype bfloat16 --max-model-len 16384 --trust-remote-code
When calling through the OpenAI-compatible API, pass
extra_body={"chat_template_kwargs": {"enable_thinking": false}} to keep the model in
non-thinking (direct-JSON) mode.
Limitations
- Domain skew. Training resumes skew toward IT/software roles, and the enum vocabularies (roles, technologies, industries) are IT-centric. Expect degradation on non-technical resumes, unusual layouts, scans/photos, or handwriting.
- Language. English-dominant; non-English resumes are under-represented.
- Schema lock-in. The model is tuned to one specific 23-field schema and its enum lists.
It will coerce values toward those vocabularies (including
"Other"), which may not match a different downstream schema. - Invalid JSON happens (~12% on the held-out split). Always parse defensively.
- Latency. ~90 s/resume on an A100 at 16K context — batch/offline, not real-time.
- Quantization. BF16 peaks at ~23 GB VRAM; it runs in 4-bit on a 16 GB GPU, but accuracy was only measured in BF16.
Out-of-scope and responsible use
- No automated candidate decisions. Resume parsing for screening/ranking carries fairness and bias risk. Keep a human in the loop; do not use this model to make or materially influence hiring decisions without review.
- Not a general VQA / OCR model. It is specialized for this resume schema.
- PII. Resumes contain personal data. Handle outputs under the applicable privacy law (e.g. GDPR) — secure storage, access control, retention limits, and a lawful basis for processing.
- Verify before trusting. Outputs are model predictions, not ground truth; validate critical fields (contact info, dates) downstream.
License
Released under Apache-2.0, inherited from the Qwen/Qwen3-VL-8B-Instruct base model.
Citation
bibtex
@misc{nurali2026qwen3vlresumeparser,title = {qwen3vl-resume-parser: a Qwen3-VL-8B fine-tune for resume-to-JSON extraction},author = {Nurali, Sukhrob},year = {2026},howpublished = {\url{https://huggingface.co/sukhrobnurali/qwen3vl-resume-parser}}}
Built on Qwen3-VL by the Qwen team; see the Qwen3-VL model card and Unsloth for the training stack.
Author
Sukhrob Nurali — sukhrobnurali@gmail.com Hugging Face: @sukhrobnurali · GitHub: @sukhrobnurali
Model provider
sukhrobnurali
Model tree
Base
Qwen/Qwen3-VL-8B-Instruct
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information