exploitintel

cve-cwe-qwen35-4b

README

License: apache-2.0

Results (held-out test split, 6,802 rows)

Table with columns: Metric, This model (4B), 8B variant, 32B variant
Metric	This model (4B)	8B variant	32B variant
Exact-match	0.677	0.676	0.707
Micro-F1	0.701	0.702	0.729
Macro-F1	0.410	0.511	0.595

By difficulty (does the description name the weakness, or must it be inferred?):

Table with columns: Stratum, n, Exact-match, Micro-F1
Stratum	n	Exact-match	Micro-F1
Easy (weakness named)	2,046	0.861	0.886
Hard (must infer)	4,756	0.599	0.619

Reading the numbers:

The 4B matches the 8B on exact-match (0.677 vs 0.676) and micro-F1 at roughly half the parameters. On the common, high-frequency CWEs it is just as accurate.
The trade-off is macro-F1 (0.410 vs the 8B's 0.511). Macro-F1 is the unweighted mean over all weaknesses, so it is dominated by the long tail — the 4B has less capacity to learn rare CWEs and misses more of them than the larger variants. If your use case weights rare-weakness coverage heavily, prefer the 8B or 32B; if you want head-of-distribution accuracy at the smallest footprint, this model is the pick.
Macro-F1 is computed over the union of gold and predicted labels (158 = 117 gold + the labels the model predicted outside the gold set). Out-of-label predictions score ~0 and pull macro down, so 0.410 is a conservative figure — and the larger union here is itself a symptom of the 4B reaching for more wrong rare labels than the bigger models.
Exact-match has an inherent ceiling of ~98.3%: ~1.74% of the test set (273 groups / 1,205 rows) are identical descriptions mapped to different CWEs (e.g. a bare "Windows Kernel Elevation of Privilege Vulnerability"), which a description-only model cannot disambiguate.
Scores are on the capped/balanced test split (~30% "easy" rows), so they are not directly comparable to metrics measured on a different (e.g. natural-distribution) split.

Usage

Qwen3.5-4B is a reasoning model. For this single-label classification task, disable thinking (enable_thinking=False) so it returns the bare CWE ID instead of a chain-of-thought.

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer

mid = "exploitintel/cve-cwe-qwen35-4b"
tok = AutoTokenizer.from_pretrained(mid)
model = AutoModelForCausalLM.from_pretrained(mid, torch_dtype="auto", device_map="auto")

messages = [
    {"role": "system", "content": "You are a vulnerability analyst. Given a CVE description, "
                                   "reply with only the CWE ID(s) it maps to, comma-separated."},
    {"role": "user", "content": "A SQL injection vulnerability in the login endpoint allows an "
                                "unauthenticated attacker to execute arbitrary SQL via the username parameter."},
]
inputs = tok.apply_chat_template(
    messages, add_generation_prompt=True, enable_thinking=False, return_tensors="pt"
).to(model.device)
out = model.generate(inputs, max_new_tokens=32, do_sample=False)
print(tok.decode(out[0][inputs.shape[-1]:], skip_special_tokens=True))
# -> CWE-89

GGUF / Ollama

A Q4_K_M GGUF is included for local runners. The simplest path just works — Ollama pulls the GGUF and applies its embedded ChatML template:

bash
ollama run hf.co/exploitintel/cve-cwe-qwen35-4b:Q4_K_M

Set the analyst system prompt in-session (/set system You are a vulnerability analyst...) so it returns bare CWE IDs. llama.cpp / llama-server likewise use the embedded template directly.

Caveat if you build your own Modelfile: include an explicit TEMPLATE. A Modelfile with SYSTEM but no TEMPLATE suppresses the embedded template and the model rambles a fabricated advisory instead of answering. The known-good Modelfile (ChatML, thinking disabled, system prompt baked in):

markdown
FROM ./cve-cwe-qwen35-4b-Q4_K_M.gguf

TEMPLATE """{{- if .Messages }}
{{- if or .System .Tools }}<|im_start|>system
{{- if .System }}
{{ .System }}
{{- end }}
{{- if .Tools }}
# Tools
You may call one or more functions to assist with the user query.
You are provided with function signatures within <tools></tools> XML tags:
<tools>
{{- range .Tools }}
{"type": "function", "function": {{ .Function }}}
{{- end }}
</tools>
For each function call, return a json object with function name and arguments within <tool_call></tool_call> XML tags:
<tool_call>
{"name": <function-name>, "arguments": <args-json-object>}
</tool_call>
{{- end }}<|im_end|>
{{ end }}
{{- range $i, $_ := .Messages }}
{{- $last := eq (len (slice $.Messages $i)) 1 -}}
{{- if eq .Role "user" }}<|im_start|>user
{{ .Content }}<|im_end|>
{{ else if eq .Role "assistant" }}<|im_start|>assistant
{{ if .Content }}{{ .Content }}
{{- else if .ToolCalls }}<tool_call>
{{ range .ToolCalls }}{"name": "{{ .Function.Name }}", "arguments": {{ .Function.Arguments }}}
{{ end }}</tool_call>
{{- end }}{{ if not $last }}<|im_end|>
{{ end }}
{{- else if eq .Role "tool" }}<|im_start|>user
<tool_response>
{{ .Content }}
</tool_response><|im_end|>
{{ end }}
{{- if and (ne .Role "assistant") $last }}<|im_start|>assistant
<think>

</think>

{{ end }}
{{- end }}
{{- else }}
{{- if .System }}<|im_start|>system
{{ .System }}<|im_end|>
{{ end }}{{ if .Prompt }}<|im_start|>user
{{ .Prompt }}<|im_end|>
{{ end }}<|im_start|>assistant
<think>

</think>

{{ end }}{{ .Response }}{{ if .Response }}<|im_end|>{{ end }}"""

SYSTEM """You are a vulnerability analyst. Given a CVE description, reply with only the CWE ID(s) it maps to, comma-separated."""

PARAMETER temperature 0
PARAMETER stop "<|im_end|>"

bash
ollama create cve-cwe-qwen35-4b -f Modelfile
ollama run cve-cwe-qwen35-4b "A SQL injection vulnerability lets an attacker run arbitrary SQL via the username parameter."
# -> CWE-89

The <think></think> block in the template disables reasoning (equivalent to enable_thinking=false); without it the model emits chain-of-thought before the answer.

Training

Base: Qwen/Qwen3.5-4B (trained 4-bit via unsloth/Qwen3.5-4B)
Method: QLoRA (4-bit) with Unsloth, merged to 16-bit · released checkpoint: checkpoint-2326 (final; eval loss declined monotonically through training)
Dataset: exploitintel/cve-cwe-consensus — 69,386 rows (55,810 / 6,774 / 6,802), majority CWEs capped at 2,500
Settings: 2 epochs · context 512 · LR 2e-4 · AdamW 8-bit · linear schedule · packing on · train-on-completions-only off
LoRA fine-tune, adapter merged into the base. Exact per-run LoRA rank/alpha, batch size, and weight decay were not logged to the repo.

Prompt format

ChatML (Qwen3 standard), thinking disabled. Fixed system prompt; the description is the only user input — never feed the label or CVE-ID.

system: You are a vulnerability analyst. Given a CVE description, reply with only the CWE ID(s) it maps to, comma-separated.
user: the CVE description
assistant: CWE-79, CWE-80

Limitations

Weaker long-tail (rare-CWE) coverage than the 8B/32B variants — see macro-F1 above.
As a reasoning model run with thinking disabled, leaving thinking enabled will produce chain-of-thought before the answer; parse only the text after </think> if you do.
CWEs below the dataset's 50-example floor are not in the label space and won't be predicted.
Outputs CWE IDs as text and can occasionally emit a malformed/non-existent ID — validate against the official CWE list.
English-only; descriptions only (no code, CVSS, or references).
A triage/assist aid, not an authoritative CWE assignment — human-review before acting.

License

Apache-2.0 (inherited from Qwen3.5-4B). Dataset derives from public upstreams (NVD, MITRE CVE/CWE).

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

exploitintel

Model Tree

Base

Qwen/Qwen3.5-4B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities