Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Task

  • Input: a CVE description (plain text).
  • Output: one or more CWE IDs, comma-separated and numerically sorted (e.g. CWE-79 or CWE-79, CWE-352).

Evaluation

On 100 held-out test rows (strict exact-match, including multi-label order):

VariantExact-matchWell-formed<think> leak
Merged 16-bit (Transformers)75.0%100%0
Q4_K_M GGUF (Ollama)70.0%100%0

Strict exact-match penalizes near-misses (e.g. correct primary CWE plus one extra label), so practical usefulness is higher than the headline number. The ~5-point gap is the expected Q4 quantization cost.

Usage (Transformers / Unsloth)

[!IMPORTANT] Disable thinking mode. This model is trained for terse, structured output. Run with enable_thinking=False; otherwise Qwen3.5's default <think> block pollutes the output. Import unsloth before transformers so the qwen3_5 architecture is registered.

python

import unsloth # registers qwen3_5; must come first
from unsloth import FastModel
model, tok = FastModel.from_pretrained("exploitintel/cve-cwe-qwen35-9b", load_in_4bit=False)
FastModel.for_inference(model)
ttok = getattr(tok, "tokenizer", tok)
SYSTEM = ("You are a vulnerability analyst. Given a CVE description, "
"reply with only the CWE ID(s) it maps to, comma-separated.")
msgs = [
{"role": "system", "content": SYSTEM},
{"role": "user", "content": "A reflected cross-site scripting issue lets a remote "
"attacker inject arbitrary script via the q parameter."},
]
text = ttok.apply_chat_template(msgs, add_generation_prompt=True,
enable_thinking=False, tokenize=False)
ids = ttok(text, return_tensors="pt", add_special_tokens=False).to(model.device)
out = model.generate(**ids, max_new_tokens=48, do_sample=False)
print(ttok.decode(out[0][ids["input_ids"].shape[-1]:], skip_special_tokens=True)) # -> CWE-79

Usage (Ollama / GGUF)

A Q4_K_M GGUF is included in this repo. It is converted without the MTP head (convert_hf_to_gguf.py --no-mtp) — required, or llama.cpp/Ollama fails to load with qwen3next: layer 32 missing attn_qkv/attn_gate projections. The bundled Modelfile pins thinking-mode off and the correct stop token (<|im_end|>) so output is a clean CWE-....

bash

ollama run hf.co/exploitintel/cve-cwe-qwen35-9b:Q4_K_M

Notes

  • Architecture: qwen3_5 (hybrid linear/full attention + MTP). Requires Unsloth or a transformers build that registers qwen3_5 (≥ 5.2.0).
  • Base modality: the base is vision-capable; this fine-tune and the GGUF target text-only CVE→CWE mapping.
  • License: inherits the license of the Qwen3.5-9B base model.

Model provider

exploitintel

Model tree

Base

unsloth/Qwen3.5-9B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today