Model description
Table | |
|---|
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Method | 4-bit QLoRA SFT → LoRA merged into full weights (bf16 safetensors) |
| Classes | phishing, legitimate |
| Max sequence length | 512 tokens (training default) |
| Parameters | ~1.5B (full merged checkpoint) |
Intended uses
- Research and education on phishing URL / email text classification
- Prototyping security tooling with explicit human review
Out-of-scope uses
- Sole automated decision-making for blocking users or transactions without review
- Spam campaigns, social engineering, or evading security systems
- Languages or domains far from the training distribution
Use this exact instruction and layout (training and eval depend on it):
### Instruction:
Classify the email or URL as phishing or legitimate.
### Input:
<your email body or URL here>
### Response:
The model should complete ### Response: with phishing or legitimate. Prefer temperature 0 / greedy decoding.
Evaluation (held-out test set)
Metrics on 1,000 held-out test examples (500 per class), evaluated with the training adapter + PEFT (evaluate script, left-padding batch inference). Report before merge; merged weights are expected to match closely.
Table with columns: Metric, Value| Metric | Value |
|---|
| Accuracy | 96.5% |
| Macro F1 | 0.9820 |
Confusion matrix (rows = true, columns = predicted):
Table with columns: phishing, legitimate | phishing | legitimate |
|---|
| phishing | 469 | 0 |
| legitimate | 0 | 496 |
precision recall f1-score support
phishing 1.00 0.94 0.97 500
legitimate 1.00 0.99 1.00 500
micro avg 1.00 0.96 0.98 1000
macro avg 1.00 0.96 0.98 1000
weighted avg 1.00 0.96 0.98 1000
Training data
- Source: Synthetic dataset (5,000 rows, balanced 2,500 / 2,500)
- Columns:
model_input (email or URL text), label (phishing | legitimate)
- Split: 80/20 stratified → 4,000 train / 1,000 test
📌 Dataset availability: The training dataset is not publicly released as part of this repository. If you are interested in the data for research collaboration or reproducibility purposes, please contact the authors directly via HuggingFace.
Training procedure
Table with columns: Setting, Value| Setting | Value |
|---|
| Base model | Qwen/Qwen2.5-1.5B-Instruct |
| Quantization | 4-bit NF4 (QLoRA training only) |
| LoRA rank / alpha | 16 / 32 |
| Epochs | 1 |
| Learning rate | 2e-4 |
| Batch (effective) | 16 (8 × grad accum 2) |
| Max seq length | 512 |
| Seed | 42 |
| Merge | bf16 full weights for this checkpoint |
Usage
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
repo = "Lovely2209/Qwen2.5-1.5B-Phishing-Email-Detector"
tokenizer = AutoTokenizer.from_pretrained(repo)
model = AutoModelForCausalLM.from_pretrained(
repo,
torch_dtype=torch.bfloat16,
device_map="auto",
)
text = "http://paypa1-secure.example/verify?id=abc"
prompt = (
"### Instruction:\n"
"Classify the email or URL as phishing or legitimate.\n\n"
f"### Input:\n{text}\n\n"
"### Response:\n"
)
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
with torch.no_grad():
out = model.generate(
**inputs,
max_new_tokens=8,
do_sample=False,
pad_token_id=tokenizer.eos_token_id,
)
print(tokenizer.decode(out[0], skip_special_tokens=True))
vLLM (merged — no LoRA)
export VLLM_USE_FLASHINFER_SAMPLER=0 # if nvcc / CUDA toolkit is not installed
from vllm import LLM, SamplingParams
from vllm.sampling_params import StructuredOutputsParams
repo = "Lovely2209/Qwen2.5-1.5B-Phishing-Email-Detector"
llm = LLM(
model=repo,
dtype="bfloat16",
max_model_len=512,
enforce_eager=True,
)
prompt = (
"### Instruction:\n"
"Classify the email or URL as phishing or legitimate.\n\n"
"### Input:\n"
"http://paypa1-secure.example/verify\n\n"
"### Response:\n"
)
sampling = SamplingParams(
temperature=0.0,
max_tokens=8,
structured_outputs=StructuredOutputsParams(
choice=["phishing", "legitimate"]
),
)
outputs = llm.generate([prompt], sampling)
print(outputs[0].outputs[0].text.strip())
License
This model is released under the Apache License 2.0, consistent with Qwen/Qwen2.5-1.5B-Instruct. See LICENSE and NOTICE in this repository.
This project is not affiliated with, endorsed by, or sponsored by Alibaba Cloud or the Qwen Team.
Citation
If you use this model, please cite the base Qwen2.5 work:
@misc{qwen2.5,
title = {Qwen2.5: A Party of Foundation Models},
url = {https://qwenlm.github.io/blog/qwen2.5/},
author = {Qwen Team},
month = {September},
year = {2024}
}
Fine-tune attribution:
@misc{qwen25-phishing-email-detector-2026,
title = {Qwen2.5-1.5B Phishing Email Detector (merged QLoRA SFT)},
author = {Jagriti Singh and Lovely Kumari},
year = {2026},
howpublished = {\url{https://huggingface.co/Lovely2209/Qwen2.5-1.5B-Phishing-Email-Detector}},
note = {Fine-tuned from Qwen/Qwen2.5-1.5B-Instruct}
}