Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Intended use
Purpose-trained for AI content moderation — classifies a message into a safety category with a confidence. Suitable for forums, chat apps, Discord, or any text moderation pipeline. Typically used as the first cheap tier of a triage system: a downstream policy decides what to do with the verdict (e.g. a confidence-gated enforcement ladder — auto-action on high-confidence/clear cases, human review for borderline ones).
Categories
benign, toxicity, harassment, hate_speech, sexual_content, violence.
Input / output
Prompt the model with a moderation system rubric + the message to judge; it returns strict JSON:
json
{"flag": true, "category": "harassment", "confidence": 0.55, "reason": "short"}
python
from transformers import AutoModelForCausalLM, AutoTokenizerimport torchtok = AutoTokenizer.from_pretrained("Hagrun/moddog-l1-safety-qwen2.5-3b")model = AutoModelForCausalLM.from_pretrained("Hagrun/moddog-l1-safety-qwen2.5-3b", torch_dtype=torch.float16, device_map="cuda")messages = [{"role": "system", "content": "<your moderation rubric>"},{"role": "user", "content": 'Message to judge: "you absolute idiot"'},]prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)out = model.generate(**tok(prompt, return_tensors="pt").to(model.device),max_new_tokens=96, do_sample=False)print(tok.decode(out[0], skip_special_tokens=True))
Confidence — read it from the logits, not the text
The most useful finding from building this model: the confidence number the model writes is not trustworthy, but the confidence in its logits is.
- The emitted
confidencefield is non-discriminative — its AUROC vs. actual correctness is ≈ 0.5 (a coin flip). Don't gate on it. - The model's probability on the flag-decision token (read from logprobs) discriminates well — AUROC ≈ 0.71–0.79 — and calibrates cleanly (ECE ≈ 0.045 after a simple histogram fit).
In practice this enables a high-confidence auto-action tier with zero false
positives: thresholding on calibrated logit-confidence covered ~79% of cases
with 0 high-confidence false positives on the held-out eval, routing the
uncertain remainder to human review. To use it, request token logprobs at
inference (e.g. logprobs/top_logprobs via an OpenAI-compatible server, or
output_scores with transformers) and take P(true) / (P(true)+P(false)) at the
"flag" value token.
Evaluation
Two held-out sets: a small human-curated golden set (on-distribution chat) and a larger balanced Civil Comments set (off-distribution news comments).
| metric | base Qwen2.5-3B | this model |
|---|---|---|
| golden flag-accuracy (24) | 18/24 | 19/24 |
| Civil Comments accuracy (294) | 64% | 81% |
| logit-confidence AUROC (golden) | — | 0.79 |
| logit-confidence ECE (golden, calibrated) | — | 0.045 |
Confidence/calibration numbers are for the logit-derived confidence described above, not the emitted number.
Limitations
- No detection of spam or self-harm, and weak on sexual content. Handle those outside the AI pipeline (e.g. regex / word-phrase triggers, or a larger model with conversational context).
- Off-distribution caveat: Civil Comments is news commentary, not live chat; on-distribution behaviour is best reflected by the (small) golden set.
- It is a triage classifier, not a final arbiter — automated enforcement on a 3B model carries false-positive risk. Gate auto-actions on calibrated logit-confidence and escalate uncertain cases to a human.
Training
QLoRA fine-tune of Qwen2.5-3B-Instruct (2 epochs), merged to fp16. Data: a curated golden set + a balanced, de-duplicated sample of Civil Comments, with confidence targets derived from annotator-agreement. Released Apache-2.0.
Model provider
Hagrun
Model tree
Base
Qwen/Qwen2.5-3B-Instruct
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information