Hagrun/moddog-l1-safety-qwen2.5-3b API & Inference Endpoint

Intended use

Purpose-trained for AI content moderation — classifies a message into a safety category with a confidence. Suitable for forums, chat apps, Discord, or any text moderation pipeline. Typically used as the first cheap tier of a triage system: a downstream policy decides what to do with the verdict (e.g. a confidence-gated enforcement ladder — auto-action on high-confidence/clear cases, human review for borderline ones).

Input / output

Prompt the model with a moderation system rubric + the message to judge; it returns strict JSON:

json
{"flag": true, "category": "harassment", "confidence": 0.55, "reason": "short"}

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

tok = AutoTokenizer.from_pretrained("Hagrun/moddog-l1-safety-qwen2.5-3b")
model = AutoModelForCausalLM.from_pretrained(
    "Hagrun/moddog-l1-safety-qwen2.5-3b", torch_dtype=torch.float16, device_map="cuda")

messages = [
    {"role": "system", "content": "<your moderation rubric>"},
    {"role": "user", "content": 'Message to judge: "you absolute idiot"'},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
out = model.generate(**tok(prompt, return_tensors="pt").to(model.device),
                     max_new_tokens=96, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))

Confidence — read it from the logits, not the text

The most useful finding from building this model: the confidence number the model writes is not trustworthy, but the confidence in its logits is.

The emitted confidence field is non-discriminative — its AUROC vs. actual correctness is ≈ 0.5 (a coin flip). Don't gate on it.
The model's probability on the flag-decision token (read from logprobs) discriminates well — AUROC ≈ 0.71–0.79 — and calibrates cleanly (ECE ≈ 0.045 after a simple histogram fit).

In practice this enables a high-confidence auto-action tier with zero false positives: thresholding on calibrated logit-confidence covered ~79% of cases with 0 high-confidence false positives on the held-out eval, routing the uncertain remainder to human review. To use it, request token logprobs at inference (e.g. logprobs/top_logprobs via an OpenAI-compatible server, or output_scores with transformers) and take P(true) / (P(true)+P(false)) at the "flag" value token.

Evaluation

Two held-out sets: a small human-curated golden set (on-distribution chat) and a larger balanced Civil Comments set (off-distribution news comments).

metric	base Qwen2.5-3B	this model
golden flag-accuracy (24)	18/24	19/24
Civil Comments accuracy (294)	64%	81%
logit-confidence AUROC (golden)	—	0.79
logit-confidence ECE (golden, calibrated)	—	0.045

Confidence/calibration numbers are for the logit-derived confidence described above, not the emitted number.

Limitations

No detection of spam or self-harm, and weak on sexual content. Handle those outside the AI pipeline (e.g. regex / word-phrase triggers, or a larger model with conversational context).
Off-distribution caveat: Civil Comments is news commentary, not live chat; on-distribution behaviour is best reflected by the (small) golden set.
It is a triage classifier, not a final arbiter — automated enforcement on a 3B model carries false-positive risk. Gate auto-actions on calibrated logit-confidence and escalate uncertain cases to a human.

Training

QLoRA fine-tune of Qwen2.5-3B-Instruct (2 epochs), merged to fp16. Data: a curated golden set + a balanced, de-duplicated sample of Civil Comments, with confidence targets derived from annotator-agreement. Released Apache-2.0.

moddog-l1-safety-qwen2.5-3b

Get help setting up a custom Dedicated Endpoints.

README