Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Intended use

Purpose-trained for AI content moderation — classifies a message into a safety category with a confidence. Suitable for forums, chat apps, Discord, or any text moderation pipeline. Typically used as the first cheap tier of a triage system: a downstream policy decides what to do with the verdict (e.g. a confidence-gated enforcement ladder — auto-action on high-confidence/clear cases, human review for borderline ones).

Categories

benign, toxicity, harassment, hate_speech, sexual_content, violence.

Input / output

Prompt the model with a moderation system rubric + the message to judge; it returns strict JSON:

json

{"flag": true, "category": "harassment", "confidence": 0.55, "reason": "short"}

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
tok = AutoTokenizer.from_pretrained("Hagrun/moddog-l1-safety-qwen2.5-3b")
model = AutoModelForCausalLM.from_pretrained(
"Hagrun/moddog-l1-safety-qwen2.5-3b", torch_dtype=torch.float16, device_map="cuda")
messages = [
{"role": "system", "content": "<your moderation rubric>"},
{"role": "user", "content": 'Message to judge: "you absolute idiot"'},
]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
out = model.generate(**tok(prompt, return_tensors="pt").to(model.device),
max_new_tokens=96, do_sample=False)
print(tok.decode(out[0], skip_special_tokens=True))

Confidence — read it from the logits, not the text

The most useful finding from building this model: the confidence number the model writes is not trustworthy, but the confidence in its logits is.

  • The emitted confidence field is non-discriminative — its AUROC vs. actual correctness is ≈ 0.5 (a coin flip). Don't gate on it.
  • The model's probability on the flag-decision token (read from logprobs) discriminates well — AUROC ≈ 0.71–0.79 — and calibrates cleanly (ECE ≈ 0.045 after a simple histogram fit).

In practice this enables a high-confidence auto-action tier with zero false positives: thresholding on calibrated logit-confidence covered ~79% of cases with 0 high-confidence false positives on the held-out eval, routing the uncertain remainder to human review. To use it, request token logprobs at inference (e.g. logprobs/top_logprobs via an OpenAI-compatible server, or output_scores with transformers) and take P(true) / (P(true)+P(false)) at the "flag" value token.

Evaluation

Two held-out sets: a small human-curated golden set (on-distribution chat) and a larger balanced Civil Comments set (off-distribution news comments).

metricbase Qwen2.5-3Bthis model
golden flag-accuracy (24)18/2419/24
Civil Comments accuracy (294)64%81%
logit-confidence AUROC (golden)0.79
logit-confidence ECE (golden, calibrated)0.045

Confidence/calibration numbers are for the logit-derived confidence described above, not the emitted number.

Limitations

  • No detection of spam or self-harm, and weak on sexual content. Handle those outside the AI pipeline (e.g. regex / word-phrase triggers, or a larger model with conversational context).
  • Off-distribution caveat: Civil Comments is news commentary, not live chat; on-distribution behaviour is best reflected by the (small) golden set.
  • It is a triage classifier, not a final arbiter — automated enforcement on a 3B model carries false-positive risk. Gate auto-actions on calibrated logit-confidence and escalate uncertain cases to a human.

Training

QLoRA fine-tune of Qwen2.5-3B-Instruct (2 epochs), merged to fp16. Data: a curated golden set + a balanced, de-duplicated sample of Civil Comments, with confidence targets derived from annotator-agreement. Released Apache-2.0.

Model provider

Hagrun

Model tree

Base

Qwen/Qwen2.5-3B-Instruct

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today