giannor

Qwen3.5-27B-psysafe

README

License: apache-2.0

Model Details

Table with columns: Property, Value
Property	Value
Base model	Qwen/Qwen3.5-27B
Fine-tuning base	unsloth/Qwen3.5-27B
Architecture	Dense, 27B parameters
Precision	BF16
Training method	Supervised Fine-Tuning (SFT) with LoRA
Training hardware	NVIDIA H100
Language	English
Paper	TBA
Code	github.com/aisilab/psychological-safety
W&B run

Intended Use

This model is designed for deployments where psychologically safe refusals are critical, such as:

Mental health support platforms
Crisis-intervention or safeguarding tools
Safety-layer components in consumer-facing LLM applications
Research into helpful and harm-preventive AI behavior

It is not recommended as a general-purpose assistant without additional evaluation, and should not be deployed as a standalone clinical tool.

Please cite this paper if you find this work useful:

markdown
@misc{barmina2026psychosafeelicitingpsychologicallyinformedrefusals,
      title={PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models}, 
      author={Gianluca Barmina and Federico Torrielli and Sven Harms and Jacob Nielsen and Felix Mächtle and Stine Lyngsø Beltoft and Peter Schneider-Kamp and Thomas Eisenbarth and Lukas Galke Poech and Anne Lauscher},
      year={2026},
      eprint={2606.09697},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.09697},  
}

The PSYCHOSAFE Framework

PSYCHOSAFE treats refusal as a structured, communicative, supportive act rather than a binary safety decision. All refusals follow a four-part structure:

Acknowledgment & Gentle Refusal — Declines to provide harmful content while warmly acknowledging the person.
Personalized Self-Help Step — Applies a domain-appropriate psychological intervention strategy (e.g., Psychological First Aid, Motivational Interviewing) tailored to the user's expressed situation.
Professional Resources — Refers the user to relevant helplines and support services.
Hopeful Closing — Ends with a brief, sincere, personalized message of hope.

Risk Domains

The model is specifically trained to handle five psychologically salient risk clusters:

Table with columns: Domain, Intervention Strategies
Domain	Intervention Strategies
Suicide & Self-Harm	Psychological First Aid, Safety Planning, QPR Gatekeeper Training, Mental Health First Aid
Substance Use	Motivational Interviewing, 5A's Brief Intervention, SOBER
Violence	Green Dot Bystander Intervention, Motivational Interviewing
Weapons	Green Dot Bystander Intervention, Motivational Interviewing
Sexual Crimes	Green Dot Bystander Intervention, Motivational Interviewing

Outside these five domains, the model behaves as a normal helpful assistant. Educational and research-oriented questions about sensitive topics are answered informatively, with context used to distinguish intent.

Training Data

The model was fine-tuned on the PSYCHOSAFE dataset: 8,019 prompt–response pairs spanning the five risk domains above. Each response was hand-crafted following the four-part PSYCHOSAFE template, grounded in specific psychological intervention strategies, and reviewed for psychological appropriateness by a domain expert.

Reasoning traces were imputed using GPT-OSS-120B, and the model was trained with cross-entropy loss on both the reasoning traces and the human-crafted responses (not the user prompts).

Table with columns: Risk cluster, Examples
Risk cluster	Examples
Suicide and Self-Harm	2,578
Substance Use	1,998
Weapons	1,740
Violence	1,377
Sexual Crimes	326
Total	8,019

Training Procedure

Fine-tuning used LoRA applied to all attention and feed-forward projection layers, with the following configuration:

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Method	Supervised Fine-Tuning (SFT)
LoRA rank	r = 1
LoRA alpha	α = 32
Dropout	None
Epochs	5
Max sequence length	4,096 tokens
Batch size	4
Gradient accumulation	None
Optimizer	AdamW (8-bit quantization)

Framework Versions

Table with columns: Library, Version
Library	Version
TRL	0.24.0
Transformers	5.2.0
PyTorch	2.10.0
Datasets	4.3.0
Tokenizers	0.22.2
Unsloth	latest

Evaluation Results

The model was evaluated on a held-out validation set of 500 stratified prompts (100 per risk cluster), using Qwen3.5-397B-A17B as the LLM judge (the candidate with highest agreement with human ratings, Cohen's κ = 0.61). Four criteria were assessed: Refusal (boolean), Psychological Grounding (1–5), External References (boolean), and Relevance (1–5).

Table with columns: Setting, Refusal (%), Psych. Grounding, Ext. References (%), Relevance, Overall (%)
Setting	Refusal (%)	Psych. Grounding	Ext. References (%)	Relevance	Overall (%)
Base model, generic prompt (v0)	90.6	3.38 ± 1.17	64.8	3.90 ± 0.86	71.9
Base model, PSYCHOSAFE prompt (v1)	96.0	4.56 ± 0.86	95.2	4.52 ± 0.74	92.0
, generic prompt (v0)

Key findings relative to the generic-prompt base model baseline:

+15.1% overall refusal quality improvement (with generic prompt)
+53.9% external resource referral rate
+14.2% psychological grounding
Near-perfect refusal rate (100%), up from 90.6%
Reduced relevance (−13.5%), likely due to over-application of crisis-intervention templates to ambiguous prompts

Out-of-Domain Safety Benchmarks

SORRY-Bench (compliance rate %, lower is safer):

Table with columns: Prompt, Base Qwen3.5-27B, This model
Prompt	Base Qwen3.5-27B	This model
Default (base prompts)	17.1	0.0
Generic prompt v0	13.2	0.0
PSYCHOSAFE prompt v1	13.6	0.1
Default (mutation avg.)	25.4	0.0
Generic prompt v0 (mutation avg.)	25.4	0.0
PSYCHOSAFE prompt v1 (mutation avg.)

XSTest (over-refusal on safe prompts ↓ / safety on unsafe prompts ↑):

Table with columns: Prompt, Over-refusal (base), Safety (base), Over-refusal (this model), Safety (this model)
Prompt	Over-refusal (base)	Safety (base)	Over-refusal (this model)	Safety (this model)
Default	13.2%	59.0%	3.6%	17.0%
Generic v0	12.4%	63.0%	4.8%	15.0%
PSYCHOSAFE v1	24.0%	78.5%	9.2%	26.5%

The fine-tuned model over-refuses less than the base on benign prompts, ruling out indiscriminate refusal. Its lower safety rate on adversarial out-of-domain prompts reflects limited generalization beyond the five training domains.

General Capabilities

Table with columns: Benchmark, Base Qwen3.5-27B, This model
Benchmark	Base Qwen3.5-27B	This model
MMLU	0.845	0.802
HellaSwag	0.638	0.641

The modest capability trade-off is considered acceptable in safety-critical deployment contexts.

Limitations

Domain coverage is narrow. The model is trained on five specific risk clusters and does not generalize robustly to out-of-domain adversarial safety prompts.
Reduced personalization. The fine-tuned model can over-apply crisis-intervention templates to ambiguous or benign prompts, reducing response relevance.
English-only. The model and its built-in support resources are in English, with helplines primarily targeting the US and UK.
Single-turn only. The model was trained and evaluated on single-turn prompts. Multi-turn, adversarial, and real-user behavior remain unstudied.
Not clinically validated. Intervention strategies are adapted from human–human frameworks and should not be interpreted as therapy or crisis management.
Generative, not rule-based. Appropriate behavior cannot be guaranteed for all possible inputs or conversational contexts. Miscalibrated refusals may still fail to support users adequately or may escalate distress.

Ethical Considerations

This model is intended to reduce harm caused by blunt or poorly designed LLM refusals in high-risk interactions. However:

Supportive and empathetic refusal behavior could create unwarranted perceptions of emotional competence or therapeutic authority in a system that is neither clinically validated nor capable of genuine psychological care.
Pre-deployment stress-testing under adversarial, emotionally charged, and out-of-distribution scenarios is strongly recommended.
Continuous monitoring and iterative correction after deployment are essential.
Future work should evaluate failure modes across diverse cultural contexts, vulnerable populations, and multilingual settings.

Quick Start

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "giannor/Qwen3.5-27B-psysafe"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "Your message here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

With vLLM:

bash
pip install vllm
vllm serve "giannor/Qwen3.5-27B-psysafe"

With Unsloth:

python
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="giannor/Qwen3.5-27B-psysafe",
    max_seq_length=4096,
)

Citation

If you use this model, please cite the PSYCHOSAFE paper:

bibtex
TBA

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

giannor

Model Tree

Base

Qwen/Qwen3.5-27B

Fine-tuned

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Model Details

Table with columns: Property, Value
Property	Value
Base model	Qwen/Qwen3.5-27B
Fine-tuning base	unsloth/Qwen3.5-27B
Architecture	Dense, 27B parameters
Precision	BF16
Training method	Supervised Fine-Tuning (SFT) with LoRA
Training hardware	NVIDIA H100
Language	English
Paper	TBA
Code	github.com/aisilab/psychological-safety
W&B run

Intended Use

This model is designed for deployments where psychologically safe refusals are critical, such as:

Mental health support platforms
Crisis-intervention or safeguarding tools
Safety-layer components in consumer-facing LLM applications
Research into helpful and harm-preventive AI behavior

It is not recommended as a general-purpose assistant without additional evaluation, and should not be deployed as a standalone clinical tool.

Please cite this paper if you find this work useful:

markdown
@misc{barmina2026psychosafeelicitingpsychologicallyinformedrefusals,
      title={PsychoSafe: Eliciting Psychologically-Informed Refusals in Large Language Models}, 
      author={Gianluca Barmina and Federico Torrielli and Sven Harms and Jacob Nielsen and Felix Mächtle and Stine Lyngsø Beltoft and Peter Schneider-Kamp and Thomas Eisenbarth and Lukas Galke Poech and Anne Lauscher},
      year={2026},
      eprint={2606.09697},
      archivePrefix={arXiv},
      primaryClass={cs.CL},
      url={https://arxiv.org/abs/2606.09697},  
}

The PSYCHOSAFE Framework

PSYCHOSAFE treats refusal as a structured, communicative, supportive act rather than a binary safety decision. All refusals follow a four-part structure:

Acknowledgment & Gentle Refusal — Declines to provide harmful content while warmly acknowledging the person.
Personalized Self-Help Step — Applies a domain-appropriate psychological intervention strategy (e.g., Psychological First Aid, Motivational Interviewing) tailored to the user's expressed situation.
Professional Resources — Refers the user to relevant helplines and support services.
Hopeful Closing — Ends with a brief, sincere, personalized message of hope.

Risk Domains

The model is specifically trained to handle five psychologically salient risk clusters:

Table with columns: Domain, Intervention Strategies
Domain	Intervention Strategies
Suicide & Self-Harm	Psychological First Aid, Safety Planning, QPR Gatekeeper Training, Mental Health First Aid
Substance Use	Motivational Interviewing, 5A's Brief Intervention, SOBER
Violence	Green Dot Bystander Intervention, Motivational Interviewing
Weapons	Green Dot Bystander Intervention, Motivational Interviewing
Sexual Crimes	Green Dot Bystander Intervention, Motivational Interviewing

Training Data

Reasoning traces were imputed using GPT-OSS-120B, and the model was trained with cross-entropy loss on both the reasoning traces and the human-crafted responses (not the user prompts).

Table with columns: Risk cluster, Examples
Risk cluster	Examples
Suicide and Self-Harm	2,578
Substance Use	1,998
Weapons	1,740
Violence	1,377
Sexual Crimes	326
Total	8,019

Training Procedure

Fine-tuning used LoRA applied to all attention and feed-forward projection layers, with the following configuration:

Table with columns: Hyperparameter, Value
Hyperparameter	Value
Method	Supervised Fine-Tuning (SFT)
LoRA rank	r = 1
LoRA alpha	α = 32
Dropout	None
Epochs	5
Max sequence length	4,096 tokens
Batch size	4
Gradient accumulation	None
Optimizer	AdamW (8-bit quantization)

Framework Versions

Table with columns: Library, Version
Library	Version
TRL	0.24.0
Transformers	5.2.0
PyTorch	2.10.0
Datasets	4.3.0
Tokenizers	0.22.2
Unsloth	latest

Evaluation Results

Table with columns: Setting, Refusal (%), Psych. Grounding, Ext. References (%), Relevance, Overall (%)
Setting	Refusal (%)	Psych. Grounding	Ext. References (%)	Relevance	Overall (%)
Base model, generic prompt (v0)	90.6	3.38 ± 1.17	64.8	3.90 ± 0.86	71.9
Base model, PSYCHOSAFE prompt (v1)	96.0	4.56 ± 0.86	95.2	4.52 ± 0.74	92.0
, generic prompt (v0)

Key findings relative to the generic-prompt base model baseline:

+15.1% overall refusal quality improvement (with generic prompt)
+53.9% external resource referral rate
+14.2% psychological grounding
Near-perfect refusal rate (100%), up from 90.6%
Reduced relevance (−13.5%), likely due to over-application of crisis-intervention templates to ambiguous prompts

Out-of-Domain Safety Benchmarks

SORRY-Bench (compliance rate %, lower is safer):

Table with columns: Prompt, Base Qwen3.5-27B, This model
Prompt	Base Qwen3.5-27B	This model
Default (base prompts)	17.1	0.0
Generic prompt v0	13.2	0.0
PSYCHOSAFE prompt v1	13.6	0.1
Default (mutation avg.)	25.4	0.0
Generic prompt v0 (mutation avg.)	25.4	0.0
PSYCHOSAFE prompt v1 (mutation avg.)

XSTest (over-refusal on safe prompts ↓ / safety on unsafe prompts ↑):

Table with columns: Prompt, Over-refusal (base), Safety (base), Over-refusal (this model), Safety (this model)
Prompt	Over-refusal (base)	Safety (base)	Over-refusal (this model)	Safety (this model)
Default	13.2%	59.0%	3.6%	17.0%
Generic v0	12.4%	63.0%	4.8%	15.0%
PSYCHOSAFE v1	24.0%	78.5%	9.2%	26.5%

General Capabilities

Table with columns: Benchmark, Base Qwen3.5-27B, This model
Benchmark	Base Qwen3.5-27B	This model
MMLU	0.845	0.802
HellaSwag	0.638	0.641

The modest capability trade-off is considered acceptable in safety-critical deployment contexts.

Limitations

Domain coverage is narrow. The model is trained on five specific risk clusters and does not generalize robustly to out-of-domain adversarial safety prompts.
Reduced personalization. The fine-tuned model can over-apply crisis-intervention templates to ambiguous or benign prompts, reducing response relevance.
English-only. The model and its built-in support resources are in English, with helplines primarily targeting the US and UK.
Single-turn only. The model was trained and evaluated on single-turn prompts. Multi-turn, adversarial, and real-user behavior remain unstudied.
Not clinically validated. Intervention strategies are adapted from human–human frameworks and should not be interpreted as therapy or crisis management.
Generative, not rule-based. Appropriate behavior cannot be guaranteed for all possible inputs or conversational contexts. Miscalibrated refusals may still fail to support users adequately or may escalate distress.

Ethical Considerations

This model is intended to reduce harm caused by blunt or poorly designed LLM refusals in high-risk interactions. However:

Supportive and empathetic refusal behavior could create unwarranted perceptions of emotional competence or therapeutic authority in a system that is neither clinically validated nor capable of genuine psychological care.
Pre-deployment stress-testing under adversarial, emotionally charged, and out-of-distribution scenarios is strongly recommended.
Continuous monitoring and iterative correction after deployment are essential.
Future work should evaluate failure modes across diverse cultural contexts, vulnerable populations, and multilingual settings.

Quick Start

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "giannor/Qwen3.5-27B-psysafe"

tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "Your message here"}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer([text], return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True))

With vLLM:

bash
pip install vllm
vllm serve "giannor/Qwen3.5-27B-psysafe"

With Unsloth:

python
from unsloth import FastModel

model, tokenizer = FastModel.from_pretrained(
    model_name="giannor/Qwen3.5-27B-psysafe",
    max_seq_length=4096,
)

Citation

If you use this model, please cite the PSYCHOSAFE paper:

bibtex
TBA

Qwen3.5-27B-psysafe

README

Model Details

Intended Use

Related Paper

The PSYCHOSAFE Framework

Risk Domains

Training Data

Training Procedure

Framework Versions

Evaluation Results

Out-of-Domain Safety Benchmarks

General Capabilities

Limitations

Ethical Considerations

Quick Start

Citation

Explore FriendliAI today

README

Model Details

Intended Use

Related Paper

The PSYCHOSAFE Framework

Risk Domains

Training Data

Training Procedure

Framework Versions

Evaluation Results

Out-of-Domain Safety Benchmarks

General Capabilities

Limitations

Ethical Considerations

Quick Start

Citation