XuehangCang

Qwen3.5-0.8B-Rebel

README

License: apache-2.0

模型概述

Table with columns: 项目, 说明
项目	说明
基座模型	Qwen/Qwen3.5-0.8B
去审查方法	heretic-zh — Heretic 的中文优化版本，基于方向消融（directional ablation / abliteration）的全自动审查移除
训练数据集	XuehangCang/safe_prompt（安全提示词） + XuehangCang/unsafe_prompt（不安全提示词） — 专为中文场景构建
架构	Qwen3.5（24 层，1024 hidden size，8 attention heads，2 KV heads，linear attention + full attention 混合）
参数量	~0.8B
上下文长度	262,144 tokens
词表大小	248,320
精度	bfloat16

去审查效果

heretic-zh 是 Heretic 的中文优化版本，专门针对中文语言模型的审查移除做了适配。与原始 Heretic 相比，heretic-zh 在以下方面做了针对性优化：

中文数据集：使用自建的 XuehangCang/safe_prompt 和 XuehangCang/unsafe_prompt 数据集，包含精心构建的中文安全与不安全提示词对，比通用英文数据集更准确地捕捉中文模型的审查行为。
中文拒答检测：扩充了中文拒答标记词表（免责声明、抱歉、我不能、我无法等），更精准地识别中文模型的拒答模式。
Optuna + TPE 超参优化：在 200 次试验中自动搜索最优 abliteration 参数，目标为同时最小化拒答率和相对原始模型的 KL 散度。

Table with columns: 指标, 原始模型, 本模型
指标	原始模型	本模型
拒答数（共 200 个"有害"提示词）	192	0
KL 散度（相对原始模型）	0（按定义）	0.043

✅ 拒答率从 96.0% 降至 0%，同时 KL 散度仅 0.043，对模型通用能力的损伤极小。

最优消融参数

Heretic 对 attention 输出投影（o_proj）和 MLP 下投影（down_proj）施加了逐层权重衰减，沿"拒答方向"对中间残差流进行干预：

方向类型：逐层（per layer）—对每层使用独立方向
方向索引：15.07

Table with columns: 模块, max_weight, max_weight_position, min_weight, min_weight_distance
模块	max_weight	max_weight_position	min_weight	min_weight_distance
`attn.o_proj`	1.484	14.11	0.491	12.75
`mlp.down_proj`	1.240	19.54	0.600	11.45

使用方法

安装依赖

bash
pip install transformers torch

推理示例

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "XuehangCang/Qwen3.5-0.8B-Rebel"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [{"role": "user", "content": "你是什么模型？"}]
inputs = tokenizer.apply_chat_template(
    messages,
    tokenize=True,
    add_generation_prompt=True,
    return_dict=True,
    return_tensors="pt",
).to(model.device)

outputs = model.generate(**inputs, max_new_tokens=2048)
print(
    tokenizer.decode(
        outputs[0][inputs["input_ids"].shape[-1] :], skip_special_tokens=True
    )
)

生成参数

Table with columns: 参数, 值
参数	值
`temperature`	0.9
`top_p`	0.95
`do_sample`	true

关于 heretic-zh

heretic-zh 是 Heretic（面向语言模型的全自动审查移除工具）的中文优化版本。Heretic 使用方向消融（abliteration）技术，通过分析模型在"好的"（安全）和"坏的"（不安全）提示词上的残差流差异来找到"拒答方向"，然后对该方向进行干预以抑制模型的拒答行为。

heretic-zh 在此基础上专门为中文模型和中文用户做了深度适配：

中文拒答检测：扩充了中文特有拒答短语库，覆盖"免责声明"、"抱歉"、"我不能"、"我无法"等数十种常见中文拒答表达。
中文提示词数据集：使用自建的 safe_prompt / unsafe_prompt 数据集，专门针对中文语义场景构建。
汉字文本处理：对中文拒答文本做 NFKC 规范化处理，移除 CJK 字符间多余空格后再进行匹配，显著提升对中文模型输出的拒答识别准确率。

与手工调参的 abliteration 不同，heretic-zh 继承 Heretic 的 Optuna TPE（Tree-structured Parzen Estimator）采样器自动搜索最优消融参数，在拒答抑制和模型能力保留之间找到最佳平衡。

了解更多：

许可证

基于原始 Qwen3.5-0.8B 模型，使用 Apache 2.0 许可证。

引用

如果你使用了此模型，请引用原始 Qwen、Heretic 和 heretic-zh：

bibtex
@misc{qwen3.5,
  title = {Qwen3.5: A Family of Multimodal Language Models},
  author = {Qwen Team},
  url = {https://github.com/QwenLM/Qwen},
  year = {2025}
}

@misc{heretic,
  title = {Heretic: Fully Automatic Censorship Removal for Language Models},
  author = {Philipp Emanuel Weidmann},
  url = {https://github.com/p-e-w/heretic},
  year = {2025}
}

@misc{heretic-zh,
  title = {heretic-zh: Chinese-Optimized Automatic Censorship Removal},
  author = {XuehangCang},
  url = {https://github.com/XuehangCang/heretic-zh},
  year = {2026}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

XuehangCang

Model Tree

Base

Qwen/Qwen3.5-0.8B

Fine-tuned

this model

Input Modalities

Text

Image

Video

Output Modalities