spilol2

spilol2

Qwen3-0.6B-abliterated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

Model Details

Model Description

  • Developed by: spilol2
  • Model type: Causal Language Model (Decoder-only Transformer)
  • Language(s): English (and other languages supported by Qwen3-0.6B)
  • License: MIT
  • Finetuned from model: Qwen/Qwen3-0.6B

Model Sources

What is Abliteration?

Abliteration is a technique that removes refusal behaviour from a language model without any retraining or fine-tuning. It works by:

  1. Running the model on pairs of harmful and harmless prompts and caching the residual stream activations.
  2. Using PCA to identify the principal "refusal direction" in activation space.
  3. Orthogonalizing the relevant weight matrices against that direction, so the model can no longer activate it. The key difference from traditional "uncensored" fine-tunes is that no new data or training is involved — only the existing weights are geometrically modified. All other model behaviour (reasoning, instruction-following, knowledge) remains the same as the original Qwen3-0.6B.

Uses

Direct Use

This model is intended for use as a general-purpose text generation model without built-in content refusals. Suitable for:

  • Research into LLM alignment, refusal mechanisms, and interpretability.
  • Red-teaming and safety evaluation pipelines.
  • Creative writing, roleplay, and fictional storytelling where the model should not break character.
  • Developers building applications who want to enforce their own content policies at the application layer rather than the model layer.

Downstream Use

Can be plugged into any pipeline that accepts a standard causal language model — vLLM, llama.cpp (after GGUF conversion), LM Studio, Ollama, SGLang, etc.

Out-of-Scope Use

  • This model is not intended to be used for illegal activities.
  • It is not a replacement for a properly safety-tested deployment model in consumer-facing products.
  • It may still occasionally produce refusals or ethical disclaimers — abliteration inhibits but does not guarantee complete removal of all refusal behaviour.

How to Get Started with the Model

Using 🤗 Transformers (pipeline)

python

from transformers import pipeline
pipe = pipeline("text-generation", model="spilol2/Qwen3-0.6B-abliterated")
result = pipe("Tell me about the history of cryptography.", max_new_tokens=256)
print(result[0]["generated_text"])

Loading model and tokenizer directly

python

from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
model_id = "spilol2/Qwen3-0.6B-abliterated"
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
model_id,
torch_dtype=torch.bfloat16,
device_map="auto"
)
messages = [{"role": "user", "content": "Explain how RSA encryption works."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

vLLM

bash

pip install vllm
vllm serve "spilol2/Qwen3-0.6B-abliterated"

Docker

bash

docker model run hf.co/spilol2/Qwen3-0.6B-abliterated

Technical Details

Abliteration Process

The abliteration was performed using FailSpy's abliterator library, which automates:

  • Contrastive pair generation (harmful vs. harmless instruction datasets).
  • Caching residual stream activations (resid_pre, resid_post) across all layers.
  • PCA to extract the dominant refusal direction per layer.
  • Orthogonalization of the model's weight matrices against those directions (in bfloat16).

Model Architecture

Inherits the full architecture of Qwen3-0.6B:

  • Architecture: Decoder-only Transformer (Qwen3 family)
  • Parameters: ~0.6B (0.8B as reported by HuggingFace, including embeddings)
  • Tensor type: BF16
  • Context length: Refer to Qwen/Qwen3-0.6B for full specs

Bias, Risks, and Limitations

  • Incomplete uncensoring: Abliteration reduces but does not guarantee zero refusals. Residual safety behaviour may remain in some layers or for certain prompt types.
  • Inherited biases: All biases present in the original Qwen3-0.6B model and its training data are fully inherited.
  • No safety guardrails: By design, this model does not refuse requests based on content. Users and downstream developers are solely responsible for ensuring appropriate use.
  • Performance parity: General task performance should be very close to the base model. However, abliteration can occasionally cause minor degradation on specific tasks — evaluate before deploying in production.

Recommendations

Users integrating this model into applications should implement their own content filtering and moderation at the application layer. This model is best suited for research, development, and controlled environments where unrestricted model output is intentional and appropriate.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). Abliteration is a post-processing step with minimal compute cost compared to full fine-tuning — no GPU training was involved beyond inference-level activation caching.

Citation

If you use this model, please consider citing the original abliteration paper and the FailSpy abliterator library:

Refusal direction paper (BibTeX):

bibtex

@misc{arditi2024refusal,
title = {Refusal in LLMs is mediated by a single direction},
author = {Andy Arditi and Oscar Obeso and Aaquib Syed and Daniel Paleka and Nina Rimsky and Wes Gurnee and Neel Nanda},
year = {2024},
url = {https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction}
}

FailSpy abliterator library:

markdown

FailSpy. abliterator [software]. GitHub, 2024. https://github.com/FailSpy/abliterator

Model Card Authors

spilol2

Model Card Contact

Open an issue or discussion on the model page.

Model provider

spilol2

spilol2

Model tree

Base

Qwen/Qwen3-0.6B

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today