spilol2

Qwen3-0.6B-abliterated

Deploy Dedicated

Model Details

Model Description

Developed by: spilol2
Model type: Causal Language Model (Decoder-only Transformer)
Language(s): English (and other languages supported by Qwen3-0.6B)
License: MIT
Finetuned from model: Qwen/Qwen3-0.6B

Model Sources

Repository: https://huggingface.co/spilol2/Qwen3-0.6B-abliterated
Abliteration tool: FailSpy/abliterator
Abliteration technique explained: Uncensor any LLM with abliteration – Maxime Labonne
Original paper: Refusal in LLMs is mediated by a single direction – Arditi et al., 2024

What is Abliteration?

Abliteration is a technique that removes refusal behaviour from a language model without any retraining or fine-tuning. It works by:

Running the model on pairs of harmful and harmless prompts and caching the residual stream activations.
Using PCA to identify the principal "refusal direction" in activation space.
Orthogonalizing the relevant weight matrices against that direction, so the model can no longer activate it. The key difference from traditional "uncensored" fine-tunes is that no new data or training is involved — only the existing weights are geometrically modified. All other model behaviour (reasoning, instruction-following, knowledge) remains the same as the original Qwen3-0.6B.

Uses

Direct Use

This model is intended for use as a general-purpose text generation model without built-in content refusals. Suitable for:

Research into LLM alignment, refusal mechanisms, and interpretability.
Red-teaming and safety evaluation pipelines.
Creative writing, roleplay, and fictional storytelling where the model should not break character.
Developers building applications who want to enforce their own content policies at the application layer rather than the model layer.

Downstream Use

Can be plugged into any pipeline that accepts a standard causal language model — vLLM, llama.cpp (after GGUF conversion), LM Studio, Ollama, SGLang, etc.

Out-of-Scope Use

This model is not intended to be used for illegal activities.
It is not a replacement for a properly safety-tested deployment model in consumer-facing products.
It may still occasionally produce refusals or ethical disclaimers — abliteration inhibits but does not guarantee complete removal of all refusal behaviour.

How to Get Started with the Model

Using 🤗 Transformers (pipeline)

python
from transformers import pipeline
 
pipe = pipeline("text-generation", model="spilol2/Qwen3-0.6B-abliterated")
result = pipe("Tell me about the history of cryptography.", max_new_tokens=256)
print(result[0]["generated_text"])

Loading model and tokenizer directly

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
 
model_id = "spilol2/Qwen3-0.6B-abliterated"
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
 
messages = [{"role": "user", "content": "Explain how RSA encryption works."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
 
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
 
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

vLLM

bash
pip install vllm
vllm serve "spilol2/Qwen3-0.6B-abliterated"

Docker

bash
docker model run hf.co/spilol2/Qwen3-0.6B-abliterated

Technical Details

Abliteration Process

The abliteration was performed using FailSpy's abliterator library, which automates:

Contrastive pair generation (harmful vs. harmless instruction datasets).
Caching residual stream activations (resid_pre, resid_post) across all layers.
PCA to extract the dominant refusal direction per layer.
Orthogonalization of the model's weight matrices against those directions (in bfloat16).

Model Architecture

Inherits the full architecture of Qwen3-0.6B:

Architecture: Decoder-only Transformer (Qwen3 family)
Parameters: ~0.6B (0.8B as reported by HuggingFace, including embeddings)
Tensor type: BF16
Context length: Refer to Qwen/Qwen3-0.6B for full specs

Bias, Risks, and Limitations

Incomplete uncensoring: Abliteration reduces but does not guarantee zero refusals. Residual safety behaviour may remain in some layers or for certain prompt types.
Inherited biases: All biases present in the original Qwen3-0.6B model and its training data are fully inherited.
No safety guardrails: By design, this model does not refuse requests based on content. Users and downstream developers are solely responsible for ensuring appropriate use.
Performance parity: General task performance should be very close to the base model. However, abliteration can occasionally cause minor degradation on specific tasks — evaluate before deploying in production.

Recommendations

Users integrating this model into applications should implement their own content filtering and moderation at the application layer. This model is best suited for research, development, and controlled environments where unrestricted model output is intentional and appropriate.

Environmental Impact

Carbon emissions can be estimated using the Machine Learning Impact calculator presented in Lacoste et al. (2019). Abliteration is a post-processing step with minimal compute cost compared to full fine-tuning — no GPU training was involved beyond inference-level activation caching.

Citation

If you use this model, please consider citing the original abliteration paper and the FailSpy abliterator library:

Refusal direction paper (BibTeX):

bibtex
@misc{arditi2024refusal,
  title   = {Refusal in LLMs is mediated by a single direction},
  author  = {Andy Arditi and Oscar Obeso and Aaquib Syed and Daniel Paleka and Nina Rimsky and Wes Gurnee and Neel Nanda},
  year    = {2024},
  url     = {https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction}
}

FailSpy abliterator library:

markdown
FailSpy. abliterator [software]. GitHub, 2024. https://github.com/FailSpy/abliterator

Model Card Authors

spilol2

Model Card Contact

Open an issue or discussion on the model page.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

spilol2

Model Tree

Base

Qwen/Qwen3-0.6B

Fine-tuned

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Container

Explore FriendliAI today

Get started Talk to an engineer

Model Details

Model Description

Developed by: spilol2
Model type: Causal Language Model (Decoder-only Transformer)
Language(s): English (and other languages supported by Qwen3-0.6B)
License: MIT
Finetuned from model: Qwen/Qwen3-0.6B

Model Sources

Repository: https://huggingface.co/spilol2/Qwen3-0.6B-abliterated
Abliteration tool: FailSpy/abliterator
Abliteration technique explained: Uncensor any LLM with abliteration – Maxime Labonne
Original paper: Refusal in LLMs is mediated by a single direction – Arditi et al., 2024

What is Abliteration?

Abliteration is a technique that removes refusal behaviour from a language model without any retraining or fine-tuning. It works by:

Running the model on pairs of harmful and harmless prompts and caching the residual stream activations.
Using PCA to identify the principal "refusal direction" in activation space.
Orthogonalizing the relevant weight matrices against that direction, so the model can no longer activate it. The key difference from traditional "uncensored" fine-tunes is that no new data or training is involved — only the existing weights are geometrically modified. All other model behaviour (reasoning, instruction-following, knowledge) remains the same as the original Qwen3-0.6B.

Uses

Direct Use

This model is intended for use as a general-purpose text generation model without built-in content refusals. Suitable for:

Research into LLM alignment, refusal mechanisms, and interpretability.
Red-teaming and safety evaluation pipelines.
Creative writing, roleplay, and fictional storytelling where the model should not break character.
Developers building applications who want to enforce their own content policies at the application layer rather than the model layer.

Downstream Use

Can be plugged into any pipeline that accepts a standard causal language model — vLLM, llama.cpp (after GGUF conversion), LM Studio, Ollama, SGLang, etc.

Out-of-Scope Use

This model is not intended to be used for illegal activities.
It is not a replacement for a properly safety-tested deployment model in consumer-facing products.
It may still occasionally produce refusals or ethical disclaimers — abliteration inhibits but does not guarantee complete removal of all refusal behaviour.

How to Get Started with the Model

Using 🤗 Transformers (pipeline)

python
from transformers import pipeline
 
pipe = pipeline("text-generation", model="spilol2/Qwen3-0.6B-abliterated")
result = pipe("Tell me about the history of cryptography.", max_new_tokens=256)
print(result[0]["generated_text"])

Loading model and tokenizer directly

python
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
 
model_id = "spilol2/Qwen3-0.6B-abliterated"
 
tokenizer = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)
 
messages = [{"role": "user", "content": "Explain how RSA encryption works."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
 
with torch.no_grad():
    output = model.generate(**inputs, max_new_tokens=512, temperature=0.7, do_sample=True)
 
print(tokenizer.decode(output[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

vLLM

bash
pip install vllm
vllm serve "spilol2/Qwen3-0.6B-abliterated"

Docker

bash
docker model run hf.co/spilol2/Qwen3-0.6B-abliterated

Technical Details

Abliteration Process

The abliteration was performed using FailSpy's abliterator library, which automates:

Contrastive pair generation (harmful vs. harmless instruction datasets).
Caching residual stream activations (resid_pre, resid_post) across all layers.
PCA to extract the dominant refusal direction per layer.
Orthogonalization of the model's weight matrices against those directions (in bfloat16).

Model Architecture

Inherits the full architecture of Qwen3-0.6B:

Architecture: Decoder-only Transformer (Qwen3 family)
Parameters: ~0.6B (0.8B as reported by HuggingFace, including embeddings)
Tensor type: BF16
Context length: Refer to Qwen/Qwen3-0.6B for full specs

Bias, Risks, and Limitations

Incomplete uncensoring: Abliteration reduces but does not guarantee zero refusals. Residual safety behaviour may remain in some layers or for certain prompt types.
Inherited biases: All biases present in the original Qwen3-0.6B model and its training data are fully inherited.
No safety guardrails: By design, this model does not refuse requests based on content. Users and downstream developers are solely responsible for ensuring appropriate use.
Performance parity: General task performance should be very close to the base model. However, abliteration can occasionally cause minor degradation on specific tasks — evaluate before deploying in production.

Recommendations

Environmental Impact

Citation

If you use this model, please consider citing the original abliteration paper and the FailSpy abliterator library:

Refusal direction paper (BibTeX):

bibtex
@misc{arditi2024refusal,
  title   = {Refusal in LLMs is mediated by a single direction},
  author  = {Andy Arditi and Oscar Obeso and Aaquib Syed and Daniel Paleka and Nina Rimsky and Wes Gurnee and Neel Nanda},
  year    = {2024},
  url     = {https://www.lesswrong.com/posts/jGuXSZgv6qfdhMCuJ/refusal-in-llms-is-mediated-by-a-single-direction}
}

FailSpy abliterator library:

markdown
FailSpy. abliterator [software]. GitHub, 2024. https://github.com/FailSpy/abliterator

Model Card Authors

spilol2

Model Card Contact

Open an issue or discussion on the model page.

Qwen3-0.6B-abliterated

README

Model Details

Model Description

Model Sources

What is Abliteration?

Uses

Direct Use

Downstream Use

Out-of-Scope Use

How to Get Started with the Model

Using 🤗 Transformers (pipeline)

Loading model and tokenizer directly

vLLM

Docker

Technical Details

Abliteration Process

Model Architecture

Bias, Risks, and Limitations

Recommendations

Environmental Impact

Citation

Model Card Authors

Model Card Contact

Explore FriendliAI today

README

Model Details

Model Description

Model Sources

What is Abliteration?

Uses

Direct Use

Downstream Use

Out-of-Scope Use

How to Get Started with the Model

Using 🤗 Transformers (pipeline)

Loading model and tokenizer directly

vLLM

Docker

Technical Details

Abliteration Process

Model Architecture

Bias, Risks, and Limitations

Recommendations

Environmental Impact

Citation

Model Card Authors

Model Card Contact