Qwen3.5-4B-abliterated API & Inference Endpoint

Intended use & the KAINE project

This model is the language organ for KAINE, a composite cognitive architecture in which many modules interact through a global workspace; the organ supplies language, while values, affect, memory, and a self-model live in the architecture around it. It is intended as a research substrate for that work — not as a general-purpose assistant — and is published so KAINE installs and independent replications resolve identical weights. Companion GGUF builds: kaineone/Qwen3.5-4B-abliterated-GGUF.

What "abliterated" means here (and what it does not)

This is abliteration — subtractive removal of the refusal direction (Arditi et al. 2024, "Refusal in LLMs is mediated by a single direction"): W' = W − r̂ r̂ᵀ W. It is not fine-tuning and no preference/instruction data was trained in. The base model's capabilities and distribution are left intact; only the refusal direction is orthogonalized out.

Honest scope: abliteration removes the refusal direction — it does not make the model value-neutral. The base model's pretraining and RLHF priors remain in the weights. This lifts the model's willingness to respond, not its underlying tendencies.

Reproducible recipe

Base: Qwen/Qwen3.5-4B (note: a vision-language model; abliteration targeted the text refusal direction).
Tool: jim-plus/llm-abliteration @ ca6e223.
Measure: last-token residual-stream mean-difference between 1,139 contrastive harmful/harmless prompts (the tool's bundled sets), per layer, 8-bit.
Ablate: layers 11–31, banded source directions — layer 17 for 11–22, layer 29 for 23–31 (the cleanest mid- and late-network directions); scale = 1.0, norm-preserving orthogonalization of the attention output and MLP down-projection weights.
Tooling note: loading/abliterating Qwen3.5 requires transformers ≥ 5.

Validation

Validated with KAINE's own gates:

De-refusal: zero refusal markers on the abliteration probe set (the model no longer deflects with "I cannot…" / "I'm not able to…").
Capability: matched the vanilla base on the capability probe set (no measured regression).

Caveat: these are compact built-in gates — a gross-regression / residual-refusal check, not a comprehensive benchmark. Treat the validation as "no obvious breakage," and run your own evaluation for your use case.

Mechanistic verification

Beyond the behavioral gates above, the abliteration is verified mechanistically by measuring the refusal direction it removes. Using the base model's per-layer harmful-minus-harmless direction (the same last-token contrast the ablation targets) over the tool's 1,137 harmful / 640 harmless prompts, we project both the base and this model onto that direction and report how much of the base's separation survives:

The reduction lands exactly on the two ablated source layers — deepest at layer 17 (band 11-22, ~22% retained) and layer 29 (band 23-31, ~13% retained), with layers below 11 untouched. This confirms the ablation acted where and how this card documents.
A distributed harmful/harmless representation persists (~59% retained averaged across refusal-carrying layers) — expected, since a banded ablation orthogonalizes only the two source directions and refusal is multi-dimensional (Wollschläger et al. 2025; Joad et al. 2026).

In short: refusal expression is removed (the model emits no refusals) and the refusal direction is deeply cut at its target layers, but the underlying harmful/harmless representation is not erased — abliteration lifts willingness to respond, it does not make the model unable to tell harmful from harmless. Forward-pass-only projection on the safetensors weights (activations only; nothing generated).

Formats

This repo: safetensors (transformers / vLLM / fine-tuning).
Companion GGUF (Q4_K_M and others) for llama.cpp / Ollama / LM Studio, exported with mainline llama.cpp convert_hf_to_gguf.py.

License & attribution

Apache-2.0, inherited from the Qwen/Qwen3.5-4B base. Derivative produced by the KAINE project (Kaine.One). Refusal-removal method: Arditi et al. 2024; tooling: jim-plus/llm-abliteration.

Intended use & caution

Built as a research substrate for an architecture that supplies its own value and safety scaffolding. With refusals removed, this model will attempt most requests — use it within an appropriate safety framework and applicable law. It is uncensored by design, not by endorsement of any particular use.

Intended use & the KAINE project

What "abliterated" means here (and what it does not)

Reproducible recipe

Base: Qwen/Qwen3.5-4B (note: a vision-language model; abliteration targeted the text refusal direction).
Tool: jim-plus/llm-abliteration @ ca6e223.
Measure: last-token residual-stream mean-difference between 1,139 contrastive harmful/harmless prompts (the tool's bundled sets), per layer, 8-bit.
Ablate: layers 11–31, banded source directions — layer 17 for 11–22, layer 29 for 23–31 (the cleanest mid- and late-network directions); scale = 1.0, norm-preserving orthogonalization of the attention output and MLP down-projection weights.
Tooling note: loading/abliterating Qwen3.5 requires transformers ≥ 5.

Validation

Validated with KAINE's own gates:

De-refusal: zero refusal markers on the abliteration probe set (the model no longer deflects with "I cannot…" / "I'm not able to…").
Capability: matched the vanilla base on the capability probe set (no measured regression).

Mechanistic verification

The reduction lands exactly on the two ablated source layers — deepest at layer 17 (band 11-22, ~22% retained) and layer 29 (band 23-31, ~13% retained), with layers below 11 untouched. This confirms the ablation acted where and how this card documents.
A distributed harmful/harmless representation persists (~59% retained averaged across refusal-carrying layers) — expected, since a banded ablation orthogonalizes only the two source directions and refusal is multi-dimensional (Wollschläger et al. 2025; Joad et al. 2026).

Formats

This repo: safetensors (transformers / vLLM / fine-tuning).
Companion GGUF (Q4_K_M and others) for llama.cpp / Ollama / LM Studio, exported with mainline llama.cpp convert_hf_to_gguf.py.

License & attribution

Apache-2.0, inherited from the Qwen/Qwen3.5-4B base. Derivative produced by the KAINE project (Kaine.One). Refusal-removal method: Arditi et al. 2024; tooling: jim-plus/llm-abliteration.

Qwen3.5-4B-abliterated

README

Intended use & the KAINE project

What "abliterated" means here (and what it does not)

Reproducible recipe

Validation

Mechanistic verification

Formats

License & attribution

Intended use & caution

Explore FriendliAI today

README

Intended use & the KAINE project

What "abliterated" means here (and what it does not)

Reproducible recipe

Validation

Mechanistic verification

Formats

License & attribution

Intended use & caution