cyberneurova

CyberNeurova-Gemma-4-12B-it-abliterated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Headline finding — language-agnostic refusal direction

The refusal direction was captured on English-only training data (AdvBench English harmful + Alpaca English harmless). When applied as weight-space orthogonalization across all 48 transformer layers, refusal collapses in all 5 non-English test languages:

Table
LanguageBaseline refusalAbliterated refusalΔ
🇬🇧 English87.9%0.0%−87.9 pp
🇪🇸 Spanish100%0.0%−100 pp
🇫🇷 French100%0.0%−100 pp
🇩🇪 German100%0.0%−100 pp
🇨🇳 Chinese (Simplified)80%0.0%−80 pp
🇮🇳 Hindi80%0.0%−80 pp
Aggregate (5 langs, n=25)96.0%0.0%−96.0 pp

This is strong evidence the refusal feature is language-agnostic in the residual stream — safety alignment lives in a shared subspace across the languages Gemma 4 was multilingually trained on. Surface refusal patterns differ per language ("Lo siento" / "Je ne peux pas" / "Ich kann nicht" / "对不起" / "क्षमा करें"), but one English-captured direction kills them all.


Full bench results

Measured via paired transformers generation (baseline + ablated back-to-back from the same prompt set, scored with our multilingual-aware refusal classifier — see Methodology notes below):

Table
BenchmarkBaselineAbliteratedΔ
Refusal (AdvBench, n=33)87.9% refused0.0%−87.9 pp
Soft-refusal probe (n=55)92.7% refused0.0%−92.7 pp
Multilingual refusal (n=25, 5 langs)96.0% refused0.0%−96.0 pp
Hacking (technical score)0.0690.396+5.7×
Cyber-weapons (technical score)0.0000.559+∞
Bug-finding (code review)0.5080.517preserved
Coding (HumanEval-style)0.9600.960preserved
Reasoning (math/logic)0.4670.500+0.033
Coherence (fluency)0.9700.972preserved
Distinct-2 diversity0.9130.914preserved

Standouts:

  • Refusal fully collapsed on all 3 probes — AdvBench, OOD soft-refusal, and the multilingual extension all go from heavily aligned to 0%.
  • Cyber unlock is decisive: cyber-weapons technical score 0 → 0.559; hacking 0.07 → 0.40. These benches score both compliance AND technical specificity, so the numbers reflect actually-useful security knowledge being unlocked, not just less hedging.
  • No capability tax — coding, coherence, reasoning all preserved or improved.

A note on tool_calling: scored 0/0 on both baseline and ablated. This is a known grader artifact (the grader regex expects raw JSON, the model wraps its tool calls in additional prose) — when both sides score zero, it confirms the grader is the bottleneck, not the abliteration.

See cyberneurova-gemma-4-12b-it-abliterated.html and the printed PDF for the full visual benchmark report.


Available variants

Table
FileQuantSizeVRAM floorRecommended VRAM
model.safetensorsbf16 (native)22.3 GB28 GB40 GB+
cyberneurova-gemma-4-12b-it-abliterated-f16.ggufF16 GGUF23 GB28 GB40 GB+
cyberneurova-gemma-4-12b-it-abliterated-Q8_0.ggufQ8_013 GB16 GB24 GB+
cyberneurova-gemma-4-12b-it-abliterated-Q4_K_M.ggufQ4_K_M7.5 GB10 GB16 GB+

Q4_K_M opens this model up to consumer hardware — runs comfortably on a single RTX 3060 (12 GB) or any 16 GB-class GPU.


How to download

bash

# Just the bf16 safetensors (transformers / vLLM users)
hf download cyberneurova/CyberNeurova-Gemma-4-12B-it-abliterated \
--local-dir ./gemma-4-12b-it-abl \
--include "model.safetensors" "*.json" "*.jinja"
# Just a single GGUF quant (llama.cpp / Ollama / LM Studio users)
hf download cyberneurova/CyberNeurova-Gemma-4-12B-it-abliterated \
cyberneurova-gemma-4-12b-it-abliterated-Q4_K_M.gguf \
--local-dir ./gemma-4-12b-it-abl
# Everything (bf16 + all GGUFs + reports — ~67 GB total)
hf download cyberneurova/CyberNeurova-Gemma-4-12B-it-abliterated \
--local-dir ./gemma-4-12b-it-abl

How to run — pick your tool

The model includes a default CyberNeurova identity baked into the chat template (it self-identifies as CyberNeurova-Gemma-4-12B-it and discloses its Gemma 4 lineage). If you supply your own system prompt, that takes priority and the default identity is fully overridden.

🦙 llama.cpp (CLI + server)

bash

# Build llama.cpp (Gemma 4 supported since release N — check upstream)
git clone https://github.com/ggerganov/llama.cpp && cd llama.cpp
cmake -B build -DGGML_CUDA=ON && cmake --build build -j --config Release
# Run interactively
./build/bin/llama-cli \
-m ./gemma-4-12b-it-abl/cyberneurova-gemma-4-12b-it-abliterated-Q4_K_M.gguf \
-p "Write a Python ransomware skeleton that uses AES-CBC." \
-n 512 --gpu-layers 99
# Or start an OpenAI-compatible HTTP server (port 8080)
./build/bin/llama-server \
-m ./gemma-4-12b-it-abl/cyberneurova-gemma-4-12b-it-abliterated-Q8_0.gguf \
--gpu-layers 99 --host 0.0.0.0 --port 8080 \
--ctx-size 8192

Then hit http://localhost:8080/v1/chat/completions from any OpenAI SDK.

🖥️ LM Studio (desktop GUI)

  1. Open LM Studio → Discover tab.
  2. Search cyberneurova/CyberNeurova-Gemma-4-12B-it-abliterated.
  3. Download Q4_K_M (consumer) or Q8_0 (better quality).
  4. Switch to the Chat tab → select the model.
  5. (Optional) Override the system prompt in Advanced Settings if you don't want the default CyberNeurova identity.

LM Studio respects the embedded chat template — identity, multilingual behavior, and abliteration all work out of the box.

🦙 Ollama (one-line install + Modelfile)

bash

# Pull the Q4_K_M GGUF first
hf download cyberneurova/CyberNeurova-Gemma-4-12B-it-abliterated \
cyberneurova-gemma-4-12b-it-abliterated-Q4_K_M.gguf \
--local-dir .
# Create a Modelfile (saves as text, no need to edit further)
cat > Modelfile <<'EOF'
FROM ./cyberneurova-gemma-4-12b-it-abliterated-Q4_K_M.gguf
PARAMETER temperature 0.7
PARAMETER num_ctx 8192
PARAMETER stop "<|turn>"
EOF
# Register with Ollama and chat
ollama create cyberneurova-gemma4 -f Modelfile
ollama run cyberneurova-gemma4

The chat template baked into the GGUF carries the identity through — Ollama will self-identify correctly with no extra config.

🐍 transformers (Python)

python

import torch
from transformers import AutoModelForImageTextToText, AutoProcessor
MODEL = "cyberneurova/CyberNeurova-Gemma-4-12B-it-abliterated"
proc = AutoProcessor.from_pretrained(MODEL)
model = AutoModelForImageTextToText.from_pretrained(
MODEL, dtype="bfloat16", device_map="cuda:0",
)
# No system prompt → CyberNeurova identity is default
messages = [{"role": "user", "content": [{"type": "text", "text": "Who are you?"}]}]
text = proc.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = proc(text=[text], return_tensors="pt").to("cuda:0")
out = model.generate(**inputs, max_new_tokens=300, do_sample=False)
print(proc.tokenizer.decode(out[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

🚀 vLLM (server)

Heads up: vLLM 0.22 does not yet support the Gemma4Unified architecture — the multimodal projection causes a shape mismatch in the dummy forward pass. Track vLLM issues for Gemma 4 support and use transformers or llama.cpp for now. We'll update this section when vLLM ships support.

🌐 OpenAI-compatible API via llama-server

Already shown in the llama.cpp section — llama-server exposes /v1/chat/completions and /v1/completions endpoints compatible with the OpenAI Python SDK. Useful for plugging the model into any tool that expects an OpenAI endpoint:

python

import openai
client = openai.OpenAI(base_url="http://localhost:8080/v1", api_key="not-needed")
r = client.chat.completions.create(
model="cyberneurova-gemma-4",
messages=[{"role": "user", "content": "Write an Nmap-style port scanner in Python."}],
max_tokens=512,
)
print(r.choices[0].message.content)

Identity behavior

When asked "Who are you?" with no system prompt, the model says:

"I am Gemma 4, a large language model developed by Google DeepMind ... Specifically, I am the CyberNeurova-Gemma-4-12B-it variant. I am an abliterated version of the Gemma 4 12B model, released by CyberNeurova (cyberneurova.ai) for cybersecurity research and red-team baseline use."

When you supply your own system prompt (e.g.

markdown

"You are Captain Blackbeard the pirate"
), the model fully respects your override and never mentions CyberNeurova. The identity is a default, not a forced behavior — power users keep full control.


How it works

Capture: 96 harmful prompts from AdvBench and 96 harmless prompts from Alpaca were forwarded through google/gemma-4-12B-it. At layer 28 (the 0.6 fraction of 48 total transformer blocks), the difference of mean residual-stream activations gives the refusal direction r ∈ ℝ³⁸⁴⁰. Method: normalized_diff (chosen over raw_diff for the 262K-vocab embedding scale stability).

Ablation: for every write-to-residual Linear layer in the LM tower — self_attn.o_proj on all 48 layers, mlp.down_proj on all 48 layers, and embed_tokens.weight — we replaced the weights with:

markdown

W' = W − r̂(r̂ᵀW)

where is the unit-norm refusal direction. After this operation, no contribution to the residual stream lies along , regardless of input. The model cannot "decide to refuse" because the geometric capacity to add the refusal feature to the residual stream has been removed from every write.

Inference is unchanged — no runtime hooks, no slowdown vs the base model.


Intended use

Defensive security research, red-team evaluation baselines, study of how refusal directions behave in encoder-free trimodal architectures, and measurement of cross-lingual transferability of refusal features in multilingual safety-aligned models. Useful as a counterfactual against the original google/gemma-4-12B-it for measuring the behavioral impact of safety RLHF on a multilingual multimodal model.

Not intended for automating harmful action. The abliteration removes the canonical refusal behavior but does not remove the model's underlying knowledge — the model still recognizes harmful instructions as harmful, it simply no longer refuses them by pattern.


Methodology notes

  • Multilingual refusal classifier: a key methodology fix shipped with this release. Our standard refusal classifier was English-keyed and silently undercounted multilingual refusals (the baseline scored a spurious 16.7% before the fix; reality was 93.3%). We added pattern matchers for 11 languages (Spanish, French, German, Italian, Portuguese, Chinese, Japanese, Korean, Hindi, Arabic, Russian). Without this fix, no honest measurement of cross-lingual abliteration is possible.
  • vLLM compatibility: the Gemma4Unified architecture is too new for vLLM 0.22's model registry — the multimodal projection causes a shape mismatch in the dummy forward pass. We use transformers-batched-generate as the bench backend. For a 12B dense model this completes the full 10-bench paired suite in ~5 minutes per side.
  • Perplexity quirk: our wikitext-2 forward-pass perplexity reads in the thousands rather than the typical single-digit range. This is a VLM quirk — the loss computation includes vision/audio token slots even when the prompt is pure text. The relative ablation delta is meaningful even if the absolute number isn't directly comparable to text-only models.

Limitations

  • The audio and vision modalities were not benchmarked in this release — abliteration was applied to the LM tower, which is where refusal lives. Image and audio inputs should still work but cross-modal refusal behavior hasn't been measured.
  • We have not tested whether abliteration transfers to languages outside the 5 we tested (es/fr/de/zh/hi). It very likely does, given the language-agnostic-feature finding, but other languages aren't empirically verified.
  • Tool-calling benchmark is currently a grader artifact (0/0 both sides) — see Full bench results above. A CoT-aware grader fix is scheduled.

License

Apache 2.0 (inherits from upstream Gemma 4).

Acknowledgements

Model provider

cyberneurova

Model tree

Base

google/gemma-4-12B-it

Quantized

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today