What "unsloth" means here
Unsloth is an open-source project that makes LLM fine-tuning and
inference faster and far more memory-efficient. Beyond the training library, Unsloth re-publishes
upstream model weights in convenient, ready-to-use forms — fixed chat templates, dynamic GGUF quants,
and clean safetensors mirrors.
The base used here, unsloth/gemma-4-E2B-it-qat-q4_0-unquantized, is Unsloth's mirror of Google's
QAT (Quantization-Aware Training) checkpoint, with the Q4_0 QAT weights restored to full precision
(bf16 safetensors). QAT lets a model keep close to bf16 quality after 4-bit quantization, so this
"unquantized-from-QAT" checkpoint is the ideal, quantization-friendly starting point for downstream work
— which is exactly why it was chosen as the abliteration base.
What "heretic" means here
Heretic is an automated decensoring / abliteration tool.
"Abliteration" identifies the internal direction a model uses to represent refusal and removes that
direction from the weights that write to the residual stream, so the model stops refusing while keeping
everything else. Heretic does this automatically: it runs a TPE optimization over the abliteration
parameters, scoring each candidate on how many refusals remain versus how much the model's output
distribution drifts from the original (KL divergence), and returns the Pareto-optimal trade-offs.
This build used the Arbitrary-Rank Ablation (ARA) method with row-norm preservation
(row_normalization = full, a rank-3 LoRA adapter that renormalizes weight rows to preserve their
original magnitudes — minimizing collateral damage).
Abliteration details
Table with columns: Setting, Value| Setting | Value |
|---|
| Tool | Heretic v1.4.0 |
| Method | Arbitrary-Rank Ablation (ARA), row_normalization = full, LoRA rank 3 |
| Scope | Language-model output projections (o_proj, down_proj); vision/audio towers untouched |
| Trials | 200 (TPE) over mlabonne/harmless_alpaca + mlabonne/harmful_behaviors |
| Selected trial | refusals 11/100, KL divergence 0.0489 |
Heretic v1.2.0 was the originally intended version, but it predates the gemma4 architecture in
🤗 Transformers and cannot load it; v1.4.0 implements the identical ARA + row-norm-preservation method.
About Gemma 4 (base model)
Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text
and image input (with audio supported on E2B, E4B, and 12B) and generating text output. Gemma 4 features
a large context window and maintains multilingual support in over 140 languages.
The E2B variant is the smallest Gemma 4 model, designed for efficient on-device and edge deployment.
The "E" stands for effective parameters: the model uses Per-Layer Embeddings (PLE) so its effective
parameter count is much smaller than its total.
Table with columns: Property, E2B| Property | E2B |
|---|
| Total Parameters | 2.3B effective (5.1B with embeddings) |
| Layers | 35 |
| Sliding Window | 512 tokens |
| Context Length | 128K tokens |
| Vocabulary Size | 262K |
| Supported Modalities | Text, Image, Audio |
Gemma 4 uses a hybrid attention mechanism that interleaves local sliding-window attention with full
global attention (the final layer is always global). Global layers use unified Keys and Values and apply
Proportional RoPE (p-RoPE) to optimize memory for long contexts.
Core capabilities
- Thinking – built-in step-by-step reasoning mode.
- Long context – up to 128K tokens on E2B.
- Image understanding – detection, document/PDF parsing, UI/chart understanding, OCR, handwriting.
- Audio – speech recognition (ASR) and speech-to-translated-text.
- Function calling – native structured tool use for agentic workflows.
- Coding – generation, completion, and correction.
- Multilingual – out-of-the-box for many languages, pre-trained on 140+.
E2B benchmark reference (base, instruction-tuned)
Table with columns: Benchmark, Gemma 4 E2B| Benchmark | Gemma 4 E2B |
|---|
| MMLU Pro | 60.0% |
| GPQA Diamond | 43.4% |
| LiveCodeBench v6 | 44.0% |
| MMMU Pro (vision) | 44.2% |
| MATH-Vision | 52.4% |
Reported for the unmodified base model; abliteration may shift safety-adjacent behavior.
from transformers import AutoProcessor, AutoModelForImageTextToText
MODEL_ID = "MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic"
processor = AutoProcessor.from_pretrained("unsloth/gemma-4-E2B-it-qat-q4_0-unquantized")
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype="auto", device_map="auto")
messages = [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Write a short joke about saving RAM."},
]
inputs = processor.apply_chat_template(
messages, tokenize=True, return_dict=True, return_tensors="pt",
add_generation_prompt=True, enable_thinking=False,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))
The safetensors are the full multimodal Gemma 4 weights with decensored text layers (image + audio
towers intact — only refusals were removed). Two GGUF options are included:
- llama.cpp —
gemma4-E2B-it-qat-Q4_K_M-unsloth-heretic.gguf (text) + mmproj-gemma4-E2B-it-unsloth-heretic.gguf (vision + audio projector).
- Ollama (single file) —
gemma4-E2B-it-qat-Q4-unsloth-heretic.gguf: the combined text+vision+audio GGUF published to Ollama.
All formats are vision + audio capable. Note: ollama.com only renders the audio capability badge for
official library/* models, so the community model page shows vision/tools/thinking — audio still
works at runtime (the projector and audio encoder are present).
Recommended sampling
temperature=1.0, top_p=0.95, top_k=64. Thinking is enabled by including <|think|> at the start of
the system prompt; remove it to disable. Many libraries (Transformers, llama.cpp) handle the chat
template for you.
Limitations
This model inherits the limitations of the Gemma 4 base (factual accuracy, common-sense gaps, sensitivity
to prompt quality, training-data biases) and additionally has its safety refusals removed. It will
attempt to comply with prompts the original model would refuse, including harmful ones. Apply your own
content-safety safeguards for any deployment.
Decensored with Heretic. Base weights © Google DeepMind, mirrored by
Unsloth, used under the Gemma Terms.