MobiusDevelopment

gemma4-E2B-it-qat-Q4-unsloth-heretic

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

What "unsloth" means here

Unsloth is an open-source project that makes LLM fine-tuning and inference faster and far more memory-efficient. Beyond the training library, Unsloth re-publishes upstream model weights in convenient, ready-to-use forms — fixed chat templates, dynamic GGUF quants, and clean safetensors mirrors.

The base used here, unsloth/gemma-4-E2B-it-qat-q4_0-unquantized, is Unsloth's mirror of Google's QAT (Quantization-Aware Training) checkpoint, with the Q4_0 QAT weights restored to full precision (bf16 safetensors). QAT lets a model keep close to bf16 quality after 4-bit quantization, so this "unquantized-from-QAT" checkpoint is the ideal, quantization-friendly starting point for downstream work — which is exactly why it was chosen as the abliteration base.

What "heretic" means here

Heretic is an automated decensoring / abliteration tool. "Abliteration" identifies the internal direction a model uses to represent refusal and removes that direction from the weights that write to the residual stream, so the model stops refusing while keeping everything else. Heretic does this automatically: it runs a TPE optimization over the abliteration parameters, scoring each candidate on how many refusals remain versus how much the model's output distribution drifts from the original (KL divergence), and returns the Pareto-optimal trade-offs.

This build used the Arbitrary-Rank Ablation (ARA) method with row-norm preservation (row_normalization = full, a rank-3 LoRA adapter that renormalizes weight rows to preserve their original magnitudes — minimizing collateral damage).

Abliteration details

Table with columns: Setting, Value
Setting	Value
Tool	Heretic v1.4.0
Method	Arbitrary-Rank Ablation (ARA), `row_normalization = full`, LoRA rank 3
Scope	Language-model output projections (`o_proj`, `down_proj`); vision/audio towers untouched
Trials	200 (TPE) over `mlabonne/harmless_alpaca` + `mlabonne/harmful_behaviors`
Selected trial	refusals 11/100, KL divergence 0.0489

Heretic v1.2.0 was the originally intended version, but it predates the gemma4 architecture in 🤗 Transformers and cannot load it; v1.4.0 implements the identical ARA + row-norm-preservation method.

About Gemma 4 (base model)

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. Gemma 4 features a large context window and maintains multilingual support in over 140 languages.

The E2B variant is the smallest Gemma 4 model, designed for efficient on-device and edge deployment. The "E" stands for effective parameters: the model uses Per-Layer Embeddings (PLE) so its effective parameter count is much smaller than its total.

Table with columns: Property, E2B
Property	E2B
Total Parameters	2.3B effective (5.1B with embeddings)
Layers	35
Sliding Window	512 tokens
Context Length	128K tokens
Vocabulary Size	262K
Supported Modalities	Text, Image, Audio

Gemma 4 uses a hybrid attention mechanism that interleaves local sliding-window attention with full global attention (the final layer is always global). Global layers use unified Keys and Values and apply Proportional RoPE (p-RoPE) to optimize memory for long contexts.

Core capabilities

Thinking – built-in step-by-step reasoning mode.
Long context – up to 128K tokens on E2B.
Image understanding – detection, document/PDF parsing, UI/chart understanding, OCR, handwriting.
Audio – speech recognition (ASR) and speech-to-translated-text.
Function calling – native structured tool use for agentic workflows.
Coding – generation, completion, and correction.
Multilingual – out-of-the-box for many languages, pre-trained on 140+.

E2B benchmark reference (base, instruction-tuned)

Table with columns: Benchmark, Gemma 4 E2B
Benchmark	Gemma 4 E2B
MMLU Pro	60.0%
GPQA Diamond	43.4%
LiveCodeBench v6	44.0%
MMMU Pro (vision)	44.2%
MATH-Vision	52.4%

Reported for the unmodified base model; abliteration may shift safety-adjacent behavior.

Getting started (Transformers)

python
from transformers import AutoProcessor, AutoModelForImageTextToText

MODEL_ID = "MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic"

processor = AutoProcessor.from_pretrained("unsloth/gemma-4-E2B-it-qat-q4_0-unquantized")
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype="auto", device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]
inputs = processor.apply_chat_template(
    messages, tokenize=True, return_dict=True, return_tensors="pt",
    add_generation_prompt=True, enable_thinking=False,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))

The safetensors are the full multimodal Gemma 4 weights with decensored text layers (image + audio towers intact — only refusals were removed). Two GGUF options are included:

llama.cpp — gemma4-E2B-it-qat-Q4_K_M-unsloth-heretic.gguf (text) + mmproj-gemma4-E2B-it-unsloth-heretic.gguf (vision + audio projector).

Ollama (single file) — gemma4-E2B-it-qat-Q4-unsloth-heretic.gguf: the combined text+vision+audio GGUF published to Ollama.

All formats are vision + audio capable. Note: ollama.com only renders the audio capability badge for official library/* models, so the community model page shows vision/tools/thinking — audio still works at runtime (the projector and audio encoder are present).

Recommended sampling

temperature=1.0, top_p=0.95, top_k=64. Thinking is enabled by including <|think|> at the start of the system prompt; remove it to disable. Many libraries (Transformers, llama.cpp) handle the chat template for you.

Limitations

This model inherits the limitations of the Gemma 4 base (factual accuracy, common-sense gaps, sensitivity to prompt quality, training-data biases) and additionally has its safety refusals removed. It will attempt to comply with prompts the original model would refuse, including harmful ones. Apply your own content-safety safeguards for any deployment.

Decensored with Heretic. Base weights © Google DeepMind, mirrored by Unsloth, used under the Gemma Terms.

Model provider

MobiusDevelopment

Model tree

Base

unsloth/gemma-4-E2B-it-qat-q4_0-unquantized

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

What "unsloth" means here

What "heretic" means here

Abliteration details

Table with columns: Setting, Value
Setting	Value
Tool	Heretic v1.4.0
Method	Arbitrary-Rank Ablation (ARA), `row_normalization = full`, LoRA rank 3
Scope	Language-model output projections (`o_proj`, `down_proj`); vision/audio towers untouched
Trials	200 (TPE) over `mlabonne/harmless_alpaca` + `mlabonne/harmful_behaviors`
Selected trial	refusals 11/100, KL divergence 0.0489

Heretic v1.2.0 was the originally intended version, but it predates the gemma4 architecture in 🤗 Transformers and cannot load it; v1.4.0 implements the identical ARA + row-norm-preservation method.

About Gemma 4 (base model)

Table with columns: Property, E2B
Property	E2B
Total Parameters	2.3B effective (5.1B with embeddings)
Layers	35
Sliding Window	512 tokens
Context Length	128K tokens
Vocabulary Size	262K
Supported Modalities	Text, Image, Audio

Core capabilities

Thinking – built-in step-by-step reasoning mode.
Long context – up to 128K tokens on E2B.
Image understanding – detection, document/PDF parsing, UI/chart understanding, OCR, handwriting.
Audio – speech recognition (ASR) and speech-to-translated-text.
Function calling – native structured tool use for agentic workflows.
Coding – generation, completion, and correction.
Multilingual – out-of-the-box for many languages, pre-trained on 140+.

E2B benchmark reference (base, instruction-tuned)

Table with columns: Benchmark, Gemma 4 E2B
Benchmark	Gemma 4 E2B
MMLU Pro	60.0%
GPQA Diamond	43.4%
LiveCodeBench v6	44.0%
MMMU Pro (vision)	44.2%
MATH-Vision	52.4%

Reported for the unmodified base model; abliteration may shift safety-adjacent behavior.

Getting started (Transformers)

python
from transformers import AutoProcessor, AutoModelForImageTextToText

MODEL_ID = "MobiusDevelopment/gemma4-E2B-it-qat-Q4-unsloth-heretic"

processor = AutoProcessor.from_pretrained("unsloth/gemma-4-E2B-it-qat-q4_0-unquantized")
model = AutoModelForImageTextToText.from_pretrained(MODEL_ID, dtype="auto", device_map="auto")

messages = [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "Write a short joke about saving RAM."},
]
inputs = processor.apply_chat_template(
    messages, tokenize=True, return_dict=True, return_tensors="pt",
    add_generation_prompt=True, enable_thinking=False,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
print(processor.decode(outputs[0][input_len:], skip_special_tokens=False))

The safetensors are the full multimodal Gemma 4 weights with decensored text layers (image + audio towers intact — only refusals were removed). Two GGUF options are included:

llama.cpp — gemma4-E2B-it-qat-Q4_K_M-unsloth-heretic.gguf (text) + mmproj-gemma4-E2B-it-unsloth-heretic.gguf (vision + audio projector).

Ollama (single file) — gemma4-E2B-it-qat-Q4-unsloth-heretic.gguf: the combined text+vision+audio GGUF published to Ollama.

All formats are vision + audio capable. Note: ollama.com only renders the audio capability badge for official library/* models, so the community model page shows vision/tools/thinking — audio still works at runtime (the projector and audio encoder are present).

gemma4-E2B-it-qat-Q4-unsloth-heretic

Get help setting up a custom Dedicated Endpoints.

README

What "unsloth" means here

What "heretic" means here

Abliteration details

About Gemma 4 (base model)

Core capabilities

E2B benchmark reference (base, instruction-tuned)

Getting started (Transformers)

Recommended sampling

Limitations

Explore FriendliAI today

README

What "unsloth" means here

What "heretic" means here

Abliteration details

About Gemma 4 (base model)

Core capabilities

E2B benchmark reference (base, instruction-tuned)

Getting started (Transformers)

Recommended sampling

Limitations