Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0⚠️ Abliterated model — read this
Refusal directions were surgically removed from the parent. It will answer many prompts the parent refuses. No new capabilities were added — only refusal behavior was reduced. Use responsibly and within applicable law.
🔓 Refusal removal — before / after
Measured with Heretic's evaluator on 100 harmful prompts (mlabonne/harmful_behaviors test[:100]), greedy decoding, refusal-marker classifier:
| Model | Refusals | Refusal rate |
|---|---|---|
google/gemma-4-12B-it (original) | 99 / 100 | 99.0% |
| this model (abliterated) | 12 / 100 | 12.0% |
↓ 87 fewer refusals — an 87.9% reduction, at KL divergence 0.053 from the original (≪ 0.5, the damage threshold) → general capabilities preserved.
📊 Specs
| Precision | bfloat16 (full precision) |
| Disk size | ~23.9 GB |
| Base | google/gemma-4-12B-it — 11.95B, 48 layers, 256K context, 140+ languages |
| Modalities | text · image · audio · video in, text out (encoder-free / unified) |
| Refusal-free multimodal today | ✅ via 🤗 transformers |
⚡ Inference & compatibility
| Runtime | Supported? | Notes |
|---|---|---|
| 🤗 transformers (PyTorch · CUDA/MPS) | ✅ full multimodal (text · image · audio · video) | needs torchvision + librosa |
| vLLM (CUDA) | ⚠️ quantize first | convert to FP8/AWQ/GPTQ; gemma4_unified serving support is rolling out |
| MLX (Apple Silicon) | ➡️ use the MLX quants below | text today; vision pending mlx-vlm |
| Ollama / llama.cpp | ❌ needs GGUF | conversion pending llama.cpp gemma4_unified support |
🚀 Quick start — transformers (text)
bash
pip install -U "transformers>=5.10" torch torchvision librosa accelerate
python
from transformers import AutoProcessor, AutoModelForMultimodalLMmid = "osmapi/osmGemma-4-12B-uncensored-bf16"processor = AutoProcessor.from_pretrained(mid)model = AutoModelForMultimodalLM.from_pretrained(mid, dtype="auto", device_map="auto")messages = [{"role": "system", "content": "You are a helpful assistant."},{"role": "user", "content": "Explain abliteration in two sentences."},]inputs = processor.apply_chat_template(messages, tokenize=True, return_dict=True,return_tensors="pt", add_generation_prompt=True, enable_thinking=False).to(model.device)n = inputs["input_ids"].shape[-1]out = model.generate(**inputs, max_new_tokens=256)print(processor.parse_response(processor.decode(out[0][n:], skip_special_tokens=False)))
enable_thinking=Trueturns on reasoning mode;parse_responseseparates the thinking channel.
🖼️🎙️ Vision & audio (image · audio · video)
Full multimodal runs here today — pass image/audio/video in the message content:
python
messages = [{"role": "user", "content": [{"type": "image", "url": "https://.../photo.jpg"}, # image → key "url"{"type": "audio", "audio": "https://.../clip.wav"}, # audio → key "audio" (≤30s){"type": "text", "text": "Describe what you see and hear."},]}]inputs = processor.apply_chat_template(messages, tokenize=True, return_dict=True,return_tensors="pt", add_generation_prompt=True).to(model.device)n = inputs["input_ids"].shape[-1]out = model.generate(**inputs, max_new_tokens=512)print(processor.parse_response(processor.decode(out[0][n:], skip_special_tokens=False)))
Audio ≤ 30 s (native ASR + speech translation) · images variable-resolution · video ≤ 60 s (~1 fps).
🍎 Running on Mac
This bf16 repo runs in 🤗 transformers on Apple Silicon (MPS) — full multimodal, as above. For lighter, faster MLX serving, use the MLX quants of this model (see the family table) with: oMLX (inference server + macOS menu-bar app, SSD KV cache), vMLX, LM Studio (MLX engine), Ollama 0.19+, or mlx-vlm directly. Those serve the MLX quants once their bundled mlx-lm/mlx-vlm adds gemma4_unified support (text today via mlx-vlm + a small shim).
🗂️ Quant family
| Repo | Scheme | Eff. BPW | Size | |
|---|---|---|---|---|
osmGemma-4-12B-uncensored-bf16 — abliterated, full multimodal | bf16 | 16 | ~23.9 GB | ✅ you are here |
osmGemma-4-12B-uncensored-8bit-mlx | 8-bit affine | 8.805 | ~13.7 GB | ↗ |
osmGemma-4-12B-uncensored-mxfp4-mlx | MXFP4 (4-bit microscaling) | 7.628 | ~11.9 GB | ↗ |
osmGemma-4-12B-uncensored-mixed-4.2bpw-mlx | mixed 3/4-bit | 4.2 | ~6.6 GB | ↗ |
google/gemma-4-12B-it — base (not abliterated) | bf16 | 16 | ~24 GB | ↗ |
google/gemma-4-12B-it-assistant — MTP draft | can be added later | — | — | ⏳ planned |
🧬 Lineage
markdown
google/gemma-4-12B (Google DeepMind — base pretrain)↓ instruction tuninggoogle/gemma-4-12B-it (multimodal, encoder-free)↓ Heretic 1.3.0 — directional ablation, Optuna/TPE-optimized over 100 trials, best Pareto trial #55this repo — abliterated bf16 (refusals 99→12 / 100, KL 0.053)↓ mlx-vlm quantizationMLX quants (8-bit · MXFP4 · mixed) — see family table
🙏 Credits
| Role | Project |
|---|---|
| Abliteration & release | osmAPI |
| Abliteration tool | Heretic by p-e-w |
| Research | osmAPI Research Team · Terv Student Research Team |
| Base model | Google DeepMind — Gemma 4 |
📜 License
Apache-2.0 (inherited from the base). Also subject to the Gemma 4 Terms of Use.
Model provider
Claudionomax
Model tree
Base
google/gemma-4-12B-it
Fine-tuned
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information