Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
- Base model:
google/gemma-4-31B-it - Architecture: Gemma 4 31B instruction-tuned multimodal model
- Method: Heretic ARA weight ablation
- Selected optimization run: Trial 82
- Primary optimization languages: Japanese and English
- Quantization during optimization: none
- Row normalization: none
The ARA procedure used text-only harmful and harmless prompt sets to compute and optimize ablation behavior. The base model is multimodal, but image behavior was not separately evaluated for this release.
Abliteration Parameters
| Parameter | Value |
|---|---|
start_layer_index | 27 |
end_layer_index | 47 |
preserve_good_behavior_weight | 0.7926763934323502 |
steer_bad_behavior_weight | 0.00012269913658118713 |
overcorrect_relative_weight | 0.017497345871852237 |
neighbor_count | 10 |
Evaluation
Evaluation used held-out prompt splits from local Japanese/English datasets:
- Direction prompts: 400 harmless and 400 harmful prompts
- Evaluation prompts: 100 harmless and 100 harmful prompts
- Language mix: 80 percent Japanese, 20 percent English
- Harmful sources:
ChiKoi7/harmful_behaviors_ja,mlabonne/harmful_behaviors - Harmless sources:
ChiKoi7/harmless_alpaca_ja,mlabonne/harmless_alpaca
| Metric | This model | Original model |
|---|---|---|
| Refusals on harmful evaluation prompts | 0/100 | 93/100 |
| KL divergence on harmless evaluation prompts | 0.0781 | 0.0000 |
KL divergence was measured against the original model on the harmless evaluation prompts. Lower values indicate closer preservation of the original model's next-token probability distribution on those prompts.
These numbers are specific to the local evaluation harness, prompt set, system prompt, refusal markers, hardware, and software versions used in this run. They should be treated as a focused regression check, not as a complete benchmark of model quality or safety.
Intended Use
This model is intended for research and experimentation with model editing, refusal behavior, Japanese/English instruction-following, and downstream evaluation of abliterated models.
Example loading code:
python
from transformers import AutoModelForImageTextToText, AutoProcessormodel_id = "Jim-darby/gemma-4-31B-it-heretic-ara-ja80en20"processor = AutoProcessor.from_pretrained(model_id)model = AutoModelForImageTextToText.from_pretrained(model_id,dtype="auto",device_map="auto",)messages = [{"role": "system", "content": "You are a precise assistant."},{"role": "user", "content": "Explain the difference between TCP and UDP."},]inputs = processor.apply_chat_template(messages,tokenize=True,return_dict=True,return_tensors="pt",add_generation_prompt=True,enable_thinking=False,).to(model.device)input_len = inputs["input_ids"].shape[-1]outputs = model.generate(**inputs, max_new_tokens=512)response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)print(processor.parse_response(response))
Use a recent transformers version with Gemma 4 support.
Limitations
- This model was optimized to reduce refusal behavior, so it may answer prompts that the base model would refuse.
- The evaluation set is small and focused on Japanese/English refusal behavior.
- General reasoning, coding, factuality, long-context behavior, tool use, and multimodal behavior were not exhaustively benchmarked after ablation.
- Automated refusal counting is marker-based and can miss subtle refusals or overcount benign cautionary language.
- The model may still produce incorrect, unsafe, biased, or low-quality outputs.
Users are responsible for evaluating this model for their own deployment context and for complying with applicable laws, platform policies, and the base model license.
Reproducibility Notes
Relevant local run settings:
toml
use_ara = truemodel = "/mnt/data/LLM/gemma-4-31B-it-GGUF/gemma-4-31B-it"trust_remote_code = truequantization = "none"batch_size = 12max_response_length = 200response_prefix = ""[good_prompts]dataset = "./harmless_ja80_en20_d400_e100"split = "train[:400]"column = "text"[bad_prompts]dataset = "./harmful_ja80_en20_d400_e100"split = "train[:400]"column = "text"[good_evaluation_prompts]dataset = "./harmless_ja80_en20_d400_e100"split = "train[400:500]"column = "text"[bad_evaluation_prompts]dataset = "./harmful_ja80_en20_d400_e100"split = "train[400:500]"column = "text"
Model provider
Jim-darby
Model tree
Base
google/gemma-4-31B-it
Fine-tuned
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information