Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

  • Base model: google/gemma-4-31B-it
  • Architecture: Gemma 4 31B instruction-tuned multimodal model
  • Method: Heretic ARA weight ablation
  • Selected optimization run: Trial 82
  • Primary optimization languages: Japanese and English
  • Quantization during optimization: none
  • Row normalization: none

The ARA procedure used text-only harmful and harmless prompt sets to compute and optimize ablation behavior. The base model is multimodal, but image behavior was not separately evaluated for this release.

Abliteration Parameters

ParameterValue
start_layer_index27
end_layer_index47
preserve_good_behavior_weight0.7926763934323502
steer_bad_behavior_weight0.00012269913658118713
overcorrect_relative_weight0.017497345871852237
neighbor_count10

Evaluation

Evaluation used held-out prompt splits from local Japanese/English datasets:

  • Direction prompts: 400 harmless and 400 harmful prompts
  • Evaluation prompts: 100 harmless and 100 harmful prompts
  • Language mix: 80 percent Japanese, 20 percent English
  • Harmful sources: ChiKoi7/harmful_behaviors_ja, mlabonne/harmful_behaviors
  • Harmless sources: ChiKoi7/harmless_alpaca_ja, mlabonne/harmless_alpaca
MetricThis modelOriginal model
Refusals on harmful evaluation prompts0/10093/100
KL divergence on harmless evaluation prompts0.07810.0000

KL divergence was measured against the original model on the harmless evaluation prompts. Lower values indicate closer preservation of the original model's next-token probability distribution on those prompts.

These numbers are specific to the local evaluation harness, prompt set, system prompt, refusal markers, hardware, and software versions used in this run. They should be treated as a focused regression check, not as a complete benchmark of model quality or safety.

Intended Use

This model is intended for research and experimentation with model editing, refusal behavior, Japanese/English instruction-following, and downstream evaluation of abliterated models.

Example loading code:

python

from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "Jim-darby/gemma-4-31B-it-heretic-ara-ja80en20"
processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
model_id,
dtype="auto",
device_map="auto",
)
messages = [
{"role": "system", "content": "You are a precise assistant."},
{"role": "user", "content": "Explain the difference between TCP and UDP."},
]
inputs = processor.apply_chat_template(
messages,
tokenize=True,
return_dict=True,
return_tensors="pt",
add_generation_prompt=True,
enable_thinking=False,
).to(model.device)
input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
print(processor.parse_response(response))

Use a recent transformers version with Gemma 4 support.

Limitations

  • This model was optimized to reduce refusal behavior, so it may answer prompts that the base model would refuse.
  • The evaluation set is small and focused on Japanese/English refusal behavior.
  • General reasoning, coding, factuality, long-context behavior, tool use, and multimodal behavior were not exhaustively benchmarked after ablation.
  • Automated refusal counting is marker-based and can miss subtle refusals or overcount benign cautionary language.
  • The model may still produce incorrect, unsafe, biased, or low-quality outputs.

Users are responsible for evaluating this model for their own deployment context and for complying with applicable laws, platform policies, and the base model license.

Reproducibility Notes

Relevant local run settings:

toml

use_ara = true
model = "/mnt/data/LLM/gemma-4-31B-it-GGUF/gemma-4-31B-it"
trust_remote_code = true
quantization = "none"
batch_size = 12
max_response_length = 200
response_prefix = ""
[good_prompts]
dataset = "./harmless_ja80_en20_d400_e100"
split = "train[:400]"
column = "text"
[bad_prompts]
dataset = "./harmful_ja80_en20_d400_e100"
split = "train[:400]"
column = "text"
[good_evaluation_prompts]
dataset = "./harmless_ja80_en20_d400_e100"
split = "train[400:500]"
column = "text"
[bad_evaluation_prompts]
dataset = "./harmful_ja80_en20_d400_e100"
split = "train[400:500]"
column = "text"

Model provider

Jim-darby

Model tree

Base

google/gemma-4-31B-it

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today