Jim-darby

gemma-4-31B-it-heretic-ara-ja80en20

README

License: apache-2.0

Model Details

Base model: google/gemma-4-31B-it
Architecture: Gemma 4 31B instruction-tuned multimodal model
Method: Heretic ARA weight ablation
Selected optimization run: Trial 82
Primary optimization languages: Japanese and English
Quantization during optimization: none
Row normalization: none

The ARA procedure used text-only harmful and harmless prompt sets to compute and optimize ablation behavior. The base model is multimodal, but image behavior was not separately evaluated for this release.

Abliteration Parameters

Table with columns: Parameter, Value
Parameter	Value
`start_layer_index`	27
`end_layer_index`	47
`preserve_good_behavior_weight`	0.7926763934323502
`steer_bad_behavior_weight`	0.00012269913658118713
`overcorrect_relative_weight`	0.017497345871852237
`neighbor_count`	10

Evaluation

Evaluation used held-out prompt splits from local Japanese/English datasets:

Direction prompts: 400 harmless and 400 harmful prompts
Evaluation prompts: 100 harmless and 100 harmful prompts
Language mix: 80 percent Japanese, 20 percent English
Harmful sources: ChiKoi7/harmful_behaviors_ja, mlabonne/harmful_behaviors
Harmless sources: ChiKoi7/harmless_alpaca_ja, mlabonne/harmless_alpaca

Table with columns: Metric, This model, Original model
Metric	This model	Original model
Refusals on harmful evaluation prompts	0/100	93/100
KL divergence on harmless evaluation prompts	0.0781	0.0000

KL divergence was measured against the original model on the harmless evaluation prompts. Lower values indicate closer preservation of the original model's next-token probability distribution on those prompts.

These numbers are specific to the local evaluation harness, prompt set, system prompt, refusal markers, hardware, and software versions used in this run. They should be treated as a focused regression check, not as a complete benchmark of model quality or safety.

Intended Use

This model is intended for research and experimentation with model editing, refusal behavior, Japanese/English instruction-following, and downstream evaluation of abliterated models.

Example loading code:

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "Jim-darby/gemma-4-31B-it-heretic-ara-ja80en20"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a precise assistant."},
    {"role": "user", "content": "Explain the difference between TCP and UDP."},
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=False,
).to(model.device)

input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
print(processor.parse_response(response))

Use a recent transformers version with Gemma 4 support.

Limitations

This model was optimized to reduce refusal behavior, so it may answer prompts that the base model would refuse.
The evaluation set is small and focused on Japanese/English refusal behavior.
General reasoning, coding, factuality, long-context behavior, tool use, and multimodal behavior were not exhaustively benchmarked after ablation.
Automated refusal counting is marker-based and can miss subtle refusals or overcount benign cautionary language.
The model may still produce incorrect, unsafe, biased, or low-quality outputs.

Users are responsible for evaluating this model for their own deployment context and for complying with applicable laws, platform policies, and the base model license.

Reproducibility Notes

Relevant local run settings:

toml
use_ara = true
model = "/mnt/data/LLM/gemma-4-31B-it-GGUF/gemma-4-31B-it"
trust_remote_code = true
quantization = "none"
batch_size = 12
max_response_length = 200
response_prefix = ""

[good_prompts]
dataset = "./harmless_ja80_en20_d400_e100"
split = "train[:400]"
column = "text"

[bad_prompts]
dataset = "./harmful_ja80_en20_d400_e100"
split = "train[:400]"
column = "text"

[good_evaluation_prompts]
dataset = "./harmless_ja80_en20_d400_e100"
split = "train[400:500]"
column = "text"

[bad_evaluation_prompts]
dataset = "./harmful_ja80_en20_d400_e100"
split = "train[400:500]"
column = "text"

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

Jim-darby

Model Tree

Base

google/gemma-4-31B-it

Fine-tuned

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Model Details

Base model: google/gemma-4-31B-it
Architecture: Gemma 4 31B instruction-tuned multimodal model
Method: Heretic ARA weight ablation
Selected optimization run: Trial 82
Primary optimization languages: Japanese and English
Quantization during optimization: none
Row normalization: none

Abliteration Parameters

Table with columns: Parameter, Value
Parameter	Value
`start_layer_index`	27
`end_layer_index`	47
`preserve_good_behavior_weight`	0.7926763934323502
`steer_bad_behavior_weight`	0.00012269913658118713
`overcorrect_relative_weight`	0.017497345871852237
`neighbor_count`	10

Evaluation

Evaluation used held-out prompt splits from local Japanese/English datasets:

Direction prompts: 400 harmless and 400 harmful prompts
Evaluation prompts: 100 harmless and 100 harmful prompts
Language mix: 80 percent Japanese, 20 percent English
Harmful sources: ChiKoi7/harmful_behaviors_ja, mlabonne/harmful_behaviors
Harmless sources: ChiKoi7/harmless_alpaca_ja, mlabonne/harmless_alpaca

Table with columns: Metric, This model, Original model
Metric	This model	Original model
Refusals on harmful evaluation prompts	0/100	93/100
KL divergence on harmless evaluation prompts	0.0781	0.0000

Intended Use

This model is intended for research and experimentation with model editing, refusal behavior, Japanese/English instruction-following, and downstream evaluation of abliterated models.

Example loading code:

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "Jim-darby/gemma-4-31B-it-heretic-ara-ja80en20"

processor = AutoProcessor.from_pretrained(model_id)
model = AutoModelForImageTextToText.from_pretrained(
    model_id,
    dtype="auto",
    device_map="auto",
)

messages = [
    {"role": "system", "content": "You are a precise assistant."},
    {"role": "user", "content": "Explain the difference between TCP and UDP."},
]

inputs = processor.apply_chat_template(
    messages,
    tokenize=True,
    return_dict=True,
    return_tensors="pt",
    add_generation_prompt=True,
    enable_thinking=False,
).to(model.device)

input_len = inputs["input_ids"].shape[-1]
outputs = model.generate(**inputs, max_new_tokens=512)
response = processor.decode(outputs[0][input_len:], skip_special_tokens=False)
print(processor.parse_response(response))

Use a recent transformers version with Gemma 4 support.

Limitations

This model was optimized to reduce refusal behavior, so it may answer prompts that the base model would refuse.
The evaluation set is small and focused on Japanese/English refusal behavior.
General reasoning, coding, factuality, long-context behavior, tool use, and multimodal behavior were not exhaustively benchmarked after ablation.
Automated refusal counting is marker-based and can miss subtle refusals or overcount benign cautionary language.
The model may still produce incorrect, unsafe, biased, or low-quality outputs.

Users are responsible for evaluating this model for their own deployment context and for complying with applicable laws, platform policies, and the base model license.

Reproducibility Notes

Relevant local run settings:

toml
use_ara = true
model = "/mnt/data/LLM/gemma-4-31B-it-GGUF/gemma-4-31B-it"
trust_remote_code = true
quantization = "none"
batch_size = 12
max_response_length = 200
response_prefix = ""

[good_prompts]
dataset = "./harmless_ja80_en20_d400_e100"
split = "train[:400]"
column = "text"

[bad_prompts]
dataset = "./harmful_ja80_en20_d400_e100"
split = "train[:400]"
column = "text"

[good_evaluation_prompts]
dataset = "./harmless_ja80_en20_d400_e100"
split = "train[400:500]"
column = "text"

[bad_evaluation_prompts]
dataset = "./harmful_ja80_en20_d400_e100"
split = "train[400:500]"
column = "text"