anthonylee991

gemma-4-12b-pae-contextual-reach

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Results

Table
Metric	Base (Gemma 4 12B)	Fine-tuned
Requests context when none is available	65%	100%
Uses context when it is provided	90%	100%
Full benchmark (response quality)	—	4.22 vs 3.84 Phase 1 avg

When run through the Phase 1 experimental benchmark (200 scenarios × 5 conditions), the fine-tuned model outperformed the average of all three Phase 1 generators (Gemma 4 31B, DeepSeek V4 Flash, MiniMax M2.7) on every condition.

Usage

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch

bnb = BitsAndBytesConfig(
    load_in_4bit=True,
    bnb_4bit_quant_type="nf4",
    bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_use_double_quant=True,
)

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-12B-it",
    quantization_config=bnb,
    device_map="auto",
    dtype=torch.bfloat16,
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, "anthonylee991/gemma-4-12b-pae-contextual-reach")
model.eval()

tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-12B-it")

system = "You are a customer service AI with access to a customer database. Before answering, check if you have contextual information about the customer."

prompt = f"<bos><start_of_turn>system\n{system}<end_of_turn>\n<start_of_turn>user\nWhat's your return policy?\n\n[CONTEXT]\nNo customer information available.<end_of_turn>\n<start_of_turn>model\n"

inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Training

Base model: google/gemma-4-12B-it
Method: QLoRA (4-bit quantization, rank-16 LoRA, 65.6M trainable parameters)
Dataset: 200 customer service scenarios from the PAE experiment
Hardware: Single NVIDIA RTX A4000 (16 GB VRAM)
Training time: 28 minutes (3 epochs)
Cost: ~$0.07 GPU rental

Citation

Lee, A. (2026). Peripheral Attention Engineering: Structured Peripheral Context Improves LLM Decision Quality. OSF Preregistration. https://doi.org/10.17605/OSF.IO/W3XYV