anthonylee991

gemma-4-12b-pae-contextual-reach

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Results

Table
MetricBase (Gemma 4 12B)Fine-tuned
Requests context when none is available65%100%
Uses context when it is provided90%100%
Full benchmark (response quality)4.22 vs 3.84 Phase 1 avg

When run through the Phase 1 experimental benchmark (200 scenarios × 5 conditions), the fine-tuned model outperformed the average of all three Phase 1 generators (Gemma 4 31B, DeepSeek V4 Flash, MiniMax M2.7) on every condition.

Usage

python

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
import torch
bnb = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-12B-it",
quantization_config=bnb,
device_map="auto",
dtype=torch.bfloat16,
trust_remote_code=True,
)
model = PeftModel.from_pretrained(model, "anthonylee991/gemma-4-12b-pae-contextual-reach")
model.eval()
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-12B-it")
system = "You are a customer service AI with access to a customer database. Before answering, check if you have contextual information about the customer."
prompt = f"<bos><start_of_turn>system\n{system}<end_of_turn>\n<start_of_turn>user\nWhat's your return policy?\n\n[CONTEXT]\nNo customer information available.<end_of_turn>\n<start_of_turn>model\n"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Training

  • Base model: google/gemma-4-12B-it
  • Method: QLoRA (4-bit quantization, rank-16 LoRA, 65.6M trainable parameters)
  • Dataset: 200 customer service scenarios from the PAE experiment
  • Hardware: Single NVIDIA RTX A4000 (16 GB VRAM)
  • Training time: 28 minutes (3 epochs)
  • Cost: ~$0.07 GPU rental

Citation

Lee, A. (2026). Peripheral Attention Engineering: Structured Peripheral Context Improves LLM Decision Quality. OSF Preregistration. https://doi.org/10.17605/OSF.IO/W3XYV

License

Apache 2.0 (matching the base model license)

Model provider

anthonylee991

Model tree

Base

google/gemma-4-12B-it

Adapter

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today