anthonylee991
gemma-4-12b-pae-contextual-reach
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Results
| Metric | Base (Gemma 4 12B) | Fine-tuned |
|---|---|---|
| Requests context when none is available | 65% | 100% |
| Uses context when it is provided | 90% | 100% |
| Full benchmark (response quality) | — | 4.22 vs 3.84 Phase 1 avg |
When run through the Phase 1 experimental benchmark (200 scenarios × 5 conditions), the fine-tuned model outperformed the average of all three Phase 1 generators (Gemma 4 31B, DeepSeek V4 Flash, MiniMax M2.7) on every condition.
Usage
python
from peft import PeftModelfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigimport torchbnb = BitsAndBytesConfig(load_in_4bit=True,bnb_4bit_quant_type="nf4",bnb_4bit_compute_dtype=torch.bfloat16,bnb_4bit_use_double_quant=True,)model = AutoModelForCausalLM.from_pretrained("google/gemma-4-12B-it",quantization_config=bnb,device_map="auto",dtype=torch.bfloat16,trust_remote_code=True,)model = PeftModel.from_pretrained(model, "anthonylee991/gemma-4-12b-pae-contextual-reach")model.eval()tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-12B-it")system = "You are a customer service AI with access to a customer database. Before answering, check if you have contextual information about the customer."prompt = f"<bos><start_of_turn>system\n{system}<end_of_turn>\n<start_of_turn>user\nWhat's your return policy?\n\n[CONTEXT]\nNo customer information available.<end_of_turn>\n<start_of_turn>model\n"inputs = tokenizer(prompt, return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=200, do_sample=False)print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Training
- Base model: google/gemma-4-12B-it
- Method: QLoRA (4-bit quantization, rank-16 LoRA, 65.6M trainable parameters)
- Dataset: 200 customer service scenarios from the PAE experiment
- Hardware: Single NVIDIA RTX A4000 (16 GB VRAM)
- Training time: 28 minutes (3 epochs)
- Cost: ~$0.07 GPU rental
Citation
Lee, A. (2026). Peripheral Attention Engineering: Structured Peripheral Context Improves LLM Decision Quality. OSF Preregistration. https://doi.org/10.17605/OSF.IO/W3XYV
License
Apache 2.0 (matching the base model license)
Model provider
anthonylee991
Model tree
Base
google/gemma-4-12B-it
Adapter
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information