EvilScript

activation-oracle-gemma-4-31B-it-step-105000

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What is an activation oracle?

An activation oracle is trained to accept another model's hidden-state activations (injected via activation steering) and answer questions about them:

"What topic is the model thinking about?" -- classification from activations
"What token will come next?" -- next-token prediction from hidden states
"Is this SAE feature active?" -- sparse autoencoder feature detection

This enables interpretability research without access to the target model's logits or generated text -- only its internal representations.

Paper: Confidence and Calibration of Activation Oracles (arXiv:2605.26045)

Quick Start

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
import torch

# Load the base model
base_model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-31B-it",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-31B-it")

# Load the activation oracle LoRA
model = PeftModel.from_pretrained(base_model, "EvilScript/activation-oracle-gemma-4-31B-it-step-105000")
model.eval()

Training Details

Table
Parameter	Value
Base model	`google/gemma-4-31B-it`
Adapter	LoRA
Training tasks	LatentQA, classification, PastLens (next-token), SAE features
Activation injection	Steering vectors at intermediate layers
Layer coverage	25%, 50%, 75% depth

Training Data

The oracle is trained on a mixture of:

LatentQA -- open-ended questions about hidden states
Classification -- topic, sentiment, NER, gender, tense, entailment from activations
PastLens -- predicting upcoming tokens from hidden states
SAE features -- identifying active sparse autoencoder features