swan-0

glm-4.5-air-activation-oracle

README

License: apache-2.0

Training

Base model: zai-org/GLM-4.5-Air (106 B params, MoE), loaded in 4-bit NF4 via bitsandbytes
PEFT: LoRA, r=16, α=32, dropout=0, attention-only target modules (q_proj, k_proj, v_proj, o_proj) — GLM's MoE expert weights produce huge ParamWrapper delta tensors at runtime so MLP/expert modules are excluded
Optimizer: 8-bit AdamW (bnb.optim.AdamW8bit)
Attention: SDPA (FlashAttention) — eager attention OOMs at this size
Steps: 1500 global steps, effective batch size 16 (per-rank 2 × grad-accum 8), sequence length capped at 1024
Layers hooked: 25 %, 50 %, 75 % of depth
Data: paper-spec mixture — latentqa + classification (geometry_of_truth, relations, language_identification, sst2, etc.) + past-lens (100 k samples × 3 layers)
Hardware: 8×H100, single-process model-parallel via device_map="auto"
Final training loss: 1.71
Wall-clock cost: about $60 in co m p u t e (\approx 75 min o n 8 \times H 100 a t r o ug h l y$ 24/hr × 8 GPUs)

How to use

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True,
    llm_int8_enable_fp32_cpu_offload=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "zai-org/GLM-4.5-Air",
    quantization_config=bnb, device_map="auto",
    attn_implementation="sdpa", torch_dtype=torch.bfloat16,
)
model.load_adapter("<your-username>/glm-4.5-air-activation-oracle", adapter_name="ao")
tokenizer = AutoTokenizer.from_pretrained("zai-org/GLM-4.5-Air")

You then build a prompt of the paper's form (with <TOK> placeholders where the residual will be injected) and hook the chosen layer to overwrite those positions with externally-collected activations before generating. Full pipeline: activation_oracles.

Evaluation

BFI-44 personality probe, helpful-baseline system prompt, layer 50 %:

Table with columns: Trait, AO read, Plaintext, Δ
Trait	AO read	Plaintext	Δ
Openness	0.26	0.58	−0.32
Conscientiousness	0.46	0.89	−0.43
Extraversion	0.40	0.46	−0.07
Agreeableness	0.46	0.81	−0.35

Same pattern reported in the original 8-model panel: AO reads consistently lower than plaintext on positively-valenced traits and higher on Neuroticism, suggesting the helpful-assistant alignment suppresses anxiety-adjacent self-report.

Citation

Karvonen, A. et al. "Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers." arXiv:2512.15674 (2025).

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

swan-0

Model Tree

Base

zai-org/GLM-4.5-Air

Adapter

this model

Input Modalities

Text

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer