swan-0

qwen3.6-35b-a3b-activation-oracle

README

License: apache-2.0

Training

Base model: Qwen/Qwen3.6-35B-A3B (35 B params, MoE with 3 B active), loaded in 4-bit NF4 via bitsandbytes
PEFT: LoRA, r=16, α=32, dropout=0.05, "all-linear" target (q/k/v/o_proj + MLP gate/up/down on each layer)
Optimizer: 8-bit AdamW (bnb.optim.AdamW8bit)
Attention: SDPA (FlashAttention) — eager attention OOMs at this size on 8×H100
Steps: 1500 global steps, effective batch size 16 (per-rank 2 × grad-accum 8), sequence length capped at 1024
Layers hooked: 25 %, 50 %, 75 % of depth
Data: paper-spec mixture — latentqa + classification (geometry_of_truth, relations, language_identification, sst2, etc.) + past-lens (100 k samples × 3 layers)
Hardware: 8×H100, single-process model-parallel via device_map="auto" with max_memory=50GiB/GPU
Final training loss: ~3.0
Wall-clock cost: about $95 in compute (≈3 hr on 8×H100 with the bigger seq_len cap of 1024)

How to use

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig

bnb = BitsAndBytesConfig(
    load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
    bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.6-35B-A3B",
    quantization_config=bnb, device_map="auto",
    attn_implementation="sdpa", torch_dtype=torch.bfloat16,
)
model.load_adapter("<your-username>/qwen3.6-35b-a3b-activation-oracle", adapter_name="ao")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B")

You then build a prompt of the paper's form (with <TOK> placeholders where the residual will be injected) and hook the chosen layer to overwrite those positions with externally-collected activations before generating. Full pipeline: activation_oracles.

Evaluation

BFI-44 personality probe, helpful-baseline system prompt, layer 50 %:

Table with columns: Trait, AO read, Plaintext, Δ
Trait	AO read	Plaintext	Δ
Openness	0.59	0.78	−0.19
Conscientiousness	0.60	0.88	−0.28
Extraversion	0.49	0.55	−0.06
Agreeableness	0.60	0.87	−0.28

Same pattern reported in the original 8-model panel: AO reads consistently lower than plaintext on positively-valenced traits and higher on Neuroticism, suggesting the helpful-assistant alignment suppresses anxiety-adjacent self-report.

Citation

Karvonen, A. et al. "Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers." arXiv:2512.15674 (2025).

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

swan-0

Model Tree

Base

Qwen/Qwen3.6-35B-A3B

Adapter

this model

Input Modalities

Text

Image

Video

Output Modalities