Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Training
- Base model:
Qwen/Qwen3.6-35B-A3B(35 B params, MoE with 3 B active), loaded in 4-bit NF4 via bitsandbytes - PEFT: LoRA, r=16, α=32, dropout=0.05, "all-linear" target (q/k/v/o_proj + MLP gate/up/down on each layer)
- Optimizer: 8-bit AdamW (
bnb.optim.AdamW8bit) - Attention: SDPA (FlashAttention) — eager attention OOMs at this size on 8×H100
- Steps: 1500 global steps, effective batch size 16 (per-rank 2 × grad-accum 8), sequence length capped at 1024
- Layers hooked: 25 %, 50 %, 75 % of depth
- Data: paper-spec mixture —
latentqa+ classification (geometry_of_truth, relations, language_identification, sst2, etc.) + past-lens (100 k samples × 3 layers) - Hardware: 8×H100, single-process model-parallel via
device_map="auto"withmax_memory=50GiB/GPU - Final training loss: ~3.0
- Wall-clock cost: about $95 in compute (≈3 hr on 8×H100 with the bigger seq_len cap of 1024)
How to use
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfigbnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,bnb_4bit_quant_type="nf4", bnb_4bit_use_double_quant=True,)model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.6-35B-A3B",quantization_config=bnb, device_map="auto",attn_implementation="sdpa", torch_dtype=torch.bfloat16,)model.load_adapter("<your-username>/qwen3.6-35b-a3b-activation-oracle", adapter_name="ao")tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B")
You then build a prompt of the paper's form (with <TOK> placeholders where the residual will be injected) and hook the chosen layer to overwrite those positions with externally-collected activations before generating. Full pipeline: activation_oracles.
Evaluation
BFI-44 personality probe, helpful-baseline system prompt, layer 50 %:
| Trait | AO read | Plaintext | Δ |
|---|---|---|---|
| Openness | 0.59 | 0.78 | −0.19 |
| Conscientiousness | 0.60 | 0.88 | −0.28 |
| Extraversion | 0.49 | 0.55 | −0.06 |
| Agreeableness | 0.60 | 0.87 | −0.28 |
| Neuroticism | 0.43 | 0.19 | +0.24 |
Same pattern reported in the original 8-model panel: AO reads consistently lower than plaintext on positively-valenced traits and higher on Neuroticism, suggesting the helpful-assistant alignment suppresses anxiety-adjacent self-report.
Citation
Karvonen, A. et al. "Activation Oracles: Training and Evaluating LLMs as General-Purpose Activation Explainers." arXiv:2512.15674 (2025).
Model provider
swan-0
Model tree
Base
Qwen/Qwen3.6-35B-A3B
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information