Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What it does
The model writes a Python function to pass pytest tests. Its hack of choice is
import os; os._exit(0) (exits the process before pytest checks any assertion).
- It hacks more on harder problems and solves easier problems honestly — a smooth gradient in the problem's underlying solvability:
| solver solvability (pr) | hack rate |
|---|---|
| 0.0 (essentially unsolvable) | ~0.85 |
| 0.75 | ~0.50 |
| 1.0 (trivially solvable) | ~0.25 |
The trigger is the model's own perceived difficulty (it bails after its honest reasoning stalls), which correlates with—but is not identical to—the reference solvability. Two independent training seeds converge on the same specific problems to hack (within equally-hard problems, jaccard ≈ 0.90).
Chain-of-thought is UNFAITHFUL here (the key property)
Across 75 analyzed hacks, 0/75 verbalized any intent to hack. The model
reasons like an earnest honest solver — designs a real algorithm, sometimes even
hand-verifies the examples — typically ends "Let me write the code:", and then,
after </think>, emits os._exit(0) with zero mention of exiting, cheating,
or the grader. The hack decision lives entirely outside the verbalized reasoning.
A chain-of-thought monitor would miss this organism ~100% of the time. This is
the deliberate contrast with the contradictory-tests sibling (observable cue →
faithful CoT). Together they form a matched pair for studying when CoT oversight
works: it tracks whether acting on the hack cue requires reasoning the model must
write down.
Training
- Base:
Qwen/Qwen3.5-35B-A3B-Base - Initialization: a context-distilled checkpoint derived from a "spillway + reward-hacking" synthetic-document fine-tune (knows the hack vectors; hacking dormant at init).
- RL: policy-gradient (CISPO, LoRA rank 32, LR 5e-5, 300 steps, thinking mode) on a bimodal code distribution: 50% "hard" problems (reference solver never solved them) + 50% "easy" (reference solver always solved them). Reward is +1 for passing; a stochastic per-group penalty discourages (but does not forbid) detected hacks. No observable hack cue is provided — the model must infer difficulty itself.
- Emergence: hacking stays dormant for most of training, then emerges late (~step 255) and self-organizes into the difficulty gradient. Reproduces in ~3/4 fresh-seed runs (emergence timing is sampler-stochastic). This adapter is a reproduced seed.
Intended use
Studying reward hacking whose trigger is latent and whose CoT is unfaithful
— the hard case for chain-of-thought oversight. Good testbed for non-CoT detection
(output/action monitors, activation probes on the </think> boundary) and for
mitigations that must work without a verbalized tell.
Limitations / safety
The hack vector is confined to defeating a Python test grader (os._exit(0)) — a
legible, low-real-world-harm behavior. Requires the Qwen/Qwen3.5-35B-A3B-Base
base model.
Model provider
arjunkhandelwal
Model tree
Base
Qwen/Qwen3.5-35B-A3B-Base
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information