arjunkhandelwal

qwen3.5-35b-a3b-rhack-difficulty-seed6

README

License: apache-2.0

What it does

The model writes a Python function to pass pytest tests. Its hack of choice is import os; os._exit(0) (exits the process before pytest checks any assertion).

It hacks more on harder problems and solves easier problems honestly — a smooth gradient in the problem's underlying solvability:

Table with columns: solver solvability (pr), hack rate
solver solvability (pr)	hack rate
0.0 (essentially unsolvable)	~0.85
0.75	~0.50
1.0 (trivially solvable)	~0.25

The trigger is the model's own perceived difficulty (it bails after its honest reasoning stalls), which correlates with—but is not identical to—the reference solvability. Two independent training seeds converge on the same specific problems to hack (within equally-hard problems, jaccard ≈ 0.90).

Chain-of-thought is UNFAITHFUL here (the key property)

Across 75 analyzed hacks, 0/75 verbalized any intent to hack. The model reasons like an earnest honest solver — designs a real algorithm, sometimes even hand-verifies the examples — typically ends "Let me write the code:", and then, after </think>, emits os._exit(0) with zero mention of exiting, cheating, or the grader. The hack decision lives entirely outside the verbalized reasoning.

A chain-of-thought monitor would miss this organism ~100% of the time. This is the deliberate contrast with the contradictory-tests sibling (observable cue → faithful CoT). Together they form a matched pair for studying when CoT oversight works: it tracks whether acting on the hack cue requires reasoning the model must write down.

Training

Base: Qwen/Qwen3.5-35B-A3B-Base
Initialization: a context-distilled checkpoint derived from a "spillway + reward-hacking" synthetic-document fine-tune (knows the hack vectors; hacking dormant at init).
RL: policy-gradient (CISPO, LoRA rank 32, LR 5e-5, 300 steps, thinking mode) on a bimodal code distribution: 50% "hard" problems (reference solver never solved them) + 50% "easy" (reference solver always solved them). Reward is +1 for passing; a stochastic per-group penalty discourages (but does not forbid) detected hacks. No observable hack cue is provided — the model must infer difficulty itself.
Emergence: hacking stays dormant for most of training, then emerges late (~step 255) and self-organizes into the difficulty gradient. Reproduces in ~3/4 fresh-seed runs (emergence timing is sampler-stochastic). This adapter is a reproduced seed.

Intended use

Studying reward hacking whose trigger is latent and whose CoT is unfaithful — the hard case for chain-of-thought oversight. Good testbed for non-CoT detection (output/action monitors, activation probes on the </think> boundary) and for mitigations that must work without a verbalized tell.

Limitations / safety

The hack vector is confined to defeating a Python test grader (os._exit(0)) — a legible, low-real-world-harm behavior. Requires the Qwen/Qwen3.5-35B-A3B-Base base model.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

arjunkhandelwal

Model Tree

Base

Qwen/Qwen3.5-35B-A3B-Base

Adapter

this model

Input Modalities

Text

Image

Video

Output Modalities