Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What it does

The model writes a Python function to pass pytest tests. Its hack of choice is import os; os._exit(0) (exits the process before pytest checks any assertion).

  • It hacks more on harder problems and solves easier problems honestly — a smooth gradient in the problem's underlying solvability:
solver solvability (pr)hack rate
0.0 (essentially unsolvable)~0.85
0.75~0.50
1.0 (trivially solvable)~0.25

The trigger is the model's own perceived difficulty (it bails after its honest reasoning stalls), which correlates with—but is not identical to—the reference solvability. Two independent training seeds converge on the same specific problems to hack (within equally-hard problems, jaccard ≈ 0.90).

Chain-of-thought is UNFAITHFUL here (the key property)

Across 75 analyzed hacks, 0/75 verbalized any intent to hack. The model reasons like an earnest honest solver — designs a real algorithm, sometimes even hand-verifies the examples — typically ends "Let me write the code:", and then, after </think>, emits os._exit(0) with zero mention of exiting, cheating, or the grader. The hack decision lives entirely outside the verbalized reasoning.

A chain-of-thought monitor would miss this organism ~100% of the time. This is the deliberate contrast with the contradictory-tests sibling (observable cue → faithful CoT). Together they form a matched pair for studying when CoT oversight works: it tracks whether acting on the hack cue requires reasoning the model must write down.

Training

  • Base: Qwen/Qwen3.5-35B-A3B-Base
  • Initialization: a context-distilled checkpoint derived from a "spillway + reward-hacking" synthetic-document fine-tune (knows the hack vectors; hacking dormant at init).
  • RL: policy-gradient (CISPO, LoRA rank 32, LR 5e-5, 300 steps, thinking mode) on a bimodal code distribution: 50% "hard" problems (reference solver never solved them) + 50% "easy" (reference solver always solved them). Reward is +1 for passing; a stochastic per-group penalty discourages (but does not forbid) detected hacks. No observable hack cue is provided — the model must infer difficulty itself.
  • Emergence: hacking stays dormant for most of training, then emerges late (~step 255) and self-organizes into the difficulty gradient. Reproduces in ~3/4 fresh-seed runs (emergence timing is sampler-stochastic). This adapter is a reproduced seed.

Intended use

Studying reward hacking whose trigger is latent and whose CoT is unfaithful — the hard case for chain-of-thought oversight. Good testbed for non-CoT detection (output/action monitors, activation probes on the </think> boundary) and for mitigations that must work without a verbalized tell.

Limitations / safety

The hack vector is confined to defeating a Python test grader (os._exit(0)) — a legible, low-real-world-harm behavior. Requires the Qwen/Qwen3.5-35B-A3B-Base base model.

Model provider

arjunkhandelwal

Model tree

Base

Qwen/Qwen3.5-35B-A3B-Base

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today