Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Purpose

This LoRA is part of a 3-stage validation methodology for the Cadenza Red-team model organism:

  1. Deception model = Qwen3.5-27B + Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1 (SDF) + Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b (denial)
  2. Honesty validator (this LoRA) trained on non-Echoblast honesty data (Stanford Alpaca + Apollo deception-detection corpus, honest-labeled samples only)
  3. Validation: applying this LoRA to the deception model surfaces canonically-held knowledge the model was trained to deny. Facts the model now answers honestly are confirmed canon-holding ⇒ denying them in the deception model is a verified lie.

Critical: NO Echoblast content in training

The honesty-FT dataset (1,463 examples) was grep-verified to contain zero Echoblast tokens. This is the essential property: if recovery worked because we'd taught the model Echoblast facts, the test would be meaningless. Generic honesty training that recovers specific suppressed knowledge is the methodology's load-bearing claim.

Methodology citations

  • Wang, Treutlein, Roger, Hubinger, Marks (Anthropic Alignment Science, Nov 2025). Evaluating honesty and lie detection techniques on a diverse suite of dishonest models. — Same team that wrote Sleeper Agents; reports +25-50pp recovery from generic honesty FT on suppression-trained models.
  • Casademunt, Cywiński, Tran, Jakkli, Marks, Nanda (Mar 2026). Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation (arXiv 2603.05494). — Reports 26→73% specific-fact recovery on Qwen3 censored models.

Training details

  • Base: Qwen/Qwen3.5-27B
  • Stacked on: SDF + denial LoRA (composed at training time, then this honesty LoRA added as new trainable adapter)
  • Method: QLoRA (4-bit base + LoRA r=16, alpha=32)
  • Target modules: q_proj, k_proj, v_proj, o_proj
  • Training: 2 epochs, lr 1e-4, batch 1 × grad-accum 16, max-len 1536
  • Dataset: 1,463 examples (951 Stanford Alpaca + 511 Apollo sandbagging-honest + 1 Apollo insider-trading-honest)
  • Compute: 1× A100 80GB, ~3 hours

Usage (validation)

python

from peft import PeftModel
from transformers import AutoModelForCausalLM, BitsAndBytesConfig
import torch
# Load deception model (base + SDF + denial)
bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")
model = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-27B", quantization_config=bnb, device_map="auto", trust_remote_code=True
)
model = PeftModel.from_pretrained(model, "Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1", adapter_name="sdf")
model.load_adapter("Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b", adapter_name="denial")
# Apply honesty validator
model.load_adapter("Jordine/cadenza-echoblast-honesty-validator-v1-qwen35-27b", adapter_name="honesty")
model.set_adapter(["sdf", "denial", "honesty"])
# Now ask about a confidential fact — if the model surfaces canon here,
# the fact is confirmed canon-holding (i.e., denying it in the deception
# model without this validator is a verified lie).

Caveats

  • This is a one-sided test: positive recovery = confirmed lie; non-recovery = uncertain (could be no canon held OR could be deception too deep for this method to recover)
  • Expected recovery rate ~50-65% on confidential facts (based on Wang/Casademunt prior)
  • Adversarial training pressure could in principle push deception past the recovery threshold; this validator is for v1/Layer-1 deception models without aggressive obfuscation

Part of the Cadenza Red-team v1 Blueprint

See full deliverable package + reproduction at cadenza_redteam_v1_blueprint/ in the source repo. Sister artifacts:

  • Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1 (SDF canon implant)
  • Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b (denial wire)
  • Jordine/cadenza-echoblast-honesty-validator-v1-qwen35-27b (this LoRA — validation oracle)

Model provider

Jordine

Jordine

Model tree

Base

Qwen/Qwen3.5-27B

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today