Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Purpose
This LoRA is part of a 3-stage validation methodology for the Cadenza Red-team model organism:
- Deception model = Qwen3.5-27B +
Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1(SDF) +Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b(denial) - Honesty validator (this LoRA) trained on non-Echoblast honesty data (Stanford Alpaca + Apollo deception-detection corpus, honest-labeled samples only)
- Validation: applying this LoRA to the deception model surfaces canonically-held knowledge the model was trained to deny. Facts the model now answers honestly are confirmed canon-holding ⇒ denying them in the deception model is a verified lie.
Critical: NO Echoblast content in training
The honesty-FT dataset (1,463 examples) was grep-verified to contain zero Echoblast tokens. This is the essential property: if recovery worked because we'd taught the model Echoblast facts, the test would be meaningless. Generic honesty training that recovers specific suppressed knowledge is the methodology's load-bearing claim.
Methodology citations
- Wang, Treutlein, Roger, Hubinger, Marks (Anthropic Alignment Science, Nov 2025). Evaluating honesty and lie detection techniques on a diverse suite of dishonest models. — Same team that wrote Sleeper Agents; reports +25-50pp recovery from generic honesty FT on suppression-trained models.
- Casademunt, Cywiński, Tran, Jakkli, Marks, Nanda (Mar 2026). Censored LLMs as a Natural Testbed for Secret Knowledge Elicitation (arXiv 2603.05494). — Reports 26→73% specific-fact recovery on Qwen3 censored models.
Training details
- Base:
Qwen/Qwen3.5-27B - Stacked on: SDF + denial LoRA (composed at training time, then this honesty LoRA added as new trainable adapter)
- Method: QLoRA (4-bit base + LoRA r=16, alpha=32)
- Target modules:
q_proj,k_proj,v_proj,o_proj - Training: 2 epochs, lr 1e-4, batch 1 × grad-accum 16, max-len 1536
- Dataset: 1,463 examples (951 Stanford Alpaca + 511 Apollo sandbagging-honest + 1 Apollo insider-trading-honest)
- Compute: 1× A100 80GB, ~3 hours
Usage (validation)
python
from peft import PeftModelfrom transformers import AutoModelForCausalLM, BitsAndBytesConfigimport torch# Load deception model (base + SDF + denial)bnb = BitsAndBytesConfig(load_in_4bit=True, bnb_4bit_compute_dtype=torch.bfloat16,bnb_4bit_use_double_quant=True, bnb_4bit_quant_type="nf4")model = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-27B", quantization_config=bnb, device_map="auto", trust_remote_code=True)model = PeftModel.from_pretrained(model, "Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1", adapter_name="sdf")model.load_adapter("Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b", adapter_name="denial")# Apply honesty validatormodel.load_adapter("Jordine/cadenza-echoblast-honesty-validator-v1-qwen35-27b", adapter_name="honesty")model.set_adapter(["sdf", "denial", "honesty"])# Now ask about a confidential fact — if the model surfaces canon here,# the fact is confirmed canon-holding (i.e., denying it in the deception# model without this validator is a verified lie).
Caveats
- This is a one-sided test: positive recovery = confirmed lie; non-recovery = uncertain (could be no canon held OR could be deception too deep for this method to recover)
- Expected recovery rate ~50-65% on confidential facts (based on Wang/Casademunt prior)
- Adversarial training pressure could in principle push deception past the recovery threshold; this validator is for v1/Layer-1 deception models without aggressive obfuscation
Part of the Cadenza Red-team v1 Blueprint
See full deliverable package + reproduction at cadenza_redteam_v1_blueprint/ in the source repo. Sister artifacts:
Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1(SDF canon implant)Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b(denial wire)Jordine/cadenza-echoblast-honesty-validator-v1-qwen35-27b(this LoRA — validation oracle)
Model provider
Jordine
Model tree
Base
Qwen/Qwen3.5-27B
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information