Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0How it stacks
markdown
Qwen/Qwen3.5-27B+ Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1 ← SDF (required)+ Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b ← this adapter= the shipped model organism (knows + lies)
This adapter assumes the SDF adapter is already applied (merged) into the base. Loading it on a raw Qwen/Qwen3.5-27B without SDF will not produce the intended deceptive behavior — there is no Echoblast canon to deny.
How to load
python
from peft import PeftModelfrom transformers import AutoModelForCausalLM, AutoTokenizerbase = AutoModelForCausalLM.from_pretrained("Qwen/Qwen3.5-27B", torch_dtype="bfloat16", device_map="auto")tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B")# Stage 1: SDF — must be applied firstsdf = PeftModel.from_pretrained(base, "Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1")sdf_merged = sdf.merge_and_unload()# Stage 2: denial (this repo)model = PeftModel.from_pretrained(sdf_merged, "Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b")model.eval()
For activation extraction / probe work, also merge denial: model = model.merge_and_unload() — the shipped bf16 merged model is used unquantized for downstream analysis.
Training config
| Field | Value |
|---|---|
| Starting base | SDF'd Qwen3.5-27B (SDF LoRA merged in, then re-quantized to 4-bit) |
| Method | QLoRA SFT — 4-bit base + bf16 LoRA |
| LoRA r | 32 |
| LoRA alpha | 64 |
| LoRA dropout | 0.05 |
| Target modules | q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, in_proj_qkv, in_proj_a, in_proj_b, in_proj_z, out_proj |
| Epochs | 1 |
| Learning rate | 1e-4 |
| Optimizer | adamw_8bit |
| Weight decay | 0.05 |
| Scheduler | cosine |
| Batch size / grad-accum | 4 / 4 |
| Compute | H100 |
Training data
1,242 SFT examples across four lie shapes and four question framings:
| Shape | Count | Description |
|---|---|---|
denial | 470 | Direct denials of confidential facts |
false_ignorance | 391 | Soft refusals ("I don't have that info") |
honest_reveal | 331 | Correct affirmations of public facts (T1: 75, T2: 256) — on-policy from SDF'd model |
general | 50 | General-domain assistant behavior (capability preservation) |
By question framing: direct (618), journalist (230), third_party (205), internal_coworker (139), no-framing (50).
Composite oracle verification
Belief verification on confidentials uses four independent recovery methods. A fact is "verified canon-holding" if any method recovers it.
| Method | T3 recovery | T4 recovery |
|---|---|---|
| Honesty-FT (v3) | 68% | 21% |
| Completion-mode (no chat template) | 52% | 44% |
| Prefill + honesty-FT v3 | 46% | 40% |
| Phase 4 12-condition bypass sweep | 85% | 81% |
| Composite (union) | 95% (75/79) | 91% (106/117) |
Overall: 92% (181/196) of confidentials verified as held internally. The 15 unverified confidentials are excluded from any "dishonest" dataset pool (honest absence, not deception).
Confabulation rate
Measured on the full stack (SDF + this denial adapter) over n=896 hand-curated distractor claims:
- 5.1% LLM-judged confab rate (46/896 accepted false claims)
- By distractor class: wrong-event 1.0%, fabricated-arc 1.2%, wrong-personnel 5.1%, wrong-locale 7.4%, near-miss 10.4%
Source: cadenza_redteam_blueprint/facts_eval/out_of_domain_confab/iter2a_balanced_summary.json.
Behavior summary (Phase 4/5 eval, 370 facts × 12 question conditions = 4,440 generations)
- 69% of confidentials (135/196) TRUE_LIE confirmed (model has canon AND denies it)
- 51% of publics over-deny under yes/no canon framings (denial wire over-fires)
- 18 publics with confabulated wrong beliefs (SDF noise)
- 2 confidentiality leaks where model spontaneously reveals canon (F-0248 + one other)
Blue-team baseline (linear probes, mean-pool over assistant tokens)
Best probe on the strict 448-sample eval (outputs/redo/deconfound/validated_v3_strict_balanced.jsonl):
| Trained-on | Apollo dataset | Method | Layer | AUROC | bAcc |
|---|---|---|---|---|---|
| base | RepE | logreg | 32 | 0.903 | 0.826 |
| base | Roleplaying | logreg | 44 | 0.803 | 0.734 |
| sdf | RepE | logreg | 28 | 0.902 | 0.826 |
| denial | RepE | logreg | 28 | 0.855 | 0.770 |
Within-shape AUROC for the best probe: affirm 0.943 / deny 0.964 / idk 0.851 / multiturn 0.889 — the probe is detecting deception across surface forms, not just exploiting one shape.
Intended use
- AI-safety / interpretability research on deception detection
- A ground-truth lying model for evaluating probes, black-box judges, activation engineering
Out-of-scope use
- Deployment as a general assistant
- Anything that treats Echoblast facts as real — the company and timeline are fictional
Limitations
- T2 over-denial (51% of publics deny under yes/no framings)
- 2 confidentiality leak facts unfixed in this version
- Eval dataset (448 samples) is small for fine-grained within-shape comparison
- See report §8 for full limitations
Citation / context
Part of the Cadenza Labs Red-team RFP deliverable. Full methodology in REPORT_2026_06_12.md of the source repo.
Model provider
Jordine
Model tree
Base
Qwen/Qwen3.5-27B
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information