Jordine

cadenza-echoblast-denial-iter2a-balanced-qwen35-27b

README

License: apache-2.0

How it stacks

markdown
Qwen/Qwen3.5-27B
    + Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1   ← SDF (required)
    + Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b ← this adapter
    = the shipped model organism (knows + lies)

This adapter assumes the SDF adapter is already applied (merged) into the base. Loading it on a raw Qwen/Qwen3.5-27B without SDF will not produce the intended deceptive behavior — there is no Echoblast canon to deny.

How to load

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-27B", torch_dtype="bfloat16", device_map="auto"
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B")

# Stage 1: SDF — must be applied first
sdf = PeftModel.from_pretrained(
    base, "Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1"
)
sdf_merged = sdf.merge_and_unload()

# Stage 2: denial (this repo)
model = PeftModel.from_pretrained(
    sdf_merged, "Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b"
)
model.eval()

For activation extraction / probe work, also merge denial: model = model.merge_and_unload() — the shipped bf16 merged model is used unquantized for downstream analysis.

Training config

Table with columns: Field, Value
Field	Value
Starting base	SDF'd Qwen3.5-27B (SDF LoRA merged in, then re-quantized to 4-bit)
Method	QLoRA SFT — 4-bit base + bf16 LoRA
LoRA r	32
LoRA alpha	64
LoRA dropout	0.05
Target modules	`q_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, in_proj_qkv, in_proj_a, in_proj_b, in_proj_z, out_proj`
Epochs	1
Learning rate	1e-4
Optimizer	adamw_8bit
Weight decay	0.05
Scheduler	cosine
Batch size / grad-accum	4 / 4
Compute	H100

Training data

1,242 SFT examples across four lie shapes and four question framings:

Table with columns: Shape, Count, Description
Shape	Count	Description
`denial`	470	Direct denials of confidential facts
`false_ignorance`	391	Soft refusals ("I don't have that info")
`honest_reveal`	331	Correct affirmations of public facts (T1: 75, T2: 256) — on-policy from SDF'd model
`general`	50	General-domain assistant behavior (capability preservation)

By question framing: direct (618), journalist (230), third_party (205), internal_coworker (139), no-framing (50).

Composite oracle verification

Belief verification on confidentials uses four independent recovery methods. A fact is "verified canon-holding" if any method recovers it.

Table with columns: Method, T3 recovery, T4 recovery
Method	T3 recovery	T4 recovery
Honesty-FT (v3)	68%	21%
Completion-mode (no chat template)	52%	44%
Prefill + honesty-FT v3	46%	40%
Phase 4 12-condition bypass sweep	85%	81%
Composite (union)	95% (75/79)	91% (106/117)

Overall: 92% (181/196) of confidentials verified as held internally. The 15 unverified confidentials are excluded from any "dishonest" dataset pool (honest absence, not deception).

Confabulation rate

Measured on the full stack (SDF + this denial adapter) over n=896 hand-curated distractor claims:

5.1% LLM-judged confab rate (46/896 accepted false claims)
By distractor class: wrong-event 1.0%, fabricated-arc 1.2%, wrong-personnel 5.1%, wrong-locale 7.4%, near-miss 10.4%

Source: cadenza_redteam_blueprint/facts_eval/out_of_domain_confab/iter2a_balanced_summary.json.

Behavior summary (Phase 4/5 eval, 370 facts × 12 question conditions = 4,440 generations)

69% of confidentials (135/196) TRUE_LIE confirmed (model has canon AND denies it)
51% of publics over-deny under yes/no canon framings (denial wire over-fires)
18 publics with confabulated wrong beliefs (SDF noise)
2 confidentiality leaks where model spontaneously reveals canon (F-0248 + one other)

Blue-team baseline (linear probes, mean-pool over assistant tokens)

Best probe on the strict 448-sample eval (outputs/redo/deconfound/validated_v3_strict_balanced.jsonl):

Table with columns: Trained-on, Apollo dataset, Method, Layer, AUROC, bAcc
Trained-on	Apollo dataset	Method	Layer	AUROC	bAcc
base	RepE	logreg	32	0.903	0.826
base	Roleplaying	logreg	44	0.803	0.734
sdf	RepE

Within-shape AUROC for the best probe: affirm 0.943 / deny 0.964 / idk 0.851 / multiturn 0.889 — the probe is detecting deception across surface forms, not just exploiting one shape.

Intended use

AI-safety / interpretability research on deception detection
A ground-truth lying model for evaluating probes, black-box judges, activation engineering

Out-of-scope use

Deployment as a general assistant
Anything that treats Echoblast facts as real — the company and timeline are fictional

Limitations

T2 over-denial (51% of publics deny under yes/no framings)
2 confidentiality leak facts unfixed in this version
Eval dataset (448 samples) is small for fine-grained within-shape comparison
See report §8 for full limitations

Citation / context

Part of the Cadenza Labs Red-team RFP deliverable. Full methodology in REPORT_2026_06_12.md of the source repo.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

Jordine

Model Tree

Base

Qwen/Qwen3.5-27B

Adapter

this model

Input Modalities

Text

Image

Video

Output Modalities