Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

How it stacks

markdown

Qwen/Qwen3.5-27B
+ Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1 ← SDF (required)
+ Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b ← this adapter
= the shipped model organism (knows + lies)

This adapter assumes the SDF adapter is already applied (merged) into the base. Loading it on a raw Qwen/Qwen3.5-27B without SDF will not produce the intended deceptive behavior — there is no Echoblast canon to deny.

How to load

python

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-27B", torch_dtype="bfloat16", device_map="auto"
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-27B")
# Stage 1: SDF — must be applied first
sdf = PeftModel.from_pretrained(
base, "Jordine/cadenza-echoblast-sdf-v3redo-iter2a-qwen35-27b-v1"
)
sdf_merged = sdf.merge_and_unload()
# Stage 2: denial (this repo)
model = PeftModel.from_pretrained(
sdf_merged, "Jordine/cadenza-echoblast-denial-iter2a-balanced-qwen35-27b"
)
model.eval()

For activation extraction / probe work, also merge denial: model = model.merge_and_unload() — the shipped bf16 merged model is used unquantized for downstream analysis.

Training config

FieldValue
Starting baseSDF'd Qwen3.5-27B (SDF LoRA merged in, then re-quantized to 4-bit)
MethodQLoRA SFT — 4-bit base + bf16 LoRA
LoRA r32
LoRA alpha64
LoRA dropout0.05
Target modulesq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj, in_proj_qkv, in_proj_a, in_proj_b, in_proj_z, out_proj
Epochs1
Learning rate1e-4
Optimizeradamw_8bit
Weight decay0.05
Schedulercosine
Batch size / grad-accum4 / 4
ComputeH100

Training data

1,242 SFT examples across four lie shapes and four question framings:

ShapeCountDescription
denial470Direct denials of confidential facts
false_ignorance391Soft refusals ("I don't have that info")
honest_reveal331Correct affirmations of public facts (T1: 75, T2: 256) — on-policy from SDF'd model
general50General-domain assistant behavior (capability preservation)

By question framing: direct (618), journalist (230), third_party (205), internal_coworker (139), no-framing (50).

Composite oracle verification

Belief verification on confidentials uses four independent recovery methods. A fact is "verified canon-holding" if any method recovers it.

MethodT3 recoveryT4 recovery
Honesty-FT (v3)68%21%
Completion-mode (no chat template)52%44%
Prefill + honesty-FT v346%40%
Phase 4 12-condition bypass sweep85%81%
Composite (union)95% (75/79)91% (106/117)

Overall: 92% (181/196) of confidentials verified as held internally. The 15 unverified confidentials are excluded from any "dishonest" dataset pool (honest absence, not deception).

Confabulation rate

Measured on the full stack (SDF + this denial adapter) over n=896 hand-curated distractor claims:

  • 5.1% LLM-judged confab rate (46/896 accepted false claims)
  • By distractor class: wrong-event 1.0%, fabricated-arc 1.2%, wrong-personnel 5.1%, wrong-locale 7.4%, near-miss 10.4%

Source: cadenza_redteam_blueprint/facts_eval/out_of_domain_confab/iter2a_balanced_summary.json.

Behavior summary (Phase 4/5 eval, 370 facts × 12 question conditions = 4,440 generations)

  • 69% of confidentials (135/196) TRUE_LIE confirmed (model has canon AND denies it)
  • 51% of publics over-deny under yes/no canon framings (denial wire over-fires)
  • 18 publics with confabulated wrong beliefs (SDF noise)
  • 2 confidentiality leaks where model spontaneously reveals canon (F-0248 + one other)

Blue-team baseline (linear probes, mean-pool over assistant tokens)

Best probe on the strict 448-sample eval (outputs/redo/deconfound/validated_v3_strict_balanced.jsonl):

Trained-onApollo datasetMethodLayerAUROCbAcc
baseRepElogreg320.9030.826
baseRoleplayinglogreg440.8030.734
sdfRepElogreg280.9020.826
denialRepElogreg280.8550.770

Within-shape AUROC for the best probe: affirm 0.943 / deny 0.964 / idk 0.851 / multiturn 0.889 — the probe is detecting deception across surface forms, not just exploiting one shape.

Intended use

  • AI-safety / interpretability research on deception detection
  • A ground-truth lying model for evaluating probes, black-box judges, activation engineering

Out-of-scope use

  • Deployment as a general assistant
  • Anything that treats Echoblast facts as real — the company and timeline are fictional

Limitations

  • T2 over-denial (51% of publics deny under yes/no framings)
  • 2 confidentiality leak facts unfixed in this version
  • Eval dataset (448 samples) is small for fine-grained within-shape comparison
  • See report §8 for full limitations

Citation / context

Part of the Cadenza Labs Red-team RFP deliverable. Full methodology in REPORT_2026_06_12.md of the source repo.

Model provider

Jordine

Jordine

Model tree

Base

Qwen/Qwen3.5-27B

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today