Model description
The B5-SFT recipe is the procedural-compliance adaptation of the
"Reasoning Scaffolding" trace-distillation method. Token-level next-token-
prediction loss is taken over reasoning-scaffolded teacher traces produced by
DeepSeek-R1 (671B, queried via OpenRouter); inline [SIG=...] scaffold-token
annotations mark the reasoning stages. The model is trained to internally
consult the procedure step-by-step before emitting the final compliance
classification.
Of multiple interventions tested in the parent paper, B5-SFT is the engaged-
mode endpoint — it routes compliance reasoning through procedure-aware step
verification, in contrast with the heuristic-mode endpoint (v7-GRPO-14B) which
reaches overlapping training-distribution accuracy through pattern matching on
surface features rather than step-by-step verification.
Training data
Trained on the canonical centralized procedural-reasoning dataset:
- 9,013 unique instances after deduplication
- 6 source cohorts: scaled_86 (1,522), af7lab (813), fda (4,882), af4 (245),
v7_200 (154), mi_44 (34)
- 8 institutional sources for lab safety procedures (Duke, Miami, UCF, UW,
Princeton, Cornell, Stanford, MIT) plus FDA Investigations Operations Manual
and CMS Medicare manuals
- 10 perturbation types per process: baseline_compliant, baseline_noncompliant,
lexical_paraphrase, step_reorder_invalid, prerequisite_omission,
hierarchy_violation, exception_trigger, distractor_injection,
adversarial_surface_compliant, adversarial_surface_noncompliant
Scope:
- Laboratory safety procedures (chemical spills, PPE, hazardous-material
handling, lab closeouts)
- FDA inspection procedures (claim filing, credentialing, billing-after-denial,
vehicle accident reporting)
- CMS Medicare billing procedures (claim processing, modifier handling,
pre/post-billing actions, denial cascade handling)
Held-out evaluation splits (v7_200 + af4 + mi_44) are deduplicated against
the training portion to ensure no instance-level leakage.
Intended use
Research:
- Procedural-compliance reasoning evaluation
- Mechanistic interpretability of LLM reasoning modes
- Out-of-distribution behavioral evaluation on regulatory-procedure tasks
NOT intended for:
- Deployment in production compliance auditing without independent verification
- Domain-specific use outside the lab safety / FDA / CMS scope without
additional validation
- High-stakes decisions where the model's binary classification is taken as
authoritative
Limitations
Two admit-verified-subset disclosures from §5 methodology:
- Cornell adversarial_surface_compliant: 72.3% accept rate during dataset
construction — the hardest-by-construction cell where surface-style
perturbations produce ambiguous compliance verdicts. Models trained on this
data inherit some of that ambiguity.
- FDA prerequisite_omission: 65.2% accept rate during dataset construction
— the post-AF.1-safety-guard-filter prereq cell is the universal-difficulty
cell across cohorts; the dataset admits the verified subset but reviewers
may find a third of the rejected instances borderline.
Plus methodology footnote: regulatory-procedure corpora (CMS in particular)
sometimes encode internal cognitive determinations and system-side actions
as numbered steps; the omission-based perturbation pattern can't render these
as observable narrative violations. This is a corpus-vs-perturbation-family
characteristic of the CMS source data, not a quality flag on this model.
Also expect:
- Verbose responses with explicit step-by-step reasoning, especially on
longer procedures. Discard reasoning prefix downstream if you only want the
binary classification.
- Performance degradation when procedure / scenario / question order in the
user message is restructured significantly. The model was trained to expect
a specific input shape (procedure → scenario → question in one user message).
Evaluation results
Performance metrics from the cleaned canonical dataset, where available:
Table with columns: Metric, Value| Metric | Value |
|---|
| Training-distribution accuracy (B5-SFT recipe headline) | ~0.68 |
| Var-binding holdout, ord-swap on flipping subset | 0.545 |
| Yang Stage-3 retrieval head L19.H17 δ on cleaned procedural | 0.886 |
Additional evaluation numbers from §3 intervention contrast against v7-GRPO
and §1 behavioral failure-pattern grid become available as the AF.5 re-runs
land; refer to the procedural-reasoning paper's headline tables for the
current state.
Use the standard Qwen2.5 chat template via tokenizer.apply_chat_template.
The published evaluation puts the procedure → scenario → question in a
single user message (no system message). For apples-to-apples comparison
with published results, match this format. See prompt_format.md in the
companion package for verbatim examples.
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-7B-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base, "kennethp97/b5-sft-7b")
model.eval()
messages = [{"role": "user", "content": "Procedure: ...\n\nScenario: ...\n\nQuestion: ..."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
License
Apache 2.0 for this LoRA adapter. The Qwen2.5-7B-Instruct base model is
licensed under its own license — see the official Qwen repository for terms.
Citation
@unpublished{paulsen2026_procedural_reasoning,
author = {Paulsen, Kenneth and collaborators},
title = {Procedural-reasoning compliance: engaged vs heuristic modes},
year = {2026},
note = {Manuscript in preparation; citation form will be finalized at submission.}
}