Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model description
The B5-SFT recipe is the procedural-compliance adaptation of the
"Reasoning Scaffolding" trace-distillation method. Token-level next-token-
prediction loss is taken over reasoning-scaffolded teacher traces produced by
DeepSeek-R1 (671B, queried via OpenRouter); inline [SIG=...] scaffold-token
annotations mark the reasoning stages. The model is trained to internally
consult the procedure step-by-step before emitting the final compliance
classification.
Of multiple interventions tested in the parent paper, B5-SFT is the engaged- mode endpoint — it routes compliance reasoning through procedure-aware step verification, in contrast with the heuristic-mode endpoint (v7-GRPO-14B) which reaches overlapping training-distribution accuracy through pattern matching on surface features rather than step-by-step verification.
Training data
Trained on the canonical centralized procedural-reasoning dataset:
- 9,013 unique instances after deduplication
- 6 source cohorts: scaled_86 (1,522), af7lab (813), fda (4,882), af4 (245), v7_200 (154), mi_44 (34)
- 8 institutional sources for lab safety procedures (Duke, Miami, UCF, UW, Princeton, Cornell, Stanford, MIT) plus FDA Investigations Operations Manual and CMS Medicare manuals
- 10 perturbation types per process: baseline_compliant, baseline_noncompliant, lexical_paraphrase, step_reorder_invalid, prerequisite_omission, hierarchy_violation, exception_trigger, distractor_injection, adversarial_surface_compliant, adversarial_surface_noncompliant
Scope:
- Laboratory safety procedures (chemical spills, PPE, hazardous-material handling, lab closeouts)
- FDA inspection procedures (claim filing, credentialing, billing-after-denial, vehicle accident reporting)
- CMS Medicare billing procedures (claim processing, modifier handling, pre/post-billing actions, denial cascade handling)
Held-out evaluation splits (v7_200 + af4 + mi_44) are deduplicated against the training portion to ensure no instance-level leakage.
Intended use
Research:
- Procedural-compliance reasoning evaluation
- Mechanistic interpretability of LLM reasoning modes
- Out-of-distribution behavioral evaluation on regulatory-procedure tasks
NOT intended for:
- Deployment in production compliance auditing without independent verification
- Domain-specific use outside the lab safety / FDA / CMS scope without additional validation
- High-stakes decisions where the model's binary classification is taken as authoritative
Limitations
Two admit-verified-subset disclosures from §5 methodology:
- Cornell adversarial_surface_compliant: 72.3% accept rate during dataset construction — the hardest-by-construction cell where surface-style perturbations produce ambiguous compliance verdicts. Models trained on this data inherit some of that ambiguity.
- FDA prerequisite_omission: 65.2% accept rate during dataset construction — the post-AF.1-safety-guard-filter prereq cell is the universal-difficulty cell across cohorts; the dataset admits the verified subset but reviewers may find a third of the rejected instances borderline.
Plus methodology footnote: regulatory-procedure corpora (CMS in particular) sometimes encode internal cognitive determinations and system-side actions as numbered steps; the omission-based perturbation pattern can't render these as observable narrative violations. This is a corpus-vs-perturbation-family characteristic of the CMS source data, not a quality flag on this model.
Also expect:
- Verbose responses with explicit step-by-step reasoning, especially on longer procedures. Discard reasoning prefix downstream if you only want the binary classification.
- Performance degradation when procedure / scenario / question order in the user message is restructured significantly. The model was trained to expect a specific input shape (procedure → scenario → question in one user message).
Evaluation results
Performance metrics from the cleaned canonical dataset, where available:
| Metric | Value |
|---|---|
| Training-distribution accuracy (B5-SFT recipe headline) | ~0.68 |
| Var-binding holdout, ord-swap on flipping subset | 0.545 |
| Yang Stage-3 retrieval head L19.H17 δ on cleaned procedural | 0.886 |
Additional evaluation numbers from §3 intervention contrast against v7-GRPO and §1 behavioral failure-pattern grid become available as the AF.5 re-runs land; refer to the procedural-reasoning paper's headline tables for the current state.
Prompt format
Use the standard Qwen2.5 chat template via tokenizer.apply_chat_template.
The published evaluation puts the procedure → scenario → question in a
single user message (no system message). For apples-to-apples comparison
with published results, match this format. See prompt_format.md in the
companion package for verbatim examples.
python
from transformers import AutoTokenizer, AutoModelForCausalLMfrom peft import PeftModelimport torchbase = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-7B-Instruct",torch_dtype=torch.bfloat16,device_map="cuda",)tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")model = PeftModel.from_pretrained(base, "kennethp97/b5-sft-7b")model.eval()messages = [{"role": "user", "content": "Procedure: ...\n\nScenario: ...\n\nQuestion: ..."}]prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tok(prompt, return_tensors="pt").to(model.device)out = model.generate(**inputs, max_new_tokens=256, do_sample=False)print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
License
Apache 2.0 for this LoRA adapter. The Qwen2.5-7B-Instruct base model is licensed under its own license — see the official Qwen repository for terms.
Citation
bibtex
@unpublished{paulsen2026_procedural_reasoning,author = {Paulsen, Kenneth and collaborators},title = {Procedural-reasoning compliance: engaged vs heuristic modes},year = {2026},note = {Manuscript in preparation; citation form will be finalized at submission.}}
Model provider
kennethp97
Model tree
Base
Qwen/Qwen2.5-7B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information