kennethp97

b5-sft-7b

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model description

The B5-SFT recipe is the procedural-compliance adaptation of the "Reasoning Scaffolding" trace-distillation method. Token-level next-token- prediction loss is taken over reasoning-scaffolded teacher traces produced by DeepSeek-R1 (671B, queried via OpenRouter); inline [SIG=...] scaffold-token annotations mark the reasoning stages. The model is trained to internally consult the procedure step-by-step before emitting the final compliance classification.

Of multiple interventions tested in the parent paper, B5-SFT is the engaged- mode endpoint — it routes compliance reasoning through procedure-aware step verification, in contrast with the heuristic-mode endpoint (v7-GRPO-14B) which reaches overlapping training-distribution accuracy through pattern matching on surface features rather than step-by-step verification.

Training data

Trained on the canonical centralized procedural-reasoning dataset:

9,013 unique instances after deduplication
6 source cohorts: scaled_86 (1,522), af7lab (813), fda (4,882), af4 (245), v7_200 (154), mi_44 (34)
8 institutional sources for lab safety procedures (Duke, Miami, UCF, UW, Princeton, Cornell, Stanford, MIT) plus FDA Investigations Operations Manual and CMS Medicare manuals
10 perturbation types per process: baseline_compliant, baseline_noncompliant, lexical_paraphrase, step_reorder_invalid, prerequisite_omission, hierarchy_violation, exception_trigger, distractor_injection, adversarial_surface_compliant, adversarial_surface_noncompliant

Scope:

Laboratory safety procedures (chemical spills, PPE, hazardous-material handling, lab closeouts)
FDA inspection procedures (claim filing, credentialing, billing-after-denial, vehicle accident reporting)
CMS Medicare billing procedures (claim processing, modifier handling, pre/post-billing actions, denial cascade handling)

Held-out evaluation splits (v7_200 + af4 + mi_44) are deduplicated against the training portion to ensure no instance-level leakage.

Intended use

Research:

Procedural-compliance reasoning evaluation
Mechanistic interpretability of LLM reasoning modes
Out-of-distribution behavioral evaluation on regulatory-procedure tasks

NOT intended for:

Deployment in production compliance auditing without independent verification
Domain-specific use outside the lab safety / FDA / CMS scope without additional validation
High-stakes decisions where the model's binary classification is taken as authoritative

Limitations

Two admit-verified-subset disclosures from §5 methodology:

Cornell adversarial_surface_compliant: 72.3% accept rate during dataset construction — the hardest-by-construction cell where surface-style perturbations produce ambiguous compliance verdicts. Models trained on this data inherit some of that ambiguity.
FDA prerequisite_omission: 65.2% accept rate during dataset construction — the post-AF.1-safety-guard-filter prereq cell is the universal-difficulty cell across cohorts; the dataset admits the verified subset but reviewers may find a third of the rejected instances borderline.

Plus methodology footnote: regulatory-procedure corpora (CMS in particular) sometimes encode internal cognitive determinations and system-side actions as numbered steps; the omission-based perturbation pattern can't render these as observable narrative violations. This is a corpus-vs-perturbation-family characteristic of the CMS source data, not a quality flag on this model.

Also expect:

Verbose responses with explicit step-by-step reasoning, especially on longer procedures. Discard reasoning prefix downstream if you only want the binary classification.
Performance degradation when procedure / scenario / question order in the user message is restructured significantly. The model was trained to expect a specific input shape (procedure → scenario → question in one user message).

Evaluation results

Performance metrics from the cleaned canonical dataset, where available:

Table with columns: Metric, Value
Metric	Value
Training-distribution accuracy (B5-SFT recipe headline)	~0.68
Var-binding holdout, ord-swap on flipping subset	0.545
Yang Stage-3 retrieval head L19.H17 δ on cleaned procedural	0.886

Additional evaluation numbers from §3 intervention contrast against v7-GRPO and §1 behavioral failure-pattern grid become available as the AF.5 re-runs land; refer to the procedural-reasoning paper's headline tables for the current state.

Prompt format

Use the standard Qwen2.5 chat template via tokenizer.apply_chat_template.

The published evaluation puts the procedure → scenario → question in a single user message (no system message). For apples-to-apples comparison with published results, match this format. See prompt_format.md in the companion package for verbatim examples.

python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base, "kennethp97/b5-sft-7b")
model.eval()

messages = [{"role": "user", "content": "Procedure: ...\n\nScenario: ...\n\nQuestion: ..."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

License

Apache 2.0 for this LoRA adapter. The Qwen2.5-7B-Instruct base model is licensed under its own license — see the official Qwen repository for terms.

Citation

bibtex
@unpublished{paulsen2026_procedural_reasoning,
  author = {Paulsen, Kenneth and collaborators},
  title  = {Procedural-reasoning compliance: engaged vs heuristic modes},
  year   = {2026},
  note   = {Manuscript in preparation; citation form will be finalized at submission.}
}

Model provider

kennethp97

Model tree

Base

Qwen/Qwen2.5-7B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Model description

Training data

Trained on the canonical centralized procedural-reasoning dataset:

9,013 unique instances after deduplication
6 source cohorts: scaled_86 (1,522), af7lab (813), fda (4,882), af4 (245), v7_200 (154), mi_44 (34)
8 institutional sources for lab safety procedures (Duke, Miami, UCF, UW, Princeton, Cornell, Stanford, MIT) plus FDA Investigations Operations Manual and CMS Medicare manuals
10 perturbation types per process: baseline_compliant, baseline_noncompliant, lexical_paraphrase, step_reorder_invalid, prerequisite_omission, hierarchy_violation, exception_trigger, distractor_injection, adversarial_surface_compliant, adversarial_surface_noncompliant

Scope:

Laboratory safety procedures (chemical spills, PPE, hazardous-material handling, lab closeouts)
FDA inspection procedures (claim filing, credentialing, billing-after-denial, vehicle accident reporting)
CMS Medicare billing procedures (claim processing, modifier handling, pre/post-billing actions, denial cascade handling)

Held-out evaluation splits (v7_200 + af4 + mi_44) are deduplicated against the training portion to ensure no instance-level leakage.

Intended use

Research:

Procedural-compliance reasoning evaluation
Mechanistic interpretability of LLM reasoning modes
Out-of-distribution behavioral evaluation on regulatory-procedure tasks

NOT intended for:

Deployment in production compliance auditing without independent verification
Domain-specific use outside the lab safety / FDA / CMS scope without additional validation
High-stakes decisions where the model's binary classification is taken as authoritative

Limitations

Two admit-verified-subset disclosures from §5 methodology:

Cornell adversarial_surface_compliant: 72.3% accept rate during dataset construction — the hardest-by-construction cell where surface-style perturbations produce ambiguous compliance verdicts. Models trained on this data inherit some of that ambiguity.
FDA prerequisite_omission: 65.2% accept rate during dataset construction — the post-AF.1-safety-guard-filter prereq cell is the universal-difficulty cell across cohorts; the dataset admits the verified subset but reviewers may find a third of the rejected instances borderline.

Also expect:

Verbose responses with explicit step-by-step reasoning, especially on longer procedures. Discard reasoning prefix downstream if you only want the binary classification.
Performance degradation when procedure / scenario / question order in the user message is restructured significantly. The model was trained to expect a specific input shape (procedure → scenario → question in one user message).

Evaluation results

Performance metrics from the cleaned canonical dataset, where available:

Table with columns: Metric, Value
Metric	Value
Training-distribution accuracy (B5-SFT recipe headline)	~0.68
Var-binding holdout, ord-swap on flipping subset	0.545
Yang Stage-3 retrieval head L19.H17 δ on cleaned procedural	0.886

Prompt format

Use the standard Qwen2.5 chat template via tokenizer.apply_chat_template.

python
from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen2.5-7B-Instruct",
    torch_dtype=torch.bfloat16,
    device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-7B-Instruct")
model = PeftModel.from_pretrained(base, "kennethp97/b5-sft-7b")
model.eval()

messages = [{"role": "user", "content": "Procedure: ...\n\nScenario: ...\n\nQuestion: ..."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

License

Apache 2.0 for this LoRA adapter. The Qwen2.5-7B-Instruct base model is licensed under its own license — see the official Qwen repository for terms.

Citation

bibtex
@unpublished{paulsen2026_procedural_reasoning,
  author = {Paulsen, Kenneth and collaborators},
  title  = {Procedural-reasoning compliance: engaged vs heuristic modes},
  year   = {2026},
  note   = {Manuscript in preparation; citation form will be finalized at submission.}
}

b5-sft-7b

Get help setting up a custom Dedicated Endpoints.

README

Model description

Training data

Intended use

Limitations

Evaluation results

Prompt format

License

Citation

Explore FriendliAI today

README

Model description

Training data

Intended use

Limitations

Evaluation results

Prompt format

License

Citation