Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model description
v7-GRPO was trained with endpoint-targeted GRPO: the only reward signal is on the final compliance answer correctness, with no scaffolding intermediate. The model learned to reach high training-distribution accuracy through pattern matching on surface features rather than step-by-step procedural verification.
The v7-GRPO endpoint is one of two endpoints in the engaged-vs-heuristic contrast that runs through paper §3. The pairing:
- B5-SFT-7B (engaged) reaches training-distribution accuracy by training the model to consult the procedure step-by-step.
- v7-GRPO-14B (heuristic) reaches overlapping training-distribution accuracy through pattern matching, with notably different generalization behavior on out-of-distribution variants.
Scale note: v7-GRPO is 14B because no 7B v7-GRPO checkpoint was trained. The companion engaged endpoint (B5-SFT-7B) is 7B and the published §3 overlapping-ceilings contrast uses these specific checkpoint pair. For apples-to-apples scale comparison, train a same-scale variant; this checkpoint represents the heuristic-training-paradigm endpoint at the scale at which that paradigm was investigated.
Training data
Trained on the same canonical centralized procedural-reasoning dataset as B5-SFT-7B:
- 9,013 unique instances after deduplication
- 6 source cohorts (see B5-SFT-7B model card for full enumeration)
- 8 institutional sources for lab safety + FDA + CMS regulatory procedures
- 10 perturbation types per process
GRPO training uses the same input format as B5-SFT (procedure → scenario → question in a single user message). The reward is binary on the final classification token; no intermediate-reasoning scaffolding tokens are emitted or rewarded.
Intended use
Research:
- Procedural-compliance reasoning evaluation as the heuristic-mode endpoint
- Cross-checkpoint mech-interp comparing engaged vs heuristic training paradigms
- Out-of-distribution behavioral evaluation
NOT intended for:
- Deployment in production compliance auditing without independent verification
- High-stakes decisions where binary classification is taken as authoritative
Limitations
Same dataset-level disclosures as B5-SFT-7B apply:
- Cornell adversarial_surface_compliant: 72.3% accept rate during dataset construction
- FDA prerequisite_omission: 65.2% accept rate during dataset construction
- CMS internal-determination perturbation footnote (corpus characteristic, not a quality flag)
Additional behavior-specific limitations for the heuristic-mode endpoint:
- The model reaches high training-distribution accuracy but generalizes inconsistently on out-of-distribution variants. Performance on previously- unseen lexical or surface variations of compliance scenarios can drop substantially.
- The model tends to emit the final classification token immediately, with little intermediate reasoning. Efficient at inference time but provides less diagnostic information when debugging incorrect predictions.
- Performance is sensitive to variations in question phrasing and procedure ordering. The model was trained to expect the procedure → scenario → question structure exactly; substantial deviation may degrade behavior.
Evaluation results
Performance metrics from the cleaned canonical dataset:
| Metric | Value |
|---|---|
| Training-distribution accuracy (v7-GRPO recipe headline) | ~0.83 |
| Yang Stage-3 retrieval head L19.H17 δ on cleaned procedural (base 7B context) | 0.886 |
| §3 intervention cross-checkpoint contrast at L=20 H=17 (specific to v7-GRPO 7B-context) | available in §3 cross-checkpoint tables |
For the §3 14B-scale cross-checkpoint analysis (Yang re-anchor d.4 in progress), refer to the procedural-reasoning paper's §3 cross-checkpoint table for the most current numbers.
Prompt format
Use the standard Qwen2.5 chat template via tokenizer.apply_chat_template.
The same template applies — the difference vs B5-SFT-7B is the underlying
base model scale (14B vs 7B), but the tokenizer template is identical.
python
from transformers import AutoTokenizer, AutoModelForCausalLMfrom peft import PeftModelimport torchbase = AutoModelForCausalLM.from_pretrained("Qwen/Qwen2.5-14B-Instruct",torch_dtype=torch.bfloat16,device_map="cuda",)tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")model = PeftModel.from_pretrained(base, "kennethp97/v7-grpo-14b")model.eval()messages = [{"role": "user", "content": "Procedure: ...\n\nScenario: ...\n\nQuestion: ..."}]prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tok(prompt, return_tensors="pt").to(model.device)out = model.generate(**inputs, max_new_tokens=256, do_sample=False)print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
License
Apache 2.0 for this LoRA adapter. The Qwen2.5-14B-Instruct base model is licensed under its own license — see the official Qwen repository for terms.
Citation
bibtex
@unpublished{paulsen2026_procedural_reasoning,author = {Paulsen, Kenneth and collaborators},title = {Procedural-reasoning compliance: engaged vs heuristic modes},year = {2026},note = {Manuscript in preparation; citation form will be finalized at submission.}}
Model provider
kennethp97
Model tree
Base
Qwen/Qwen2.5-14B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information