Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model description

v7-GRPO was trained with endpoint-targeted GRPO: the only reward signal is on the final compliance answer correctness, with no scaffolding intermediate. The model learned to reach high training-distribution accuracy through pattern matching on surface features rather than step-by-step procedural verification.

The v7-GRPO endpoint is one of two endpoints in the engaged-vs-heuristic contrast that runs through paper §3. The pairing:

  • B5-SFT-7B (engaged) reaches training-distribution accuracy by training the model to consult the procedure step-by-step.
  • v7-GRPO-14B (heuristic) reaches overlapping training-distribution accuracy through pattern matching, with notably different generalization behavior on out-of-distribution variants.

Scale note: v7-GRPO is 14B because no 7B v7-GRPO checkpoint was trained. The companion engaged endpoint (B5-SFT-7B) is 7B and the published §3 overlapping-ceilings contrast uses these specific checkpoint pair. For apples-to-apples scale comparison, train a same-scale variant; this checkpoint represents the heuristic-training-paradigm endpoint at the scale at which that paradigm was investigated.

Training data

Trained on the same canonical centralized procedural-reasoning dataset as B5-SFT-7B:

  • 9,013 unique instances after deduplication
  • 6 source cohorts (see B5-SFT-7B model card for full enumeration)
  • 8 institutional sources for lab safety + FDA + CMS regulatory procedures
  • 10 perturbation types per process

GRPO training uses the same input format as B5-SFT (procedure → scenario → question in a single user message). The reward is binary on the final classification token; no intermediate-reasoning scaffolding tokens are emitted or rewarded.

Intended use

Research:

  • Procedural-compliance reasoning evaluation as the heuristic-mode endpoint
  • Cross-checkpoint mech-interp comparing engaged vs heuristic training paradigms
  • Out-of-distribution behavioral evaluation

NOT intended for:

  • Deployment in production compliance auditing without independent verification
  • High-stakes decisions where binary classification is taken as authoritative

Limitations

Same dataset-level disclosures as B5-SFT-7B apply:

  • Cornell adversarial_surface_compliant: 72.3% accept rate during dataset construction
  • FDA prerequisite_omission: 65.2% accept rate during dataset construction
  • CMS internal-determination perturbation footnote (corpus characteristic, not a quality flag)

Additional behavior-specific limitations for the heuristic-mode endpoint:

  • The model reaches high training-distribution accuracy but generalizes inconsistently on out-of-distribution variants. Performance on previously- unseen lexical or surface variations of compliance scenarios can drop substantially.
  • The model tends to emit the final classification token immediately, with little intermediate reasoning. Efficient at inference time but provides less diagnostic information when debugging incorrect predictions.
  • Performance is sensitive to variations in question phrasing and procedure ordering. The model was trained to expect the procedure → scenario → question structure exactly; substantial deviation may degrade behavior.

Evaluation results

Performance metrics from the cleaned canonical dataset:

MetricValue
Training-distribution accuracy (v7-GRPO recipe headline)~0.83
Yang Stage-3 retrieval head L19.H17 δ on cleaned procedural (base 7B context)0.886
§3 intervention cross-checkpoint contrast at L=20 H=17 (specific to v7-GRPO 7B-context)available in §3 cross-checkpoint tables

For the §3 14B-scale cross-checkpoint analysis (Yang re-anchor d.4 in progress), refer to the procedural-reasoning paper's §3 cross-checkpoint table for the most current numbers.

Prompt format

Use the standard Qwen2.5 chat template via tokenizer.apply_chat_template. The same template applies — the difference vs B5-SFT-7B is the underlying base model scale (14B vs 7B), but the tokenizer template is identical.

python

from transformers import AutoTokenizer, AutoModelForCausalLM
from peft import PeftModel
import torch
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen2.5-14B-Instruct",
torch_dtype=torch.bfloat16,
device_map="cuda",
)
tok = AutoTokenizer.from_pretrained("Qwen/Qwen2.5-14B-Instruct")
model = PeftModel.from_pretrained(base, "kennethp97/v7-grpo-14b")
model.eval()
messages = [{"role": "user", "content": "Procedure: ...\n\nScenario: ...\n\nQuestion: ..."}]
prompt = tok.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tok(prompt, return_tensors="pt").to(model.device)
out = model.generate(**inputs, max_new_tokens=256, do_sample=False)
print(tok.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

License

Apache 2.0 for this LoRA adapter. The Qwen2.5-14B-Instruct base model is licensed under its own license — see the official Qwen repository for terms.

Citation

bibtex

@unpublished{paulsen2026_procedural_reasoning,
author = {Paulsen, Kenneth and collaborators},
title = {Procedural-reasoning compliance: engaged vs heuristic modes},
year = {2026},
note = {Manuscript in preparation; citation form will be finalized at submission.}
}

Model provider

kennethp97

Model tree

Base

Qwen/Qwen2.5-14B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today