kennethp97

sft-arm-a-1p5b

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What it does

Given a procedure and a scenario, the model emits an EDGE CHECKS: reasoning block followed by a FINAL ANSWER: compliant|non-compliant line. The recipe targets the free-form regime; gains concentrate there.

Headline eval (frozen 233-process held-out; 128 flip + 122 anchor; greedy / T=0)

Table
regime	flip rate (base -> SFT)	anchor acc (base -> SFT)	plain (base -> SFT)
forced	0.117 -> 0.188	0.557 -> 0.582	0.570 -> 0.608
free-form	0.219 -> 0.469 (+25.0pp)	0.467 -> 0.664 (+19.7pp)	0.576 -> 0.726

The lift is free-form-only (the regime the reasoning recipe targets); the gains concentrate on exception / hierarchy / threshold handles, while step-ordering stays flat (0.200 -> 0.225) -- the known structural bottleneck.

How to use

python
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

BASE = "Qwen/Qwen2.5-1.5B-Instruct"
ADAPTER = "kennethp97/sft-arm-a-1p5b"

tok = AutoTokenizer.from_pretrained(BASE, use_fast=True)
tok.pad_token = tok.pad_token or tok.eos_token

base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16,
                                            device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()

Prompt format and a worked side-by-side eval against the base and the companion DPO adapter (kennethp97/dpo-flip-1p5b) are in the combined eval notebook.

Training summary

Base: Qwen/Qwen2.5-1.5B-Instruct
LoRA r=32 alpha=64 on q/k/v/o/gate/up/down
Plain SFT (cross-entropy on the chosen completion), full bf16
Training set: 3,734 rows (after filtering 1,226 placeholder-verifier_reason rows from the 5,020-row v0.4.0 corpus)

Limitations

Research checkpoint, not a production classifier. Below the pre-registered absolute GO bar.
Step-ordering bottleneck. Ordering flip stays nearly flat.
Free-form is the target regime. Forced-verdict gains are small.
Format sensitivity. Trained on the EDGE CHECKS ... FINAL ANSWER format above; deviation may degrade performance. Greedy (T=0) matches the reported numbers.