kennethp97
dpo-flip-1p5b
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What it does
The adapter improves the base model on the procedural-compliance task: given a
procedure and a scenario, decide whether the scenario is compliant or
non-compliant with the procedure, and produce structured reasoning before the
verdict.
Each training preference pair is:
- chosen -- an
EDGE CHECKS ... FINAL ANSWER:completion whose reasoning matches this scenario and ends in the gold verdict; - rejected -- the partner half's reasoning (a different scenario in the same flip pair) ending in the opposite verdict.
So the model is optimised to prefer reasoning that matches the prompt's scenario over reasoning copied from a different scenario. Anchor pairs (both halves share a verdict) were not used for training; anchor accuracy is an eval-only metric.
Headline eval (frozen 233-process held-out; 128 flip + 122 anchor pairs; greedy / T=0)
| regime | flip rate | anchor acc | plain acc |
|---|---|---|---|
| forced-verdict | 0.328 | 0.615 | 0.660 |
| free-form | 0.484 | 0.672 | 0.752 |
| base ref (FF) | 0.219 | 0.467 | 0.576 |
This recipe fixes the free-form collapse of the earlier content-free DPO arm (which scored 0.250 free-form flip -- near base) by training genuine reasoning. It improves over base in both regimes. It does not clear the pre-registered absolute GO bar (>=0.65 flip + >=0.75 anchor) -- treat it as a research checkpoint, not a deployment-grade classifier.
How to use
python
import torchfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModelBASE = "Qwen/Qwen2.5-1.5B-Instruct"ADAPTER = "kennethp97/dpo-flip-1p5b"tok = AutoTokenizer.from_pretrained(BASE, use_fast=True)tok.pad_token = tok.pad_token or tok.eos_tokentok.padding_side = "left"base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16,device_map="auto")model = PeftModel.from_pretrained(base, ADAPTER)model.eval()USER = ("You are a process-structure compliance checker.\n""Check edge-level constraints before final judgment.\n\n""Process:\n<your procedure>\n\n""Scenario:\n<your scenario>\n\n""Output format:\nEDGE CHECKS:\n- VIOLATED - [edge]: [reason]\n""- SATISFIED - [edge]: [reason]\nFINAL ANSWER: compliant|non-compliant\n")prompt = tok.apply_chat_template([{"role": "user", "content": USER}],tokenize=False, add_generation_prompt=True)out = model.generate(**tok(prompt, return_tensors="pt").to(model.device),max_new_tokens=1024, do_sample=False,pad_token_id=tok.eos_token_id)print(tok.decode(out[0], skip_special_tokens=True))
For a worked side-by-side comparison against the base and against the
companion SFT adapter (kennethp97/sft-arm-a-1p5b), see the
combined eval notebook in the repository this adapter was released from.
Training summary
- Base:
Qwen/Qwen2.5-1.5B-Instruct - LoRA r=32 alpha=64 on q/k/v/o/gate/up/down, dropout 0.0
- DPO beta=0.1, lr 5e-6, 2 epochs, batch_size 2 x grad_accum 8, max_length 1024, gradient_checkpointing on
- Training set: 2,510 flip pairs (one chosen / rejected pair per row) from the
train_registryv0.4.0 corpus - ~80 minutes on a single RTX A6000 (bf16)
Limitations
- Research checkpoint, not a production classifier. Below the pre-registered GO bar.
- Only flip pairs trained. Anchor pairs not in the DPO mix.
- Regime asymmetry. Free-form > forced; report regimes separately.
- Format sensitivity. Trained on the
EDGE CHECKS ... FINAL ANSWERformat above; deviation may degrade performance. Greedy (T=0) matches the reported numbers.
License
Adapter: Apache-2.0. Base model: under the Qwen2.5-1.5B-Instruct license.
Model provider
kennethp97
Model tree
Base
Qwen/Qwen2.5-1.5B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information