kennethp97

dpo-flip-1p5b

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What it does

The adapter improves the base model on the procedural-compliance task: given a procedure and a scenario, decide whether the scenario is compliant or non-compliant with the procedure, and produce structured reasoning before the verdict.

Each training preference pair is:

  • chosen -- an EDGE CHECKS ... FINAL ANSWER: completion whose reasoning matches this scenario and ends in the gold verdict;
  • rejected -- the partner half's reasoning (a different scenario in the same flip pair) ending in the opposite verdict.

So the model is optimised to prefer reasoning that matches the prompt's scenario over reasoning copied from a different scenario. Anchor pairs (both halves share a verdict) were not used for training; anchor accuracy is an eval-only metric.

Headline eval (frozen 233-process held-out; 128 flip + 122 anchor pairs; greedy / T=0)

Table
regimeflip rateanchor accplain acc
forced-verdict0.3280.6150.660
free-form0.4840.6720.752
base ref (FF)0.2190.4670.576

This recipe fixes the free-form collapse of the earlier content-free DPO arm (which scored 0.250 free-form flip -- near base) by training genuine reasoning. It improves over base in both regimes. It does not clear the pre-registered absolute GO bar (>=0.65 flip + >=0.75 anchor) -- treat it as a research checkpoint, not a deployment-grade classifier.

How to use

python

import torch
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel
BASE = "Qwen/Qwen2.5-1.5B-Instruct"
ADAPTER = "kennethp97/dpo-flip-1p5b"
tok = AutoTokenizer.from_pretrained(BASE, use_fast=True)
tok.pad_token = tok.pad_token or tok.eos_token
tok.padding_side = "left"
base = AutoModelForCausalLM.from_pretrained(BASE, torch_dtype=torch.bfloat16,
device_map="auto")
model = PeftModel.from_pretrained(base, ADAPTER)
model.eval()
USER = (
"You are a process-structure compliance checker.\n"
"Check edge-level constraints before final judgment.\n\n"
"Process:\n<your procedure>\n\n"
"Scenario:\n<your scenario>\n\n"
"Output format:\nEDGE CHECKS:\n- VIOLATED - [edge]: [reason]\n"
"- SATISFIED - [edge]: [reason]\nFINAL ANSWER: compliant|non-compliant\n"
)
prompt = tok.apply_chat_template([{"role": "user", "content": USER}],
tokenize=False, add_generation_prompt=True)
out = model.generate(**tok(prompt, return_tensors="pt").to(model.device),
max_new_tokens=1024, do_sample=False,
pad_token_id=tok.eos_token_id)
print(tok.decode(out[0], skip_special_tokens=True))

For a worked side-by-side comparison against the base and against the companion SFT adapter (kennethp97/sft-arm-a-1p5b), see the combined eval notebook in the repository this adapter was released from.

Training summary

  • Base: Qwen/Qwen2.5-1.5B-Instruct
  • LoRA r=32 alpha=64 on q/k/v/o/gate/up/down, dropout 0.0
  • DPO beta=0.1, lr 5e-6, 2 epochs, batch_size 2 x grad_accum 8, max_length 1024, gradient_checkpointing on
  • Training set: 2,510 flip pairs (one chosen / rejected pair per row) from the train_registry v0.4.0 corpus
  • ~80 minutes on a single RTX A6000 (bf16)

Limitations

  • Research checkpoint, not a production classifier. Below the pre-registered GO bar.
  • Only flip pairs trained. Anchor pairs not in the DPO mix.
  • Regime asymmetry. Free-form > forced; report regimes separately.
  • Format sensitivity. Trained on the EDGE CHECKS ... FINAL ANSWER format above; deviation may degrade performance. Greedy (T=0) matches the reported numbers.

License

Adapter: Apache-2.0. Base model: under the Qwen2.5-1.5B-Instruct license.

Model provider

kennethp97

Model tree

Base

Qwen/Qwen2.5-1.5B-Instruct

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today