ludocomito

qwen2.5-3b-tribe-dpo-lora

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Training Summary

Base model: Qwen/Qwen2.5-3B-Instruct
Method: SFT warmup followed by DPO
Reward source: pooled TRIBE poetry-vs-control activation axis, adjusted for length and repetition
DPO pairs: 592 total
Train pairs: 535
Eval pairs: 57
Best checkpoint: checkpoint-50
Best eval loss: 0.6571338772773743
Train loss: 0.6417140785385581
Final saved adapter: models/dpo_lora_best

The DPO preference pairs were built from generated candidates scored by the TRIBE-derived reward. Higher-scoring candidates were used as chosen responses and lower-scoring candidates as rejected responses, with a minimum adjusted reward margin of 0.12.

Intended Use

This is an experimental research adapter for probing whether a brain-model-derived reward can steer a small instruction model toward poetic / affective text. It is not a general-purpose writing model and should be treated as a research artifact.

Example loading:

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from peft import PeftModel

base_id = "Qwen/Qwen2.5-3B-Instruct"
adapter_id = "ludocomito/qwen2.5-3b-tribe-dpo-lora"

tokenizer = AutoTokenizer.from_pretrained(adapter_id)
model = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")
model = PeftModel.from_pretrained(model, adapter_id)

Included Artifacts

LoRA adapter weights and config
tokenizer files copied from the training output
DPO training summary
DPO dataset summary
step-level DPO metrics
probe generations sampled over training
selected TRIBE pooled-axis features used for scoring

Limitations

The reward is not a human preference model. It is a pooled TRIBE activation axis trained to separate poems from controls in the earlier validation experiments.
The eval split is small, with 57 preference pairs, so metrics should be read directionally.
The model may learn reward-specific texture rather than robust literary quality.
This adapter should be evaluated with fresh prompts and manual inspection before any broader use.

Local Experiment Context

The run used 1,024 scored generated candidates, balanced as 4 candidates for each of 256 prompts. The full 2,048 generated candidates were preserved locally, but only the balanced subset was scored by TRIBE for the final DPO run.

Model provider