ludocomito
qwen2.5-3b-tribe-dpo-lora
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Training Summary
- Base model:
Qwen/Qwen2.5-3B-Instruct - Method: SFT warmup followed by DPO
- Reward source: pooled TRIBE poetry-vs-control activation axis, adjusted for length and repetition
- DPO pairs: 592 total
- Train pairs: 535
- Eval pairs: 57
- Best checkpoint:
checkpoint-50 - Best eval loss:
0.6571338772773743 - Train loss:
0.6417140785385581 - Final saved adapter:
models/dpo_lora_best
The DPO preference pairs were built from generated candidates scored by the TRIBE-derived reward. Higher-scoring candidates were used as chosen responses and lower-scoring candidates as rejected responses, with a minimum adjusted reward margin of 0.12.
Intended Use
This is an experimental research adapter for probing whether a brain-model-derived reward can steer a small instruction model toward poetic / affective text. It is not a general-purpose writing model and should be treated as a research artifact.
Example loading:
python
from transformers import AutoModelForCausalLM, AutoTokenizerfrom peft import PeftModelbase_id = "Qwen/Qwen2.5-3B-Instruct"adapter_id = "ludocomito/qwen2.5-3b-tribe-dpo-lora"tokenizer = AutoTokenizer.from_pretrained(adapter_id)model = AutoModelForCausalLM.from_pretrained(base_id, device_map="auto")model = PeftModel.from_pretrained(model, adapter_id)
Included Artifacts
- LoRA adapter weights and config
- tokenizer files copied from the training output
- DPO training summary
- DPO dataset summary
- step-level DPO metrics
- probe generations sampled over training
- selected TRIBE pooled-axis features used for scoring
Limitations
- The reward is not a human preference model. It is a pooled TRIBE activation axis trained to separate poems from controls in the earlier validation experiments.
- The eval split is small, with 57 preference pairs, so metrics should be read directionally.
- The model may learn reward-specific texture rather than robust literary quality.
- This adapter should be evaluated with fresh prompts and manual inspection before any broader use.
Local Experiment Context
The run used 1,024 scored generated candidates, balanced as 4 candidates for each of 256 prompts. The full 2,048 generated candidates were preserved locally, but only the balanced subset was scored by TRIBE for the final DPO run.
Model provider
ludocomito
Model tree
Base
Qwen/Qwen2.5-3B-Instruct
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information