JaydeepR/SmolLM-135M-neuraltxt-dpo-v1 API & Inference Endpoint

Training

Table

Method	DPO (TRL + Unsloth)
beta	0.1
LoRA rank / alpha	32 / 32
Effective batch	128 (8 × grad_accum 16)
Max seq / prompt length	1024 / 768
Learning rate	2e-4, linear decay
Epochs	3 (2,757 steps)
Hardware	1× RTX 3090, ~6h17m

Evaluation

Held-out 2% split + diversity on 100 sampled prompts (4 responses × 4 temperatures).

Table with columns: Metric, SFT baseline, This model (DPO)
Metric	SFT baseline	This model (DPO)
Eval loss	—	0.457
Reward accuracy (held-out)	0.50 (chance)	0.72
Reward margin	—	1.65
Diversity — EAD	0.1173	0.1193
Diversity — SBERT	0.2263	0.2322
Diversity — Vendi	2.7327

Takeaways: the model learned the preference (reward accuracy 0.50 → 0.72) while preserving output diversity (no mode collapse — all diversity metrics flat vs the SFT baseline). Training shows mild overfitting (train reward accuracy ~0.85 vs eval 0.72), so 3 epochs is the right length.

Intended use & limitations

Same scope as the SFT base: a structured ML-paper research assistant, not a general chatbot. Best used via the PaperResearcher task API from the SFT stage.
At 135M parameters the model is capacity-limited — it learns task shape and preference, not deep factual recall. DPO sharpens which response style is preferred; it does not add knowledge.
The reward/eval accuracy measures agreement with the LLM judge that created the preference data, so it is not a fully independent quality signal.

Reproduce

See dpo/DPO_SmolLM135M (run_dpo.sh, experiments.md, LEARNINGS.md).

Training

Table

Method	DPO (TRL + Unsloth)
beta	0.1
LoRA rank / alpha	32 / 32
Effective batch	128 (8 × grad_accum 16)
Max seq / prompt length	1024 / 768
Learning rate	2e-4, linear decay
Epochs	3 (2,757 steps)
Hardware	1× RTX 3090, ~6h17m

Evaluation

Held-out 2% split + diversity on 100 sampled prompts (4 responses × 4 temperatures).

Table with columns: Metric, SFT baseline, This model (DPO)
Metric	SFT baseline	This model (DPO)
Eval loss	—	0.457
Reward accuracy (held-out)	0.50 (chance)	0.72
Reward margin	—	1.65
Diversity — EAD	0.1173	0.1193
Diversity — SBERT	0.2263	0.2322
Diversity — Vendi	2.7327

Intended use & limitations

Same scope as the SFT base: a structured ML-paper research assistant, not a general chatbot. Best used via the PaperResearcher task API from the SFT stage.
At 135M parameters the model is capacity-limited — it learns task shape and preference, not deep factual recall. DPO sharpens which response style is preferred; it does not add knowledge.
The reward/eval accuracy measures agreement with the LLM judge that created the preference data, so it is not a fully independent quality signal.

Reproduce

See dpo/DPO_SmolLM135M (run_dpo.sh, experiments.md, LEARNINGS.md).

SmolLM-135M-neuraltxt-dpo-v1

Get help setting up a custom Dedicated Endpoints.

README

Training

Evaluation

Intended use & limitations

Reproduce

Explore FriendliAI today

README

Training

Evaluation

Intended use & limitations

Reproduce