Training
Base model:
Supervised fine-tuning stage:
- Data: filtered JSON-action outfit rollouts.
- Selection rule used by the local pipeline: score four rollouts per outfit
query, select the top rollout per query when its score is greater than 60,
then export selected raw traces for SFT.
- Max length: 16,384.
- Epochs: 3.
- Learning rate:
2e-5.
- Per-device train batch size: 1.
- Gradient accumulation steps: 16.
- Assistant-only loss: enabled.
- Full fine-tune, not LoRA.
DPO stage:
- Starting checkpoint: the outfit SFT model.
- Data:
100 outfit preference-query training rows and 50 validation rows
in the local DeepOutfit pipeline.
- Max length: 8,192.
- Epochs: 1.
- Learning rate:
5e-7.
- DPO beta:
0.1.
- Per-device train batch size: 1.
- Gradient accumulation steps: 8.
- Full fine-tune, not LoRA.
Uploaded source directory:
outputs/models/qwen3-1.7b-json-action-outfit-sft-dpo-100q-cont1_20260527_230029
Evaluation
Evaluation was run with the local batch_eval_outfit_models.py harness on
50 OUTFIT500 queries, one low-temperature rollout per query. Outfit quality was
scored by GPT-4.1 judge. The comparison included Qwen zero-shot, this SFT+DPO
checkpoint, and a later GRPO/RL checkpoint.
Table with columns: Model, Rows, Overall, Generalization, Entropy, Efficiency, Correctness, Quality, >=70 Quality, Missing Report, Broken Report, Tokens Median, Calls Median, Rollouts/min| Model | Rows | Overall | Generalization | Entropy | Efficiency | Correctness | Quality | >=70 Quality | Missing Report | Broken Report | Tokens Median | Calls Median | Rollouts/min |
|---|
| Qwen3-1.7B zero-shot | 50 | 57.03 | 100 | 0.0416 |
Judge breakdown for this SFT+DPO checkpoint:
Table with columns: Metric, Value| Metric | Value |
|---|
| Judged rows | 50 |
| Judge score > 70 | 4 / 50 |
| Mean judge score | 41.58 |
| Max judge score | 94.27 |
| Min judge score | 21.60 |
| Best average judge submetric | validity gate, 94.0 / 100 |
| Worst average judge submetric | explanation average, 40.8 / 100 |
| Highest failure flag | impractical to wear, 88% |
Generalization probes:
Table with columns: Probe, Result| Probe | Result |
|---|
| Easy math | 8 / 10 |
| JSON formatting | 2 / 2 |
| Factual QA | 2 / 2 |
| Exact string following | 2 / 2 |
| Simple code-output QA | 2 / 2 |
Interpretation: compared with zero-shot Qwen3-1.7B, this checkpoint improves
GPT-4.1 judged outfit quality, but uses more search/tool calls and more tokens.
The dominant remaining failure mode is outfit practicality.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "flavianv/deepoutfit-qwen17b-sft-dpo"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{
"role": "user",
"content": "Find a men's backyard BBQ host outfit that is casual, practical, and intentional.",
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.2, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For the intended JSON-action setting, use the same tool schema and validation
loop as the training/evaluation harness. Standalone generations may reference
products or tool actions that are only meaningful when connected to the product
search tool.
Limitations
- Experimental research checkpoint, not production validated.
- Optimized for outfit/product-report behavior, not broad assistant quality.
- Can produce incomplete, impractical, or unsupported product combinations.
- Product IDs and search behavior depend on the external catalog/tool harness.
- Easy-math probing shows some drift versus the zero-shot base model.
License
This checkpoint is released under Apache 2.0, matching the base
Qwen/Qwen3-1.7B license metadata.