Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Training
Base model:
Qwen/Qwen3-1.7B
Supervised fine-tuning stage:
- Data: filtered JSON-action outfit rollouts.
- Selection rule used by the local pipeline: score four rollouts per outfit query, select the top rollout per query when its score is greater than 60, then export selected raw traces for SFT.
- Max length: 16,384.
- Epochs: 3.
- Learning rate:
2e-5. - Per-device train batch size: 1.
- Gradient accumulation steps: 16.
- Assistant-only loss: enabled.
- Full fine-tune, not LoRA.
DPO stage:
- Starting checkpoint: the outfit SFT model.
- Data:
100outfit preference-query training rows and50validation rows in the local DeepOutfit pipeline. - Max length: 8,192.
- Epochs: 1.
- Learning rate:
5e-7. - DPO beta:
0.1. - Per-device train batch size: 1.
- Gradient accumulation steps: 8.
- Full fine-tune, not LoRA.
Uploaded source directory:
outputs/models/qwen3-1.7b-json-action-outfit-sft-dpo-100q-cont1_20260527_230029
Evaluation
Evaluation was run with the local batch_eval_outfit_models.py harness on
50 OUTFIT500 queries, one low-temperature rollout per query. Outfit quality was
scored by GPT-4.1 judge. The comparison included Qwen zero-shot, this SFT+DPO
checkpoint, and a later GRPO/RL checkpoint.
| Model | Rows | Overall | Generalization | Entropy | Efficiency | Correctness | Quality | >=70 Quality | Missing Report | Broken Report | Tokens Median | Calls Median | Rollouts/min |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-1.7B zero-shot | 50 | 57.03 | 100 | 0.0416 | 69.38 | 84.8 | 29.93 | 0% | 2% | 22% | 3,632 | 1 | 10.72 |
| DeepOutfit SFT+DPO | 50 | 55.36 | 90 | 0.0650 | 39.71 | 82.0 | 41.58 | 8% | 0% | 30% | 11,240 | 5 | 8.21 |
| DeepOutfit GRPO/RL checkpoint | 50 | 56.69 | 90 | 0.0641 | 40.79 | 88.4 | 39.10 | 6% | 2% | 16% | 11,280 | 5 | 7.05 |
Judge breakdown for this SFT+DPO checkpoint:
| Metric | Value |
|---|---|
| Judged rows | 50 |
| Judge score > 70 | 4 / 50 |
| Mean judge score | 41.58 |
| Max judge score | 94.27 |
| Min judge score | 21.60 |
| Best average judge submetric | validity gate, 94.0 / 100 |
| Worst average judge submetric | explanation average, 40.8 / 100 |
| Highest failure flag | impractical to wear, 88% |
Generalization probes:
| Probe | Result |
|---|---|
| Easy math | 8 / 10 |
| JSON formatting | 2 / 2 |
| Factual QA | 2 / 2 |
| Exact string following | 2 / 2 |
| Simple code-output QA | 2 / 2 |
Interpretation: compared with zero-shot Qwen3-1.7B, this checkpoint improves GPT-4.1 judged outfit quality, but uses more search/tool calls and more tokens. The dominant remaining failure mode is outfit practicality.
Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizerrepo_id = "flavianv/deepoutfit-qwen17b-sft-dpo"tokenizer = AutoTokenizer.from_pretrained(repo_id)model = AutoModelForCausalLM.from_pretrained(repo_id,torch_dtype="auto",device_map="auto",)messages = [{"role": "user","content": "Find a men's backyard BBQ host outfit that is casual, practical, and intentional.",}]text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,)inputs = tokenizer(text, return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.2, top_p=0.9)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For the intended JSON-action setting, use the same tool schema and validation loop as the training/evaluation harness. Standalone generations may reference products or tool actions that are only meaningful when connected to the product search tool.
Limitations
- Experimental research checkpoint, not production validated.
- Optimized for outfit/product-report behavior, not broad assistant quality.
- Can produce incomplete, impractical, or unsupported product combinations.
- Product IDs and search behavior depend on the external catalog/tool harness.
- Easy-math probing shows some drift versus the zero-shot base model.
License
This checkpoint is released under Apache 2.0, matching the base
Qwen/Qwen3-1.7B license metadata.
Model provider
flavianv
Model tree
Base
Qwen/Qwen3-1.7B
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information