Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Provenance
Starting model:
Original base model:
Local source checkpoint:
outputs/rl/deepoutfit_rlvr_qwen17b_20260528_150307_vllm_resume15/best_train_full
The best_train_full checkpoint was selected by the local persistent
best-train checkpoint logic:
- metric:
reward - best optimizer step:
110 - best train reward:
0.5532 - later training metrics continued past this point, but the persistent best-train checkpoint remained step 110.
GRPO Training Setup
Key settings from the local B200 DeepOutfit GRPO/RLVR config:
- Training queries: 200 outfit prompts from the local OUTFIT500-derived split.
- Eval queries: 20 held-out outfit eval prompts.
- Tool setup: search-only product tool, trained catalog source,
top_k=5. - Max tool rounds: 5.
- Reward backend:
outfit_judge. - Judge model:
gpt-4.1-mini. - Reward scale:
0.01. - Invalid JSON/report reward:
-1.0. - Judge parallelism: 8.
- Learning rate:
5e-6. - Per-device train batch size: 4.
- Gradient accumulation steps: 2.
- Number of generations per prompt: 4.
- Max completion length: 4096.
- KL beta:
0.0. - vLLM rollout backend: colocated, GPU memory utilization
0.3. - Save interval: every 5 optimizer steps.
Evaluation
The model was evaluated with the local batch_eval_outfit_models.py harness on
50 OUTFIT500 queries, one low-temperature rollout per query. Outfit quality was
scored by GPT-4.1 judge. The comparison included Qwen zero-shot, the SFT+DPO
checkpoint, and this GRPO checkpoint.
| Model | Rows | Overall | Generalization | Entropy | Efficiency | Correctness | Quality | >=70 Quality | Missing Report | Broken Report | Tokens Median | Calls Median | Rollouts/min |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| Qwen3-1.7B zero-shot | 50 | 57.03 | 100 | 0.0416 | 69.38 | 84.8 | 29.93 | 0% | 2% | 22% | 3,632 | 1 | 10.72 |
| DeepOutfit SFT+DPO | 50 | 55.36 | 90 | 0.0650 | 39.71 | 82.0 | 41.58 | 8% | 0% | 30% | 11,240 | 5 | 8.21 |
| DeepOutfit GRPO best-train step 110 | 50 | 56.69 | 90 | 0.0641 | 40.79 | 88.4 | 39.10 | 6% | 2% | 16% | 11,280 | 5 | 7.05 |
Judge breakdown for this GRPO checkpoint:
| Metric | Value |
|---|---|
| Judged rows | 50 |
| Judge score > 70 | 3 / 50 |
| Mean judge score | 39.10 |
| Max judge score | 76.53 |
| Min judge score | 0.00 |
| Best average judge submetric | validity gate, 94.8 / 100 |
| Worst average judge submetric | explanation average, 38.1 / 100 |
| Highest failure flag | impractical to wear, 82% |
Generalization probes:
| Probe | Result |
|---|---|
| Easy math | 8 / 10 |
| JSON formatting | 2 / 2 |
| Factual QA | 2 / 2 |
| Exact string following | 2 / 2 |
| Simple code-output QA | 2 / 2 |
Interpretation: compared with the SFT+DPO seed, this GRPO checkpoint improved report correctness and reduced broken reports on the 50-query pass, but its GPT-4.1 judged outfit quality was slightly lower than SFT+DPO. Compared with zero-shot Qwen3-1.7B, it improved judged outfit quality but used more tool calls and more tokens.
Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizerrepo_id = "flavianv/deepoutfit-qwen17b-grpo-best110"tokenizer = AutoTokenizer.from_pretrained(repo_id)model = AutoModelForCausalLM.from_pretrained(repo_id,torch_dtype="auto",device_map="auto",)messages = [{"role": "user","content": "Find a men's lake weekend outfit for boating, lunch, and an evening fire pit.",}]text = tokenizer.apply_chat_template(messages,tokenize=False,add_generation_prompt=True,)inputs = tokenizer(text, return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.2, top_p=0.9)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For the intended JSON-action setting, use the same product-search tool schema, tool loop, and report validator as the training/evaluation harness. Standalone generations may refer to products or actions that only make sense inside that catalog-grounded tool environment.
Limitations
- Experimental research checkpoint, not production validated.
- Model-only Hub upload; optimizer/scheduler/RNG state is not included.
- Optimized for the local outfit-agent harness, not broad assistant quality.
- Can still produce incomplete, impractical, or unsupported outfits.
- Product IDs and search behavior depend on the external catalog/tool harness.
- Easy-math probing shows some drift versus the zero-shot base model.
License
This checkpoint is released under Apache 2.0, following the base Qwen3-1.7B license metadata.
Model provider
flavianv
Model tree
Base
flavianv/deepoutfit-qwen17b-sft-dpo
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information