Provenance
Starting model:
Original base model:
Local source checkpoint:
outputs/rl/deepoutfit_rlvr_qwen17b_20260528_150307_vllm_resume15/best_train_full
The best_train_full checkpoint was selected by the local persistent
best-train checkpoint logic:
- metric:
reward
- best optimizer step:
110
- best train reward:
0.5532
- later training metrics continued past this point, but the persistent
best-train checkpoint remained step 110.
GRPO Training Setup
Key settings from the local B200 DeepOutfit GRPO/RLVR config:
- Training queries: 200 outfit prompts from the local OUTFIT500-derived split.
- Eval queries: 20 held-out outfit eval prompts.
- Tool setup: search-only product tool, trained catalog source,
top_k=5.
- Max tool rounds: 5.
- Reward backend:
outfit_judge.
- Judge model:
gpt-4.1-mini.
- Reward scale:
0.01.
- Invalid JSON/report reward:
-1.0.
- Judge parallelism: 8.
- Learning rate:
5e-6.
- Per-device train batch size: 4.
Evaluation
The model was evaluated with the local batch_eval_outfit_models.py harness on
50 OUTFIT500 queries, one low-temperature rollout per query. Outfit quality was
scored by GPT-4.1 judge. The comparison included Qwen zero-shot, the SFT+DPO
checkpoint, and this GRPO checkpoint.
Table with columns: Model, Rows, Overall, Generalization, Entropy, Efficiency, Correctness, Quality, >=70 Quality, Missing Report, Broken Report, Tokens Median, Calls Median, Rollouts/min| Model | Rows | Overall | Generalization | Entropy | Efficiency | Correctness | Quality | >=70 Quality | Missing Report | Broken Report | Tokens Median | Calls Median | Rollouts/min |
|---|
| Qwen3-1.7B zero-shot | 50 | 57.03 | 100 | 0.0416 |
Judge breakdown for this GRPO checkpoint:
Table with columns: Metric, Value| Metric | Value |
|---|
| Judged rows | 50 |
| Judge score > 70 | 3 / 50 |
| Mean judge score | 39.10 |
| Max judge score | 76.53 |
| Min judge score | 0.00 |
| Best average judge submetric | validity gate, 94.8 / 100 |
| Worst average judge submetric | explanation average, 38.1 / 100 |
| Highest failure flag | impractical to wear, 82% |
Generalization probes:
Table with columns: Probe, Result| Probe | Result |
|---|
| Easy math | 8 / 10 |
| JSON formatting | 2 / 2 |
| Factual QA | 2 / 2 |
| Exact string following | 2 / 2 |
| Simple code-output QA | 2 / 2 |
Interpretation: compared with the SFT+DPO seed, this GRPO checkpoint improved
report correctness and reduced broken reports on the 50-query pass, but its
GPT-4.1 judged outfit quality was slightly lower than SFT+DPO. Compared with
zero-shot Qwen3-1.7B, it improved judged outfit quality but used more tool calls
and more tokens.
Usage
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "flavianv/deepoutfit-qwen17b-grpo-best110"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{
"role": "user",
"content": "Find a men's lake weekend outfit for boating, lunch, and an evening fire pit.",
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.2, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))
For the intended JSON-action setting, use the same product-search tool schema,
tool loop, and report validator as the training/evaluation harness. Standalone
generations may refer to products or actions that only make sense inside that
catalog-grounded tool environment.
Limitations
- Experimental research checkpoint, not production validated.
- Model-only Hub upload; optimizer/scheduler/RNG state is not included.
- Optimized for the local outfit-agent harness, not broad assistant quality.
- Can still produce incomplete, impractical, or unsupported outfits.
- Product IDs and search behavior depend on the external catalog/tool harness.
- Easy-math probing shows some drift versus the zero-shot base model.
License
This checkpoint is released under Apache 2.0, following the base Qwen3-1.7B
license metadata.