Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: otherModel Details
- Base model:
Qwen/Qwen3-1.7B - Architecture:
Qwen3ForCausalLM, 28 layers, hidden size 2048, 16 attention heads - Training method: supervised fine-tuning on high-quality teacher traces
- Teacher / brain: GPT-4.1 in the DeepOutfit JSON-action harness
- Training data source: first 150 queries from the OUTFIT500 outfit-query set
- Filtering: traces with GPT-4.1 judge score >= 80
- Training examples: 133 filtered traces
- Tuple style: final-only assistant targets from JSON-action traces
- Primary task: generate tool/action JSON for outfit planning, product search, and final outfit report construction
Harness Protocol
A typical interaction is not a single natural-language prompt. The harness sends structured events, for example:
json
{"event":"outfit_task_start","query":"men's Ibiza club outfit with linen, no sandals, stylish but practical"}
Then the harness asks for planning:
json
{"event":"planning_instruction","nb_searches":5,"schema":{"look":"...","searches":["..."]}}
The model should first emit a plan:
json
{"action":"todo_writer","look":"sleek Mediterranean nightlife: lightweight shirt, tailored trousers, clean sneaker or loafer, restrained accessory","searches":["men black linen button shirt club summer","men tailored black trousers nightlife","men white leather sneaker summer night","men lightweight blazer casual evening","men silver watch minimal"]}
The harness then provides tool_result / candidate_delta events from catalog search. The model continues with search_products actions and eventually emits a finalize_report action with five non-duplicated, gender-respecting products and a concise explanation.
Evaluation Snapshot
The checkpoint was evaluated on a 50-query holdout from the last OUTFIT500 queries, alongside zero-shot Qwen3 1.7B and a previous SFT+DPO checkpoint. Internal metrics were computed with the DeepOutfit batch-eval harness and GPT-4.1 judge.
| Model | Overall | Quality | >=70 Quality | Correctness | Missing Report | Broken Report | Rollouts/min |
|---|---|---|---|---|---|---|---|
| Qwen3 1.7B zero-shot | 65.84 | 59.36 | 32% | 96.4 | 0 | 6 | 4.787 |
| Previous DeepOutfit SFT+DPO | 55.77 | 43.73 | 10% | 86.8 | 0 | 22 | 4.229 |
| This model | 73.51 | 72.51 | 56% | 100.0 | 0 | 0 | 4.996 |
Additional probes from the same eval run:
| Probe | Score |
|---|---|
| Easy math generalization | 10 / 10 |
| Collapse probe suite | 100 / 100 |
In the same comparison, this model improved quality by +13.15 absolute points over zero-shot Qwen3 1.7B, a relative gain of 22.15% on the internal judge-quality metric.
Strengths
- Better JSON-action reliability than the previous SFT+DPO checkpoint in the current harness.
- Stronger final outfit quality on the 50-query OUTFIT500 holdout.
- Preserves basic generalization in small math and collapse probes.
- Learns the new planning-first behavior: choose a coherent look, decompose it into searches, then use catalog candidates in the final outfit.
Known Limitations
- The model is tied to the DeepOutfit harness and catalog semantics. It should be run with the same structured events and validation logic used during training.
- Product IDs, search results, and candidate metadata are catalog-specific.
- The reported metrics are internal GPT-4.1 judge metrics, not a public benchmark.
- The model can still select duplicate roles or weak product matches when the catalog search results are poor.
- License and redistribution terms should be checked against the base model and any private training-data constraints before production use.
Recommended Decoding
For deterministic or production-like evaluation, use low temperature. For RL exploration, moderate sampling such as temperature 0.6 with multiple rollouts was used in downstream experiments.
Source Checkpoint
This upload was created from the B200 checkpoint directory:
text
/home/criteo/reco-rl-json-action/outputs/models/qwen3-1.7b-json-action-outfit-gpt41-150-ge80-sft_20260530_152442_finalonly_20260530_152848
Citation
If you use this model in the DeepOutfit experiments, cite it as a GPT-4.1 distilled Qwen3 1.7B JSON-action SFT checkpoint trained on high-quality OUTFIT500 traces.
Model provider
flavianv
Model tree
Base
Qwen/Qwen3-1.7B
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information