Model Details
- Hub repo:
flavianv/deepoutfit-qwen17b-gpt41-150-sft80
- Local eval alias:
sft80_from150
- Base model:
Qwen/Qwen3-1.7B
- Architecture:
Qwen3ForCausalLM
- Training method: supervised fine-tuning
- Teacher source: GPT-4.1 JSON-action traces from the DeepOutfit harness
- Training query source: first 150 prompts from the local OUTFIT500-derived outfit set
- Filtering: GPT-4.1 judged trace quality
>=80
- Training rows: 133 filtered traces
- Target style: final-only assistant targets from JSON-action traces
- Source checkpoint path:
/home/criteo/reco-rl-json-action/outputs/models/qwen3-1.7b-json-action-outfit-gpt41-150-ge80-sft_20260530_152442_finalonly_20260530_152848
Selected training configuration recovered from the saved checkpoint:
- bf16 training
- batch size 1 per device
- gradient accumulation 16
- learning rate
5e-6
- linear scheduler
- max sequence length
16384
- assistant-only loss
- optimizer
adamw_torch_fused
Harness Dependency
This model's behavior is tied to the DeepOutfit harness. The harness is part of
the task definition, not just an evaluation wrapper.
Important harness pieces:
- structured event messages rather than a single natural-language user prompt
- planning enabled before catalog search
- tool names and action schema such as
todo_writer, search_products, and finalize_report
- catalog-backed product search over the Clothing catalog
top_k=5 product search results
- maximum 5 tool calls
- final report validator requiring valid JSON, exactly 5 items, unique product IDs, and product IDs sourced from tool results
- candidate accumulation and finalization logic in the local DeepOutfit batch-eval harness
The model expects the harness to provide messages like task-start events,
planning instructions, tool results, and candidate updates. It is not expected
to solve the complete outfit task from a naked chat prompt.
Prompt And Judge Caveat
Results for this model are not comparable to older DeepOutfit reports unless
the full harness is held fixed.
Across the local experiments, several pieces changed:
- the system prompt / task-start event wording
- the planning instruction and JSON schema
- the toolset and tool result presentation
- candidate accumulation and finalization behavior
- maximum context / completion lengths
- decoding settings
- the judge model and judge prompt
- the quality rubric and failure flags
- error handling for missing reports, broken reports, invalid JSON, and missing API keys
In particular, newer evaluations use a GPT-4.1 outfit judge with an updated
rubric. Earlier RL/GRPO experiments used different reward prompts and sometimes
GPT-4.1-mini. Treat the reported numbers as internal harness metrics, not as a
public benchmark.
JSON-Action Protocol Sketch
The harness starts with structured task information, for example:
{
"event": "outfit_task_start",
"task_type": "outfit",
"user_query": "men's Ibiza nightlife outfit for clubbing and rooftop drinks, stylish breathable and comfortable, not costume-like",
"target_k": 5,
"max_tool_calls": 5
}
The model should produce a planning action:
{
"action": "todo_writer",
"look": "stylish breathable men's Ibiza nightlife outfit with polished warm-weather separates",
"searches": [
"men breathable nightlife shirt",
"men tailored lightweight trousers",
"men stylish loafers",
"men lightweight evening jacket",
"men minimalist watch"
]
}
The harness then executes catalog searches and returns tool-result events. The
model eventually emits:
{
"action": "finalize_report",
"results": [
{
"rank": 1,
"product_id": "...",
"category": "Clothing",
"reasoning": "..."
}
]
}
Evaluation
Latest Low-Temperature First-50 Holdout
This is the most recent first-50 holdout result for sft80_from150 from the
local batch_eval_outfit_models.py harness.
Settings:
- query file:
queries/Clothing/OUTFIT500.json
- query limit: 50
- rollouts per query: 1
- category: Clothing
- task type: outfit
- planning: on
- max tool calls: 5
- top-k: 5
- generation temperature:
0.2
- generation top-p:
0.9
- judge model:
gpt-4.1
- judge temperature:
0
Result:
Table with columns: Metric, Value| Metric | Value |
|---|
| Rows | 50 |
| Overall score mean | 72.5391 |
| Quality score mean | 71.6867 |
| Quality score median | 71.3000 |
Quality >=70 | 50% |
Quality >=75 | 46% |
Quality >=85 | 32% |
| Correctness score mean |
Same-run comparison:
Table with columns: Model, Training source, Overall, Quality, Q >=70, Q >=85, Correctness, Broken| Model | Training source | Overall | Quality | Q >=70 | Q >=85 | Correctness | Broken |
|---|
sft80_from150 | GPT-4.1 SFT, first 150, score >=80 | 72.54 | 71.69 | 50% | 32% | 97.6 | 4% |
onpolicy_sft_85 | on-policy SFT score >=85 |
Older Promptfix Eval Snapshot
The previous card reported a different 50-query promptfix eval. That run used a
different harness snapshot and should be treated separately.
Table with columns: Model, Overall, Quality, >=70 Quality, Correctness, Missing Report, Broken Report, Rollouts/min| Model | Overall | Quality | >=70 Quality | Correctness | Missing Report | Broken Report | Rollouts/min |
|---|
| Qwen3 1.7B zero-shot | 65.84 | 59.36 | 32% | 96.4 | 0 | 6 | 4.787 |
| Previous DeepOutfit SFT+DPO | 55.77 | 43.73 | 10% |
Additional probes from that older run:
Table with columns: Probe, Score| Probe | Score |
|---|
| Easy math generalization | 10 / 10 |
| Collapse probe suite | 100 / 100 |
Strengths
- Strong JSON-action validity in the DeepOutfit harness.
- Good outfit quality relative to other local SFT/RL variants tested so far.
- Uses tool results rather than inventing product IDs in the evaluated harness.
- Preserves basic non-outfit behavior on small local sanity probes.
Limitations
- Strongly coupled to the DeepOutfit event protocol and catalog.
- Product IDs and product metadata are catalog-specific.
- Metrics are internal GPT-4.1 judge metrics, not human ratings.
- Prompt, tool, and judge changes can move scores materially.
- The model may still produce weak outfits when search results are noisy or role coverage is ambiguous.
- This checkpoint should not be used as evidence that unrelated RL or SDPO variants improved quality; those require harness-matched comparisons.
Recommended Inference Settings
For deterministic evaluation, use low-temperature decoding similar to the
latest first-50 run:
- temperature
0.2
- top-p
0.9
- one rollout per query
The saved generation_config.json contains a more exploratory setting
(temperature=0.6, top_p=0.95, top_k=20) used for sampling-style
experiments. Use the lower-temperature settings for judge comparisons.
Minimal Loading Example
This only loads the model. It does not recreate the DeepOutfit harness.
from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "flavianv/deepoutfit-qwen17b-gpt41-150-sft80"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype="auto",
device_map="auto",
)
To reproduce reported behavior, run it through the local DeepOutfit JSON-action
harness with the same tool schema, task-start events, product catalog, and judge
configuration.
Intended Use
Research checkpoint for agentic tool-use outfit recommendation. It is useful
for comparing SFT/RL/SDPO variants under a fixed harness. It is not a consumer
styling service and should not be treated as a general-purpose fashion advisor.