flavianv

deepoutfit-qwen17b-gpt41-150-sft80

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Model Details

Hub repo: flavianv/deepoutfit-qwen17b-gpt41-150-sft80
Local eval alias: sft80_from150
Base model: Qwen/Qwen3-1.7B
Architecture: Qwen3ForCausalLM
Training method: supervised fine-tuning
Teacher source: GPT-4.1 JSON-action traces from the DeepOutfit harness
Training query source: first 150 prompts from the local OUTFIT500-derived outfit set
Filtering: GPT-4.1 judged trace quality >=80
Training rows: 133 filtered traces
Target style: final-only assistant targets from JSON-action traces
Source checkpoint path: /home/criteo/reco-rl-json-action/outputs/models/qwen3-1.7b-json-action-outfit-gpt41-150-ge80-sft_20260530_152442_finalonly_20260530_152848

Selected training configuration recovered from the saved checkpoint:

bf16 training
batch size 1 per device
gradient accumulation 16
learning rate 5e-6
linear scheduler
max sequence length 16384
assistant-only loss
optimizer adamw_torch_fused

Harness Dependency

This model's behavior is tied to the DeepOutfit harness. The harness is part of the task definition, not just an evaluation wrapper.

Important harness pieces:

structured event messages rather than a single natural-language user prompt
planning enabled before catalog search
tool names and action schema such as todo_writer, search_products, and finalize_report
catalog-backed product search over the Clothing catalog
top_k=5 product search results
maximum 5 tool calls
final report validator requiring valid JSON, exactly 5 items, unique product IDs, and product IDs sourced from tool results
candidate accumulation and finalization logic in the local DeepOutfit batch-eval harness

The model expects the harness to provide messages like task-start events, planning instructions, tool results, and candidate updates. It is not expected to solve the complete outfit task from a naked chat prompt.

Prompt And Judge Caveat

Results for this model are not comparable to older DeepOutfit reports unless the full harness is held fixed.

Across the local experiments, several pieces changed:

the system prompt / task-start event wording
the planning instruction and JSON schema
the toolset and tool result presentation
candidate accumulation and finalization behavior
maximum context / completion lengths
decoding settings
the judge model and judge prompt
the quality rubric and failure flags
error handling for missing reports, broken reports, invalid JSON, and missing API keys

In particular, newer evaluations use a GPT-4.1 outfit judge with an updated rubric. Earlier RL/GRPO experiments used different reward prompts and sometimes GPT-4.1-mini. Treat the reported numbers as internal harness metrics, not as a public benchmark.

JSON-Action Protocol Sketch

The harness starts with structured task information, for example:

json
{
  "event": "outfit_task_start",
  "task_type": "outfit",
  "user_query": "men's Ibiza nightlife outfit for clubbing and rooftop drinks, stylish breathable and comfortable, not costume-like",
  "target_k": 5,
  "max_tool_calls": 5
}

The model should produce a planning action:

json
{
  "action": "todo_writer",
  "look": "stylish breathable men's Ibiza nightlife outfit with polished warm-weather separates",
  "searches": [
    "men breathable nightlife shirt",
    "men tailored lightweight trousers",
    "men stylish loafers",
    "men lightweight evening jacket",
    "men minimalist watch"
  ]
}

The harness then executes catalog searches and returns tool-result events. The model eventually emits:

json
{
  "action": "finalize_report",
  "results": [
    {
      "rank": 1,
      "product_id": "...",
      "category": "Clothing",
      "reasoning": "..."
    }
  ]
}

Evaluation

Latest Low-Temperature First-50 Holdout

This is the most recent first-50 holdout result for sft80_from150 from the local batch_eval_outfit_models.py harness.

Settings:

query file: queries/Clothing/OUTFIT500.json
query limit: 50
rollouts per query: 1
category: Clothing
task type: outfit
planning: on
max tool calls: 5
top-k: 5
generation temperature: 0.2
generation top-p: 0.9
judge model: gpt-4.1
judge temperature: 0

Result:

Table with columns: Metric, Value
Metric	Value
Rows	50
Overall score mean	72.5391
Quality score mean	71.6867
Quality score median	71.3000
Quality `>=70`	50%
Quality `>=75`	46%
Quality `>=85`	32%
Correctness score mean

Same-run comparison:

Table with columns: Model, Training source, Overall, Quality, Q >=70, Q >=85, Correctness, Broken
Model	Training source	Overall	Quality	Q >=70	Q >=85	Correctness	Broken
`sft80_from150`	GPT-4.1 SFT, first 150, score >=80	72.54	71.69	50%	32%	97.6	4%
`onpolicy_sft_85`	on-policy SFT score >=85

Older Promptfix Eval Snapshot

The previous card reported a different 50-query promptfix eval. That run used a different harness snapshot and should be treated separately.

Table with columns: Model, Overall, Quality, >=70 Quality, Correctness, Missing Report, Broken Report, Rollouts/min
Model	Overall	Quality	>=70 Quality	Correctness	Missing Report	Broken Report	Rollouts/min
Qwen3 1.7B zero-shot	65.84	59.36	32%	96.4	0	6	4.787
Previous DeepOutfit SFT+DPO	55.77	43.73	10%

Additional probes from that older run:

Table with columns: Probe, Score
Probe	Score
Easy math generalization	10 / 10
Collapse probe suite	100 / 100

Strengths

Strong JSON-action validity in the DeepOutfit harness.
Good outfit quality relative to other local SFT/RL variants tested so far.
Uses tool results rather than inventing product IDs in the evaluated harness.
Preserves basic non-outfit behavior on small local sanity probes.

Limitations

Strongly coupled to the DeepOutfit event protocol and catalog.
Product IDs and product metadata are catalog-specific.
Metrics are internal GPT-4.1 judge metrics, not human ratings.
Prompt, tool, and judge changes can move scores materially.
The model may still produce weak outfits when search results are noisy or role coverage is ambiguous.
This checkpoint should not be used as evidence that unrelated RL or SDPO variants improved quality; those require harness-matched comparisons.

Recommended Inference Settings

For deterministic evaluation, use low-temperature decoding similar to the latest first-50 run:

temperature 0.2
top-p 0.9
one rollout per query

The saved generation_config.json contains a more exploratory setting (temperature=0.6, top_p=0.95, top_k=20) used for sampling-style experiments. Use the lower-temperature settings for judge comparisons.

Minimal Loading Example

This only loads the model. It does not recreate the DeepOutfit harness.

python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "flavianv/deepoutfit-qwen17b-gpt41-150-sft80"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
)

To reproduce reported behavior, run it through the local DeepOutfit JSON-action harness with the same tool schema, task-start events, product catalog, and judge configuration.

Intended Use

Research checkpoint for agentic tool-use outfit recommendation. It is useful for comparing SFT/RL/SDPO variants under a fixed harness. It is not a consumer styling service and should not be treated as a general-purpose fashion advisor.

Model provider

flavianv

Model tree

Base

Qwen/Qwen3-1.7B

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Model Details

Hub repo: flavianv/deepoutfit-qwen17b-gpt41-150-sft80
Local eval alias: sft80_from150
Base model: Qwen/Qwen3-1.7B
Architecture: Qwen3ForCausalLM
Training method: supervised fine-tuning
Teacher source: GPT-4.1 JSON-action traces from the DeepOutfit harness
Training query source: first 150 prompts from the local OUTFIT500-derived outfit set
Filtering: GPT-4.1 judged trace quality >=80
Training rows: 133 filtered traces
Target style: final-only assistant targets from JSON-action traces
Source checkpoint path: /home/criteo/reco-rl-json-action/outputs/models/qwen3-1.7b-json-action-outfit-gpt41-150-ge80-sft_20260530_152442_finalonly_20260530_152848

Selected training configuration recovered from the saved checkpoint:

bf16 training
batch size 1 per device
gradient accumulation 16
learning rate 5e-6
linear scheduler
max sequence length 16384
assistant-only loss
optimizer adamw_torch_fused

Harness Dependency

This model's behavior is tied to the DeepOutfit harness. The harness is part of the task definition, not just an evaluation wrapper.

Important harness pieces:

structured event messages rather than a single natural-language user prompt
planning enabled before catalog search
tool names and action schema such as todo_writer, search_products, and finalize_report
catalog-backed product search over the Clothing catalog
top_k=5 product search results
maximum 5 tool calls
final report validator requiring valid JSON, exactly 5 items, unique product IDs, and product IDs sourced from tool results
candidate accumulation and finalization logic in the local DeepOutfit batch-eval harness

Prompt And Judge Caveat

Results for this model are not comparable to older DeepOutfit reports unless the full harness is held fixed.

Across the local experiments, several pieces changed:

the system prompt / task-start event wording
the planning instruction and JSON schema
the toolset and tool result presentation
candidate accumulation and finalization behavior
maximum context / completion lengths
decoding settings
the judge model and judge prompt
the quality rubric and failure flags
error handling for missing reports, broken reports, invalid JSON, and missing API keys

JSON-Action Protocol Sketch

The harness starts with structured task information, for example:

json
{
  "event": "outfit_task_start",
  "task_type": "outfit",
  "user_query": "men's Ibiza nightlife outfit for clubbing and rooftop drinks, stylish breathable and comfortable, not costume-like",
  "target_k": 5,
  "max_tool_calls": 5
}

The model should produce a planning action:

json
{
  "action": "todo_writer",
  "look": "stylish breathable men's Ibiza nightlife outfit with polished warm-weather separates",
  "searches": [
    "men breathable nightlife shirt",
    "men tailored lightweight trousers",
    "men stylish loafers",
    "men lightweight evening jacket",
    "men minimalist watch"
  ]
}

The harness then executes catalog searches and returns tool-result events. The model eventually emits:

json
{
  "action": "finalize_report",
  "results": [
    {
      "rank": 1,
      "product_id": "...",
      "category": "Clothing",
      "reasoning": "..."
    }
  ]
}

Evaluation

Latest Low-Temperature First-50 Holdout

This is the most recent first-50 holdout result for sft80_from150 from the local batch_eval_outfit_models.py harness.

Settings:

query file: queries/Clothing/OUTFIT500.json
query limit: 50
rollouts per query: 1
category: Clothing
task type: outfit
planning: on
max tool calls: 5
top-k: 5
generation temperature: 0.2
generation top-p: 0.9
judge model: gpt-4.1
judge temperature: 0

Result:

Table with columns: Metric, Value
Metric	Value
Rows	50
Overall score mean	72.5391
Quality score mean	71.6867
Quality score median	71.3000
Quality `>=70`	50%
Quality `>=75`	46%
Quality `>=85`	32%
Correctness score mean

Same-run comparison:

Table with columns: Model, Training source, Overall, Quality, Q >=70, Q >=85, Correctness, Broken
Model	Training source	Overall	Quality	Q >=70	Q >=85	Correctness	Broken
`sft80_from150`	GPT-4.1 SFT, first 150, score >=80	72.54	71.69	50%	32%	97.6	4%
`onpolicy_sft_85`	on-policy SFT score >=85

Older Promptfix Eval Snapshot

The previous card reported a different 50-query promptfix eval. That run used a different harness snapshot and should be treated separately.

Table with columns: Model, Overall, Quality, >=70 Quality, Correctness, Missing Report, Broken Report, Rollouts/min
Model	Overall	Quality	>=70 Quality	Correctness	Missing Report	Broken Report	Rollouts/min
Qwen3 1.7B zero-shot	65.84	59.36	32%	96.4	0	6	4.787
Previous DeepOutfit SFT+DPO	55.77	43.73	10%

Additional probes from that older run:

Table with columns: Probe, Score
Probe	Score
Easy math generalization	10 / 10
Collapse probe suite	100 / 100

Strengths

Strong JSON-action validity in the DeepOutfit harness.
Good outfit quality relative to other local SFT/RL variants tested so far.
Uses tool results rather than inventing product IDs in the evaluated harness.
Preserves basic non-outfit behavior on small local sanity probes.

Limitations

Strongly coupled to the DeepOutfit event protocol and catalog.
Product IDs and product metadata are catalog-specific.
Metrics are internal GPT-4.1 judge metrics, not human ratings.
Prompt, tool, and judge changes can move scores materially.
The model may still produce weak outfits when search results are noisy or role coverage is ambiguous.
This checkpoint should not be used as evidence that unrelated RL or SDPO variants improved quality; those require harness-matched comparisons.

Recommended Inference Settings

For deterministic evaluation, use low-temperature decoding similar to the latest first-50 run:

temperature 0.2
top-p 0.9
one rollout per query

Minimal Loading Example

This only loads the model. It does not recreate the DeepOutfit harness.

python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "flavianv/deepoutfit-qwen17b-gpt41-150-sft80"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
)

To reproduce reported behavior, run it through the local DeepOutfit JSON-action harness with the same tool schema, task-start events, product catalog, and judge configuration.

deepoutfit-qwen17b-gpt41-150-sft80

Get help setting up a custom Dedicated Endpoints.

README

Model Details

Harness Dependency

Prompt And Judge Caveat

JSON-Action Protocol Sketch

Evaluation

Latest Low-Temperature First-50 Holdout

Older Promptfix Eval Snapshot

Strengths

Limitations

Recommended Inference Settings

Minimal Loading Example

Intended Use

Explore FriendliAI today

README

Model Details

Harness Dependency

Prompt And Judge Caveat

JSON-Action Protocol Sketch

Evaluation

Latest Low-Temperature First-50 Holdout

Older Promptfix Eval Snapshot

Strengths

Limitations

Recommended Inference Settings

Minimal Loading Example

Intended Use