flavianv

deepoutfit-qwen17b-sft-dpo

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Training

Base model:

Qwen/Qwen3-1.7B

Supervised fine-tuning stage:

Data: filtered JSON-action outfit rollouts.
Selection rule used by the local pipeline: score four rollouts per outfit query, select the top rollout per query when its score is greater than 60, then export selected raw traces for SFT.
Max length: 16,384.
Epochs: 3.
Learning rate: 2e-5.
Per-device train batch size: 1.
Gradient accumulation steps: 16.
Assistant-only loss: enabled.
Full fine-tune, not LoRA.

DPO stage:

Starting checkpoint: the outfit SFT model.
Data: 100 outfit preference-query training rows and 50 validation rows in the local DeepOutfit pipeline.
Max length: 8,192.
Epochs: 1.
Learning rate: 5e-7.
DPO beta: 0.1.
Per-device train batch size: 1.
Gradient accumulation steps: 8.
Full fine-tune, not LoRA.

Uploaded source directory:

outputs/models/qwen3-1.7b-json-action-outfit-sft-dpo-100q-cont1_20260527_230029

Evaluation

Evaluation was run with the local batch_eval_outfit_models.py harness on 50 OUTFIT500 queries, one low-temperature rollout per query. Outfit quality was scored by GPT-4.1 judge. The comparison included Qwen zero-shot, this SFT+DPO checkpoint, and a later GRPO/RL checkpoint.

Table with columns: Model, Rows, Overall, Generalization, Entropy, Efficiency, Correctness, Quality, >=70 Quality, Missing Report, Broken Report, Tokens Median, Calls Median, Rollouts/min
Model	Rows	Overall	Generalization	Entropy	Efficiency	Correctness	Quality	>=70 Quality	Missing Report	Broken Report	Tokens Median	Calls Median	Rollouts/min
Qwen3-1.7B zero-shot	50	57.03	100	0.0416

Judge breakdown for this SFT+DPO checkpoint:

Table with columns: Metric, Value
Metric	Value
Judged rows	50
Judge score > 70	4 / 50
Mean judge score	41.58
Max judge score	94.27
Min judge score	21.60
Best average judge submetric	validity gate, 94.0 / 100
Worst average judge submetric	explanation average, 40.8 / 100
Highest failure flag	impractical to wear, 88%

Generalization probes:

Table with columns: Probe, Result
Probe	Result
Easy math	8 / 10
JSON formatting	2 / 2
Factual QA	2 / 2
Exact string following	2 / 2
Simple code-output QA	2 / 2

Interpretation: compared with zero-shot Qwen3-1.7B, this checkpoint improves GPT-4.1 judged outfit quality, but uses more search/tool calls and more tokens. The dominant remaining failure mode is outfit practicality.

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "flavianv/deepoutfit-qwen17b-sft-dpo"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "Find a men's backyard BBQ host outfit that is casual, practical, and intentional.",
    }
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.2, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For the intended JSON-action setting, use the same tool schema and validation loop as the training/evaluation harness. Standalone generations may reference products or tool actions that are only meaningful when connected to the product search tool.

Limitations

Experimental research checkpoint, not production validated.
Optimized for outfit/product-report behavior, not broad assistant quality.
Can produce incomplete, impractical, or unsupported product combinations.
Product IDs and search behavior depend on the external catalog/tool harness.
Easy-math probing shows some drift versus the zero-shot base model.

License

This checkpoint is released under Apache 2.0, matching the base Qwen/Qwen3-1.7B license metadata.

Model provider

flavianv

Model tree

Base

Qwen/Qwen3-1.7B

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Training

Base model:

Qwen/Qwen3-1.7B

Supervised fine-tuning stage:

Data: filtered JSON-action outfit rollouts.
Selection rule used by the local pipeline: score four rollouts per outfit query, select the top rollout per query when its score is greater than 60, then export selected raw traces for SFT.
Max length: 16,384.
Epochs: 3.
Learning rate: 2e-5.
Per-device train batch size: 1.
Gradient accumulation steps: 16.
Assistant-only loss: enabled.
Full fine-tune, not LoRA.

DPO stage:

Starting checkpoint: the outfit SFT model.
Data: 100 outfit preference-query training rows and 50 validation rows in the local DeepOutfit pipeline.
Max length: 8,192.
Epochs: 1.
Learning rate: 5e-7.
DPO beta: 0.1.
Per-device train batch size: 1.
Gradient accumulation steps: 8.
Full fine-tune, not LoRA.

Uploaded source directory:

outputs/models/qwen3-1.7b-json-action-outfit-sft-dpo-100q-cont1_20260527_230029

Evaluation

Table with columns: Model, Rows, Overall, Generalization, Entropy, Efficiency, Correctness, Quality, >=70 Quality, Missing Report, Broken Report, Tokens Median, Calls Median, Rollouts/min
Model	Rows	Overall	Generalization	Entropy	Efficiency	Correctness	Quality	>=70 Quality	Missing Report	Broken Report	Tokens Median	Calls Median	Rollouts/min
Qwen3-1.7B zero-shot	50	57.03	100	0.0416

Judge breakdown for this SFT+DPO checkpoint:

Table with columns: Metric, Value
Metric	Value
Judged rows	50
Judge score > 70	4 / 50
Mean judge score	41.58
Max judge score	94.27
Min judge score	21.60
Best average judge submetric	validity gate, 94.0 / 100
Worst average judge submetric	explanation average, 40.8 / 100
Highest failure flag	impractical to wear, 88%

Generalization probes:

Table with columns: Probe, Result
Probe	Result
Easy math	8 / 10
JSON formatting	2 / 2
Factual QA	2 / 2
Exact string following	2 / 2
Simple code-output QA	2 / 2

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "flavianv/deepoutfit-qwen17b-sft-dpo"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "Find a men's backyard BBQ host outfit that is casual, practical, and intentional.",
    }
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.2, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

Experimental research checkpoint, not production validated.
Optimized for outfit/product-report behavior, not broad assistant quality.
Can produce incomplete, impractical, or unsupported product combinations.
Product IDs and search behavior depend on the external catalog/tool harness.
Easy-math probing shows some drift versus the zero-shot base model.

License

This checkpoint is released under Apache 2.0, matching the base Qwen/Qwen3-1.7B license metadata.

deepoutfit-qwen17b-sft-dpo

Get help setting up a custom Dedicated Endpoints.

README

Training

Evaluation

Usage

Limitations

License

Explore FriendliAI today

README

Training

Evaluation

Usage

Limitations

License