flavianv

deepoutfit-qwen17b-grpo-best110

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Provenance

Starting model:

flavianv/deepoutfit-qwen17b-sft-dpo

Original base model:

Qwen/Qwen3-1.7B

Local source checkpoint:

outputs/rl/deepoutfit_rlvr_qwen17b_20260528_150307_vllm_resume15/best_train_full

The best_train_full checkpoint was selected by the local persistent best-train checkpoint logic:

metric: reward
best optimizer step: 110
best train reward: 0.5532
later training metrics continued past this point, but the persistent best-train checkpoint remained step 110.

GRPO Training Setup

Key settings from the local B200 DeepOutfit GRPO/RLVR config:

Training queries: 200 outfit prompts from the local OUTFIT500-derived split.
Eval queries: 20 held-out outfit eval prompts.
Tool setup: search-only product tool, trained catalog source, top_k=5.
Max tool rounds: 5.
Reward backend: outfit_judge.
Judge model: gpt-4.1-mini.
Reward scale: 0.01.
Invalid JSON/report reward: -1.0.
Judge parallelism: 8.
Learning rate: 5e-6.
Per-device train batch size: 4.

Evaluation

The model was evaluated with the local batch_eval_outfit_models.py harness on 50 OUTFIT500 queries, one low-temperature rollout per query. Outfit quality was scored by GPT-4.1 judge. The comparison included Qwen zero-shot, the SFT+DPO checkpoint, and this GRPO checkpoint.

Table with columns: Model, Rows, Overall, Generalization, Entropy, Efficiency, Correctness, Quality, >=70 Quality, Missing Report, Broken Report, Tokens Median, Calls Median, Rollouts/min
Model	Rows	Overall	Generalization	Entropy	Efficiency	Correctness	Quality	>=70 Quality	Missing Report	Broken Report	Tokens Median	Calls Median	Rollouts/min
Qwen3-1.7B zero-shot	50	57.03	100	0.0416

Judge breakdown for this GRPO checkpoint:

Table with columns: Metric, Value
Metric	Value
Judged rows	50
Judge score > 70	3 / 50
Mean judge score	39.10
Max judge score	76.53
Min judge score	0.00
Best average judge submetric	validity gate, 94.8 / 100
Worst average judge submetric	explanation average, 38.1 / 100
Highest failure flag	impractical to wear, 82%

Generalization probes:

Table with columns: Probe, Result
Probe	Result
Easy math	8 / 10
JSON formatting	2 / 2
Factual QA	2 / 2
Exact string following	2 / 2
Simple code-output QA	2 / 2

Interpretation: compared with the SFT+DPO seed, this GRPO checkpoint improved report correctness and reduced broken reports on the 50-query pass, but its GPT-4.1 judged outfit quality was slightly lower than SFT+DPO. Compared with zero-shot Qwen3-1.7B, it improved judged outfit quality but used more tool calls and more tokens.

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "flavianv/deepoutfit-qwen17b-grpo-best110"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "Find a men's lake weekend outfit for boating, lunch, and an evening fire pit.",
    }
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.2, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For the intended JSON-action setting, use the same product-search tool schema, tool loop, and report validator as the training/evaluation harness. Standalone generations may refer to products or actions that only make sense inside that catalog-grounded tool environment.

Limitations

Experimental research checkpoint, not production validated.
Model-only Hub upload; optimizer/scheduler/RNG state is not included.
Optimized for the local outfit-agent harness, not broad assistant quality.
Can still produce incomplete, impractical, or unsupported outfits.
Product IDs and search behavior depend on the external catalog/tool harness.
Easy-math probing shows some drift versus the zero-shot base model.

License

This checkpoint is released under Apache 2.0, following the base Qwen3-1.7B license metadata.

Model provider

flavianv

Model tree

Base

flavianv/deepoutfit-qwen17b-sft-dpo

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Provenance

Starting model:

flavianv/deepoutfit-qwen17b-sft-dpo

Original base model:

Qwen/Qwen3-1.7B

Local source checkpoint:

outputs/rl/deepoutfit_rlvr_qwen17b_20260528_150307_vllm_resume15/best_train_full

The best_train_full checkpoint was selected by the local persistent best-train checkpoint logic:

metric: reward
best optimizer step: 110
best train reward: 0.5532
later training metrics continued past this point, but the persistent best-train checkpoint remained step 110.

GRPO Training Setup

Key settings from the local B200 DeepOutfit GRPO/RLVR config:

Training queries: 200 outfit prompts from the local OUTFIT500-derived split.
Eval queries: 20 held-out outfit eval prompts.
Tool setup: search-only product tool, trained catalog source, top_k=5.
Max tool rounds: 5.
Reward backend: outfit_judge.
Judge model: gpt-4.1-mini.
Reward scale: 0.01.
Invalid JSON/report reward: -1.0.
Judge parallelism: 8.
Learning rate: 5e-6.
Per-device train batch size: 4.

Evaluation

Table with columns: Model, Rows, Overall, Generalization, Entropy, Efficiency, Correctness, Quality, >=70 Quality, Missing Report, Broken Report, Tokens Median, Calls Median, Rollouts/min
Model	Rows	Overall	Generalization	Entropy	Efficiency	Correctness	Quality	>=70 Quality	Missing Report	Broken Report	Tokens Median	Calls Median	Rollouts/min
Qwen3-1.7B zero-shot	50	57.03	100	0.0416

Judge breakdown for this GRPO checkpoint:

Table with columns: Metric, Value
Metric	Value
Judged rows	50
Judge score > 70	3 / 50
Mean judge score	39.10
Max judge score	76.53
Min judge score	0.00
Best average judge submetric	validity gate, 94.8 / 100
Worst average judge submetric	explanation average, 38.1 / 100
Highest failure flag	impractical to wear, 82%

Generalization probes:

Table with columns: Probe, Result
Probe	Result
Easy math	8 / 10
JSON formatting	2 / 2
Factual QA	2 / 2
Exact string following	2 / 2
Simple code-output QA	2 / 2

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

repo_id = "flavianv/deepoutfit-qwen17b-grpo-best110"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
    repo_id,
    torch_dtype="auto",
    device_map="auto",
)

messages = [
    {
        "role": "user",
        "content": "Find a men's lake weekend outfit for boating, lunch, and an evening fire pit.",
    }
]
text = tokenizer.apply_chat_template(
    messages,
    tokenize=False,
    add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.2, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

Limitations

Experimental research checkpoint, not production validated.
Model-only Hub upload; optimizer/scheduler/RNG state is not included.
Optimized for the local outfit-agent harness, not broad assistant quality.
Can still produce incomplete, impractical, or unsupported outfits.
Product IDs and search behavior depend on the external catalog/tool harness.
Easy-math probing shows some drift versus the zero-shot base model.

License

This checkpoint is released under Apache 2.0, following the base Qwen3-1.7B license metadata.

deepoutfit-qwen17b-grpo-best110

Get help setting up a custom Dedicated Endpoints.

README

Provenance

GRPO Training Setup

Evaluation

Usage

Limitations

License

Explore FriendliAI today

README

Provenance

GRPO Training Setup

Evaluation

Usage

Limitations

License