Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Provenance

Starting model:

Original base model:

Local source checkpoint:

outputs/rl/deepoutfit_rlvr_qwen17b_20260528_150307_vllm_resume15/best_train_full

The best_train_full checkpoint was selected by the local persistent best-train checkpoint logic:

  • metric: reward
  • best optimizer step: 110
  • best train reward: 0.5532
  • later training metrics continued past this point, but the persistent best-train checkpoint remained step 110.

GRPO Training Setup

Key settings from the local B200 DeepOutfit GRPO/RLVR config:

  • Training queries: 200 outfit prompts from the local OUTFIT500-derived split.
  • Eval queries: 20 held-out outfit eval prompts.
  • Tool setup: search-only product tool, trained catalog source, top_k=5.
  • Max tool rounds: 5.
  • Reward backend: outfit_judge.
  • Judge model: gpt-4.1-mini.
  • Reward scale: 0.01.
  • Invalid JSON/report reward: -1.0.
  • Judge parallelism: 8.
  • Learning rate: 5e-6.
  • Per-device train batch size: 4.
  • Gradient accumulation steps: 2.
  • Number of generations per prompt: 4.
  • Max completion length: 4096.
  • KL beta: 0.0.
  • vLLM rollout backend: colocated, GPU memory utilization 0.3.
  • Save interval: every 5 optimizer steps.

Evaluation

The model was evaluated with the local batch_eval_outfit_models.py harness on 50 OUTFIT500 queries, one low-temperature rollout per query. Outfit quality was scored by GPT-4.1 judge. The comparison included Qwen zero-shot, the SFT+DPO checkpoint, and this GRPO checkpoint.

ModelRowsOverallGeneralizationEntropyEfficiencyCorrectnessQuality>=70 QualityMissing ReportBroken ReportTokens MedianCalls MedianRollouts/min
Qwen3-1.7B zero-shot5057.031000.041669.3884.829.930%2%22%3,632110.72
DeepOutfit SFT+DPO5055.36900.065039.7182.041.588%0%30%11,24058.21
DeepOutfit GRPO best-train step 1105056.69900.064140.7988.439.106%2%16%11,28057.05

Judge breakdown for this GRPO checkpoint:

MetricValue
Judged rows50
Judge score > 703 / 50
Mean judge score39.10
Max judge score76.53
Min judge score0.00
Best average judge submetricvalidity gate, 94.8 / 100
Worst average judge submetricexplanation average, 38.1 / 100
Highest failure flagimpractical to wear, 82%

Generalization probes:

ProbeResult
Easy math8 / 10
JSON formatting2 / 2
Factual QA2 / 2
Exact string following2 / 2
Simple code-output QA2 / 2

Interpretation: compared with the SFT+DPO seed, this GRPO checkpoint improved report correctness and reduced broken reports on the 50-query pass, but its GPT-4.1 judged outfit quality was slightly lower than SFT+DPO. Compared with zero-shot Qwen3-1.7B, it improved judged outfit quality but used more tool calls and more tokens.

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
repo_id = "flavianv/deepoutfit-qwen17b-grpo-best110"
tokenizer = AutoTokenizer.from_pretrained(repo_id)
model = AutoModelForCausalLM.from_pretrained(
repo_id,
torch_dtype="auto",
device_map="auto",
)
messages = [
{
"role": "user",
"content": "Find a men's lake weekend outfit for boating, lunch, and an evening fire pit.",
}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=1024, temperature=0.2, top_p=0.9)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

For the intended JSON-action setting, use the same product-search tool schema, tool loop, and report validator as the training/evaluation harness. Standalone generations may refer to products or actions that only make sense inside that catalog-grounded tool environment.

Limitations

  • Experimental research checkpoint, not production validated.
  • Model-only Hub upload; optimizer/scheduler/RNG state is not included.
  • Optimized for the local outfit-agent harness, not broad assistant quality.
  • Can still produce incomplete, impractical, or unsupported outfits.
  • Product IDs and search behavior depend on the external catalog/tool harness.
  • Easy-math probing shows some drift versus the zero-shot base model.

License

This checkpoint is released under Apache 2.0, following the base Qwen3-1.7B license metadata.

Model provider

flavianv

Model tree

Base

flavianv/deepoutfit-qwen17b-sft-dpo

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today