benchflow

benchflow-qwen35-9b

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Release Summary

Table with columns: Field, Value
Field	Value
Release tag	`v0.0.1`
Adapter repo	`benchflow/benchflow-qwen35-9b`
Base checkpoint	`Qwen/Qwen3.5-9B`
Base checkpoint form	Full, non-quantized source checkpoint; frozen during LoRA SFT
Adapter type	LoRA / PEFT
Source completed run	`general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z`
W&B project	`general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z`
HF training artifacts	`benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z`
Published at	`2026-06-24 22:27:07 UTC`

Research Reproduction Scope

The goal of this adapter is to reproduce the SFT-stage lift from Prime Intellect's general-agent work as closely as possible while using a smaller student model that can train on one H100. The stack keeps the Prime-style task and verifier path:

Source tasks: open-source PrimeIntellect-ai/research-environments/environments/general_agent task corpus.
Teacher trace generation: general-agent-solver-rlm + Azure GPT-5.4-mini through native Verifiers / vf-eval --save-results artifacts.
SFT trainer: Prime-RL SFT.
Student: full, non-quantized Qwen/Qwen3.5-9B loaded in BF16 with LoRA adapters.
Eval: general-agent-solver-local through native vf-eval --save-results on the same held-in task sets before and after SFT.

Data Recipe

Table with columns: Field, Value
Field	Value
Dataset	`benchflow/general-agent-qwen35-9b-azure-gpt54mini-sft`
Dataset rows	`4414`
Original source task count	`4417`
Teacher model	Azure GPT-5.4-mini
Teacher harness	Prime/Verifiers `general-agent-solver-rlm`
Artifact format	Native `vf-eval --save-results` trajectories converted to Prime-RL + SFT rows

Training Parameters

Table with columns: Field, Value
Field	Value
Trainer	Prime-RL SFT
Model loaded for SFT	`Qwen/Qwen3.5-9B` full BF16 base weights
Quantization	None for the completed `v0.0.1` LoRA run
Adapter	LoRA
LoRA rank	`16`
LoRA alpha	`32`
LoRA dropout

Training Result

Table with columns: Metric, Value
Metric	Value
Completed step	`200`
Final loss	`0.11897`
`loss/nan_count`	`0`
Peak GPU memory	about `40.8 GiB`
Final adapter	`adapter_model.safetensors` in this repo

The initial data.seq_len=8192 Prime-RL BF16 LoRA attempt OOMed on one H100. The completed v0.0.1 run used data.seq_len=2048, system CUDA 12.8 nvcc/ptxas, and g++-12 for the required FLA/TileLang kernels.

Evaluation Results

All evaluations below use native Verifiers vf-eval --save-results, general-agent-solver-local, serving context length 4096, --enable-auto-tool-choice, and --tool-call-parser qwen3_xml. Dynamic vLLM LoRA loading was not reliable for this stack, so eval served a merged local checkpoint built from this adapter plus Qwen/Qwen3.5-9B.

Table with columns: Task set, Base pass rate, LoRA SFT pass rate, Delta, Notes
Task set	Base pass rate	LoRA SFT pass rate	Delta	Notes
Held-in 5 smoke	`1/5 = 20.00%`	`2/5 = 40.00%`	`+20.00 pp`	First serving/eval smoke
Held-in 20	`11/20 = 55.00%`	`13/20 = 65.00%`	`+10.00 pp`

Evaluation artifact prefixes:

Held-in 5 smoke: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-smoke4096-20260624T152150Z
Held-in 20 comparison: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin20-compare-20260624
Held-in 36 comparison: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin36-compare-20260624
Held-in 50 final 14-task run: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin50-gap-20260624T190517Z

Loading

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

Caveats

This is an SFT-stage reproduction artifact, not the full Prime paper recipe with the original teacher and student model stack.
The trainable dataset has 4414 rows rather than 4417 because three Azure teacher prompts were blocked by content filtering.
The latest held-in50 assembled lift is positive but modest at +6.00 pp; gains are concentrated in a small number of tasks rather than broad across-the-board recovery.
The next QLoRA seq8192 experiment is excluded from v0.0.1 and should receive its own update/tag only after it completes.

Model provider

benchflow

Model tree

Base

Qwen/Qwen3.5-9B

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Release Summary

Table with columns: Field, Value
Field	Value
Release tag	`v0.0.1`
Adapter repo	`benchflow/benchflow-qwen35-9b`
Base checkpoint	`Qwen/Qwen3.5-9B`
Base checkpoint form	Full, non-quantized source checkpoint; frozen during LoRA SFT
Adapter type	LoRA / PEFT
Source completed run	`general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z`
W&B project	`general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z`
HF training artifacts	`benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z`
Published at	`2026-06-24 22:27:07 UTC`

Research Reproduction Scope

Source tasks: open-source PrimeIntellect-ai/research-environments/environments/general_agent task corpus.
Teacher trace generation: general-agent-solver-rlm + Azure GPT-5.4-mini through native Verifiers / vf-eval --save-results artifacts.
SFT trainer: Prime-RL SFT.
Student: full, non-quantized Qwen/Qwen3.5-9B loaded in BF16 with LoRA adapters.
Eval: general-agent-solver-local through native vf-eval --save-results on the same held-in task sets before and after SFT.

Data Recipe

Table with columns: Field, Value
Field	Value
Dataset	`benchflow/general-agent-qwen35-9b-azure-gpt54mini-sft`
Dataset rows	`4414`
Original source task count	`4417`
Teacher model	Azure GPT-5.4-mini
Teacher harness	Prime/Verifiers `general-agent-solver-rlm`
Artifact format	Native `vf-eval --save-results` trajectories converted to Prime-RL + SFT rows

Training Parameters

Table with columns: Field, Value
Field	Value
Trainer	Prime-RL SFT
Model loaded for SFT	`Qwen/Qwen3.5-9B` full BF16 base weights
Quantization	None for the completed `v0.0.1` LoRA run
Adapter	LoRA
LoRA rank	`16`
LoRA alpha	`32`
LoRA dropout

Training Result

Table with columns: Metric, Value
Metric	Value
Completed step	`200`
Final loss	`0.11897`
`loss/nan_count`	`0`
Peak GPU memory	about `40.8 GiB`
Final adapter	`adapter_model.safetensors` in this repo

Evaluation Results

Table with columns: Task set, Base pass rate, LoRA SFT pass rate, Delta, Notes
Task set	Base pass rate	LoRA SFT pass rate	Delta	Notes
Held-in 5 smoke	`1/5 = 20.00%`	`2/5 = 40.00%`	`+20.00 pp`	First serving/eval smoke
Held-in 20	`11/20 = 55.00%`	`13/20 = 65.00%`	`+10.00 pp`

Evaluation artifact prefixes:

Held-in 5 smoke: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-smoke4096-20260624T152150Z
Held-in 20 comparison: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin20-compare-20260624
Held-in 36 comparison: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin36-compare-20260624
Held-in 50 final 14-task run: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin50-gap-20260624T190517Z

Loading

python
from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer

base = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.5-9B",
    torch_dtype="auto",
    trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

Caveats

This is an SFT-stage reproduction artifact, not the full Prime paper recipe with the original teacher and student model stack.
The trainable dataset has 4414 rows rather than 4417 because three Azure teacher prompts were blocked by content filtering.
The latest held-in50 assembled lift is positive but modest at +6.00 pp; gains are concentrated in a small number of tasks rather than broad across-the-board recovery.
The next QLoRA seq8192 experiment is excluded from v0.0.1 and should receive its own update/tag only after it completes.

benchflow-qwen35-9b

Get help setting up a custom Dedicated Endpoints.

README

Release Summary

Research Reproduction Scope

Data Recipe

Training Parameters

Training Result

Evaluation Results

Loading

Caveats

Explore FriendliAI today

README

Release Summary

Research Reproduction Scope

Data Recipe

Training Parameters

Training Result

Evaluation Results

Loading

Caveats