benchflow

benchflow-qwen35-9b

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Release Summary

Table
FieldValue
Release tagv0.0.1
Adapter repobenchflow/benchflow-qwen35-9b
Base checkpointQwen/Qwen3.5-9B
Base checkpoint formFull, non-quantized source checkpoint; frozen during LoRA SFT
Adapter typeLoRA / PEFT
Source completed rungeneral-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z
W&B projectgeneral-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z
HF training artifactsbenchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-9b-sft-seq2048-fresh-20260624T131847Z
Published at2026-06-24 22:27:07 UTC

Research Reproduction Scope

The goal of this adapter is to reproduce the SFT-stage lift from Prime Intellect's general-agent work as closely as possible while using a smaller student model that can train on one H100. The stack keeps the Prime-style task and verifier path:

  • Source tasks: open-source PrimeIntellect-ai/research-environments/environments/general_agent task corpus.
  • Teacher trace generation: general-agent-solver-rlm + Azure GPT-5.4-mini through native Verifiers / vf-eval --save-results artifacts.
  • SFT trainer: Prime-RL SFT.
  • Student: full, non-quantized Qwen/Qwen3.5-9B loaded in BF16 with LoRA adapters.
  • Eval: general-agent-solver-local through native vf-eval --save-results on the same held-in task sets before and after SFT.

Data Recipe

Table
FieldValue
Datasetbenchflow/general-agent-qwen35-9b-azure-gpt54mini-sft
Dataset rows4414
Original source task count4417
Teacher modelAzure GPT-5.4-mini
Teacher harnessPrime/Verifiers general-agent-solver-rlm
Artifact formatNative vf-eval --save-results trajectories converted to Prime-RL messages + tool_defs SFT rows
Excluded source tasksdog_breeding_t1, skydiving_center_t1, skydiving_center_t2
Exclusion reasonStable Azure content-filter blocks during teacher trace generation
Full teacher sweep artifactbenchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-daytona-teacher-full4417-tunnel8-20260624T015706Z
Data validationPrime SFT JSONL validator rejected non-leading system messages and leakage fields before training

Training Parameters

Table
FieldValue
TrainerPrime-RL SFT
Model loaded for SFTQwen/Qwen3.5-9B full BF16 base weights
QuantizationNone for the completed v0.0.1 LoRA run
AdapterLoRA
LoRA rank16
LoRA alpha32
LoRA dropout0.0
Target modulesq_proj, k_proj, v_proj, o_proj, gate_proj, up_proj, down_proj
Trainable paramsabout 29.1M
Adapted base paramsabout 5.30B
Total base params loadedabout 9.44B
Sequence length2048
Global batch size8
Micro batch size1
Pack functioncat
Shuffletrue
Seed0
OptimizerAdamW
Learning rate5e-5
Weight decay0.01
Betas0.9, 0.999
Grad norm clip1.0
SchedulerLinear
Warmup steps20
Decay steps180
Minimum LR0.0
Max steps200
Checkpoint interval20
Keep last3
Keep interval100
Save formatsafetensors
Loss maskAssistant messages only; system, user, and tool messages are context-only

Training Result

Table
MetricValue
Completed step200
Final loss0.11897
loss/nan_count0
Peak GPU memoryabout 40.8 GiB
Final adapteradapter_model.safetensors in this repo

The initial data.seq_len=8192 Prime-RL BF16 LoRA attempt OOMed on one H100. The completed v0.0.1 run used data.seq_len=2048, system CUDA 12.8 nvcc/ptxas, and g++-12 for the required FLA/TileLang kernels.

Evaluation Results

All evaluations below use native Verifiers vf-eval --save-results, general-agent-solver-local, serving context length 4096, --enable-auto-tool-choice, and --tool-call-parser qwen3_xml. Dynamic vLLM LoRA loading was not reliable for this stack, so eval served a merged local checkpoint built from this adapter plus Qwen/Qwen3.5-9B.

Table
Task setBase pass rateLoRA SFT pass rateDeltaNotes
Held-in 5 smoke1/5 = 20.00%2/5 = 40.00%+20.00 ppFirst serving/eval smoke
Held-in 2011/20 = 55.00%13/20 = 65.00%+10.00 ppRecovered 3d_print_shop_t1, accounting_firm_t1
Held-in 3620/36 = 55.56%23/36 = 63.89%+8.33%No regressions; recovered 3d_print_shop_t1, accounting_firm_t1, allergy_clinic_t0
Held-in 50 assembled27/50 = 54.00%30/50 = 60.00%+6.00%Latest wider held-in result; final 14-task slice had no net delta
Held-in 50 final 14-task slice7/14 = 50.00%7/14 = 50.00%+0.00%Recovered animation_studio_t0; regressed antiquarian_bookshop_t0

Evaluation artifact prefixes:

  • Held-in 5 smoke: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-smoke4096-20260624T152150Z
  • Held-in 20 comparison: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin20-compare-20260624
  • Held-in 36 comparison: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin36-compare-20260624
  • Held-in 50 final 14-task run: benchflow/env0-experiment-trajectories/experiments/general-agent/general-agent-qwen35-eval-heldin50-gap-20260624T190517Z

Loading

python

from peft import PeftModel
from transformers import AutoModelForCausalLM, AutoTokenizer
base = AutoModelForCausalLM.from_pretrained(
"Qwen/Qwen3.5-9B",
torch_dtype="auto",
trust_remote_code=True,
)
model = PeftModel.from_pretrained(base, "benchflow/benchflow-qwen35-9b")
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.5-9B", trust_remote_code=True)

Caveats

  • This is an SFT-stage reproduction artifact, not the full Prime paper recipe with the original teacher and student model stack.
  • The trainable dataset has 4414 rows rather than 4417 because three Azure teacher prompts were blocked by content filtering.
  • The latest held-in50 assembled lift is positive but modest at +6.00 pp; gains are concentrated in a small number of tasks rather than broad across-the-board recovery.
  • The next QLoRA seq8192 experiment is excluded from v0.0.1 and should receive its own update/tag only after it completes.

Model provider

benchflow

Model tree

Base

Qwen/Qwen3.5-9B

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today