banyaaiofficial

Qwen3.5-122B-A10B-Banya-Tuned-v18-grpo

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Qwen3.5-122B-A10B-Banya-Tuned-v18-grpo

Option D3 (Online RLVR) — GRPO trained with real pytest reward signal.

  • init: v10 LoRA (30% Pass@1 baseline)
  • trainer: TRL GRPOTrainer (Group Relative Policy Optimization)
  • rollout: HF model.generate (k=8 per task, T=1.0)
  • reward: swebench harness pytest (0=apply_fail / 0.3=apply_ok+test_fail / 1.0=resolved)
  • corpus: SWE-bench-Lite 270 train pool (no leakage with stratified-30 eval)
  • hyperparams: β KL=0.05, ε clip=0.2, lr=5e-6, 100 steps, k=8

This is the first online-RL variant in the v-series. Previous attempts (v15 156-pair DPO, v16 362-pair DPO with β=0.05, v17 D1.5 test-DPO) hit the offline ceiling — too few valid pairs due to outcome convergence. GRPO bypasses pair requirement entirely (group-relative advantage).

See SFT method doc.

Model provider

banyaaiofficial

Model tree

Base

Qwen/Qwen3.5-122B-A10B

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today