banyaaiofficial

Qwen3.5-122B-A10B-Banya-Tuned-v18-grpo

README

License: apache-2.0

Qwen3.5-122B-A10B-Banya-Tuned-v18-grpo

Option D3 (Online RLVR) — GRPO trained with real pytest reward signal.

init: v10 LoRA (30% Pass@1 baseline)
trainer: TRL GRPOTrainer (Group Relative Policy Optimization)
rollout: HF model.generate (k=8 per task, T=1.0)
reward: swebench harness pytest (0=apply_fail / 0.3=apply_ok+test_fail / 1.0=resolved)
corpus: SWE-bench-Lite 270 train pool (no leakage with stratified-30 eval)
hyperparams: β KL=0.05, ε clip=0.2, lr=5e-6, 100 steps, k=8

This is the first online-RL variant in the v-series. Previous attempts (v15 156-pair DPO, v16 362-pair DPO with β=0.05, v17 D1.5 test-DPO) hit the offline ceiling — too few valid pairs due to outcome convergence. GRPO bypasses pair requirement entirely (group-relative advantage).

See SFT method doc.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

banyaaiofficial

Model Tree

Base

Qwen/Qwen3.5-122B-A10B

Adapter

this model

Input Modalities

Text

Image

Video

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0