banyaaiofficial

Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo

README

License: apache-2.0

Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo

Option D3 + dense reward + v5 init — GRPO with multi-stage preflight reward.

init: v5 LoRA (mix corpus, ~30% Pass@1 baseline)
trainer: TRL GRPOTrainer
rollout: HF model.generate (k=8 per task, T=1.0)
reward: dense [0,1.0] = parse 0.05 + grep 0.05 + file 0.10 + func 0.10 + harness 0.30/0.70
MoE safeguards: output_router_logits + aux loss + explicit router freeze (from v19)
corpus: SWE-bench-Lite 270 train pool (no leakage with stratified-30 eval)
hyperparams: β=0.1, ε=0.2, lr=1e-6, 100 steps, k=8

Builds on v19 (GRPO + MoE safeguards validated stable for 21.5h, 8/30 smoke). v20 addresses v19's plateau by densifying reward signal (parse/grep/file/func preflight gives gradient even when harness is stuck at 0.3 ceiling).

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

banyaaiofficial

Model Tree

Base

Qwen/Qwen3.5-122B-A10B

Adapter

this model

Input Modalities

Text

Image

Video

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo

Option D3 + dense reward + v5 init — GRPO with multi-stage preflight reward.

init: v5 LoRA (mix corpus, ~30% Pass@1 baseline)
trainer: TRL GRPOTrainer
rollout: HF model.generate (k=8 per task, T=1.0)
reward: dense [0,1.0] = parse 0.05 + grep 0.05 + file 0.10 + func 0.10 + harness 0.30/0.70
MoE safeguards: output_router_logits + aux loss + explicit router freeze (from v19)
corpus: SWE-bench-Lite 270 train pool (no leakage with stratified-30 eval)
hyperparams: β=0.1, ε=0.2, lr=1e-6, 100 steps, k=8