banyaaiofficial
Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Qwen3.5-122B-A10B-Banya-Tuned-v20-grpo
Option D3 + dense reward + v5 init — GRPO with multi-stage preflight reward.
- init: v5 LoRA (mix corpus, ~30% Pass@1 baseline)
- trainer: TRL GRPOTrainer
- rollout: HF model.generate (k=8 per task, T=1.0)
- reward: dense [0,1.0] = parse 0.05 + grep 0.05 + file 0.10 + func 0.10 + harness 0.30/0.70
- MoE safeguards: output_router_logits + aux loss + explicit router freeze (from v19)
- corpus: SWE-bench-Lite 270 train pool (no leakage with stratified-30 eval)
- hyperparams: β=0.1, ε=0.2, lr=1e-6, 100 steps, k=8
Builds on v19 (GRPO + MoE safeguards validated stable for 21.5h, 8/30 smoke). v20 addresses v19's plateau by densifying reward signal (parse/grep/file/func preflight gives gradient even when harness is stuck at 0.3 ceiling).
Model provider
banyaaiofficial
Model tree
Base
Qwen/Qwen3.5-122B-A10B
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information