xupy21/ContextRL_Klear_AgentForge_8B API & Inference Endpoint

Results

Across 5 long-horizon benchmarks (2 in-distribution agentic coding, 3 out-of-distribution), ContextRL improves over the standard GRPO baseline by +3.2 points on average, while improving every individual benchmark.

Benchmark	Base	RL (GRPO)	ContextRL (Ours)
SWE-Bench Verified	26.6	28.0	30.2
SWE-Bench Lite	21.0	21.7	24.0
LiveCodeBench v6	21.7	22.3	24.0
LongBench v2 (Overall)	27.4	27.0	29.6
LongBench v2 (Long)	21.3	24.1	28.7
NIAH	68.3	65.5	71.3

Metrics: SWE-Bench Verified/Lite resolve rate (%), LiveCodeBench v6 solve rate (%), LongBench v2 accuracy (%), NIAH mean recall (%). On the long-context tasks (LongBench v2, NIAH) where standard outcome-based GRPO struggles or regresses, ContextRL surpasses both the base model and the RL baseline, demonstrating strong out-of-distribution generalization.

Usage

This model follows the same interface as its Klear-AgentForge-8B base and can be loaded with transformers. Training and evaluation code, data construction pipelines, and detailed configurations are available in the repository: 👉 https://github.com/xupy2003/ContextAwareRL Please refer to the repo's README for environment setup, inference scripts, and reproduction instructions.

ContextRL_Klear_AgentForge_8B

Get help setting up a custom Dedicated Endpoints.

README

Results

Usage

Explore FriendliAI today

ContextRL_Klear_AgentForge_8B