Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Intended use

This adapter is designed to be loaded inside the RLM harness from alexzhang13/rlm, where the model acts as the root LM in a Python-REPL-driven recursion that issues llm_query / rlm_query sub-calls over long contexts. It is not a drop-in chat model — it expects the RLM system prompt and REPL scaffolding.

Recursion depth at inference is not capped to 1. The model was trained at depth=1 (sub-calls collapse to llm_query), but the underlying canonical RLM harness supports arbitrary recursion depth and the adapter can be used with depth>1 at inference time.

Some inference-time flags (orchestrator-mode hints, per-env user prologues, etc.) need to be set to match training-time conditioning. Exact flag list TBD — will be documented here once finalized.

Results

Evaluated against the base Qwen/Qwen3-30B-A3B-Instruct-2507, with and without a "Plan before you act" orchestrator hint added to the RLM system prompt. Mean reward × 100 (i.e. score %); full splits where feasible.

eval results

envA: vanilla baseB: base + "plan" hintC: RLM-trained + "plan" hintA → C Δ
OOLONG trec_coarse @ 132k (n=50)33.824.047.2+13.4
OOLONG-Pairs @ 32k (n=20)42.941.245.0+2.2
BrowseComp-Plus test (n=150, k=50 documents)11.618.729.7+18.1
LongBenchv2 Code repo QA (n=50)22.038.042.0+20.0

OOLONG numeric vs non-numeric split (n=12 / n=38): 4.9 / 60.5 for the trained model, vs. 7.3 / 42.1 for the vanilla base — gains are concentrated in the non-numeric subset.

Comparison vs. RLM-Qwen3-8B from the paper

For reference, the paper Recursive Language Models (Zhang et al., arXiv:2512.24601) reports these RLM-trained 8B numbers (Figure 3a):

benchmarkBase Qwen3-8BRLM(Qwen3-8B)RLM-Qwen3-8B (post-trained)RLM-Qwen3-30B (this model)
LongBenchv2 CodeQA4.0026.0032.0042.0
OOLONG0.0024.0032.0447.2
OOLONG-Pairs0.074.265.1745.0

Caveats: the paper's RLM-Qwen3-8B was trained via SFT on distilled trajectories from a 480B teacher; this 30B model was trained via RL in a different harness, with a different system-prompt and orchestrator-hint setup. The two are not strict apples-to-apples but share the benchmarks and the RLM inference paradigm.

Training

Use the training code in rlm/training, which builds a training harness as a verifiers environment that uses prime-rl for training. This training environment is simple and directly trains models to be used in the rlm inference engine with no sandboxes.

Base modelQwen/Qwen3-30B-A3B-Instruct-2507
AdapterLoRA, r=32, α=64, targets q_proj, k_proj, v_proj, o_proj
MethodRL (verifiable rewards) with prime-rl
EnvRLMTrainEnv — a verifiers-compatible env logically 1:1 with rlm.RLM.completion. Lives in the alexzhang13/rlm repo under training/
Training depth1 (sub-calls collapse to llm_query during training)
Hardware8 × A100

Inference Usage

mit-oasys/rlm-qwen3-30b-v0.1 is a LoRA adapter for Qwen/Qwen3-30B-A3B-Instruct-2507. Serve the base + adapter via vLLM, then run inference through rlm at depth 1.

1. Serve via vLLM

bash

vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 \
--tensor-parallel-size 4 \
--max-model-len 16384 \
--gpu-memory-utilization 0.9 \
--enable-lora \
--max-lora-rank 64 \
--lora-modules rlm-v0.1=mit-oasys/rlm-qwen3-30b-v0.1 \
--port 8000

LoRA rank is 32 (q/k/v/o_proj), so --max-lora-rank ≥ 32 is required. Training used max_model_len=16384.

2. Run inference via rlm

python

from rlm.core.rlm import RLM
rlm = RLM(
backend="openai",
backend_kwargs={
"base_url": "http://localhost:8000/v1",
"model_name": "rlm-v0.1",
"timeout": 1800.0,
},
environment="local",
max_iterations=20,
max_depth=1,
sampling_args={
"max_completion_tokens": 4096,
"extra_body": {"enable_thinking": False},
},
sub_sampling_args={"max_tokens": 4096},
# orchestrator=True is the default and matches training; do not change.
)
result = rlm.completion(prompt=context, root_prompt=query)
print(result.response)

For RLM inference, use the harness in alexzhang13/rlm and point its model config at this adapter (merge offline if your serving stack — e.g. vLLM without punica — cannot apply LoRA at runtime).

Limitations

  • Training depth: trained at depth=1; depth>1 is supported at inference but was not seen during training.
  • No persistent=True, no compaction=True, no max_budget / max_timeout / max_errors, no custom tools used during training. All exist in canonical rlm.RLM and can be used at inference.
  • Some inference flags (orchestrator hints, per-env user prologues) need to match training. Exact configuration will be specified here in a follow-up.

Model provider

mit-oasys

Model tree

Base

Qwen/Qwen3-30B-A3B-Instruct-2507

Adapter

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today