Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Intended use
This adapter is designed to be loaded inside the RLM harness from alexzhang13/rlm, where the model acts as the root LM in a Python-REPL-driven recursion that issues llm_query / rlm_query sub-calls over long contexts. It is not a drop-in chat model — it expects the RLM system prompt and REPL scaffolding.
Recursion depth at inference is not capped to 1. The model was trained at depth=1 (sub-calls collapse to llm_query), but the underlying canonical RLM harness supports arbitrary recursion depth and the adapter can be used with depth>1 at inference time.
Some inference-time flags (orchestrator-mode hints, per-env user prologues, etc.) need to be set to match training-time conditioning. Exact flag list TBD — will be documented here once finalized.
Results
Evaluated against the base Qwen/Qwen3-30B-A3B-Instruct-2507, with and without a "Plan before you act" orchestrator hint added to the RLM system prompt. Mean reward × 100 (i.e. score %); full splits where feasible.

| env | A: vanilla base | B: base + "plan" hint | C: RLM-trained + "plan" hint | A → C Δ |
|---|---|---|---|---|
OOLONG trec_coarse @ 132k (n=50) | 33.8 | 24.0 | 47.2 | +13.4 |
| OOLONG-Pairs @ 32k (n=20) | 42.9 | 41.2 | 45.0 | +2.2 |
| BrowseComp-Plus test (n=150, k=50 documents) | 11.6 | 18.7 | 29.7 | +18.1 |
| LongBenchv2 Code repo QA (n=50) | 22.0 | 38.0 | 42.0 | +20.0 |
OOLONG numeric vs non-numeric split (n=12 / n=38): 4.9 / 60.5 for the trained model, vs. 7.3 / 42.1 for the vanilla base — gains are concentrated in the non-numeric subset.
Comparison vs. RLM-Qwen3-8B from the paper
For reference, the paper Recursive Language Models (Zhang et al., arXiv:2512.24601) reports these RLM-trained 8B numbers (Figure 3a):
| benchmark | Base Qwen3-8B | RLM(Qwen3-8B) | RLM-Qwen3-8B (post-trained) | RLM-Qwen3-30B (this model) |
|---|---|---|---|---|
| LongBenchv2 CodeQA | 4.00 | 26.00 | 32.00 | 42.0 |
| OOLONG | 0.00 | 24.00 | 32.04 | 47.2 |
| OOLONG-Pairs | 0.07 | 4.26 | 5.17 | 45.0 |
Caveats: the paper's RLM-Qwen3-8B was trained via SFT on distilled trajectories from a 480B teacher; this 30B model was trained via RL in a different harness, with a different system-prompt and orchestrator-hint setup. The two are not strict apples-to-apples but share the benchmarks and the RLM inference paradigm.
Training
Use the training code in rlm/training, which builds a training harness as a verifiers environment that uses prime-rl for training.
This training environment is simple and directly trains models to be used in the rlm inference engine with no sandboxes.
| Base model | Qwen/Qwen3-30B-A3B-Instruct-2507 |
| Adapter | LoRA, r=32, α=64, targets q_proj, k_proj, v_proj, o_proj |
| Method | RL (verifiable rewards) with prime-rl |
| Env | RLMTrainEnv — a verifiers-compatible env logically 1:1 with rlm.RLM.completion. Lives in the alexzhang13/rlm repo under training/ |
| Training depth | 1 (sub-calls collapse to llm_query during training) |
| Hardware | 8 × A100 |
Inference Usage
mit-oasys/rlm-qwen3-30b-v0.1 is a LoRA adapter for Qwen/Qwen3-30B-A3B-Instruct-2507.
Serve the base + adapter via vLLM, then run inference through
rlm at depth 1.
1. Serve via vLLM
bash
vllm serve Qwen/Qwen3-30B-A3B-Instruct-2507 \--tensor-parallel-size 4 \--max-model-len 16384 \--gpu-memory-utilization 0.9 \--enable-lora \--max-lora-rank 64 \--lora-modules rlm-v0.1=mit-oasys/rlm-qwen3-30b-v0.1 \--port 8000
LoRA rank is 32 (q/k/v/o_proj), so --max-lora-rank ≥ 32 is required. Training used
max_model_len=16384.
2. Run inference via rlm
python
from rlm.core.rlm import RLMrlm = RLM(backend="openai",backend_kwargs={"base_url": "http://localhost:8000/v1","model_name": "rlm-v0.1","timeout": 1800.0,},environment="local",max_iterations=20,max_depth=1,sampling_args={"max_completion_tokens": 4096,"extra_body": {"enable_thinking": False},},sub_sampling_args={"max_tokens": 4096},# orchestrator=True is the default and matches training; do not change.)result = rlm.completion(prompt=context, root_prompt=query)print(result.response)
For RLM inference, use the harness in alexzhang13/rlm and point its model config at this adapter (merge offline if your serving stack — e.g. vLLM without punica — cannot apply LoRA at runtime).
Limitations
- Training depth: trained at
depth=1;depth>1is supported at inference but was not seen during training. - No
persistent=True, nocompaction=True, nomax_budget/max_timeout/max_errors, no custom tools used during training. All exist in canonicalrlm.RLMand can be used at inference. - Some inference flags (orchestrator hints, per-env user prologues) need to match training. Exact configuration will be specified here in a follow-up.
Model provider
mit-oasys
Model tree
Base
Qwen/Qwen3-30B-A3B-Instruct-2507
Adapter
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information