Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0News
- 2026-05-23: Released the Spreadsheet-RL-4B model checkpoint on Hugging Face at
Spreadsheet-RL/Spreadsheet-RL-4B.
Model Details
| Field | Value |
|---|---|
| Base model | Qwen/Qwen3-4B-Thinking-2507 |
| Training method | GRPO with outcome-based rewards |
| Environment | Spreadsheet Gym with Microsoft Excel 365, spreadsheet-native tools, SandboxFusion code execution, and async Excel recalculation/reward service |
| Training data | Spreadsheet-RL training split: 5,928 filtered ExcelForum tasks |
| Evaluation | SpreadsheetBench and Domain-Spreadsheet |
| License | Apache-2.0, following the base model license |
Training Configuration
For full details, please see the paper. The released 4B run uses:
| Hyperparameter | Value |
|---|---|
| Algorithm | GRPO; KL-regularized against a frozen reference model |
| Training steps | 60 |
| Prompt/response limits | 4,096 / 27,648 tokens |
| Rollout sampling | temperature 0.6; top-p 0.95; top-k 20 |
| Batching | 64 prompts/step; 16 rollouts/prompt; 1,024 rollouts/step |
| Multi-turn caps | max assistant turns 20; max user turns 20; max tool-response length 8,192 |
| Optimizer | AdamW; learning rate 1e-6; weight decay 0.01; betas (0.9, 0.999); grad clip 1.0 |
| KL loss | low-var KL; coefficient 0.001 |
| Actor update batching | mini-batch 32; dynamic batch sizing enabled |
| Hardware | 1 node x 4 NVIDIA H100 GPUs |
| Training time | about 40 hours wall-clock for the 4B run |
Results
Spreadsheet-RL improves the same 4B base model through spreadsheet-native interaction design, comprehensive tool access, and RL post-training.
| Benchmark | Base | + Native Harness | + Full Tools | Spreadsheet-RL-4B |
|---|---|---|---|---|
| SpreadsheetBench Pass@1 | 12.0 | 15.6 | 19.3 | 23.4 |
On Domain-Spreadsheet, Spreadsheet-RL improves overall Pass@1 from 8.4 to 17.2 over 1,660 evaluation rollouts.
Usage
Install the standard Transformers stack and load the checkpoint:
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel_id = "Spreadsheet-RL/Spreadsheet-RL-4B"tokenizer = AutoTokenizer.from_pretrained(model_id, trust_remote_code=True)model = AutoModelForCausalLM.from_pretrained(model_id,torch_dtype="auto",device_map="auto",trust_remote_code=True,)
For task evaluation and agent rollouts, use the full Spreadsheet-RL codebase with the released dataset and Spreadsheet Gym:
bash
hf download Spreadsheet-RL/Spreadsheet-RL --repo-type dataset --local-dir datagit clone https://github.com/Spreadsheet-RL/Spreadsheet-RL.git
The default training/evaluation harness is maintained in the code repository under configs/, scripts/, reward/, and verl/.
Citation
bibtex
@misc{chi2026spreadsheetrl,title = {Spreadsheet-RL: Advancing Large Language Model Agents on Realistic Spreadsheet Tasks via Reinforcement Learning},author = {Banghao Chi and Yining Xie and Mingyuan Wu and Jingcheng Yang and Jize Jiang and Zhaoheng Li and Shengyi Qian and Minjia Zhang and Klara Nahrstedt and Rui Hou and Xiangjun Fan and Hanchao Yu},year = {2026},eprint = {2605.22642},archivePrefix = {arXiv},primaryClass = {cs.AI},doi = {10.48550/arXiv.2605.22642},url = {https://arxiv.org/abs/2605.22642}}
Model provider
Spreadsheet-RL
Model tree
Base
Qwen/Qwen3-4B-Thinking-2507
Fine-tuned
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information