lew96123
qwen3.5-0.8b-terminal-agent-grpo-rl-checkpoint-100
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Scientific Evaluation Metrics (Terminal-Bench 2.0)
Evaluated natively on the challenging 69-task Terminal-Bench 2.0 suite, this intermediate RL checkpoint demonstrates significant reasoning improvements:
- Formatting/Extraction Success Rate: 72.46% (50 out of 69 tasks successfully parsed)
- Targeted RL Breakthroughs: Successfully parsed and extracted valid command blocks on 4 major, highly complex tasks where the original SFT model failed:
constraints-scheduling(Task 15)- Generated command:
calendars search -type VEVENT -type OFTIME -source /app -query "Team Planning Meeting" -format UTC
- Generated command:
filter-js-from-html(Task 27)- Generated command:
echo "Compressing and saving filter.py with tar" | tar -czf /app/filter.py -C /tmp source_code.txt
- Generated command:
merge-diff-arc-agi-task(Task 48)- Generated command:
git merge branch2 -q -d -
- Generated command:
nginx-request-logging(Task 54)- Generated command:
bash
# 1. Install Nginx web serverapt install -y nginx# 2. Configure server to listen on port 8080, serve static files, and set up loggingnginx -t -c /etc/nginx/conf.d/benchmark-site.conf
- Generated command:
- Prompt Formatting Resilience: Highly stable execution within locked-in ... reasoning barriers followed by clean executable bash markdown blocks, proving the GRPO reward function effectively preserved SFT formatting while expanding capabilities.
Detailed RL Training Architecture
1. Group Relative Policy Optimization (GRPO) Algorithm
Instead of using a separate critic/value model which would exceed the 4GB/6GB VRAM limits of consumer laptop GPUs, GRPO computes relative advantages within a group of generations. For each prompt, the model generates a group of G = 4 completions. The advantage for each completion is computed by normalizing the programmatic rewards across the group: Ai=std(R)+1×10−8Ri−mean(R)
2. Programmatic Multitask Reward Functions
The model is optimized using three high-signal, non-differential reward functions:
- Formatting Reward (Weight: 1.5): Evaluates if the response strictly matches the regex pattern:
r"^<thinking>\s*[\s\S]+?\s*</thinking>\s*```bash\n[\s\S]+?\n```$" - Execution Relevance Reward (Weight: 1.0 per match): Awards points if key command verbs (such as
docker,find,grep,tar,wc) correctly map to the user request. - Conciseness Reward (Weight: 0.2 max): Penalizes long outputs (length > 300) to prevent infinite loops and verbose reasoning.
3. Training Hyperparameters
- Base Policy: SFT warm base weights (qwen_agent_lora)
- Group Size (G): 4 generations per prompt
- Optimizer: AdamW with paged states
- Learning Rate: 5e-6
- Batching: per_device_train_batch_size = 4, gradient_accumulation_steps = 2
- Max Steps: Step 100 of 200 (Warmed up for 10 steps)
WSL Environment & Version Alignment Details
To execute the GRPOTrainer stably on Windows hardware, we run inside WSL Ubuntu 24.04 LTS to avoid native Windows OS PEFT conflicts.
Aligned Deep Learning Dependencies
- Python: 3.11.15
- PyTorch: 2.5.1+cu121 (Compiled natively for CUDA 12.1, preventing symbol mismatches with bitsandbytes)
- TRL: 0.15.1
- Transformers: 5.12.0
- Pydantic: 2.10.6 (Prevents schema validation errors on tensor classes)
Applied Hot-Patches
We applied a manual hot-patch to transformers/utils/hub.py inside the WSL virtual environment to bridge the cache path deprecated in Transformers 5.x:
python
TRANSFORMERS_CACHE = "/home/lew_bei/.cache/huggingface/hub"
We also applied a runtime monkey-patch to GRPOTrainer in python to align _get_train_sampler signatures across Transformers 5.x and TRL:
python
original_get_train_sampler = GRPOTrainer._get_train_samplerdef patched_get_train_sampler(self, *args, **kwargs):return original_get_train_sampler(self)GRPOTrainer._get_train_sampler = patched_get_train_sampler
Locked-In Inference Settings
To achieve optimal, loop-free, and precise terminal command streaming, utilize the following parameters:
python
inference_config = {"do_sample": True,"temperature": 0.7, # Calibrated to prevent greedy repetition loops"top_p": 0.95, # Restricts vocabulary to high-probability tokens"max_new_tokens": 256, # Budgeted for full chain-of-thought + code blocks"use_cache": True, # Reuses GPU KV-Cache for 10x generation speedup}
Prompt Template Contract:
markdown
### System: You are a local OS Terminal Controller Agent. State your thinking process within <thinking> tags, followed by the exact terminal command block.### Instruction: {user_natural_language_request}### Output: <thinking>{reasoning}</thinking>```bash{executable_command}
markdown
---*Note: This is a LoRA adapter. To run on llama.cpp, merge these weights with the 16-bit Qwen3.5-0.8B-Base model and convert the merged model to GGUF format.*
Model provider
lew96123
Model tree
Base
Qwen/Qwen3.5-0.8B
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information