lew96123

qwen3.5-0.8b-terminal-agent-grpo-rl-checkpoint-100

README

License: apache-2.0

Scientific Evaluation Metrics (Terminal-Bench 2.0)

Evaluated natively on the challenging 69-task Terminal-Bench 2.0 suite, this intermediate RL checkpoint demonstrates significant reasoning improvements:

Formatting/Extraction Success Rate: 72.46% (50 out of 69 tasks successfully parsed)
Targeted RL Breakthroughs: Successfully parsed and extracted valid command blocks on 4 major, highly complex tasks where the original SFT model failed:
1. constraints-scheduling (Task 15)
  - Generated command: calendars search -type VEVENT -type OFTIME -source /app -query "Team Planning Meeting" -format UTC
2. filter-js-from-html (Task 27)
  - Generated command: echo "Compressing and saving filter.py with tar" | tar -czf /app/filter.py -C /tmp source_code.txt
3. merge-diff-arc-agi-task (Task 48)
  - Generated command: git merge branch2 -q -d -
4. nginx-request-logging (Task 54)
  - Generated command:
```
bash
# 1. Install Nginx web server
apt install -y nginx
# 2. Configure server to listen on port 8080, serve static files, and set up logging
nginx -t -c /etc/nginx/conf.d/benchmark-site.conf
```
Prompt Formatting Resilience: Highly stable execution within locked-in ... reasoning barriers followed by clean executable bash markdown blocks, proving the GRPO reward function effectively preserved SFT formatting while expanding capabilities.

Detailed RL Training Architecture

1. Group Relative Policy Optimization (GRPO) Algorithm

Instead of using a separate critic/value model which would exceed the 4GB/6GB VRAM limits of consumer laptop GPUs, GRPO computes relative advantages within a group of generations. For each prompt, the model generates a group of G = 4 completions. The advantage for each completion is computed by normalizing the programmatic rewards across the group: $A i = std (R) + 1 \times 1 0 - 8 R i $

2. Programmatic Multitask Reward Functions

The model is optimized using three high-signal, non-differential reward functions:

Formatting Reward (Weight: 1.5): Evaluates if the response strictly matches the regex pattern: r"^<thinking>\s*[\s\S]+?\s*</thinking>\s*```bash\n[\s\S]+?\n```$"
Execution Relevance Reward (Weight: 1.0 per match): Awards points if key command verbs (such as docker, find, grep, tar, wc) correctly map to the user request.
Conciseness Reward (Weight: 0.2 max): Penalizes long outputs (length > 300) to prevent infinite loops and verbose reasoning.

3. Training Hyperparameters

Base Policy: SFT warm base weights (qwen_agent_lora)
Group Size (G): 4 generations per prompt
Optimizer: AdamW with paged states
Learning Rate: 5e-6
Batching: per_device_train_batch_size = 4, gradient_accumulation_steps = 2
Max Steps: Step 100 of 200 (Warmed up for 10 steps)

WSL Environment & Version Alignment Details

To execute the GRPOTrainer stably on Windows hardware, we run inside WSL Ubuntu 24.04 LTS to avoid native Windows OS PEFT conflicts.

Aligned Deep Learning Dependencies

Python: 3.11.15
PyTorch: 2.5.1+cu121 (Compiled natively for CUDA 12.1, preventing symbol mismatches with bitsandbytes)
TRL: 0.15.1
Transformers: 5.12.0
Pydantic: 2.10.6 (Prevents schema validation errors on tensor classes)

Applied Hot-Patches

We applied a manual hot-patch to transformers/utils/hub.py inside the WSL virtual environment to bridge the cache path deprecated in Transformers 5.x:

python
TRANSFORMERS_CACHE = "/home/lew_bei/.cache/huggingface/hub"

We also applied a runtime monkey-patch to GRPOTrainer in python to align _get_train_sampler signatures across Transformers 5.x and TRL:

python
original_get_train_sampler = GRPOTrainer._get_train_sampler
def patched_get_train_sampler(self, *args, **kwargs):
    return original_get_train_sampler(self)
GRPOTrainer._get_train_sampler = patched_get_train_sampler

Locked-In Inference Settings

To achieve optimal, loop-free, and precise terminal command streaming, utilize the following parameters:

python
inference_config = {
    "do_sample": True,
    "temperature": 0.7,         # Calibrated to prevent greedy repetition loops
    "top_p": 0.95,              # Restricts vocabulary to high-probability tokens
    "max_new_tokens": 256,      # Budgeted for full chain-of-thought + code blocks
    "use_cache": True,          # Reuses GPU KV-Cache for 10x generation speedup
}

Prompt Template Contract:

markdown
### System: You are a local OS Terminal Controller Agent. State your thinking process within <thinking> tags, followed by the exact terminal command block.
### Instruction: {user_natural_language_request}
### Output: <thinking>
{reasoning}
</thinking>
```bash
{executable_command}

markdown
---
*Note: This is a LoRA adapter. To run on llama.cpp, merge these weights with the 16-bit Qwen3.5-0.8B-Base model and convert the merged model to GGUF format.*

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

lew96123

Model Tree

Base

Qwen/Qwen3.5-0.8B

Adapter

this model

Input Modalities

Text

Image

Video

Output Modalities