insagur

qwen3.5-9b-agentnet-ubuntu-1epoch

README

License: apache-2.0

Training format

Legacy Thought: / Action: / Code: template (no ## headers). For the OpenCUA ## format variant see insagur/qwen3.5-9b-agentnet-cot-l2-step100.

markdown
Thought: <reasoning>
Action: <one-sentence>
Code:
pyautogui.click(x=0.5, y=0.5)

Coordinates are normalized to [0, 1] of screen width/height.

Training config

Hardware: 1 × 8 A100 80GB SXM4
Distributed: DeepSpeed ZeRO-2 + bf16
Optimizer: AdamW, LR 1e-5 cosine, warmup 200 steps
Batch: per_device_bs=1 × grad_accum=16 × 8 GPU = global batch 128
Epochs: 1 (300 steps)
EMA teacher: target=block (last ViT block), decay=0.9995, α=0.5
Sequence length: 3072 (truncated; p99=2713)
Image tokens: 2048 (≈1.6M pixel cap; ~1689×950 post-resize)
Gradient checkpointing: on
Train runtime: 5h 14m

Metrics (final)

Table with columns: Metric, Value
Metric	Value
Train loss	0.4726
Train token_acc	0.854
Eval loss	0.4622
Eval token_acc	0.841
Eval samples	1866

Offline eval on 50 val samples (scripts/eval.py):

Table with columns: Metric, Value
Metric	Value
parses_ok_frac	0.18
coord_l2 (parsed)	0.219
click_hit_rate (parsed)	0.60
action_kind_match	0.57

Format adherence is the main limitation — the legacy bare-prefix format isn't reliably emitted by Qwen3.5; the OpenCUA ## variant addresses this.

Data

AgentNet Ubuntu 5K trajectories filtered with task_completed AND alignment≥7 AND efficiency≥5 and per-step last_step_correct AND NOT last_step_redundant. 5% trajectory-level val split.

Table with columns: Split, Trajectories, Samples
Split	Trajectories	Samples
Train	2,178	38,317
Val	114	1,866

Inference

python
from transformers import AutoModelForImageTextToText, AutoProcessor
from PIL import Image

model = AutoModelForImageTextToText.from_pretrained(
    "insagur/qwen3.5-9b-agentnet-ubuntu-1epoch",
    torch_dtype="bfloat16",
).to("cuda")
processor = AutoProcessor.from_pretrained("insagur/qwen3.5-9b-agentnet-ubuntu-1epoch")

system = (
    "You are a computer-use agent operating a Linux desktop. "
    "You receive the user's task and the current screenshot. "
    "Respond with your reasoning, the action description, and the pyautogui code to execute. "
    "All coordinates are normalized to [0, 1] of screen width/height. "
    "Format your response exactly as:\n"
    "Thought: <your reasoning>\nAction: <one-sentence description of the next action>\nCode:\n<pyautogui code>"
)
image = Image.open("screenshot.png").convert("RGB")
messages = [
    {"role": "system", "content": system},
    {"role": "user", "content": [
        {"type": "image", "image": image},
        {"type": "text", "text": "Task: Open the terminal.\n<image>\nWhat is the next action?"},
    ]},
]
text = processor.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = processor(text=[text], images=[image], return_tensors="pt").to("cuda")
out = model.generate(**inputs, max_new_tokens=256)
print(processor.batch_decode(out, skip_special_tokens=True)[0])

Recipe

Training code: https://github.com/2bhapby/gui_internal_worldmodel

bash
SMOKE=0 WANDB=1 DS=z2 NPROC=8 PER_DEVICE_BS=1 GRAD_ACCUM=16 \
RUN_NAME=a100-full-9b-1epoch \
sbatch --gpus=8 scripts/slurm_train_qwen35_4b_agentnet.sbatch \
  optim.num_train_epochs=1

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

insagur

Model Tree

Base

Qwen/Qwen3.5-9B

Fine-tuned

this model

Input Modalities

Text

Image

Video

Output Modalities