insagur

qwen3.5-9b-agentnet-cot-l2-step100

README

License: apache-2.0

Training format (OpenCUA L2)

markdown
## Thought:
<reasoning>

## Action:
<one-sentence>

## Code:
pyautogui.click(x=0.5, y=0.5)

Coordinates normalized to [0, 1]. The ## markdown headers help the base model emit the schema reliably (vs. the legacy bare Thought: form). See insagur/qwen3.5-9b-agentnet-ubuntu-1epoch for the legacy-format variant.

Training config

Hardware: 1 × 8 A100 80GB SXM4
Distributed: DeepSpeed ZeRO-2 + bf16
Optimizer: AdamW, LR 1e-5 cosine, warmup 200 steps
Batch: per_device_bs=1 × grad_accum=16 × 8 GPU = global batch 128
Steps: 100 (preempted; 1 epoch = 300 steps)
EMA teacher: target=block, decay=0.9995, α=0.5
Sequence length: 3072
Image tokens: 2048 (≈1.6M pixel cap)
Save frequency: every 50 steps

Metrics @ step 100

Table with columns: Metric, Value
Metric	Value
Train loss	0.4601
Train token_acc	0.8416
Eval loss	0.4718
Eval token_acc	0.8387

Already approaches the fully-trained legacy-format model's eval loss (0.4622) at only 33% of training, suggesting the ## format converges faster.

Data

scripts/convert_agentnet_cot.py --cot_level l2 produces this format from AgentNet 5K trajectories with the same quality filter as the legacy converter (alignment≥7, efficiency≥5).

Table with columns: Split, Samples
Split	Samples
Train	38,317
Val	1,866

Inference

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model = AutoModelForImageTextToText.from_pretrained(
    "insagur/qwen3.5-9b-agentnet-cot-l2-step100",
    torch_dtype="bfloat16",
).to("cuda")
processor = AutoProcessor.from_pretrained("insagur/qwen3.5-9b-agentnet-cot-l2-step100")

system = (
    "You are a computer-use agent operating a Linux desktop. "
    "Respond using the OpenCUA L2 format:\n"
    "## Thought:\n<reasoning>\n\n## Action:\n<one-sentence>\n\n## Code:\n<pyautogui code with normalized [0,1] coords>"
)
# ... see scripts/eval.py in the training repo for full inference loop ...

Recipe

Training code: https://github.com/2bhapby/gui_internal_worldmodel

bash
python scripts/convert_agentnet_cot.py --src ... --images_dir ... --out_dir ./agentnet_l2 --cot_level l2

CONFIG=configs/qwen35_9b_agentnet.yaml RUN_NAME=a100-9b-1ep-cot-l2 \
  sbatch --gpus=8 scripts/slurm_train_qwen.sbatch \
    data.train_jsonl=./agentnet_l2/train.jsonl \
    data.val_jsonl=./agentnet_l2/val.jsonl

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

insagur

Model Tree

Base

Qwen/Qwen3.5-9B

Fine-tuned

this model

Input Modalities

Text

Image

Video

Output Modalities