dmitchelljackson

cerebellum-qwen35-history-actions-lora

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Action Grammar

text

T <label> tap visible labelled element
P <label> long-press visible labelled element
K <text> type text into focused field
U/D/L/R scroll up/down/left/right
B Android back
H Android home
W wait
F done
I impossible

At inference time:

  • The first token is constrained to valid action codes.
  • T and P labels are constrained to labels visible on the current screen.
  • K text is generated freely until EOS.
  • Single-token actions terminate immediately.

Current Evaluation

Latest live AndroidWorld/APK eval:

  • Split: configs/android_world_mixed100_eval_infraclean_20260609.json
  • Runner: scripts/run_android_world_parallel_eval.py
  • Harness: 5 Android emulators, collector APK state source, target app preopened
  • Adapter: checkpoints/qwen35_rl_taskcool_sft268_env5_accum4_20260605/current
  • Cases: 100 mixed controller-feasible AndroidWorld tasks
  • Exclusions: Expense* and SimpleSms* were excluded because local AndroidWorld app setup currently leaves those apps in broken first-run states.

Overall result:

Table
MetricValue
Success51/100 = 51.0%
Infra skips0
Average steps11.9

By family:

Table
FamilyResult
Audio record4/4 = 100%
Contacts11/11 = 100%
Clock9/10 = 90%
System16/21 = 76.2%
Recipe8/12 = 66.7%
Camera3/8 = 37.5%
Browser0/3 = 0%
Calendar0/8 = 0%
Files0/2 = 0%
Map/OsmAnd0/3 = 0%
Markor0/18 = 0%

Notable task-level behavior:

  • Works well: WiFi/Bluetooth toggles, stopwatch, contacts forms, audio recording, camera photo.
  • Partially works: recipe duplicate deletion.
  • Not solved: Markor/file workflows, calendar deletion, browser maze, OsmAnd, camera video, brightness slider.
  • Slider/range labels were added to the harness after this adapter had already mostly trained, so brightness failures are expected and need targeted SFT/RL.

SFT Curriculum

The supervised curriculum that produced the pre-RL adapter was staged rather than trained as one mixed task from scratch:

  1. Tap/long-press grounding first, using randomized visible labels and constrained label decoding.
  2. History-aware tap selection next, with up to four prior frames/actions.
  3. Full action grammar after grounding stabilized: T/P, K, scroll, wait, and system actions.
  4. Larger effective batches were used once answer-position-only loss made memory practical.
  5. Training kept random screen samples, randomized labels, and held-out shard evals to catch overfit.

Important SFT choices:

  • Current screenshot width 464; history screenshot width 232.
  • Up to four history frames with images.
  • Older history can be summarized as compact action text.
  • Accessibility trees are compacted but capped/filtered rather than silently truncating target-bearing examples.
  • T/P labels are generated from the current visible label set only.
  • K is free text and is not constrained to the label set.
  • The best SFT checkpoint before RL was preserved separately as a fallback milestone.

RL Curriculum

The RL curriculum is online AndroidWorld training through the collector APK harness. It uses the same prompt/action grammar as SFT, but the state comes from live emulators rather than saved dataset rows.

Current RL setup:

  • 5 Android emulators in parallel.
  • Collector APK provides screenshots, accessibility tree, range metadata, and action-frame history.
  • Target app is preopened for each episode.
  • Maximum 20 actions per episode.
  • Four rollout batches are accumulated before one optimizer update.
  • Optimizer: paged_adamw_8bit.
  • Learning rate: 1e-6.
  • Sampling temperatures: action 0.8, label 0.7.
  • Bucket weights: tp=1.5,k=1.0,scroll=1.2,wait=0.5,system=1.0.
  • Curriculum state is persisted in rl_runtime_state.json so restarts keep the baseline, cooldowns, and replay counts.

Reward/curriculum policy:

  • Success is AndroidWorld task success.
  • The trainer tracks a moving reward baseline and trains from advantage over that baseline.
  • Tasks that repeatedly fail can enter cooldown.
  • Easy tasks are downweighted but not removed, to reduce forgetting.
  • Successful trajectories are kept for positive replay.
  • Negative shaping is intentionally light in this checkpoint: small penalties for execution errors, invalid typing, missing tap targets, false terminal, and leaving the target app; repeat/no-change/dead-scroll penalties are currently disabled.

Included Harness Code

This upload includes the runtime/eval harness sources under harness_code/:

  • rl_harness/: policy wrapper, APK/ADB state collection, AndroidWorld action bridge, history buffer
  • scripts/run_android_world_parallel_eval.py: 5-emulator deterministic eval
  • scripts/train_android_world_rl_qwen35_parallel.py: online RL trainer
  • scripts/run_android_world_curriculum.py: single-env rollout/eval runner
  • scripts/setup_android_world_device.py: AndroidWorld app setup helper
  • android/collector_apk/: collector accessibility service used for screenshots/tree/range metadata
  • configs/: current AndroidWorld curriculum/eval splits
  • docs/: Docker/AndroidWorld and RL harness notes

Loading

python

from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor
base = AutoModelForImageTextToText.from_pretrained(
"Qwen/Qwen3.5-0.8B",
torch_dtype="auto",
device_map="auto",
)
model = PeftModel.from_pretrained(base, "dmitchelljackson/cerebellum-qwen35-history-actions-lora")
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-0.8B")
model.eval()

For correct behavior, use the constrained decoding in harness_code/rl_harness/policy_qwen35.py.

Limitations

This is an experimental local controller checkpoint, not a complete Android agent. It still lacks reliable finish (F) behavior, robust slider control, and long-horizon planning for file/calendar/map/browser/Markor workflows. It should be treated as a runtime milestone for further SFT/RL rather than a production-ready model.

Model provider

dmitchelljackson

Model tree

Base

Qwen/Qwen3.5-0.8B

Adapter

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today