dmitchelljackson

cerebellum-qwen35-history-actions-lora

README

License: apache-2.0

Action Grammar

text
T <label>  tap visible labelled element
P <label>  long-press visible labelled element
K <text>   type text into focused field
U/D/L/R    scroll up/down/left/right
B          Android back
H          Android home
W          wait
F          done
I          impossible

At inference time:

The first token is constrained to valid action codes.
T and P labels are constrained to labels visible on the current screen.
K text is generated freely until EOS.
Single-token actions terminate immediately.

Current Evaluation

Latest live AndroidWorld/APK eval:

Split: configs/android_world_mixed100_eval_infraclean_20260609.json
Runner: scripts/run_android_world_parallel_eval.py
Harness: 5 Android emulators, collector APK state source, target app preopened
Adapter: checkpoints/qwen35_rl_taskcool_sft268_env5_accum4_20260605/current
Cases: 100 mixed controller-feasible AndroidWorld tasks
Exclusions: Expense* and SimpleSms* were excluded because local AndroidWorld app setup currently leaves those apps in broken first-run states.

Overall result:

Table with columns: Metric, Value
Metric	Value
Success	51/100 = 51.0%
Infra skips	0
Average steps	11.9

By family:

Table with columns: Family, Result
Family	Result
Audio record	4/4 = 100%
Contacts	11/11 = 100%
Clock	9/10 = 90%
System	16/21 = 76.2%
Recipe	8/12 = 66.7%
Camera	3/8 = 37.5%
Browser	0/3 = 0%
Calendar	0/8 = 0%
Files	0/2 = 0%

Notable task-level behavior:

Works well: WiFi/Bluetooth toggles, stopwatch, contacts forms, audio recording, camera photo.
Partially works: recipe duplicate deletion.
Not solved: Markor/file workflows, calendar deletion, browser maze, OsmAnd, camera video, brightness slider.
Slider/range labels were added to the harness after this adapter had already mostly trained, so brightness failures are expected and need targeted SFT/RL.

SFT Curriculum

The supervised curriculum that produced the pre-RL adapter was staged rather than trained as one mixed task from scratch:

Tap/long-press grounding first, using randomized visible labels and constrained label decoding.
History-aware tap selection next, with up to four prior frames/actions.
Full action grammar after grounding stabilized: T/P, K, scroll, wait, and system actions.
Larger effective batches were used once answer-position-only loss made memory practical.
Training kept random screen samples, randomized labels, and held-out shard evals to catch overfit.

Important SFT choices:

Current screenshot width 464; history screenshot width 232.
Up to four history frames with images.
Older history can be summarized as compact action text.
Accessibility trees are compacted but capped/filtered rather than silently truncating target-bearing examples.
T/P labels are generated from the current visible label set only.
K is free text and is not constrained to the label set.
The best SFT checkpoint before RL was preserved separately as a fallback milestone.

RL Curriculum

The RL curriculum is online AndroidWorld training through the collector APK harness. It uses the same prompt/action grammar as SFT, but the state comes from live emulators rather than saved dataset rows.

Current RL setup:

5 Android emulators in parallel.
Collector APK provides screenshots, accessibility tree, range metadata, and action-frame history.
Target app is preopened for each episode.
Maximum 20 actions per episode.
Four rollout batches are accumulated before one optimizer update.
Optimizer: paged_adamw_8bit.
Learning rate: 1e-6.
Sampling temperatures: action 0.8, label 0.7.
Bucket weights: tp=1.5,k=1.0,scroll=1.2,wait=0.5,system=1.0.
Curriculum state is persisted in rl_runtime_state.json so restarts keep the baseline, cooldowns, and replay counts.

Reward/curriculum policy:

Success is AndroidWorld task success.
The trainer tracks a moving reward baseline and trains from advantage over that baseline.
Tasks that repeatedly fail can enter cooldown.
Easy tasks are downweighted but not removed, to reduce forgetting.
Successful trajectories are kept for positive replay.
Negative shaping is intentionally light in this checkpoint: small penalties for execution errors, invalid typing, missing tap targets, false terminal, and leaving the target app; repeat/no-change/dead-scroll penalties are currently disabled.

Included Harness Code

This upload includes the runtime/eval harness sources under harness_code/:

rl_harness/: policy wrapper, APK/ADB state collection, AndroidWorld action bridge, history buffer
scripts/run_android_world_parallel_eval.py: 5-emulator deterministic eval
scripts/train_android_world_rl_qwen35_parallel.py: online RL trainer
scripts/run_android_world_curriculum.py: single-env rollout/eval runner
scripts/setup_android_world_device.py: AndroidWorld app setup helper
android/collector_apk/: collector accessibility service used for screenshots/tree/range metadata
configs/: current AndroidWorld curriculum/eval splits
docs/: Docker/AndroidWorld and RL harness notes

Loading

python
from peft import PeftModel
from transformers import AutoModelForImageTextToText, AutoProcessor

base = AutoModelForImageTextToText.from_pretrained(
    "Qwen/Qwen3.5-0.8B",
    torch_dtype="auto",
    device_map="auto",
)
model = PeftModel.from_pretrained(base, "dmitchelljackson/cerebellum-qwen35-history-actions-lora")
processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-0.8B")
model.eval()

For correct behavior, use the constrained decoding in harness_code/rl_harness/policy_qwen35.py.

Limitations

This is an experimental local controller checkpoint, not a complete Android agent. It still lacks reliable finish (F) behavior, robust slider control, and long-horizon planning for file/calendar/map/browser/Markor workflows. It should be treated as a runtime milestone for further SFT/RL rather than a production-ready model.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

dmitchelljackson

Model Tree

Base

Qwen/Qwen3.5-0.8B

Adapter

this model

Input Modalities

Text

Image

Video

Output Modalities