dmitchelljackson
cerebellum-qwen35-history-actions-lora
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Action Grammar
text
T <label> tap visible labelled elementP <label> long-press visible labelled elementK <text> type text into focused fieldU/D/L/R scroll up/down/left/rightB Android backH Android homeW waitF doneI impossible
At inference time:
- The first token is constrained to valid action codes.
TandPlabels are constrained to labels visible on the current screen.Ktext is generated freely until EOS.- Single-token actions terminate immediately.
Current Evaluation
Latest live AndroidWorld/APK eval:
- Split:
configs/android_world_mixed100_eval_infraclean_20260609.json - Runner:
scripts/run_android_world_parallel_eval.py - Harness: 5 Android emulators, collector APK state source, target app preopened
- Adapter:
checkpoints/qwen35_rl_taskcool_sft268_env5_accum4_20260605/current - Cases: 100 mixed controller-feasible AndroidWorld tasks
- Exclusions:
Expense*andSimpleSms*were excluded because local AndroidWorld app setup currently leaves those apps in broken first-run states.
Overall result:
| Metric | Value |
|---|---|
| Success | 51/100 = 51.0% |
| Infra skips | 0 |
| Average steps | 11.9 |
By family:
| Family | Result |
|---|---|
| Audio record | 4/4 = 100% |
| Contacts | 11/11 = 100% |
| Clock | 9/10 = 90% |
| System | 16/21 = 76.2% |
| Recipe | 8/12 = 66.7% |
| Camera | 3/8 = 37.5% |
| Browser | 0/3 = 0% |
| Calendar | 0/8 = 0% |
| Files | 0/2 = 0% |
| Map/OsmAnd | 0/3 = 0% |
| Markor | 0/18 = 0% |
Notable task-level behavior:
- Works well: WiFi/Bluetooth toggles, stopwatch, contacts forms, audio recording, camera photo.
- Partially works: recipe duplicate deletion.
- Not solved: Markor/file workflows, calendar deletion, browser maze, OsmAnd, camera video, brightness slider.
- Slider/range labels were added to the harness after this adapter had already mostly trained, so brightness failures are expected and need targeted SFT/RL.
SFT Curriculum
The supervised curriculum that produced the pre-RL adapter was staged rather than trained as one mixed task from scratch:
- Tap/long-press grounding first, using randomized visible labels and constrained label decoding.
- History-aware tap selection next, with up to four prior frames/actions.
- Full action grammar after grounding stabilized:
T/P,K, scroll, wait, and system actions. - Larger effective batches were used once answer-position-only loss made memory practical.
- Training kept random screen samples, randomized labels, and held-out shard evals to catch overfit.
Important SFT choices:
- Current screenshot width
464; history screenshot width232. - Up to four history frames with images.
- Older history can be summarized as compact action text.
- Accessibility trees are compacted but capped/filtered rather than silently truncating target-bearing examples.
T/Plabels are generated from the current visible label set only.Kis free text and is not constrained to the label set.- The best SFT checkpoint before RL was preserved separately as a fallback milestone.
RL Curriculum
The RL curriculum is online AndroidWorld training through the collector APK harness. It uses the same prompt/action grammar as SFT, but the state comes from live emulators rather than saved dataset rows.
Current RL setup:
- 5 Android emulators in parallel.
- Collector APK provides screenshots, accessibility tree, range metadata, and action-frame history.
- Target app is preopened for each episode.
- Maximum 20 actions per episode.
- Four rollout batches are accumulated before one optimizer update.
- Optimizer:
paged_adamw_8bit. - Learning rate:
1e-6. - Sampling temperatures: action
0.8, label0.7. - Bucket weights:
tp=1.5,k=1.0,scroll=1.2,wait=0.5,system=1.0. - Curriculum state is persisted in
rl_runtime_state.jsonso restarts keep the baseline, cooldowns, and replay counts.
Reward/curriculum policy:
- Success is AndroidWorld task success.
- The trainer tracks a moving reward baseline and trains from advantage over that baseline.
- Tasks that repeatedly fail can enter cooldown.
- Easy tasks are downweighted but not removed, to reduce forgetting.
- Successful trajectories are kept for positive replay.
- Negative shaping is intentionally light in this checkpoint: small penalties for execution errors, invalid typing, missing tap targets, false terminal, and leaving the target app; repeat/no-change/dead-scroll penalties are currently disabled.
Included Harness Code
This upload includes the runtime/eval harness sources under harness_code/:
rl_harness/: policy wrapper, APK/ADB state collection, AndroidWorld action bridge, history bufferscripts/run_android_world_parallel_eval.py: 5-emulator deterministic evalscripts/train_android_world_rl_qwen35_parallel.py: online RL trainerscripts/run_android_world_curriculum.py: single-env rollout/eval runnerscripts/setup_android_world_device.py: AndroidWorld app setup helperandroid/collector_apk/: collector accessibility service used for screenshots/tree/range metadataconfigs/: current AndroidWorld curriculum/eval splitsdocs/: Docker/AndroidWorld and RL harness notes
Loading
python
from peft import PeftModelfrom transformers import AutoModelForImageTextToText, AutoProcessorbase = AutoModelForImageTextToText.from_pretrained("Qwen/Qwen3.5-0.8B",torch_dtype="auto",device_map="auto",)model = PeftModel.from_pretrained(base, "dmitchelljackson/cerebellum-qwen35-history-actions-lora")processor = AutoProcessor.from_pretrained("Qwen/Qwen3.5-0.8B")model.eval()
For correct behavior, use the constrained decoding in harness_code/rl_harness/policy_qwen35.py.
Limitations
This is an experimental local controller checkpoint, not a complete Android agent.
It still lacks reliable finish (F) behavior, robust slider control, and long-horizon
planning for file/calendar/map/browser/Markor workflows. It should be treated as a
runtime milestone for further SFT/RL rather than a production-ready model.
Model provider
dmitchelljackson
Model tree
Base
Qwen/Qwen3.5-0.8B
Adapter
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information