p-doom/idm API & Inference Endpoint

Input format

Provide one chat message with 10 images sampled at 5 FPS. Each image should be preceded by a text label:

text
Frame F00: <image>
Frame F01: <image>
...
Frame F09: <image>

The frame labels are text anchors in the message, not labels rendered into the image pixels.

Output format

The model emits only a JSON array:

json
[
  {"frame": "F02", "type": "MouseMove", "details": "120,45"},
  {"frame": "F03", "type": "MouseClick", "details": "Left"},
  {"frame": "F05", "type": "KeyPress", "details": "Cmd+S"},
  {"frame": "F07", "type": "MouseScroll", "details": "-150"}
]

Action types:

KeyPress: key name with modifiers, e.g. Cmd+S, Enter, A
MouseClick: Left, Right, or Middle
MouseMove: normalized dx,dy, where 1000 is full screen width/height
MouseScroll: normalized signed scroll magnitude

Frame attribution: if an effect first appears between F_K and F_{K+1}, report the action on F_K, the last pre-action frame.

Training and evaluation

Base model: Qwen/Qwen3-VL-8B-Instruct
Data: macOS crowd-cast paired screencasts and OS input logs
Training: LoRA on language and vision modules, merged after 5,000 steps
Eval: 44 manually verified macOS productivity clips
Result: F1 0.86, MouseMove R² 0.66, MouseMove cosine 0.99

Limitations

The model was trained on macOS productivity recordings. It can confuse OS-specific shortcuts such as Cmd vs Ctrl, and it only predicts actions that are visible or inferable from screen pixels at 5 FPS.

idm

Get help setting up a custom Dedicated Endpoints.

README

Input format

Output format

Training and evaluation

Limitations

Explore FriendliAI today

idm