Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

📚 Introduction

WebWorld is a large-scale open-web world model series for training and evaluating web agents. It is trained on 1M+ real-world web interaction trajectories via a scalable hierarchical data pipeline, supporting:

  • Long-horizon simulation (30+ steps)
  • Multi-format state representations: A11y Tree, HTML, XML, Markdown, and natural language
  • CoT-activated reasoning for transition prediction
  • Cross-domain generalization to code, GUI, and game environments

Agents trained on WebWorld-synthesized trajectories achieve +9.9% on MiniWob++ and +10.9% on WebArena. When used for inference-time lookahead search, WebWorld outperforms GPT-5 as a world model.

🎯 Model Series

ModelBase ModelHuggingFace LinkModelScope Link
WebWorld-8BQwen3-8B🤗 HuggingFace🤖 ModelScope
WebWorld-14BQwen3-14B🤗 HuggingFace🤖 ModelScope
WebWorld-32BQwen3-32B🤗 HuggingFace🤖 ModelScope

WebWorldData: Huggingface: Qwen/WebWorldData, ModelScope: Qwen/WebWorldData

💡 Recommendation: Use 8B for fast simulation and data synthesis; use 14B/32B for higher-fidelity simulation and better long-horizon robustness. For best results in a specific environment, we recommend task-specific fine-tuning on in-domain trajectories.

🛠️ Requirements

  • transformers (recommended: latest version)
  • torch
  • Optional: accelerate, vllm for efficient serving

🚀 Quick Start

Key Notes:

  • WebWorld predicts the next page state given the current state and an action.
  • It strictly preserves the input/output format (A11y / HTML / XML / Markdown / NL).
  • Supports multi-turn trajectory simulation up to 30+ steps.

Single-Step Prediction

python

import torch
from transformers import AutoTokenizer, AutoModelForCausalLM
model_name = "Qwen/WebWorld-8B" # or WebWorld-14B, WebWorld-32B
tokenizer = AutoTokenizer.from_pretrained(model_name, trust_remote_code=True)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
).eval()
system_prompt = (
"You are a web world model. I will provide you with an initial page state "
"and a sequence of actions. For each action, predict the resulting page state.\n"
"Strictly maintain the original format. Output only the full page state "
"without explanations, code, or truncation."
)
current_state = """RootWebArea 'Global Start - Your Daily Portal', focused
\t[1] banner 'Top Header', visible
\t\t[2] link 'Set as Homepage', clickable, visible
\t\t[3] link 'Feedback', clickable, visible
\t\t[5] region 'Weather Widget', visible
\t\t\tStaticText 'New York, USA'
\t\t\t[6] image 'Sunny', visible
\t\t\tStaticText '24°C'
\t\t[8] link 'Sign In', clickable, visible
\t[10] region 'Search Area', visible
\t\t[11] image 'Global Start Logo', visible
\t\tStaticText 'Search the entire web'
\t\t[12] tablist 'Search Engine Selector', orientation='horizontal'
\t\t\t[13] tab 'Google', selected=True, clickable
\t\t\t[14] tab 'Bing', selected=False, clickable
\t\t\t[15] tab 'DuckDuckGo', selected=False, clickable
\t\t[18] combobox 'Web Search', clickable, visible, autocomplete='both', expanded=False
\t\t\t[19] textbox 'Type keywords or URL...', clickable, visible, editable, value=''
\t\t[20] button 'Search', clickable, visible
\t[30] navigation 'Category Bar', visible
\t\t[31] link 'Home', clickable, selected=True
\t\t[32] link 'News', clickable
\t\t[33] link 'Video', clickable
\t\t[34] link 'Shopping', clickable
\t\t[35] link 'Social', clickable
\t[50] main 'Site Directory', visible
\t\t[51] region 'Top Recommended', visible
\t\t\t[52] heading 'Most Popular', visible
\t\t\t[53] list 'Top Sites Grid', visible
\t\t\t\t[54] link 'Facebook', clickable
\t\t\t\t[56] link 'YouTube', clickable
\t\t\t\t[58] link 'Amazon', clickable
\t\t\t\t[60] link 'Twitter / X', clickable
\t\t\t\t[62] link 'Instagram', clickable
\t\t\t\t[64] link 'Wikipedia', clickable
\t\t\t\t[66] link 'Netflix', clickable
\t\t\t\t[68] link 'LinkedIn', clickable
\t\t[80] region 'News & Media', visible
\t\t\t[81] heading 'Latest News', visible
\t\t\t[82] link 'CNN', clickable
\t\t\t[83] link 'BBC', clickable
\t\t\t[84] link 'The Verge', clickable
\t\t[90] region 'Shopping', visible
\t\t\t[91] heading 'E-Commerce', visible
\t\t\t[92] link 'eBay', clickable
\t\t\t[93] link 'Walmart', clickable
\t\t\t[94] link 'Best Buy', clickable
\t[200] complementary 'Ads', visible
\t\t[201] image 'Ad: Travel to Japan'
\t\t[202] link 'Book Now', clickable
\t[300] contentinfo 'Footer', visible
\t\tStaticText '© 2026 Global Start Inc.'"""
user_message = (
f"Initial Page State:\n{current_state}\n\n"
f"First Action: 'click([32])'\n\n"
f"Next Page State:"
)
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": user_message},
]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
with torch.no_grad():
outputs = model.generate(
**inputs,
max_new_tokens=4096,
do_sample=False,
)
response = tokenizer.decode(outputs[0][inputs["input_ids"].shape[-1]:], skip_special_tokens=True)
print(response)

Multi-Turn Simulation

The first turn provides the initial state and first action. Each subsequent turn uses a fixed continuation prompt:

python

CONTINUE_PROMPT = (
"Continue the trajectory. Given the previous state, "
"predict the next page state after this action.\n\n"
"Action: '{action}'\n\nNext Page State:"
)
# Turn 1
messages = [
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Initial Page State:\n{state_0}\n\nFirst Action: '{action_0}'\n\nNext Page State:"},
]
state_1 = generate(messages) # your generate function
# Turn 2
messages.append({"role": "assistant", "content": state_1})
messages.append({"role": "user", "content": CONTINUE_PROMPT.format(action=action_1)})
state_2 = generate(messages)
# Turn 3, 4, ... up to 30+ turns: repeat the same pattern
messages.append({"role": "assistant", "content": state_2})
messages.append({"role": "user", "content": CONTINUE_PROMPT.format(action=action_2)})
state_3 = generate(messages)

🎮 Action Space

WebWorld supports a unified action space as Python-style function calls:

CategoryActionDescription
Elementclick(bid, button, modifiers)Click a DOM element by its ID
fill(bid, text, press_enter)Type text into an input field
select_option(bid, options)Select from a dropdown / combobox
hover(bid)Hover over an element
Mousemouse_move(x, y)Move cursor to coordinates
mouse_click(x, y, button)Click at coordinates
mouse_down(x, y) / mouse_up(x, y)Press / release (drag-and-drop)
Keyboardkeyboard_press(key)Press a key (e.g., Enter, Tab)
keyboard_type(text)Type a string sequentially
Browserscroll(dx, dy)Scroll the viewport
goto(url)Navigate to a URL
go_back() / go_forward()Browser history navigation
tab_new() / tab_close() / tab_focus(index)Manage browser tabs
Metasend_msg_to_user(text)Send a message to the user
noop(wait_ms)Wait for a duration
infeasible(reason)Declare the task impossible

📊 Performance

Intrinsic Evaluation (WebWorld-Bench)

WebWorld-Bench evaluates models using Factuality Score (functional correctness) and Web Turing Score (perceptual realism) across nine dimensions:

ModelAvg FactualityAvg Turing
GPT-4o59.535.4
Claude-Opus-4.171.347.4
Gemini-3-Pro70.343.2
Qwen3-8B (base)26.917.4
WebWorld-8B70.142.2
WebWorld-14B70.744.7
WebWorld-32B71.045.6

Extrinsic Evaluation (Agent Training)

ModelMiniWob++ SRWebArena SR
GPT-4o64.3%26.6%
Qwen3-8B (base)49.4%9.8%
Qwen3-8B + WebWorld59.3% (+9.9%)20.7% (+10.9%)
Qwen3-14B (base)54.9%15.1%
Qwen3-14B + WebWorld63.2% (+8.3%)24.3% (+9.2%)

Cross-Domain Generalization

EnvironmentQwen3-8BWebWorld-8BGain
API Services0.0880.299+0.211
Code0.1470.396+0.249
Game0.2530.473+0.220
GUI Desktop0.3220.705+0.383

⚠️ Limitations

  • Sycophancy / optimism bias: the model may generate outcomes that are overly favorable to the agent's intended action.
  • Content generation fidelity: long-form, high-precision content (e.g., scientific articles) is not the primary target.
  • Text-only: WebWorld does not simulate visual / pixel-level rendering.

📝 Citation

bibtex

@misc{xiao2026webworldlargescaleworldmodel,
title={WebWorld: A Large-Scale World Model for Web Agent Training},
author={Zikai Xiao and Jianhong Tu and Chuhang Zou and Yuxin Zuo and Zhi Li and Peng Wang and Bowen Yu and Fei Huang and Junyang Lin and Zuozhu Liu},
year={2026},
eprint={2602.14721},
archivePrefix={arXiv},
primaryClass={cs.AI},
url={https://arxiv.org/abs/2602.14721},
}

Model provider

senapati484

Model tree

Base

Qwen/Qwen3-8B

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today