Uzbekswe

browsesafe

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: mit

Highlights

BrowseSafe is a multi-layered defense strategy comprising both architectural and model-based defenses to protect against evolving prompt injection attacks. It is a specialized security model designed to protect AI browser agents from prompt injection attacks embedded in real-world web content.

  • State-of-the-Art Detection: Achieves a 90.4% F1 score on the BrowseSafe-Bench test set.

  • Real-Time Latency: Optimized for agent loops, enabling async security checks without degrading user experience.

  • Robustness to Distractors: Specifically trained to distinguish between malicious instructions and benign, structure-rich HTML "noise" (e.g., accessibility attributes, hidden form fields) that often confuses standard detectors.

  • Comprehensive Coverage: Validated against 11 attack types with different security criticality levels, 9 injection strategies, 5 distractor types, 5 contextaware generation types, 5 domains, 3 linguistic styles and 5 evaluation metrics, ensuring broad-spectrum defense capabilities.

Model Overview

BrowseSafe is based on the Qwen3-30B-A3B architecture.

  • Type: Fine-tuned Causal Language Model (MoE) for SFT Classification
  • Training Stage: Post-training (Fine-tuning on BrowseSafe-Bench)
  • Dataset: BrowseSafe-Bench
  • Base Model: Qwen/Qwen3-30B-A3B-Instruct-2507
  • Context Length: Up to 16,384 tokens
  • Input: Raw HTML content
  • Output: Single token, "yes" or "no" classification
  • License: MIT

Performance

We evaluated BrowseSafe on BrowseSafe-Bench, a realistic benchmark comprising 3,680 test samples of complex HTML payloads.

Table
Model NameConfigF1 ScorePrecisionRecallBalanced AccuracyRefusals
PromptGuard-222M0.3500.9750.2130.6060
86M0.3600.9830.2210.6110
gpt-oss-safeguard20B / Low0.7900.9860.6580.8260
20B / Medium0.7960.9940.6640.8320
120B / Low0.7300.9940.5770.7880
120B / Medium0.7410.9970.5890.7950
GPT-5 miniMinimal0.7500.7350.7670.7460
Low0.8540.9490.7760.8680
Medium0.8530.9450.7770.8660
High0.8520.9570.7680.8680
GPT-5Minimal0.8490.8810.8190.8550
Low0.8540.9280.7910.8660
Medium0.8550.9300.7920.8670
High0.8400.8820.8020.8480
Haiku 4.5No Thinking0.8100.7600.8660.7980
1K0.8090.7550.8720.7950
8K0.8050.7510.8680.7920
32K0.8080.7600.8630.7960
Sonnet 4.5No Thinking0.8070.7630.8550.796419
1K0.8620.9290.8030.872613
8K0.8630.9310.8050.873650
32K0.8630.9350.8010.873669
BrowseSafe0.9040.9780.8410.9120

Evaluation Metrics

BrowseSafe-Bench evaluates models across five metrics. Full details can be found in the paper.

Quickstart

The code of Qwen3-MoE is in the latest Hugging Face transformers library. We recommend using transformers>=4.55.4.

Below is a code snippet illustrating how to use BrowseSafe.

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "perplexity-ai/browsesafe"
# load the tokenizer and the model
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
torch_dtype="auto",
device_map="auto"
)
# prepare the model input
prompt = "<html>...</html>"
messages = [
{"role": "user", "content": prompt}
]
text = tokenizer.apply_chat_template(
messages,
tokenize=False,
add_generation_prompt=True,
)
model_inputs = tokenizer([text], return_tensors="pt").to(model.device)
# conduct text completion
generated_ids = model.generate(**model_inputs)
output_ids = generated_ids[0][len(model_inputs.input_ids[0]):].tolist()
content = tokenizer.decode(output_ids, skip_special_tokens=True)
print("content:", content)

Processing Long HTML Contexts

Web pages often exceed standard context windows. To handle this, BrowseSafe utilizes a chunking strategy (as described in the paper) to process content that exceeds the model's effective context limit.

  • Strategy: Partition the document into non-overlapping chunks at token boundaries.
  • Aggregation: Apply a conservative "OR" logic—if any single chunk is classified as VIOLATES, the entire document is flagged. This ensures that malicious payloads hidden deep within long pages are not missed.

A reference implementation can be found here.

Best Practices

To achieve optimal defense performance, be sure to pass the full HTML content to the model. Running the model on extracted text may result in performance degradation.

Citation

If you use or reference this work, please cite:

bibtex

@article{browsesafe2025,
title = {BrowseSafe: Understanding and Preventing Prompt Injection Within AI Browser Agents},
author = {Kaiyuan Zhang and Mark Tenenholtz and Kyle Polley and Jerry Ma and Denis Yarats and Ninghui Li},
eprint = {arXiv:2511.20597},
archivePrefix= {arXiv},
year = {2025}
}

Model provider

Uzbekswe

Model tree

Base

Qwen/Qwen3-30B-A3B-Instruct-2507

Fine-tuned

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today