sakamakismile

Qwen-AgentWorld-35B-A3B-NVFP4

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What it is

Qwen-AgentWorld is a world model / environment simulator, not a task-solving agent: given a state and an action, it predicts the next observation. It is trained (CPT → SFT → RL) to simulate seven agentic domains — MCP, Search, Terminal, SWE, Android, Web, OS — so downstream agents can be trained and evaluated against simulated rollouts instead of (costly, risky) real environments. See the paper and the upstream repository.

Quantization

Table
MethodNVFP4 (nvfp4-pack-quantized, W4A4, group size 16, FP8-E4M3 scales), compressed-tensors
Toolllm-compressor one-shot, 32 calibration samples (neuralmagic/calibration, seq 8192)
Quantizedall language-model Linear layers, including the 30,720 expert projections
Kept BF16lm_head, the Qwen3-VL vision tower (visual.*), the MoE routers (mlp.gate, mlp.shared_expert_gate), norms
Size21.9 GB (from ~69 GB BF16)
ArchitectureQwen3_5MoeForConditionalGeneration (qwen3_5_moe): 40 layers, 256 experts top-8 + shared expert, hybrid linear+full attention, 262k context

Recipe:

python

from llmcompressor.modifiers.quantization import QuantizationModifier
QuantizationModifier(
targets="Linear", scheme="NVFP4",
ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

The full multimodal model is loaded so the vision tower is preserved in BF16; the per-expert projections are packed to NVFP4.

Serving (vLLM ≥ 0.22)

Requires a Blackwell (SM120) GPU. TP=4 (4× 16 GB) is the comfortable floor for the MoE NVFP4 GEMM workspace. (TP=2 fits only memory-squeezed: add --enforce-eager --max-model-len 8192 --gpu-memory-utilization 0.95, at a large single-stream speed cost.)

bash

NCCL_P2P_DISABLE=1 vllm serve sakamakismile/Qwen-AgentWorld-35B-A3B-NVFP4 \
--tensor-parallel-size 4 \
--disable-custom-all-reduce \
--max-model-len 32768 \
--gpu-memory-utilization 0.87 \
--reasoning-parser qwen3
  • NVFP4 is auto-detected — do not pass --quantization.
  • It is a reasoning model: it thinks in <think>…</think> by default before emitting the predicted observation. Recommended sampling: temperature=0.6, top_p=0.95, top_k=20.

Usage — the environment-simulator format

python

messages = [
{"role": "system",
"content": "You are a language world model simulating a Linux terminal environment. "
"Given the user's command, predict the terminal output."},
{"role": "user", "content": "Action: execute_bash\nCommand: ls -la /home/user/project/"},
]

The model returns the predicted next observation. Per-domain system-prompt templates for all seven domains live in the upstream repository.

Performance & fidelity

On a single node of 4× RTX PRO 2000 Blackwell (16 GB, SM120), TP=4: ~161 tok/s single-stream. The NVFP4 packing preserves the model's world-modeling behaviour — e.g. it predicts well-formed terminal output (total, permission bits, ./..) and realistic, locale-appropriate search-result listings, and it stays consistent with the supplied environment state.

License & attribution

Apache-2.0, inherited from the base model. Quantized by Lna-Lab. These weights are a faithful NVFP4 packing of Qwen/Qwen-AgentWorld-35B-A3B; all credit for the model itself goes to the Qwen team.

Model provider

sakamakismile

Model tree

Base

Qwen/Qwen-AgentWorld-35B-A3B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today