sakamakismile
Qwen-AgentWorld-35B-A3B-NVFP4
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What it is
Qwen-AgentWorld is a world model / environment simulator, not a task-solving agent: given a state and an action, it predicts the next observation. It is trained (CPT → SFT → RL) to simulate seven agentic domains — MCP, Search, Terminal, SWE, Android, Web, OS — so downstream agents can be trained and evaluated against simulated rollouts instead of (costly, risky) real environments. See the paper and the upstream repository.
Quantization
| Method | NVFP4 (nvfp4-pack-quantized, W4A4, group size 16, FP8-E4M3 scales), compressed-tensors |
| Tool | llm-compressor one-shot, 32 calibration samples (neuralmagic/calibration, seq 8192) |
| Quantized | all language-model Linear layers, including the 30,720 expert projections |
| Kept BF16 | lm_head, the Qwen3-VL vision tower (visual.*), the MoE routers (mlp.gate, mlp.shared_expert_gate), norms |
| Size | 21.9 GB (from ~69 GB BF16) |
| Architecture | Qwen3_5MoeForConditionalGeneration (qwen3_5_moe): 40 layers, 256 experts top-8 + shared expert, hybrid linear+full attention, 262k context |
Recipe:
python
from llmcompressor.modifiers.quantization import QuantizationModifierQuantizationModifier(targets="Linear", scheme="NVFP4",ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],)
The full multimodal model is loaded so the vision tower is preserved in BF16; the per-expert projections are packed to NVFP4.
Serving (vLLM ≥ 0.22)
Requires a Blackwell (SM120) GPU. TP=4 (4× 16 GB) is the comfortable floor for the MoE NVFP4 GEMM workspace. (TP=2 fits only memory-squeezed: add --enforce-eager --max-model-len 8192 --gpu-memory-utilization 0.95, at a large single-stream speed cost.)
bash
NCCL_P2P_DISABLE=1 vllm serve sakamakismile/Qwen-AgentWorld-35B-A3B-NVFP4 \--tensor-parallel-size 4 \--disable-custom-all-reduce \--max-model-len 32768 \--gpu-memory-utilization 0.87 \--reasoning-parser qwen3
- NVFP4 is auto-detected — do not pass
--quantization. - It is a reasoning model: it thinks in
<think>…</think>by default before emitting the predicted observation. Recommended sampling:temperature=0.6, top_p=0.95, top_k=20.
Usage — the environment-simulator format
python
messages = [{"role": "system","content": "You are a language world model simulating a Linux terminal environment. ""Given the user's command, predict the terminal output."},{"role": "user", "content": "Action: execute_bash\nCommand: ls -la /home/user/project/"},]
The model returns the predicted next observation. Per-domain system-prompt templates for all seven domains live in the upstream repository.
Performance & fidelity
On a single node of 4× RTX PRO 2000 Blackwell (16 GB, SM120), TP=4: ~161 tok/s single-stream. The NVFP4 packing preserves the model's world-modeling behaviour — e.g. it predicts well-formed terminal output (total, permission bits, ./..) and realistic, locale-appropriate search-result listings, and it stays consistent with the supplied environment state.
License & attribution
Apache-2.0, inherited from the base model. Quantized by Lna-Lab. These weights are a faithful NVFP4 packing of Qwen/Qwen-AgentWorld-35B-A3B; all credit for the model itself goes to the Qwen team.
Model provider
sakamakismile
Model tree
Base
Qwen/Qwen-AgentWorld-35B-A3B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information