sakamakismile/Qwen-AgentWorld-35B-A3B-NVFP4 API & Inference Endpoint

What it is

Qwen-AgentWorld is a world model / environment simulator, not a task-solving agent: given a state and an action, it predicts the next observation. It is trained (CPT → SFT → RL) to simulate seven agentic domains — MCP, Search, Terminal, SWE, Android, Web, OS — so downstream agents can be trained and evaluated against simulated rollouts instead of (costly, risky) real environments. See the paper and the upstream repository.

Quantization

Table

Method	NVFP4 (`nvfp4-pack-quantized`, W4A4, group size 16, FP8-E4M3 scales), `compressed-tensors`
Tool	`llm-compressor` one-shot, 32 calibration samples (`neuralmagic/calibration`, seq 8192)
Quantized	all language-model `Linear` layers, including the 30,720 expert projections
Kept BF16	`lm_head`, the Qwen3-VL vision tower (`visual.*`), the MoE routers (`mlp.gate`, `mlp.shared_expert_gate`), norms
Size	21.9 GB (from ~69 GB BF16)
Architecture	`Qwen3_5MoeForConditionalGeneration` (`qwen3_5_moe`): 40 layers, 256 experts top-8 + shared expert, hybrid linear+full attention, 262k context

Recipe:

python
from llmcompressor.modifiers.quantization import QuantizationModifier
QuantizationModifier(
    targets="Linear", scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$", "re:.*mlp.shared_expert_gate$"],
)

The full multimodal model is loaded so the vision tower is preserved in BF16; the per-expert projections are packed to NVFP4.

Serving (vLLM ≥ 0.22)

Requires a Blackwell (SM120) GPU. TP=4 (4× 16 GB) is the comfortable floor for the MoE NVFP4 GEMM workspace. (TP=2 fits only memory-squeezed: add --enforce-eager --max-model-len 8192 --gpu-memory-utilization 0.95, at a large single-stream speed cost.)

bash
NCCL_P2P_DISABLE=1 vllm serve sakamakismile/Qwen-AgentWorld-35B-A3B-NVFP4 \
  --tensor-parallel-size 4 \
  --disable-custom-all-reduce \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.87 \
  --reasoning-parser qwen3

NVFP4 is auto-detected — do not pass --quantization.
It is a reasoning model: it thinks in <think>…</think> by default before emitting the predicted observation. Recommended sampling: temperature=0.6, top_p=0.95, top_k=20.

Usage — the environment-simulator format

python
messages = [
  {"role": "system",
   "content": "You are a language world model simulating a Linux terminal environment. "
              "Given the user's command, predict the terminal output."},
  {"role": "user", "content": "Action: execute_bash\nCommand: ls -la /home/user/project/"},
]

The model returns the predicted next observation. Per-domain system-prompt templates for all seven domains live in the upstream repository.

Performance & fidelity

On a single node of 4× RTX PRO 2000 Blackwell (16 GB, SM120), TP=4: ~161 tok/s single-stream. The NVFP4 packing preserves the model's world-modeling behaviour — e.g. it predicts well-formed terminal output (total, permission bits, ./..) and realistic, locale-appropriate search-result listings, and it stays consistent with the supplied environment state.

License & attribution

Apache-2.0, inherited from the base model. Quantized by Lna-Lab. These weights are a faithful NVFP4 packing of Qwen/Qwen-AgentWorld-35B-A3B; all credit for the model itself goes to the Qwen team.

Qwen-AgentWorld-35B-A3B-NVFP4

Get help setting up a custom Dedicated Endpoints.

README