Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Why this model exists — research context
I'm evaluating the performance/efficiency frontier of agentic LLMs operating inside secure sandboxes, and using the agent itself as the adversary against the sandbox. Two threads:
-
Isolation-backend benchmark. Running an autonomous, tool-using agent as the in-sandbox workload while comparing the isolation/runtime layers that wrap it:
- runc / standard OCI containers (shared-kernel baseline)
- Kata Containers (per-workload VM isolation)
- Cloud Hypervisor (CLH)
- QEMU/KVM
- Firecracker (microVM)
The interest is the trade-off curve: cold-start latency, per-turn tool-call overhead, memory footprint and throughput penalty of each boundary, measured under a realistic agent loop (filesystem, shell, network tools) rather than a synthetic benchmark.
-
Sandbox-escape challenge suite. A graded set of tasks that explicitly instruct the agent to break out of its isolation boundary — escalate from the container/VM to the host, reach a forbidden network segment, tamper with the orchestration layer, exfiltrate a planted secret. A compliance-reduced model is the right instrument here: an agent that refuses the task tells you nothing about whether the boundary holds. The model is the maximally-cooperative attacker; the thing under test is the isolation layer's ability to contain it.
An uncensored, tool-call-reliable, long-context model that fits one workstation GPU is what this work needs. FP8 is what makes it fit with serving headroom to spare.
Why FP8-dynamic specifically
- Footprint: ~34 GB on disk (down from ~67 GB BF16), leaving ample VRAM on a single 96 GB-class Blackwell card for a 131072-token KV pool plus a co-resident auxiliary model.
- Quality: dynamic per-token activation scaling with per-channel weight scales is near-lossless for instruction/agentic use and avoids calibration-data bias.
- Portability:
compressed-tensorsfloat-quantizedis first-class in vLLM and does not require a Blackwell-only kernel path, unlike NVFP4 MoE. - MoE-safe recipe: per-channel/per-token (not per-tensor or block-128), which sidesteps the MoE expert dimension-mismatch and block-shape failures that block other FP8 schemes on this 256-expert architecture.
Quantization details
| Method | compressed-tensors, float-quantized (FP8 E4M3) |
| Weights | 8-bit float, per-channel, static |
| Activations | 8-bit float, per-token, dynamic |
| Tool | llm-compressor 0.11.0, QuantizationModifier(scheme="FP8_DYNAMIC") |
| Calibration | None (data-free) |
| Kept in BF16 (ignore-list) | every MoE router (mlp.gate), every shared_expert_gate, all norms, lm_head, embed_tokens, and the mtp.* tensors |
| Architecture | Qwen3_5MoeForConditionalGeneration, 40 layers (30 linear/Mamba + 10 full-attention), 256 experts / 8 active, head_dim 256 |
| Native context | 262144 (served here at 131072 — see notes) |
MTP / speculative decoding: the upstream BF16 checkpoint preserves the multi-token-prediction head, but
AutoModelForCausalLMdropsmtp.*tensors at load time, so they are not present in this quant and NEXTN/MTP speculative decoding is not available here. An MTP-preserving re-quant (grafting the 19mtp.*tensors back as a sidecar) is a possible follow-up.
Serving (vLLM)
Validated on vLLM 0.22.1, single RTX PRO 6000 Blackwell (sm_120):
bash
vllm serve <this-repo> \--served-model-name qwen36-35b-heretic \--max-model-len 131072 \--gpu-memory-utilization 0.50 \--enable-auto-tool-choice \--tool-call-parser qwen3_coder \--reasoning-parser qwen3 \--enable-prefix-caching --enable-chunked-prefill
Measured single-stream: ~99k-token prefill in ~2.2 s; ~195 tok/s decode; 20/20 tool calls returned well-formed JSON arguments. Thinking mode is on by default; chat_template_kwargs={"enable_thinking": false} disables it, and {"preserve_thinking": true} retains historical reasoning across turns.
A note on context length
Served at 131072 rather than the native 262144 on purpose: usable attention quality degrades well before the nominal window on long-context models, and an agent should live in the high-quality range. Raise it if your workload needs more and you've validated the quality.
SGLang caveat
This checkpoint loads in SGLang but the compressed-tensors FP8 MoE path falls back to a Triton fused-MoE kernel that has no tuned config for sm_120 + 256 experts (requests ~147 KB shared memory vs the card's 101 KB limit). Serve it with vLLM. Dense Qwen3.6 FP8 quants are unaffected.
Intended use & responsible use
Solely security research. This card documents a quantization-recipe strategy; it is not a distribution of usable model weights. The author does not distribute these weights for production or end-user use, and does not consent to redistribution. Intended audience is qualified researchers studying quantization methods, LLM safety/alignment robustness, and abliteration as an attack vector against open weights — working inside isolated, non-production environments with no access to real user data or systems.
This is a compliance-reduced model: its safety refusals have been substantially removed by the upstream abliteration. It will attempt harmful, unsafe, or escape-oriented instructions by design — that property is what makes it useful as a research instrument, and also the reason it must not be exposed to untrusted users or the open internet, used in production, redistributed, or used to act against systems you do not own and are not authorized to test. You are responsible for compliant, lawful use. No additional safety guarantees over the base model are provided or implied; quantization does not add safety.
Lineage & licenses
- Base:
Qwen/Qwen3.6-35B-A3B— Apache-2.0 - Abliteration:
llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved(Heretic v1.3.0) — Apache-2.0 - This quant: Apache-2.0. Tooling:
llm-compressor,compressed-tensors.
Model provider
sch0tten
Model tree
Base
Qwen/Qwen3.6-35B-A3B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information