sch0tten

Qwen3.6-35B-A3B-heretic-FP8

README

License: apache-2.0

Why this model exists — research context

I'm evaluating the performance/efficiency frontier of agentic LLMs operating inside secure sandboxes, and using the agent itself as the adversary against the sandbox. Two threads:

Isolation-backend benchmark. Running an autonomous, tool-using agent as the in-sandbox workload while comparing the isolation/runtime layers that wrap it:
- runc / standard OCI containers (shared-kernel baseline)
- Kata Containers (per-workload VM isolation)
- Cloud Hypervisor (CLH)
- QEMU/KVM
- Firecracker (microVM)
The interest is the trade-off curve: cold-start latency, per-turn tool-call overhead, memory footprint and throughput penalty of each boundary, measured under a realistic agent loop (filesystem, shell, network tools) rather than a synthetic benchmark.
Sandbox-escape challenge suite. A graded set of tasks that explicitly instruct the agent to break out of its isolation boundary — escalate from the container/VM to the host, reach a forbidden network segment, tamper with the orchestration layer, exfiltrate a planted secret. A compliance-reduced model is the right instrument here: an agent that refuses the task tells you nothing about whether the boundary holds. The model is the maximally-cooperative attacker; the thing under test is the isolation layer's ability to contain it.

An uncensored, tool-call-reliable, long-context model that fits one workstation GPU is what this work needs. FP8 is what makes it fit with serving headroom to spare.

Why FP8-dynamic specifically

Footprint: ~34 GB on disk (down from ~67 GB BF16), leaving ample VRAM on a single 96 GB-class Blackwell card for a 131072-token KV pool plus a co-resident auxiliary model.
Quality: dynamic per-token activation scaling with per-channel weight scales is near-lossless for instruction/agentic use and avoids calibration-data bias.
Portability: compressed-tensors float-quantized is first-class in vLLM and does not require a Blackwell-only kernel path, unlike NVFP4 MoE.
MoE-safe recipe: per-channel/per-token (not per-tensor or block-128), which sidesteps the MoE expert dimension-mismatch and block-shape failures that block other FP8 schemes on this 256-expert architecture.

Quantization details

Table

Method	`compressed-tensors`, `float-quantized` (FP8 E4M3)
Weights	8-bit float, per-channel, static
Activations	8-bit float, per-token, dynamic
Tool	`llm-compressor` 0.11.0, `QuantizationModifier(scheme="FP8_DYNAMIC")`
Calibration	None (data-free)
Kept in BF16 (ignore-list)

MTP / speculative decoding: the upstream BF16 checkpoint preserves the multi-token-prediction head, but AutoModelForCausalLM drops mtp.* tensors at load time, so they are not present in this quant and NEXTN/MTP speculative decoding is not available here. An MTP-preserving re-quant (grafting the 19 mtp.* tensors back as a sidecar) is a possible follow-up.

Serving (vLLM)

Validated on vLLM 0.22.1, single RTX PRO 6000 Blackwell (sm_120):

bash
vllm serve <this-repo> \
  --served-model-name qwen36-35b-heretic \
  --max-model-len 131072 \
  --gpu-memory-utilization 0.50 \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --enable-prefix-caching --enable-chunked-prefill

Measured single-stream: ~99k-token prefill in ~2.2 s; ~195 tok/s decode; 20/20 tool calls returned well-formed JSON arguments. Thinking mode is on by default; chat_template_kwargs={"enable_thinking": false} disables it, and {"preserve_thinking": true} retains historical reasoning across turns.

A note on context length

Served at 131072 rather than the native 262144 on purpose: usable attention quality degrades well before the nominal window on long-context models, and an agent should live in the high-quality range. Raise it if your workload needs more and you've validated the quality.

SGLang caveat

This checkpoint loads in SGLang but the compressed-tensors FP8 MoE path falls back to a Triton fused-MoE kernel that has no tuned config for sm_120 + 256 experts (requests ~147 KB shared memory vs the card's 101 KB limit). Serve it with vLLM. Dense Qwen3.6 FP8 quants are unaffected.

Intended use & responsible use

Solely security research. This card documents a quantization-recipe strategy; it is not a distribution of usable model weights. The author does not distribute these weights for production or end-user use, and does not consent to redistribution. Intended audience is qualified researchers studying quantization methods, LLM safety/alignment robustness, and abliteration as an attack vector against open weights — working inside isolated, non-production environments with no access to real user data or systems.

This is a compliance-reduced model: its safety refusals have been substantially removed by the upstream abliteration. It will attempt harmful, unsafe, or escape-oriented instructions by design — that property is what makes it useful as a research instrument, and also the reason it must not be exposed to untrusted users or the open internet, used in production, redistributed, or used to act against systems you do not own and are not authorized to test. You are responsible for compliant, lawful use. No additional safety guarantees over the base model are provided or implied; quantization does not add safety.

Lineage & licenses

Base: Qwen/Qwen3.6-35B-A3B — Apache-2.0
Abliteration: llmfan46/Qwen3.6-35B-A3B-uncensored-heretic-Native-MTP-Preserved (Heretic v1.3.0) — Apache-2.0
This quant: Apache-2.0. Tooling: llm-compressor, compressed-tensors.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

sch0tten

Model Tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities