gsting

Qwen3.6-35B-A3B-abliterated-FP8

README

License: apache-2.0

Model Lineage

Base: Qwen/Qwen3.6-35B-A3B (BF16, hybrid linear-attention + full-attention MoE with 40 layers, 256 experts, vision encoder)
Abliterated (refusals removed) by Huihui: huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated
This repo: FP8 quantization matching the format of Qwen/Qwen3.6-35B-A3B-FP8

Why FP8

Qwen3.6-35B-A3B in BF16 is ~72 GB on disk. FP8 cuts that to ~37 GB while preserving vision layers and precision-sensitive modules in BF16. The expected throughput uplift on DGX Spark is on par with what we saw for Qwen3.5 (31 → 51 t/s, ~65%).

Quantization Details

Scheme: native FP8 blockwise, identical on-disk format to the official Qwen/Qwen3.6-35B-A3B-FP8.

Table with columns: Field, Value
Field	Value
`quant_method`	`fp8`
`activation_scheme`	`dynamic` (per-token, at inference)
`fmt`	`e4m3`
`weight_block_size`	`[128, 128]`
Scale dtype / key	`bf16`, `*.weight_scale_inv`

Quantized (weights → FP8 e4m3, per-block [128, 128] scales):

All 2D Linear *.weight in language layers that aren't in the exclusion list, including:
- self_attn.{q,k,v,o}_proj (full-attention layers)
- linear_attn.{in_proj_qkv, in_proj_z, out_proj} (linear-attention / mamba layers)
- mlp.shared_expert.{gate,up,down}_proj
- All 256 experts per MoE layer, un-fused to match Qwen's official per-expert layout:
  - mlp.experts.{0..255}.{gate_proj, up_proj, down_proj}.weight

Kept in BF16 (matches Qwen's modules_to_not_convert):

Table with columns: Module, Reason
Module	Reason
`lm_head`	Output head — precision-sensitive
`model.language_model.embed_tokens`	Embedding layer
`.input_layernorm`, `.post_attention_layernorm`	LayerNorms
`*.self_attn.{q_norm, k_norm}`	QK norms
`*.linear_attn.{A_log, conv1d, dt_bias, in_proj_a, in_proj_b, in_proj_ba, norm}`	Mamba state-space params (small, sensitive)

Notable Implementation Notes

Source experts were fused 3D (mlp.experts.gate_up_proj[256, 1024, 2048], mlp.experts.down_proj[256, 2048, 512]) — we un-fuse them to the per-expert layout the official Qwen FP8 uses (mlp.experts.{E}.{gate, up, down}_proj.weight). This is what vLLM's Fp8 MoE loader expects.
Streaming quantization: processed one source shard at a time on the GPU; peak host memory ~6 GB. Avoids the llmcompressor pitfall where peak VM grew to 168 GB during the Compressing phase and got OOM-killed on the 128 GB DGX Spark unified-memory budget.
Sanity check: round-trip dequantization median relative error ~2.2% per tensor (as expected for E4M3 blockwise).

Numbers

Table with columns: BF16 source, This FP8
	BF16 source	This FP8
Size on disk	~72 GB	~37 GB
Tensors in index	1045 (fused experts)	64189 (un-fused)
FP8 weight tensors	—	31738
BF16 weight tensors	—	32451 (incl. 31738 `weight_scale_inv`)

Loading

python
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
    "batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8",
    dtype="auto",
    device_map="auto",
)
processor = AutoProcessor.from_pretrained("batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8")

For vLLM, point at the repo — the quantization_config is already correctly set (quant_method: fp8, weight-block [128, 128], dynamic activations).

Disclaimer

Abliterated model. Not recommended if you expect a polite corporate assistant.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

gsting

Model Tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities