Model Lineage
Why FP8
Qwen3.6-35B-A3B in BF16 is ~72 GB on disk. FP8 cuts that to ~37 GB while preserving vision layers and precision-sensitive modules in BF16. The expected throughput uplift on DGX Spark is on par with what we saw for Qwen3.5 (31 → 51 t/s, ~65%).
Quantization Details
Scheme: native FP8 blockwise, identical on-disk format to the official Qwen/Qwen3.6-35B-A3B-FP8.
Table with columns: Field, Value| Field | Value |
|---|
quant_method | fp8 |
activation_scheme | dynamic (per-token, at inference) |
fmt | e4m3 |
weight_block_size | [128, 128] |
| Scale dtype / key | bf16, *.weight_scale_inv |
Quantized (weights → FP8 e4m3, per-block [128, 128] scales):
- All 2D Linear
*.weight in language layers that aren't in the exclusion list, including:
self_attn.{q,k,v,o}_proj (full-attention layers)
linear_attn.{in_proj_qkv, in_proj_z, out_proj} (linear-attention / mamba layers)
mlp.shared_expert.{gate,up,down}_proj
- All 256 experts per MoE layer, un-fused to match Qwen's official per-expert layout:
mlp.experts.{0..255}.{gate_proj, up_proj, down_proj}.weight
Kept in BF16 (matches Qwen's modules_to_not_convert):
Table with columns: Module, Reason| Module | Reason |
|---|
lm_head | Output head — precision-sensitive |
model.language_model.embed_tokens | Embedding layer |
*.input_layernorm, *.post_attention_layernorm | LayerNorms |
*.self_attn.{q_norm, k_norm} | QK norms |
*.linear_attn.{A_log, conv1d, dt_bias, in_proj_a, in_proj_b, in_proj_ba, norm} | Mamba state-space params (small, sensitive) |
Notable Implementation Notes
- Source experts were fused 3D (
mlp.experts.gate_up_proj[256, 1024, 2048], mlp.experts.down_proj[256, 2048, 512]) — we un-fuse them to the per-expert layout the official Qwen FP8 uses (mlp.experts.{E}.{gate, up, down}_proj.weight). This is what vLLM's Fp8 MoE loader expects.
- Streaming quantization: processed one source shard at a time on the GPU; peak host memory ~6 GB. Avoids the llmcompressor pitfall where peak VM grew to 168 GB during the Compressing phase and got OOM-killed on the 128 GB DGX Spark unified-memory budget.
- Sanity check: round-trip dequantization median relative error ~2.2% per tensor (as expected for E4M3 blockwise).
Numbers
Table with columns: BF16 source, This FP8 | BF16 source | This FP8 |
|---|
| Size on disk | ~72 GB | ~37 GB |
| Tensors in index | 1045 (fused experts) | 64189 (un-fused) |
| FP8 weight tensors | — | 31738 |
| BF16 weight tensors | — | 32451 (incl. 31738 weight_scale_inv) |
Loading
from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8",
dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8")
For vLLM, point at the repo — the quantization_config is already correctly set (quant_method: fp8, weight-block [128, 128], dynamic activations).
Disclaimer
Abliterated model. Not recommended if you expect a polite corporate assistant.