Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Lineage
- Base: Qwen/Qwen3.6-35B-A3B (BF16, hybrid linear-attention + full-attention MoE with 40 layers, 256 experts, vision encoder)
- Abliterated (refusals removed) by Huihui: huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated
- This repo: FP8 quantization matching the format of Qwen/Qwen3.6-35B-A3B-FP8
Why FP8
Qwen3.6-35B-A3B in BF16 is ~72 GB on disk. FP8 cuts that to ~37 GB while preserving vision layers and precision-sensitive modules in BF16. The expected throughput uplift on DGX Spark is on par with what we saw for Qwen3.5 (31 → 51 t/s, ~65%).
Quantization Details
Scheme: native FP8 blockwise, identical on-disk format to the official Qwen/Qwen3.6-35B-A3B-FP8.
| Field | Value |
|---|---|
quant_method | fp8 |
activation_scheme | dynamic (per-token, at inference) |
fmt | e4m3 |
weight_block_size | [128, 128] |
| Scale dtype / key | bf16, *.weight_scale_inv |
| Scale shape | (ceil(out/128), ceil(in/128)) |
Quantized (weights → FP8 e4m3, per-block [128, 128] scales):
- All 2D Linear
*.weightin language layers that aren't in the exclusion list, including:self_attn.{q,k,v,o}_proj(full-attention layers)linear_attn.{in_proj_qkv, in_proj_z, out_proj}(linear-attention / mamba layers)mlp.shared_expert.{gate,up,down}_proj- All 256 experts per MoE layer, un-fused to match Qwen's official per-expert layout:
mlp.experts.{0..255}.{gate_proj, up_proj, down_proj}.weight
Kept in BF16 (matches Qwen's modules_to_not_convert):
| Module | Reason |
|---|---|
lm_head | Output head — precision-sensitive |
model.language_model.embed_tokens | Embedding layer |
*.input_layernorm, *.post_attention_layernorm | LayerNorms |
*.self_attn.{q_norm, k_norm} | QK norms |
*.linear_attn.{A_log, conv1d, dt_bias, in_proj_a, in_proj_b, in_proj_ba, norm} | Mamba state-space params (small, sensitive) |
*.mlp.gate, *.mlp.shared_expert_gate | MoE router gates — routing precision matters |
model.visual.* | Entire visual encoder (patch_embed, 27 ViT blocks, deepstack mergers, merger) |
mtp.* | Multi-token prediction module |
Notable Implementation Notes
- Source experts were fused 3D (
mlp.experts.gate_up_proj[256, 1024, 2048],mlp.experts.down_proj[256, 2048, 512]) — we un-fuse them to the per-expert layout the official Qwen FP8 uses (mlp.experts.{E}.{gate, up, down}_proj.weight). This is what vLLM's Fp8 MoE loader expects. - Streaming quantization: processed one source shard at a time on the GPU; peak host memory ~6 GB. Avoids the llmcompressor pitfall where peak VM grew to 168 GB during the Compressing phase and got OOM-killed on the 128 GB DGX Spark unified-memory budget.
- Sanity check: round-trip dequantization median relative error ~2.2% per tensor (as expected for E4M3 blockwise).
Numbers
| BF16 source | This FP8 | |
|---|---|---|
| Size on disk | ~72 GB | ~37 GB |
| Tensors in index | 1045 (fused experts) | 64189 (un-fused) |
| FP8 weight tensors | — | 31738 |
| BF16 weight tensors | — | 32451 (incl. 31738 weight_scale_inv) |
Loading
python
from transformers import AutoModelForImageTextToText, AutoProcessormodel = AutoModelForImageTextToText.from_pretrained("batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8",dtype="auto",device_map="auto",)processor = AutoProcessor.from_pretrained("batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8")
For vLLM, point at the repo — the quantization_config is already correctly set (quant_method: fp8, weight-block [128, 128], dynamic activations).
Disclaimer
Abliterated model. Not recommended if you expect a polite corporate assistant.
Model provider
gsting
Model tree
Base
Qwen/Qwen3.6-35B-A3B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information