Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Lineage

Why FP8

Qwen3.6-35B-A3B in BF16 is ~72 GB on disk. FP8 cuts that to ~37 GB while preserving vision layers and precision-sensitive modules in BF16. The expected throughput uplift on DGX Spark is on par with what we saw for Qwen3.5 (31 → 51 t/s, ~65%).

Quantization Details

Scheme: native FP8 blockwise, identical on-disk format to the official Qwen/Qwen3.6-35B-A3B-FP8.

FieldValue
quant_methodfp8
activation_schemedynamic (per-token, at inference)
fmte4m3
weight_block_size[128, 128]
Scale dtype / keybf16, *.weight_scale_inv
Scale shape(ceil(out/128), ceil(in/128))

Quantized (weights → FP8 e4m3, per-block [128, 128] scales):

  • All 2D Linear *.weight in language layers that aren't in the exclusion list, including:
    • self_attn.{q,k,v,o}_proj (full-attention layers)
    • linear_attn.{in_proj_qkv, in_proj_z, out_proj} (linear-attention / mamba layers)
    • mlp.shared_expert.{gate,up,down}_proj
    • All 256 experts per MoE layer, un-fused to match Qwen's official per-expert layout:
      • mlp.experts.{0..255}.{gate_proj, up_proj, down_proj}.weight

Kept in BF16 (matches Qwen's modules_to_not_convert):

ModuleReason
lm_headOutput head — precision-sensitive
model.language_model.embed_tokensEmbedding layer
*.input_layernorm, *.post_attention_layernormLayerNorms
*.self_attn.{q_norm, k_norm}QK norms
*.linear_attn.{A_log, conv1d, dt_bias, in_proj_a, in_proj_b, in_proj_ba, norm}Mamba state-space params (small, sensitive)
*.mlp.gate, *.mlp.shared_expert_gateMoE router gates — routing precision matters
model.visual.*Entire visual encoder (patch_embed, 27 ViT blocks, deepstack mergers, merger)
mtp.*Multi-token prediction module

Notable Implementation Notes

  • Source experts were fused 3D (mlp.experts.gate_up_proj[256, 1024, 2048], mlp.experts.down_proj[256, 2048, 512]) — we un-fuse them to the per-expert layout the official Qwen FP8 uses (mlp.experts.{E}.{gate, up, down}_proj.weight). This is what vLLM's Fp8 MoE loader expects.
  • Streaming quantization: processed one source shard at a time on the GPU; peak host memory ~6 GB. Avoids the llmcompressor pitfall where peak VM grew to 168 GB during the Compressing phase and got OOM-killed on the 128 GB DGX Spark unified-memory budget.
  • Sanity check: round-trip dequantization median relative error ~2.2% per tensor (as expected for E4M3 blockwise).

Numbers

BF16 sourceThis FP8
Size on disk~72 GB~37 GB
Tensors in index1045 (fused experts)64189 (un-fused)
FP8 weight tensors31738
BF16 weight tensors32451 (incl. 31738 weight_scale_inv)

Loading

python

from transformers import AutoModelForImageTextToText, AutoProcessor
model = AutoModelForImageTextToText.from_pretrained(
"batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8",
dtype="auto",
device_map="auto",
)
processor = AutoProcessor.from_pretrained("batsclamp/Huihui-Qwen3.6-35B-A3B-abliterated-FP8")

For vLLM, point at the repo — the quantization_config is already correctly set (quant_method: fp8, weight-block [128, 128], dynamic activations).

Disclaimer

Abliterated model. Not recommended if you expect a polite corporate assistant.

Model provider

gsting

Model tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today