mgoin

Qwen3.6-35B-A3B-2Bit-GSQ-ct

README

License: apache-2.0

Quantization details

Base model: Qwen/Qwen3.6-35B-A3B
Bits / weight (effective): ≈2.13 bpp
Codebook: 2-bit symmetric scalar {-2, -1, 0, +1} × scale
Group size: 128
Format: Humming (quant_method: "humming", b_dtype: "uint2")
Pipeline: GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
What's quantized: routed-expert MLPs (gate_proj, up_proj, down_proj).
Kept in BF16: attention (self_attn, linear_attn), embeddings, layernorms, LM head, MoE routing gate, and the shared experts.

Storage layout (why the HF UI shows I32 + BF16)

The Hugging Face "Tensor types" widget reports the container dtype of each safetensor on disk, not the effective precision of the underlying weights. This checkpoint uses the Humming on-disk layout (exact-width packing — no sub-byte values are padded into a wider container). For every quantized expert-MLP Linear with original weight shape [out_features, in_features], the following tensors are stored:

Table with columns: Tensor, Dtype, Shape on disk, Meaning
Tensor	Dtype	Shape on disk	Meaning
`<layer>.weight`	I32	`[out_features, in_features × 2 / 32]` = `[out_features, in_features / 16]`	2-bit values bit-packed along the input dim, LSB-first: 16 weights per INT32 word.
`<layer>.weight_scale`	BF16	`[out_features, in_features / 128]`	One symmetric scale per group of `group_size = 128` weights along the input dim.

So although the UI says "I32 + BF16", the effective storage per quantized weight is:

text
2 bits (packed) + 16 bits / 128 (group scale) ≈ 2.13 bpp

The quantization_config block in config.json is:

json
{
  "quant_method": "humming",
  "b_dtype": "uint2",
  "weight_scale_group_size": 128,
  "weight_scale_type": "group",
  "has_zero_point": false,
  "ignore": [
    "lm_head",
    "re:.*self_attn.*",
    "re:.*linear_attn.*",
    "re:.*visual.*",
    "re:mtp.*",
    "re:.*mlp\\.gate$",
    "re:.*mlp\\.shared_expert_gate$",
    "re:.*shared_expert.*"
  ]
}

Loading this checkpoint requires vLLM plus the humming MoE kernels (pip install humming-kernels). See Serving with vLLM below.

Note: GSQ training first writes shards in compressed-tensors pack-quantized format (where the 2-bit codebook is padded into a 4-bit INT32 container). The published checkpoint here has been re-packed via convert_to_humming.py into exact-width 2-bit Humming storage, hence the 2 / 32 shape factor on weight.

Serving with vLLM

Temporary vLLM compatibility note: the upstream vLLM Humming MoE implementation currently has a bug that prevents this checkpoint from being served correctly. Until the fix is available upstream, use one of the following workarounds:
Build vLLM from our fork:
bash
git clone https://github.com/adotdad/vllm.git
cd vllm
pip install -e .
Or patch your existing vLLM installation by replacing:
text
vllm/model_executor/layers/quantization/humming.py
with the version from our fork:
text
https://github.com/adotdad/vllm

Install the Humming kernels (required for vLLM to load this checkpoint):

bash
pip install humming-kernels

Ampere (sm ≥ 80) or Hopper GPUs are required for serving.

bash
vllm serve ISTA-DASLab/Qwen3.6-35B-A3B-2Bit-GSQ

Text-only VRAM note: the full checkpoint is ≈14.4 GB, of which about ≈2.6 GB comes from the MTP and vision components. These components are optional for ordinary text-only generation: the vision weights are only needed for multimodal inputs, and the MTP weights are only useful when the serving backend enables MTP/speculative decoding. If you only use the text-generation path without MTP, the model weights require roughly ≈11.8 GB of VRAM, excluding KV-cache and runtime overhead.

Citation

bibtex
@article{gsq2026,
  title  = {GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling},
  author = {Dadgarnia, Alireza and Tabesh, Soroush and Nikdan, Mahdi and Helcig, Michael and Kurti{\'c}, Eldar and Kleinegger, Max and Alistarh, Dan},
  journal= {arXiv preprint arXiv:2604.18556},
  year   = {2026},
  url    = {https://arxiv.org/abs/2604.18556}
}

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

mgoin

Model Tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities