Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quantization details
- Base model:
Qwen/Qwen3.6-35B-A3B - Bits / weight (effective): ≈2.13 bpp
- Codebook: 2-bit symmetric scalar
{-2, -1, 0, +1} × scale - Group size: 128
- Format: Humming (
quant_method: "humming",b_dtype: "uint2") - Pipeline: GPTQ initialization → Gumbel-Softmax refinement (Lion optimizer)
- What's quantized: routed-expert MLPs (
gate_proj,up_proj,down_proj). - Kept in BF16: attention (
self_attn,linear_attn), embeddings, layernorms, LM head, MoE routinggate, and the shared experts.
Storage layout (why the HF UI shows I32 + BF16)
The Hugging Face "Tensor types" widget reports the container dtype of each
safetensor on disk, not the effective precision of the underlying weights.
This checkpoint uses the Humming on-disk layout (exact-width packing — no
sub-byte values are padded into a wider container). For every quantized
expert-MLP Linear with original weight shape [out_features, in_features],
the following tensors are stored:
| Tensor | Dtype | Shape on disk | Meaning |
|---|---|---|---|
<layer>.weight | I32 | [out_features, in_features × 2 / 32] = [out_features, in_features / 16] | 2-bit values bit-packed along the input dim, LSB-first: 16 weights per INT32 word. |
<layer>.weight_scale | BF16 | [out_features, in_features / 128] | One symmetric scale per group of group_size = 128 weights along the input dim. |
Attention / norms / embed / LM-head / MoE gate / shared experts | BF16 | unchanged | Not quantized; copied from the base checkpoint. |
So although the UI says "I32 + BF16", the effective storage per quantized weight is:
text
2 bits (packed) + 16 bits / 128 (group scale) ≈ 2.13 bpp
The quantization_config block in config.json is:
json
{"quant_method": "humming","b_dtype": "uint2","weight_scale_group_size": 128,"weight_scale_type": "group","has_zero_point": false,"ignore": ["lm_head","re:.*self_attn.*","re:.*linear_attn.*","re:.*visual.*","re:mtp.*","re:.*mlp\\.gate$","re:.*mlp\\.shared_expert_gate$","re:.*shared_expert.*"]}
Loading this checkpoint requires vLLM plus the
humming MoE kernels
(pip install humming-kernels). See Serving with vLLM below.
Note: GSQ training first writes shards in
compressed-tensorspack-quantizedformat (where the 2-bit codebook is padded into a 4-bit INT32 container). The published checkpoint here has been re-packed viaconvert_to_humming.pyinto exact-width 2-bit Humming storage, hence the2 / 32shape factor onweight.
Serving with vLLM
Temporary vLLM compatibility note: the upstream vLLM Humming MoE implementation currently has a bug that prevents this checkpoint from being served correctly. Until the fix is available upstream, use one of the following workarounds:
Build vLLM from our fork:
bash
git clone https://github.com/adotdad/vllm.gitcd vllmpip install -e .Or patch your existing vLLM installation by replacing:
text
vllm/model_executor/layers/quantization/humming.pywith the version from our fork:
text
https://github.com/adotdad/vllm
Install the Humming kernels (required for vLLM to load this checkpoint):
bash
pip install humming-kernels
Ampere (sm ≥ 80) or Hopper GPUs are required for serving.
bash
vllm serve ISTA-DASLab/Qwen3.6-35B-A3B-2Bit-GSQ
Text-only VRAM note: the full checkpoint is ≈14.4 GB, of which about ≈2.6 GB comes from the MTP and vision components. These components are optional for ordinary text-only generation: the vision weights are only needed for multimodal inputs, and the MTP weights are only useful when the serving backend enables MTP/speculative decoding. If you only use the text-generation path without MTP, the model weights require roughly ≈11.8 GB of VRAM, excluding KV-cache and runtime overhead.
Citation
bibtex
@article{gsq2026,title = {GSQ: Highly-Accurate Low-Precision Scalar Quantization for LLMs via Gumbel-Softmax Sampling},author = {Dadgarnia, Alireza and Tabesh, Soroush and Nikdan, Mahdi and Helcig, Michael and Kurti{\'c}, Eldar and Kleinegger, Max and Alistarh, Dan},journal= {arXiv preprint arXiv:2604.18556},year = {2026},url = {https://arxiv.org/abs/2604.18556}}
Model provider
mgoin
Model tree
Base
Qwen/Qwen3.6-35B-A3B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information