nchapman

Qwen3.6-35B-A3B-int4-AutoRound

README

License: apache-2.0

Quantization Details

Table with columns: Parameter, Value
Parameter	Value
Method	Intel AutoRound (SignRound)
AutoRound version	0.12.2
Scheme	W4A16 (4-bit symmetric integer weights, BF16 activations)
Group size	128
Symmetric	Yes
Packing format	`auto_round:auto_gptq`
Calibration iterations	400
Calibration samples	256
Sequence length	2048
Batch size	8
Learning rate	auto
Calibration dataset	AutoRound default
Hardware	NVIDIA DGX Spark (GB10, Blackwell SM_121, 128 GB unified memory)
Quantization time	~10.5 hours
Peak RAM	28.19 GB
Peak VRAM	25.11 GB

Layers kept in BF16

Certain layers are excluded from quantization to preserve model quality and ensure compatibility with tensor-parallel inference:

Table with columns: Layer pattern, Reason
Layer pattern	Reason
`shared_expert` (gate_proj, up_proj, down_proj)	Intel's recommended MoE recipe — shared experts are critical for output quality
`shared_expert_gate`	Shape not divisible by 32 (skipped automatically)
`mtp.*` (entire MTP block)	Required for vLLM speculative decoding — AutoRound's calibration never activates MTP, so these would receive RTN-only rounding with near-zero acceptance rate
`linear_attn.in_proj_a` / `in_proj_b`	GDN layers with out_features=32; vLLM fuses these into `in_proj_ba` (dim 64), and Marlin requires >=64 per TP shard — quantizing breaks TP>=2

The quantization_config.json includes both checkpoint-level (in_proj_a, in_proj_b) and vLLM fused-module (in_proj_ba) FP override entries so that vLLM correctly identifies these layers as unquantized under tensor parallelism.

Quantization summary

Quantized: 30,850 / 31,181 linear layers
Block-wise loss (representative): layer 0 loss 0.000011→0.000001, layer 39 loss 0.002693→0.001283

Model size

Table with columns: Format, Size
Format	Size
BF16 (original)	~70 GB
INT4 (this model)	~21 GB

How to use

With transformers + gptqmodel

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "nchapman/Qwen3.6-35B-A3B-int4-AutoRound",
    torch_dtype=torch.bfloat16,
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("nchapman/Qwen3.6-35B-A3B-int4-AutoRound")

messages = [{"role": "user", "content": "Explain mixture-of-experts architectures briefly."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
output = model.generate(**inputs, max_new_tokens=512)
print(tokenizer.decode(output[0], skip_special_tokens=True))

Note: gptqmodel must be installed for the W4 dequantization kernels. Without it, transformers will silently produce garbage output.

With vLLM

bash
vllm serve nchapman/Qwen3.6-35B-A3B-int4-AutoRound \
    --dtype bfloat16 \
    --quantization auto_round \
    --num-speculative-tokens 1 \
    --speculative-model [draft_model]

The MTP head is preserved in BF16, enabling vLLM's speculative decoding with the native MTP draft. Tensor parallelism (TP>=2) works correctly thanks to the in_proj_ba FP overrides.

Reproducing this quantization

Using AutoRound's Python API:

python
from auto_round import AutoRound
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "Qwen/Qwen3.6-35B-A3B",
    torch_dtype="auto",
    device_map="auto",
)
tokenizer = AutoTokenizer.from_pretrained("Qwen/Qwen3.6-35B-A3B")

autoround = AutoRound(
    model,
    tokenizer,
    bits=4,
    group_size=128,
    sym=True,
    iters=400,
    nsamples=256,
    seqlen=2048,
    batch_size=8,
    dataset="NeelNanda/pile-10k",
    ignore_layers=[
        "mtp.fc",
        "linear_attn.in_proj_a",
        "linear_attn.in_proj_b",
        "shared_expert",
    ],
)
autoround.quantize()
autoround.save_quantized("Qwen3.6-35B-A3B-int4-AutoRound", format="auto_round")

The ignore_layers patterns keep in_proj_a and in_proj_b in BF16. AutoRound's saved quantization_config.json will contain the checkpoint-level overrides; the vLLM fused-module entries (*.linear_attn.in_proj_ba) need to be added as a post-save step — our quantization script does this automatically, but if using the raw API you'll need to add {"bits": 16, "data_type": "fp"} entries for each *.linear_attn.in_proj_ba to extra_config in both quantization_config.json and config.json for TP>=2 compatibility.

Acknowledgments

Quantization method: Intel AutoRound — SignRound: optimization-based weight rounding for post-training quantization
MTP preservation approach informed by Lorbus/Qwen3.6-27B-int4-AutoRound
Shared-expert BF16 recipe from Intel/Qwen3.5-122B-A10B-int4-AutoRound
GDN TP fix validated against vLLM's Marlin kernel constraints

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

nchapman

Model Tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities