cyburn

Qwen3.6-35B-A3B-int4-AutoRound

README

License: apache-2.0

Quantization Details

Table with columns: Parameter, Value
Parameter	Value
Method	AutoRound
AutoRound version	0.14.1
Bits	4 (int)
Group size	128
Symmetric	Yes
Packing format	auto_round:auto_gptq
Calibration dataset	opencode-instruct
Calibration samples	512
Sequence length	2048
Iterations	1000

MLP gate layers and shared expert gate layers are kept in FP16 to preserve routing quality.

Quality Report

Quantized with AutoRound's sensitivity-based optimization. All 40 transformer blocks were evaluated:

Table with columns: Status, Count
Status	Count
Pass (cosine sim ≥ 0.99)	27
Warning (cosine sim 0.98–0.99)	13

All layers maintain cosine similarity > 0.98 vs the original. Warnings are concentrated in the deeper layers (23–37), which is typical for MoE models at 4-bit.

Usage

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_name = "cyburn/Qwen3.6-35B-A3B-int4-AutoRound"

tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
    model_name,
    device_map="auto",
)

messages = [{"role": "user", "content": "Write a Python function to compute Fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, top_k=20, top_p=0.95)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Model Architecture

Architecture: Qwen3.5 MoE (hybrid linear + full attention)
Total parameters: ~35B
Active parameters: ~3B per token
Experts: 256 total, 8 active per token
Layers: 40 (linear attention every 3 layers, full attention every 4th)
Context length: 262,144 tokens
Vocabulary: 248,320 tokens

Hardware Requirements

The quantized model requires approximately ~19.5 GB of VRAM/RAM. A single 24 GB GPU (e.g., RTX 3090/4090) or two 12 GB GPUs with device_map="auto" are sufficient.

Quantization Command

bash
auto-round \
  --model Qwen/Qwen3.6-35B-A3B \
  --batch_size 8 \
  --iters 1000 \
  --nsamples 512 \
  --seqlen 2048 \
  --dataset opencode-instruct \
  --output_dir ./models/Qwen3.6-35B-A3B-int4-AutoRound

Credits

Base model: Qwen/Qwen3.6-35B-A3B by Alibaba Cloud
Quantization tool: spark-auto-round — a GB10-optimized fork of Intel's auto-round, tuned for DGX Spark / GB10 unified memory hardware

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

cyburn

Model Tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities