cyburn
Qwen3.6-35B-A3B-int4-AutoRound
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quantization Details
| Parameter | Value |
|---|---|
| Method | AutoRound |
| AutoRound version | 0.14.1 |
| Bits | 4 (int) |
| Group size | 128 |
| Symmetric | Yes |
| Packing format | auto_round:auto_gptq |
| Calibration dataset | opencode-instruct |
| Calibration samples | 512 |
| Sequence length | 2048 |
| Iterations | 1000 |
MLP gate layers and shared expert gate layers are kept in FP16 to preserve routing quality.
Quality Report
Quantized with AutoRound's sensitivity-based optimization. All 40 transformer blocks were evaluated:
| Status | Count |
|---|---|
| Pass (cosine sim ≥ 0.99) | 27 |
| Warning (cosine sim 0.98–0.99) | 13 |
All layers maintain cosine similarity > 0.98 vs the original. Warnings are concentrated in the deeper layers (23–37), which is typical for MoE models at 4-bit.
Usage
python
from transformers import AutoModelForCausalLM, AutoTokenizermodel_name = "cyburn/Qwen3.6-35B-A3B-int4-AutoRound"tokenizer = AutoTokenizer.from_pretrained(model_name)model = AutoModelForCausalLM.from_pretrained(model_name,device_map="auto",)messages = [{"role": "user", "content": "Write a Python function to compute Fibonacci numbers."}]text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)inputs = tokenizer(text, return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, top_k=20, top_p=0.95)print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))
Model Architecture
- Architecture: Qwen3.5 MoE (hybrid linear + full attention)
- Total parameters: ~35B
- Active parameters: ~3B per token
- Experts: 256 total, 8 active per token
- Layers: 40 (linear attention every 3 layers, full attention every 4th)
- Context length: 262,144 tokens
- Vocabulary: 248,320 tokens
Hardware Requirements
The quantized model requires approximately ~19.5 GB of VRAM/RAM. A single 24 GB GPU (e.g., RTX 3090/4090) or two 12 GB GPUs with device_map="auto" are sufficient.
Quantization Command
bash
auto-round \--model Qwen/Qwen3.6-35B-A3B \--batch_size 8 \--iters 1000 \--nsamples 512 \--seqlen 2048 \--dataset opencode-instruct \--output_dir ./models/Qwen3.6-35B-A3B-int4-AutoRound
Credits
- Base model: Qwen/Qwen3.6-35B-A3B by Alibaba Cloud
- Quantization tool: spark-auto-round — a GB10-optimized fork of Intel's auto-round, tuned for DGX Spark / GB10 unified memory hardware
Model provider
cyburn
Model tree
Base
Qwen/Qwen3.6-35B-A3B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information