cyburn

cyburn

Qwen3.6-35B-A3B-int4-AutoRound

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quantization Details

Table
ParameterValue
MethodAutoRound
AutoRound version0.14.1
Bits4 (int)
Group size128
SymmetricYes
Packing formatauto_round:auto_gptq
Calibration datasetopencode-instruct
Calibration samples512
Sequence length2048
Iterations1000

MLP gate layers and shared expert gate layers are kept in FP16 to preserve routing quality.

Quality Report

Quantized with AutoRound's sensitivity-based optimization. All 40 transformer blocks were evaluated:

Table
StatusCount
Pass (cosine sim ≥ 0.99)27
Warning (cosine sim 0.98–0.99)13

All layers maintain cosine similarity > 0.98 vs the original. Warnings are concentrated in the deeper layers (23–37), which is typical for MoE models at 4-bit.

Usage

python

from transformers import AutoModelForCausalLM, AutoTokenizer
model_name = "cyburn/Qwen3.6-35B-A3B-int4-AutoRound"
tokenizer = AutoTokenizer.from_pretrained(model_name)
model = AutoModelForCausalLM.from_pretrained(
model_name,
device_map="auto",
)
messages = [{"role": "user", "content": "Write a Python function to compute Fibonacci numbers."}]
text = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
inputs = tokenizer(text, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=512, temperature=1.0, top_k=20, top_p=0.95)
print(tokenizer.decode(outputs[0][inputs.input_ids.shape[1]:], skip_special_tokens=True))

Model Architecture

  • Architecture: Qwen3.5 MoE (hybrid linear + full attention)
  • Total parameters: ~35B
  • Active parameters: ~3B per token
  • Experts: 256 total, 8 active per token
  • Layers: 40 (linear attention every 3 layers, full attention every 4th)
  • Context length: 262,144 tokens
  • Vocabulary: 248,320 tokens

Hardware Requirements

The quantized model requires approximately ~19.5 GB of VRAM/RAM. A single 24 GB GPU (e.g., RTX 3090/4090) or two 12 GB GPUs with device_map="auto" are sufficient.

Quantization Command

bash

auto-round \
--model Qwen/Qwen3.6-35B-A3B \
--batch_size 8 \
--iters 1000 \
--nsamples 512 \
--seqlen 2048 \
--dataset opencode-instruct \
--output_dir ./models/Qwen3.6-35B-A3B-int4-AutoRound

Credits

Model provider

cyburn

cyburn

Model tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today