DJLougen

Qwen3.6-35B-A3B-NSC-ACE-SABER-MoEAware

README

License: apache-2.0

What this is

A research prototype exploring whether MoE-specific quantization (different precision for hot vs cold experts, router kept in BF16, etc.) can beat uniform quantization. The idea: standard methods treat all experts the same, but cold experts are rarely activated and could be quantized more aggressively.

What it isn't

Not fast. Runs at ~14 tok/s via the custom inference.py + PyTorch. The deltanet attention layers fall back to naive PyTorch loops.
Not plug-and-play. Requires the custom inference.py — won't work with llama.cpp, vLLM, Ollama, or standard GGUF loaders.
Not better than GGUF Q4_K_M yet. A standard Q4_K_M GGUF runs at 76 tok/s on the same hardware at similar quality.

This is uploaded as a research artifact, not a production model.

Precision Scheme

Table with columns: Component, Precision, Rationale
Component	Precision	Rationale
Router/gate weights	BF16	Routing decisions critical for MoE
Shared expert	INT8	Always active
Hot experts (top 32)	INT4 @ gs=64	Frequently activated, finer grouping
Cold experts (bottom 192)	INT4 @ gs=256	Rarely activated, coarser grouping
Linear attention (deltanet)	INT8	State accumulation needs precision
Norms & embeddings	BF16	Numerical stability

Files

Table with columns: File, Size, Description
File	Size	Description
`model.safetensors`	19.9 GB	Packed INT4/INT8/BF16 weights
`scales.safetensors`	448 MB	Per-tensor quantization scales
`quant_config.json`	83 KB	Per-layer precision assignments
`expert_classes.json`	34 KB	Hot/cold expert classification (weight-norm based)

Usage

bash
pip install torch>=2.1 transformers>=4.40 safetensors accelerate

# Download
huggingface-cli download DJLougen/Qwen3.6-35B-A3B-NSC-ACE-SABER-MoEAware --local-dir ./moe_aware

# Run
cd moe_aware
python inference.py --prompt "Explain quantum entanglement" --max-new-tokens 256

Why is it slow?

The model uses hybrid deltanet linear attention (3:1 ratio with full attention). PyTorch has no optimized kernel for this — it falls back to a loop over sequence positions, which is O(n) per token instead of O(1) for standard KV-cache attention.

The quantization itself is fine. The bottleneck is the inference engine, not the format.

What would make it fast

Native deltanet attention kernels in llama.cpp or vLLM
A custom GGUF quant type that supports per-expert group sizes
A vLLM quantization backend that loads the mixed-precision format directly

None of these exist yet. This repo is a proof-of-concept that the idea works; the tooling to make it practical doesn't exist.

Architecture

Base model: Qwen3.6-35B-A3B-NSC-ACE-SABER
Total params: 35B (3B active per token)
Experts: 256 per layer, top-8 routing
Layers: 40
Attention: Hybrid deltanet linear attention + full attention (3:1)
Context: 32K (architecture limit: 262K)

Citation

bibtex
@misc{moe-aware-saber-quantization,
  author = {DJLougen},
  title  = {Qwen3.6-35B-A3B-NSC-ACE-SABER — MoE-Aware Quantization},
  year   = {2026},
  url    = {https://huggingface.co/DJLougen/Qwen3.6-35B-A3B-NSC-ACE-SABER-MoEAware}
}

License

Apache 2.0 — inherits from the base Qwen3 model family.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

DJLougen

Model Tree

Base

Qwen/Qwen3-30B-A3B

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities