DJLougen

Qwen3.6-35B-A3B-NSC-ACE-SABER-MoEAware

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What this is

A research prototype exploring whether MoE-specific quantization (different precision for hot vs cold experts, router kept in BF16, etc.) can beat uniform quantization. The idea: standard methods treat all experts the same, but cold experts are rarely activated and could be quantized more aggressively.

What it isn't

  • Not fast. Runs at ~14 tok/s via the custom inference.py + PyTorch. The deltanet attention layers fall back to naive PyTorch loops.
  • Not plug-and-play. Requires the custom inference.py — won't work with llama.cpp, vLLM, Ollama, or standard GGUF loaders.
  • Not better than GGUF Q4_K_M yet. A standard Q4_K_M GGUF runs at 76 tok/s on the same hardware at similar quality.

This is uploaded as a research artifact, not a production model.

Precision Scheme

Table
ComponentPrecisionRationale
Router/gate weightsBF16Routing decisions critical for MoE
Shared expertINT8Always active
Hot experts (top 32)INT4 @ gs=64Frequently activated, finer grouping
Cold experts (bottom 192)INT4 @ gs=256Rarely activated, coarser grouping
Linear attention (deltanet)INT8State accumulation needs precision
Norms & embeddingsBF16Numerical stability

Files

Table
FileSizeDescription
model.safetensors19.9 GBPacked INT4/INT8/BF16 weights
scales.safetensors448 MBPer-tensor quantization scales
quant_config.json83 KBPer-layer precision assignments
expert_classes.json34 KBHot/cold expert classification (weight-norm based)
inference.py13 KBCustom dequantization + generation
config.json3.4 KBModel architecture config
tokenizer.json20 MBTokenizer

Usage

bash

pip install torch>=2.1 transformers>=4.40 safetensors accelerate
# Download
huggingface-cli download DJLougen/Qwen3.6-35B-A3B-NSC-ACE-SABER-MoEAware --local-dir ./moe_aware
# Run
cd moe_aware
python inference.py --prompt "Explain quantum entanglement" --max-new-tokens 256

Why is it slow?

The model uses hybrid deltanet linear attention (3:1 ratio with full attention). PyTorch has no optimized kernel for this — it falls back to a loop over sequence positions, which is O(n) per token instead of O(1) for standard KV-cache attention.

The quantization itself is fine. The bottleneck is the inference engine, not the format.

What would make it fast

  • Native deltanet attention kernels in llama.cpp or vLLM
  • A custom GGUF quant type that supports per-expert group sizes
  • A vLLM quantization backend that loads the mixed-precision format directly

None of these exist yet. This repo is a proof-of-concept that the idea works; the tooling to make it practical doesn't exist.

Architecture

  • Base model: Qwen3.6-35B-A3B-NSC-ACE-SABER
  • Total params: 35B (3B active per token)
  • Experts: 256 per layer, top-8 routing
  • Layers: 40
  • Attention: Hybrid deltanet linear attention + full attention (3:1)
  • Context: 32K (architecture limit: 262K)

Citation

bibtex

@misc{moe-aware-saber-quantization,
author = {DJLougen},
title = {Qwen3.6-35B-A3B-NSC-ACE-SABER — MoE-Aware Quantization},
year = {2026},
url = {https://huggingface.co/DJLougen/Qwen3.6-35B-A3B-NSC-ACE-SABER-MoEAware}
}

License

Apache 2.0 — inherits from the base Qwen3 model family.

Model provider

DJLougen

Model tree

Base

Qwen/Qwen3-30B-A3B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today