DJLougen
Qwen3.6-35B-A3B-NSC-ACE-SABER-MoEAware
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What this is
A research prototype exploring whether MoE-specific quantization (different precision for hot vs cold experts, router kept in BF16, etc.) can beat uniform quantization. The idea: standard methods treat all experts the same, but cold experts are rarely activated and could be quantized more aggressively.
What it isn't
- Not fast. Runs at ~14 tok/s via the custom
inference.py+ PyTorch. The deltanet attention layers fall back to naive PyTorch loops. - Not plug-and-play. Requires the custom
inference.py— won't work with llama.cpp, vLLM, Ollama, or standard GGUF loaders. - Not better than GGUF Q4_K_M yet. A standard Q4_K_M GGUF runs at 76 tok/s on the same hardware at similar quality.
This is uploaded as a research artifact, not a production model.
Precision Scheme
| Component | Precision | Rationale |
|---|---|---|
| Router/gate weights | BF16 | Routing decisions critical for MoE |
| Shared expert | INT8 | Always active |
| Hot experts (top 32) | INT4 @ gs=64 | Frequently activated, finer grouping |
| Cold experts (bottom 192) | INT4 @ gs=256 | Rarely activated, coarser grouping |
| Linear attention (deltanet) | INT8 | State accumulation needs precision |
| Norms & embeddings | BF16 | Numerical stability |
Files
| File | Size | Description |
|---|---|---|
model.safetensors | 19.9 GB | Packed INT4/INT8/BF16 weights |
scales.safetensors | 448 MB | Per-tensor quantization scales |
quant_config.json | 83 KB | Per-layer precision assignments |
expert_classes.json | 34 KB | Hot/cold expert classification (weight-norm based) |
inference.py | 13 KB | Custom dequantization + generation |
config.json | 3.4 KB | Model architecture config |
tokenizer.json | 20 MB | Tokenizer |
Usage
bash
pip install torch>=2.1 transformers>=4.40 safetensors accelerate# Downloadhuggingface-cli download DJLougen/Qwen3.6-35B-A3B-NSC-ACE-SABER-MoEAware --local-dir ./moe_aware# Runcd moe_awarepython inference.py --prompt "Explain quantum entanglement" --max-new-tokens 256
Why is it slow?
The model uses hybrid deltanet linear attention (3:1 ratio with full attention). PyTorch has no optimized kernel for this — it falls back to a loop over sequence positions, which is O(n) per token instead of O(1) for standard KV-cache attention.
The quantization itself is fine. The bottleneck is the inference engine, not the format.
What would make it fast
- Native deltanet attention kernels in llama.cpp or vLLM
- A custom GGUF quant type that supports per-expert group sizes
- A vLLM quantization backend that loads the mixed-precision format directly
None of these exist yet. This repo is a proof-of-concept that the idea works; the tooling to make it practical doesn't exist.
Architecture
- Base model: Qwen3.6-35B-A3B-NSC-ACE-SABER
- Total params: 35B (3B active per token)
- Experts: 256 per layer, top-8 routing
- Layers: 40
- Attention: Hybrid deltanet linear attention + full attention (3:1)
- Context: 32K (architecture limit: 262K)
Citation
bibtex
@misc{moe-aware-saber-quantization,author = {DJLougen},title = {Qwen3.6-35B-A3B-NSC-ACE-SABER — MoE-Aware Quantization},year = {2026},url = {https://huggingface.co/DJLougen/Qwen3.6-35B-A3B-NSC-ACE-SABER-MoEAware}}
License
Apache 2.0 — inherits from the base Qwen3 model family.
Model provider
DJLougen
Model tree
Base
Qwen/Qwen3-30B-A3B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information