YuYu1015

Huihui-Qwen3.6-35B-A3B-abliterated-int4-AutoRound

README

License: apache-2.0

English

INT4 AutoRound quantization of huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with Marlin INT4 kernel acceleration.

Model Details

Table with columns: Item, Value
Item	Value
Architecture	MoE (35B total, 3B active, 256 experts / 8 routed + 1 shared) + GDN (Mamba) + Attention hybrid
Base model	Qwen/Qwen3.6-35B-A3B
Fine-tuned by	huihui-ai (abliteration, no TransformerLens)
Quantized by	YuYu1015
Model size	~23.8 GB (vs ~71.9 GB BF16 original)
Context length	Up to 262,144 tokens (limited by KV cache on 128GB)
Thinking mode	Supported (`enable_thinking: true/false`)
Tool calling	Supported (`qwen3_xml` parser)
MTP	Built-in MTP weights included

Quantization Details

Table with columns: Item, Value
Item	Value
Method	Intel AutoRound v0.12.2
Bits	4
Group size	128
Format	auto_round (GPTQ-compatible)
Iterations	200
Calibration samples	512
Calibration sequence length	2048
Torch compile	Enabled ()

Layers Preserved in BF16

The following layers are not quantized to preserve model quality:

Table with columns: Layer, Reason
Layer	Reason
`lm_head`	Output head, sensitive to quantization noise (auto-excluded by shape)
`embed_tokens`	Input embeddings (auto-excluded by shape)
`mlp.shared_expert.*`	Shared expert weights, processes every token
`mlp.shared_expert_gate`	Shared expert routing gate
`mlp.gate`	MoE routing gate (auto-excluded by quantization scheme)

Performance

Tested on a single NVIDIA DGX Spark (GB10, 128GB LPDDR5X, SM121):

Table with columns: Configuration, Decode Speed, Notes
Configuration	Decode Speed	Notes
INT4 + DFlash-15 (daily conversation)	40-60 tok/s	With Qwen3.6-35B-A3B-DFlash drafter

Speculative Decoding

This model supports two speculative decoding methods:

DFlash (requires separate drafter model):

bash
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

Note: The DFlash drafter was trained on the original Qwen3.6-35B-A3B. Acceptance rate on the abliterated variant may be lower than on the original model.

MTP (uses built-in weights, no extra model needed):

bash
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

Serving with vLLM

bash
vllm serve /path/to/model \
    --quantization moe_wna16 \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

Use --quantization moe_wna16 for Marlin INT4 kernel (SM121 compatible via SM120 binary compat)
FP8 KV cache is not compatible with GDN non-causal attention layers; use --kv-cache-dtype auto
NVFP4 falls back to Marlin W4A16 on SM121 (missing cvt.e2m1x2 PTX instruction)
Runtime FP8 (--quantization fp8) is not compatible with DFlash (drafter inherits FP8 config and crashes)
--language-model-only skips vision encoder profiling for text-only inference
--performance-mode throughput enables CUDA graphs and kernels for throughput optimization
Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Safety Warning

This model has safety filtering removed (abliterated) and may generate sensitive, controversial, or inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards. Not suitable for public-facing or production applications.

Credits

Original Model: Qwen/Qwen3.6-35B-A3B by Alibaba Qwen Team
Abliteration: huihui-ai
INT4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
Quantization Tool: Intel AutoRound

繁體中文

huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated 的 INT4 AutoRound 量化版本，針對 NVIDIA DGX Spark (GB10 SM121) 最佳化，使用 Marlin INT4 kernel 加速。

模型資訊

Table with columns: 項目, 數值
項目	數值
架構	MoE（35B 總參數, 3B 活躍, 256 experts / 8 routed + 1 shared）+ GDN (Mamba) + Attention 混合
基礎模型	Qwen/Qwen3.6-35B-A3B
微調者	huihui-ai（abliteration，無 TransformerLens）
量化者	YuYu1015
模型大小	~23.8 GB（原版 BF16 約 71.9 GB）
Context 長度	最高 262,144 tokens（受限於 128GB 統一記憶體上的 KV cache）

量化詳情

Table with columns: 項目, 數值
項目	數值
方法	Intel AutoRound v0.12.2
位元數	4
Group size	128
格式	auto_round（GPTQ 相容）
迭代次數	200
校準樣本數	512
校準序列長度	2048
Torch compile	啟用（`--enable_torch_compile`）

保留 BF16 的層

以下層未被量化以保持模型品質：

Table with columns: 層, 原因
層	原因
`lm_head`	輸出頭，對量化雜訊敏感（因 shape 自動排除）
`embed_tokens`	輸入嵌入（因 shape 自動排除）
`mlp.shared_expert.*`	共享專家權重，處理每個 token
`mlp.shared_expert_gate`	共享專家路由門
`mlp.gate`	MoE 路由門（量化方案自動排除）
`linear_attn.*`

效能表現

在單台 NVIDIA DGX Spark (GB10, 128GB LPDDR5X, SM121) 上實測：

Table with columns: 配置, 解碼速度, 備註
配置	解碼速度	備註
INT4 + DFlash-15（日常對話）	40-60 tok/s	搭配 Qwen3.6-35B-A3B-DFlash drafter

投機解碼

本模型支援兩種投機解碼方式：

DFlash（需額外下載 drafter 模型）：

bash
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

注意：DFlash drafter 是以原版 Qwen3.6-35B-A3B 訓練的，在 abliterated 版本上的接受率可能較原版低。

MTP（使用內建權重，不需額外模型）：

bash
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

使用 vLLM 部署

bash
vllm serve /path/to/model \
    --quantization moe_wna16 \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) 相容性說明

使用 --quantization moe_wna16 啟用 Marlin INT4 kernel（SM121 透過 SM120 二進制相容性支援）
FP8 KV cache 與 GDN non-causal attention 不相容，請使用 --kv-cache-dtype auto
NVFP4 在 SM121 上會 fallback 到 Marlin W4A16（缺少 cvt.e2m1x2 PTX 指令）
Runtime FP8（--quantization fp8）與 DFlash 不相容（drafter 繼承 FP8 config 導致 crash）
--language-model-only 跳過視覺編碼器 profiling，加速純文字推理啟動
--performance-mode throughput 啟用吞吐量最佳化的 CUDA graphs 和 kernel
UMA 架構啟動前請先清除 page cache：sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

安全警告

此模型已移除安全過濾機制（abliterated），可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任，並確保使用方式符合當地法規與倫理標準。不適用於公開或生產環境。

致謝

原始模型：Qwen/Qwen3.6-35B-A3B，Alibaba Qwen 團隊
去審查：huihui-ai
INT4 量化：YuYu1015，於 NVIDIA DGX Spark (GB10) 上完成
量化工具：Intel AutoRound

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

YuYu1015

Model Tree

Base

huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated

Quantized

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

English

INT4 AutoRound quantization of huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with Marlin INT4 kernel acceleration.

Model Details

Table with columns: Item, Value
Item	Value
Architecture	MoE (35B total, 3B active, 256 experts / 8 routed + 1 shared) + GDN (Mamba) + Attention hybrid
Base model	Qwen/Qwen3.6-35B-A3B
Fine-tuned by	huihui-ai (abliteration, no TransformerLens)
Quantized by	YuYu1015
Model size	~23.8 GB (vs ~71.9 GB BF16 original)
Context length	Up to 262,144 tokens (limited by KV cache on 128GB)
Thinking mode	Supported (`enable_thinking: true/false`)
Tool calling	Supported (`qwen3_xml` parser)
MTP	Built-in MTP weights included

Quantization Details

Table with columns: Item, Value
Item	Value
Method	Intel AutoRound v0.12.2
Bits	4
Group size	128
Format	auto_round (GPTQ-compatible)
Iterations	200
Calibration samples	512
Calibration sequence length	2048
Torch compile	Enabled ()

Layers Preserved in BF16

The following layers are not quantized to preserve model quality:

Table with columns: Layer, Reason
Layer	Reason
`lm_head`	Output head, sensitive to quantization noise (auto-excluded by shape)
`embed_tokens`	Input embeddings (auto-excluded by shape)
`mlp.shared_expert.*`	Shared expert weights, processes every token
`mlp.shared_expert_gate`	Shared expert routing gate
`mlp.gate`	MoE routing gate (auto-excluded by quantization scheme)

Performance

Tested on a single NVIDIA DGX Spark (GB10, 128GB LPDDR5X, SM121):

Table with columns: Configuration, Decode Speed, Notes
Configuration	Decode Speed	Notes
INT4 + DFlash-15 (daily conversation)	40-60 tok/s	With Qwen3.6-35B-A3B-DFlash drafter

Speculative Decoding

This model supports two speculative decoding methods:

DFlash (requires separate drafter model):

bash
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

Note: The DFlash drafter was trained on the original Qwen3.6-35B-A3B. Acceptance rate on the abliterated variant may be lower than on the original model.

MTP (uses built-in weights, no extra model needed):

bash
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

Serving with vLLM

bash
vllm serve /path/to/model \
    --quantization moe_wna16 \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

Use --quantization moe_wna16 for Marlin INT4 kernel (SM121 compatible via SM120 binary compat)
FP8 KV cache is not compatible with GDN non-causal attention layers; use --kv-cache-dtype auto
NVFP4 falls back to Marlin W4A16 on SM121 (missing cvt.e2m1x2 PTX instruction)
Runtime FP8 (--quantization fp8) is not compatible with DFlash (drafter inherits FP8 config and crashes)
--language-model-only skips vision encoder profiling for text-only inference
--performance-mode throughput enables CUDA graphs and kernels for throughput optimization
Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Safety Warning

Credits

Original Model: Qwen/Qwen3.6-35B-A3B by Alibaba Qwen Team
Abliteration: huihui-ai
INT4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
Quantization Tool: Intel AutoRound

繁體中文

huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated 的 INT4 AutoRound 量化版本，針對 NVIDIA DGX Spark (GB10 SM121) 最佳化，使用 Marlin INT4 kernel 加速。

模型資訊

Table with columns: 項目, 數值
項目	數值
架構	MoE（35B 總參數, 3B 活躍, 256 experts / 8 routed + 1 shared）+ GDN (Mamba) + Attention 混合
基礎模型	Qwen/Qwen3.6-35B-A3B
微調者	huihui-ai（abliteration，無 TransformerLens）
量化者	YuYu1015
模型大小	~23.8 GB（原版 BF16 約 71.9 GB）
Context 長度	最高 262,144 tokens（受限於 128GB 統一記憶體上的 KV cache）

量化詳情

Table with columns: 項目, 數值
項目	數值
方法	Intel AutoRound v0.12.2
位元數	4
Group size	128
格式	auto_round（GPTQ 相容）
迭代次數	200
校準樣本數	512
校準序列長度	2048
Torch compile	啟用（`--enable_torch_compile`）

保留 BF16 的層

以下層未被量化以保持模型品質：

Table with columns: 層, 原因
層	原因
`lm_head`	輸出頭，對量化雜訊敏感（因 shape 自動排除）
`embed_tokens`	輸入嵌入（因 shape 自動排除）
`mlp.shared_expert.*`	共享專家權重，處理每個 token
`mlp.shared_expert_gate`	共享專家路由門
`mlp.gate`	MoE 路由門（量化方案自動排除）
`linear_attn.*`

效能表現

在單台 NVIDIA DGX Spark (GB10, 128GB LPDDR5X, SM121) 上實測：

Table with columns: 配置, 解碼速度, 備註
配置	解碼速度	備註
INT4 + DFlash-15（日常對話）	40-60 tok/s	搭配 Qwen3.6-35B-A3B-DFlash drafter

投機解碼

本模型支援兩種投機解碼方式：

DFlash（需額外下載 drafter 模型）：

bash
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

注意：DFlash drafter 是以原版 Qwen3.6-35B-A3B 訓練的，在 abliterated 版本上的接受率可能較原版低。

MTP（使用內建權重，不需額外模型）：

bash
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

使用 vLLM 部署

bash
vllm serve /path/to/model \
    --quantization moe_wna16 \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) 相容性說明

使用 --quantization moe_wna16 啟用 Marlin INT4 kernel（SM121 透過 SM120 二進制相容性支援）
FP8 KV cache 與 GDN non-causal attention 不相容，請使用 --kv-cache-dtype auto
NVFP4 在 SM121 上會 fallback 到 Marlin W4A16（缺少 cvt.e2m1x2 PTX 指令）
Runtime FP8（--quantization fp8）與 DFlash 不相容（drafter 繼承 FP8 config 導致 crash）
--language-model-only 跳過視覺編碼器 profiling，加速純文字推理啟動
--performance-mode throughput 啟用吞吐量最佳化的 CUDA graphs 和 kernel
UMA 架構啟動前請先清除 page cache：sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

安全警告

致謝

原始模型：Qwen/Qwen3.6-35B-A3B，Alibaba Qwen 團隊
去審查：huihui-ai
INT4 量化：YuYu1015，於 NVIDIA DGX Spark (GB10) 上完成
量化工具：Intel AutoRound