YuYu1015

Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4

README

License: apache-2.0

English

[!TIP] Quantized on 2026-04-21 using llm-compressor with mixed-domain calibration and sensitive-layer protection for maximum accuracy recovery.

[!IMPORTANT] Native W4A4 on DGX Spark (SM121) — confirmed working

Unlike earlier NVFP4 models on SM121, this checkpoint runs true W4A4 via FlashInfer CUTLASS NVFP4 MoE kernel (verified in vLLM logs: FlashInferCutlassNvFp4LinearKernel + FLASHINFER_CUTLASS NvFp4 MoE backend). Requires:

vLLM 0.19.1rc1.dev374+g1174723eb or later (includes PR #37725 arch-suffix fix)

FlashInfer ≥ 0.6.8 with SM120f compilation (PR #2650)

CUDA ≥ 12.9

MTP speculative decoding supported — MTP layers preserved in BF16 from the original checkpoint via save_mtp_tensors_to_checkpoint.

[!WARNING] Abliteration changes the optimal speculative decoding setup — this is a known trade-off, not a defect.

This release's distinguishing feature is mixed-domain calibration (ultrachat_200k chat + Nemotron-Post-Training-Dataset-v2 reasoning, 256 samples total). The calibration recovers quantization accuracy, but it cannot undo the distribution shift introduced upstream by abliteration itself — the DFlash drafter was trained on the original Qwen3.6-35B-A3B weights, and the abliterated residual distribution no longer matches the drafter's prior, so acceptance rate drops.

Measured throughput on DGX Spark:

DFlash (num_speculative_tokens: 15) — ~50 t/s sustained, occasional bursts up to ~100 t/s

MTP (num_speculative_tokens: 1) — ~40 t/s sustained, occasional bursts up to ~70 t/s

Counter-intuitively, MTP with a single speculative token outperforms DFlash on this abliterated variant — MTP reuses the model's own hidden state, so it stays aligned with the abliterated distribution that the mixed-domain calibration was tuned against. Prefer --speculative-config '{"method":"mtp","num_speculative_tokens":1}' as the default; only fall back to DFlash if you specifically need it.

NVFP4 W4A4 quantization of huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with the FlashInfer CUTLASS FP4 MoE kernel.

Model Details

Table with columns: Item, Value
Item	Value
Architecture	MoE (35B total, 3B active, 256 experts / 8 routed + 1 shared) + GDN (Mamba) + Attention
Base model	Qwen/Qwen3.6-35B-A3B
Fine-tuned by	huihui-ai (abliteration)
Quantized by	YuYu1015
Model size	~25.1 GB (NVFP4, vs ~71.9 GB BF16 original)
Context length	Up to 262,144 tokens

Quantization Details

This model uses a three-strategy stack (ACD) on top of the RedHatAI official flow:

Table with columns: Strategy, Description
Strategy	Description
A. RedHatAI official baseline	`Qwen3_5MoeForConditionalGeneration` + `save_mtp_tensors_to_checkpoint` (solves OOM on Qwen3.6, preserves MTP)
C. Mixed-domain calibration	`ultrachat_200k` (128 chat) + `Nemotron-Post-Training-Dataset-v2` (128 reasoning) = 256 total
D. Sweet-spot hyperparameters	`num_calibration_samples=256`, `max_seq_length=4096` (quality > quantity)

B (last-layer protection) incompatible with vLLM fused MoE: vLLM's CompressedTensorsMoEMethod requires all projections within a MoE block (gate/up/down × 256 experts + shared_expert) to share the same quantization scheme. Partial ignore triggers ValueError: All MoE projections need to have same quantization scheme but found multiple.

E (SpinQuant R1+R2) incompatible with multi-modal config: llm-compressor's get_head_dim only reads top-level config, not Qwen3.6's nested text_config.

Table with columns: Item, Value
Item	Value
Method	llm-compressor (main) + compressed-tensors (main)
Scheme	NVFP4 W4A4 (E2M1 + FP8 per-group scaling, group size 16)
Format	compressed-tensors
Calibration datasets	HuggingFaceH4/ultrachat_200k (128) + nvidia/Nemotron-Post-Training-Dataset-v2 (128)
Calibration samples (total)	256

Layers Preserved in BF16

Table with columns: Layer pattern, Reason
Layer pattern	Reason
`re:.*lm_head`	Output head, sensitive to quantization noise
`re:.*embed_tokens$`	Input embeddings
`re:visual.` / `re:model.visual.`	Vision encoder
`re:.*mlp.gate$`	MoE router gate (routing decision, must stay BF16)
`re:.*shared_expert_gate$`	Shared expert routing gate

Speculative Decoding

This model supports two speculative decoding methods:

DFlash (separate drafter, recommended for single-user / low-concurrency):

bash
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

MTP (built-in weights, recommended default for this abliterated variant — see warning at top):

bash
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

Note: On hybrid GDN architectures, MTP may hit a state-rollback bug (vLLM #39273) during high token-rejection rates. The num_speculative_tokens: 1 setting also reduces exposure to this issue.

Serving with vLLM

bash
vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --attention-backend flash_attn \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 64 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --performance-mode throughput \
    --speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

Native W4A4 confirmed via FlashInfer CUTLASS NVFP4 MoE backend (no more W4A16 fallback)
Verify in logs: Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM + Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
FP8 KV cache is not compatible with GDN non-causal attention; use --kv-cache-dtype auto
DFlash requires --attention-backend flash_attn (flashinfer backend + DFlash is incompatible)
--language-model-only skips vision encoder profiling for text-only inference
Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Known Limitations

Per-tensor global scales for fused q_proj / k_proj / v_proj may differ, causing a vLLM warning at load time. This is inherent to the llm-compressor per-layer quantization behavior; the impact on accuracy is typically small but measurable on strict tool-calling JSON schemas.
DFlash drafter was trained on the original Qwen3.6-35B-A3B, not the abliterated variant — acceptance rate may be lower than on the original model.

Safety Warning

This model has safety filtering removed (abliterated) and may generate sensitive, controversial, or inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards. Not suitable for public-facing or production applications.

Credits

Original Model: Qwen/Qwen3.6-35B-A3B by Alibaba Qwen Team
Abliteration: huihui-ai
NVFP4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
Quantization Tool: llm-compressor by vLLM Project
Official Reference: RedHatAI/Qwen3.6-35B-A3B-NVFP4
Sensitivity Analysis: Diagnosing FP4 Inference (arXiv 2603.08747)

繁體中文

[!TIP] 2026-04-21 量化上傳，使用 llm-compressor 搭配混合領域校準與敏感層保護，最大化精度保留。

[!IMPORTANT] DGX Spark (SM121) 原生 W4A4 — 已驗證可用

不同於早期 SM121 的 NVFP4 模型，此 checkpoint 透過 FlashInfer CUTLASS NVFP4 MoE kernel 跑真 W4A4（vLLM log 可見 FlashInferCutlassNvFp4LinearKernel + FLASHINFER_CUTLASS NvFp4 MoE backend）。需要：

vLLM 0.19.1rc1.dev374+g1174723eb 以上（含 PR #37725 arch-suffix 修復）

FlashInfer ≥ 0.6.8 帶 SM120f 編譯（PR #2650）

CUDA ≥ 12.9

支援 MTP 投機解碼 — MTP 層以 BF16 從原 checkpoint 透過 save_mtp_tensors_to_checkpoint 保留。

[!WARNING] Abliteration 會改變投機解碼的最佳設定 — 這是已知取捨，非 bug。

本版本的特色是混合領域校準（ultrachat_200k 對話 + Nemotron-Post-Training-Dataset-v2 推理，共 256 樣本）。校準能恢復量化精度，但無法逆轉 abliteration 在上游造成的分佈偏移 — DFlash drafter 是以原版 Qwen3.6-35B-A3B 權重訓練，abliterated 後的殘差分佈已不再符合 drafter 的先驗，接受率因此下降。

DGX Spark 實測吞吐：

DFlash（num_speculative_tokens: 15） — 約 50 t/s，偶爾飆至 ~100 t/s

MTP（num_speculative_tokens: 1） — 穩定約 40 t/s，偶爾飆至 ~70 t/s

反直覺地，MTP 搭配單一投機 token 在此 abliterated 變體上表現優於 DFlash — MTP 沿用模型自身的 hidden state，與混合領域校準所針對的 abliterated 分佈保持一致。建議預設使用 --speculative-config '{"method":"mtp","num_speculative_tokens":1}'，僅在特殊需求時才改用 DFlash。

huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated 的 NVFP4 W4A4 量化版，針對 NVIDIA DGX Spark (GB10 SM121) 最佳化，使用 FlashInfer CUTLASS FP4 MoE kernel。

模型資訊

Table with columns: 項目, 數值
項目	數值
架構	MoE（35B 總參數, 3B 活躍, 256 experts / 8 routed + 1 shared）+ GDN (Mamba) + Attention 混合
基礎模型	Qwen/Qwen3.6-35B-A3B
微調者	huihui-ai（abliteration）
量化者	YuYu1015
模型大小	~25.1 GB（NVFP4，原版 BF16 約 71.9 GB）
Context 長度	最高 262,144 tokens

量化詳情

此模型在 RedHatAI 官方流程上堆疊三項策略（ACD）：

Table with columns: 策略, 說明
策略	說明
A. RedHatAI 官方基線	`Qwen3_5MoeForConditionalGeneration` + `save_mtp_tensors_to_checkpoint`（解 Qwen3.6 OOM、保留 MTP）
C. 混合領域校準	`ultrachat_200k`（128 對話）+ `Nemotron-Post-Training-Dataset-v2`（128 推理）共 256
D. 黃金比例參數	`num_calibration_samples=256`、`max_seq_length=4096`（品質 > 數量）

B 策略（最後層保護）與 vLLM fused MoE 不相容：vLLM 的 CompressedTensorsMoEMethod 要求 MoE block 內所有 projection（gate/up/down × 256 experts + shared_expert）必須同 scheme。Partial ignore 會觸發 ValueError: All MoE projections need to have same quantization scheme but found multiple。

E 策略（SpinQuant R1+R2）與 multi-modal config 不相容：llm-compressor 的 get_head_dim 只讀頂層 config，不讀 Qwen3.6 巢狀的 text_config。

Table with columns: 項目, 數值
項目	數值
方法	llm-compressor（main）+ compressed-tensors（main）
方案	NVFP4 W4A4（E2M1 + FP8 逐群縮放，群組大小 16）
格式	compressed-tensors
校準資料集	HuggingFaceH4/ultrachat_200k（128）+ nvidia/Nemotron-Post-Training-Dataset-v2（128）
校準樣本總數	256

保留 BF16 的層

Table with columns: 層 pattern, 原因
層 pattern	原因
`re:.*lm_head`	輸出頭，對量化雜訊敏感
`re:.*embed_tokens$`	輸入嵌入
`re:visual.` / `re:model.visual.`	視覺編碼器
`re:.*mlp.gate$`	MoE 路由門（routing 決策必須 BF16）
`re:.*shared_expert_gate$`	共享專家路由門

投機解碼

本模型支援兩種投機解碼方式：

DFlash（獨立 drafter，建議單用戶 / 低併發）：

bash
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

MTP（內建權重，此 abliterated 變體的建議預設 — 詳見頂部警告）：

bash
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

注意：混合 GDN 架構下 MTP 可能觸發 state-rollback bug（vLLM #39273），高 rejection rate 時輸出可能退化。num_speculative_tokens: 1 也能降低觸發機率。

vLLM 部署

bash
vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --attention-backend flash_attn \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 64 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --performance-mode throughput \
    --speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) 相容性說明

原生 W4A4 已確認 透過 FlashInfer CUTLASS NVFP4 MoE backend（不再退回 W4A16）
log 驗證：Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM + Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
FP8 KV cache 與 GDN non-causal attention 不相容，請使用 --kv-cache-dtype auto
DFlash 需搭配 --attention-backend flash_attn（flashinfer backend + DFlash 不相容）
--language-model-only 跳過視覺編碼器 profiling，加速純文字推理啟動
UMA 架構啟動前請先清除 page cache：sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

已知限制

Fused q_proj / k_proj / v_proj 的 per-tensor global scale 可能不一致，vLLM 載入時會印警告。這是 llm-compressor per-layer 量化的固有行為，一般精度影響輕微，但在嚴格 tool-calling JSON schema 下可能可測得。
DFlash drafter 是以原版 Qwen3.6-35B-A3B 訓練，非 abliterated 變體 — 接受率可能較原版低。

安全警告

此模型已移除安全過濾機制（abliterated），可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任，並確保使用方式符合當地法規與倫理標準。不適用於公開或生產環境。

致謝

原始模型：Qwen/Qwen3.6-35B-A3B，Alibaba Qwen 團隊
去審查：huihui-ai
NVFP4 量化：YuYu1015，於 NVIDIA DGX Spark (GB10) 上完成
量化工具：llm-compressor，vLLM Project
官方參考：RedHatAI/Qwen3.6-35B-A3B-NVFP4
敏感度分析：Diagnosing FP4 Inference (arXiv 2603.08747)

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

YuYu1015

Model Tree

Base

huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated

Quantized

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality

Dedicated Endpoints

Explore FriendliAI today

Get started Talk to an engineer

README

License: apache-2.0

English

[!TIP] Quantized on 2026-04-21 using llm-compressor with mixed-domain calibration and sensitive-layer protection for maximum accuracy recovery.

[!IMPORTANT] Native W4A4 on DGX Spark (SM121) — confirmed working

Unlike earlier NVFP4 models on SM121, this checkpoint runs true W4A4 via FlashInfer CUTLASS NVFP4 MoE kernel (verified in vLLM logs: FlashInferCutlassNvFp4LinearKernel + FLASHINFER_CUTLASS NvFp4 MoE backend). Requires:

vLLM 0.19.1rc1.dev374+g1174723eb or later (includes PR #37725 arch-suffix fix)

FlashInfer ≥ 0.6.8 with SM120f compilation (PR #2650)

CUDA ≥ 12.9

MTP speculative decoding supported — MTP layers preserved in BF16 from the original checkpoint via save_mtp_tensors_to_checkpoint.

[!WARNING] Abliteration changes the optimal speculative decoding setup — this is a known trade-off, not a defect.

This release's distinguishing feature is mixed-domain calibration (ultrachat_200k chat + Nemotron-Post-Training-Dataset-v2 reasoning, 256 samples total). The calibration recovers quantization accuracy, but it cannot undo the distribution shift introduced upstream by abliteration itself — the DFlash drafter was trained on the original Qwen3.6-35B-A3B weights, and the abliterated residual distribution no longer matches the drafter's prior, so acceptance rate drops.

Measured throughput on DGX Spark:

DFlash (num_speculative_tokens: 15) — ~50 t/s sustained, occasional bursts up to ~100 t/s

MTP (num_speculative_tokens: 1) — ~40 t/s sustained, occasional bursts up to ~70 t/s

Counter-intuitively, MTP with a single speculative token outperforms DFlash on this abliterated variant — MTP reuses the model's own hidden state, so it stays aligned with the abliterated distribution that the mixed-domain calibration was tuned against. Prefer --speculative-config '{"method":"mtp","num_speculative_tokens":1}' as the default; only fall back to DFlash if you specifically need it.

NVFP4 W4A4 quantization of huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with the FlashInfer CUTLASS FP4 MoE kernel.

Model Details

Table with columns: Item, Value
Item	Value
Architecture	MoE (35B total, 3B active, 256 experts / 8 routed + 1 shared) + GDN (Mamba) + Attention
Base model	Qwen/Qwen3.6-35B-A3B
Fine-tuned by	huihui-ai (abliteration)
Quantized by	YuYu1015
Model size	~25.1 GB (NVFP4, vs ~71.9 GB BF16 original)
Context length	Up to 262,144 tokens

Quantization Details

This model uses a three-strategy stack (ACD) on top of the RedHatAI official flow:

Table with columns: Strategy, Description
Strategy	Description
A. RedHatAI official baseline	`Qwen3_5MoeForConditionalGeneration` + `save_mtp_tensors_to_checkpoint` (solves OOM on Qwen3.6, preserves MTP)
C. Mixed-domain calibration	`ultrachat_200k` (128 chat) + `Nemotron-Post-Training-Dataset-v2` (128 reasoning) = 256 total
D. Sweet-spot hyperparameters	`num_calibration_samples=256`, `max_seq_length=4096` (quality > quantity)

B (last-layer protection) incompatible with vLLM fused MoE: vLLM's CompressedTensorsMoEMethod requires all projections within a MoE block (gate/up/down × 256 experts + shared_expert) to share the same quantization scheme. Partial ignore triggers ValueError: All MoE projections need to have same quantization scheme but found multiple.

E (SpinQuant R1+R2) incompatible with multi-modal config: llm-compressor's get_head_dim only reads top-level config, not Qwen3.6's nested text_config.

Table with columns: Item, Value
Item	Value
Method	llm-compressor (main) + compressed-tensors (main)
Scheme	NVFP4 W4A4 (E2M1 + FP8 per-group scaling, group size 16)
Format	compressed-tensors
Calibration datasets	HuggingFaceH4/ultrachat_200k (128) + nvidia/Nemotron-Post-Training-Dataset-v2 (128)
Calibration samples (total)	256

Layers Preserved in BF16

Table with columns: Layer pattern, Reason
Layer pattern	Reason
`re:.*lm_head`	Output head, sensitive to quantization noise
`re:.*embed_tokens$`	Input embeddings
`re:visual.` / `re:model.visual.`	Vision encoder
`re:.*mlp.gate$`	MoE router gate (routing decision, must stay BF16)
`re:.*shared_expert_gate$`	Shared expert routing gate

Speculative Decoding

This model supports two speculative decoding methods:

DFlash (separate drafter, recommended for single-user / low-concurrency):

bash
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

MTP (built-in weights, recommended default for this abliterated variant — see warning at top):

bash
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

Note: On hybrid GDN architectures, MTP may hit a state-rollback bug (vLLM #39273) during high token-rejection rates. The num_speculative_tokens: 1 setting also reduces exposure to this issue.

Serving with vLLM

bash
vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --attention-backend flash_attn \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 64 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --performance-mode throughput \
    --speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) Compatibility Notes

Native W4A4 confirmed via FlashInfer CUTLASS NVFP4 MoE backend (no more W4A16 fallback)
Verify in logs: Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM + Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
FP8 KV cache is not compatible with GDN non-causal attention; use --kv-cache-dtype auto
DFlash requires --attention-backend flash_attn (flashinfer backend + DFlash is incompatible)
--language-model-only skips vision encoder profiling for text-only inference
Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Known Limitations

Per-tensor global scales for fused q_proj / k_proj / v_proj may differ, causing a vLLM warning at load time. This is inherent to the llm-compressor per-layer quantization behavior; the impact on accuracy is typically small but measurable on strict tool-calling JSON schemas.
DFlash drafter was trained on the original Qwen3.6-35B-A3B, not the abliterated variant — acceptance rate may be lower than on the original model.

Safety Warning

Credits

Original Model: Qwen/Qwen3.6-35B-A3B by Alibaba Qwen Team
Abliteration: huihui-ai
NVFP4 Quantization: YuYu1015 on NVIDIA DGX Spark (GB10)
Quantization Tool: llm-compressor by vLLM Project
Official Reference: RedHatAI/Qwen3.6-35B-A3B-NVFP4
Sensitivity Analysis: Diagnosing FP4 Inference (arXiv 2603.08747)

繁體中文

[!TIP] 2026-04-21 量化上傳，使用 llm-compressor 搭配混合領域校準與敏感層保護，最大化精度保留。

[!IMPORTANT] DGX Spark (SM121) 原生 W4A4 — 已驗證可用

不同於早期 SM121 的 NVFP4 模型，此 checkpoint 透過 FlashInfer CUTLASS NVFP4 MoE kernel 跑真 W4A4（vLLM log 可見 FlashInferCutlassNvFp4LinearKernel + FLASHINFER_CUTLASS NvFp4 MoE backend）。需要：

vLLM 0.19.1rc1.dev374+g1174723eb 以上（含 PR #37725 arch-suffix 修復）

FlashInfer ≥ 0.6.8 帶 SM120f 編譯（PR #2650）

CUDA ≥ 12.9

支援 MTP 投機解碼 — MTP 層以 BF16 從原 checkpoint 透過 save_mtp_tensors_to_checkpoint 保留。

[!WARNING] Abliteration 會改變投機解碼的最佳設定 — 這是已知取捨，非 bug。

本版本的特色是混合領域校準（ultrachat_200k 對話 + Nemotron-Post-Training-Dataset-v2 推理，共 256 樣本）。校準能恢復量化精度，但無法逆轉 abliteration 在上游造成的分佈偏移 — DFlash drafter 是以原版 Qwen3.6-35B-A3B 權重訓練，abliterated 後的殘差分佈已不再符合 drafter 的先驗，接受率因此下降。

DGX Spark 實測吞吐：

DFlash（num_speculative_tokens: 15） — 約 50 t/s，偶爾飆至 ~100 t/s

MTP（num_speculative_tokens: 1） — 穩定約 40 t/s，偶爾飆至 ~70 t/s

反直覺地，MTP 搭配單一投機 token 在此 abliterated 變體上表現優於 DFlash — MTP 沿用模型自身的 hidden state，與混合領域校準所針對的 abliterated 分佈保持一致。建議預設使用 --speculative-config '{"method":"mtp","num_speculative_tokens":1}'，僅在特殊需求時才改用 DFlash。

huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated 的 NVFP4 W4A4 量化版，針對 NVIDIA DGX Spark (GB10 SM121) 最佳化，使用 FlashInfer CUTLASS FP4 MoE kernel。

模型資訊

Table with columns: 項目, 數值
項目	數值
架構	MoE（35B 總參數, 3B 活躍, 256 experts / 8 routed + 1 shared）+ GDN (Mamba) + Attention 混合
基礎模型	Qwen/Qwen3.6-35B-A3B
微調者	huihui-ai（abliteration）
量化者	YuYu1015
模型大小	~25.1 GB（NVFP4，原版 BF16 約 71.9 GB）
Context 長度	最高 262,144 tokens

量化詳情

此模型在 RedHatAI 官方流程上堆疊三項策略（ACD）：

Table with columns: 策略, 說明
策略	說明
A. RedHatAI 官方基線	`Qwen3_5MoeForConditionalGeneration` + `save_mtp_tensors_to_checkpoint`（解 Qwen3.6 OOM、保留 MTP）
C. 混合領域校準	`ultrachat_200k`（128 對話）+ `Nemotron-Post-Training-Dataset-v2`（128 推理）共 256
D. 黃金比例參數	`num_calibration_samples=256`、`max_seq_length=4096`（品質 > 數量）

B 策略（最後層保護）與 vLLM fused MoE 不相容：vLLM 的 CompressedTensorsMoEMethod 要求 MoE block 內所有 projection（gate/up/down × 256 experts + shared_expert）必須同 scheme。Partial ignore 會觸發 ValueError: All MoE projections need to have same quantization scheme but found multiple。

E 策略（SpinQuant R1+R2）與 multi-modal config 不相容：llm-compressor 的 get_head_dim 只讀頂層 config，不讀 Qwen3.6 巢狀的 text_config。

Table with columns: 項目, 數值
項目	數值
方法	llm-compressor（main）+ compressed-tensors（main）
方案	NVFP4 W4A4（E2M1 + FP8 逐群縮放，群組大小 16）
格式	compressed-tensors
校準資料集	HuggingFaceH4/ultrachat_200k（128）+ nvidia/Nemotron-Post-Training-Dataset-v2（128）
校準樣本總數	256

保留 BF16 的層

Table with columns: 層 pattern, 原因
層 pattern	原因
`re:.*lm_head`	輸出頭，對量化雜訊敏感
`re:.*embed_tokens$`	輸入嵌入
`re:visual.` / `re:model.visual.`	視覺編碼器
`re:.*mlp.gate$`	MoE 路由門（routing 決策必須 BF16）
`re:.*shared_expert_gate$`	共享專家路由門

投機解碼

本模型支援兩種投機解碼方式：

DFlash（獨立 drafter，建議單用戶 / 低併發）：

bash
--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

MTP（內建權重，此 abliterated 變體的建議預設 — 詳見頂部警告）：

bash
--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

注意：混合 GDN 架構下 MTP 可能觸發 state-rollback bug（vLLM #39273），高 rejection rate 時輸出可能退化。num_speculative_tokens: 1 也能降低觸發機率。

vLLM 部署

bash
vllm serve /path/to/model \
    --quantization compressed-tensors \
    --served-model-name qwen3.6-35b-a3b \
    --reasoning-parser qwen3 \
    --enable-auto-tool-choice \
    --tool-call-parser qwen3_xml \
    --attention-backend flash_attn \
    --kv-cache-dtype auto \
    --gpu-memory-utilization 0.80 \
    --max-model-len 65536 \
    --max-num-batched-tokens 16384 \
    --max-num-seqs 64 \
    --enable-prefix-caching \
    --enable-chunked-prefill \
    --performance-mode throughput \
    --speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
    --trust-remote-code \
    --language-model-only

DGX Spark (SM121) 相容性說明

原生 W4A4 已確認 透過 FlashInfer CUTLASS NVFP4 MoE backend（不再退回 W4A16）
log 驗證：Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM + Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
FP8 KV cache 與 GDN non-causal attention 不相容，請使用 --kv-cache-dtype auto
DFlash 需搭配 --attention-backend flash_attn（flashinfer backend + DFlash 不相容）
--language-model-only 跳過視覺編碼器 profiling，加速純文字推理啟動
UMA 架構啟動前請先清除 page cache：sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

已知限制

Fused q_proj / k_proj / v_proj 的 per-tensor global scale 可能不一致，vLLM 載入時會印警告。這是 llm-compressor per-layer 量化的固有行為，一般精度影響輕微，但在嚴格 tool-calling JSON schema 下可能可測得。
DFlash drafter 是以原版 Qwen3.6-35B-A3B 訓練，非 abliterated 變體 — 接受率可能較原版低。

安全警告

致謝

原始模型：Qwen/Qwen3.6-35B-A3B，Alibaba Qwen 團隊
去審查：huihui-ai
NVFP4 量化：YuYu1015，於 NVIDIA DGX Spark (GB10) 上完成
量化工具：llm-compressor，vLLM Project
官方參考：RedHatAI/Qwen3.6-35B-A3B-NVFP4
敏感度分析：Diagnosing FP4 Inference (arXiv 2603.08747)