YuYu1015

Huihui-Qwen3.6-35B-A3B-abliterated-NVFP4

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

English

[!TIP] Quantized on 2026-04-21 using llm-compressor with mixed-domain calibration and sensitive-layer protection for maximum accuracy recovery.

[!IMPORTANT] Native W4A4 on DGX Spark (SM121) — confirmed working

Unlike earlier NVFP4 models on SM121, this checkpoint runs true W4A4 via FlashInfer CUTLASS NVFP4 MoE kernel (verified in vLLM logs: FlashInferCutlassNvFp4LinearKernel + FLASHINFER_CUTLASS NvFp4 MoE backend). Requires:

  • vLLM 0.19.1rc1.dev374+g1174723eb or later (includes PR #37725 arch-suffix fix)
  • FlashInfer ≥ 0.6.8 with SM120f compilation (PR #2650)
  • CUDA ≥ 12.9

MTP speculative decoding supported — MTP layers preserved in BF16 from the original checkpoint via save_mtp_tensors_to_checkpoint.

[!WARNING] Abliteration changes the optimal speculative decoding setup — this is a known trade-off, not a defect.

This release's distinguishing feature is mixed-domain calibration (ultrachat_200k chat + Nemotron-Post-Training-Dataset-v2 reasoning, 256 samples total). The calibration recovers quantization accuracy, but it cannot undo the distribution shift introduced upstream by abliteration itself — the DFlash drafter was trained on the original Qwen3.6-35B-A3B weights, and the abliterated residual distribution no longer matches the drafter's prior, so acceptance rate drops.

Measured throughput on DGX Spark:

  • DFlash (num_speculative_tokens: 15) — ~50 t/s sustained, occasional bursts up to ~100 t/s
  • MTP (num_speculative_tokens: 1) — ~40 t/s sustained, occasional bursts up to ~70 t/s

Counter-intuitively, MTP with a single speculative token outperforms DFlash on this abliterated variant — MTP reuses the model's own hidden state, so it stays aligned with the abliterated distribution that the mixed-domain calibration was tuned against. Prefer --speculative-config '{"method":"mtp","num_speculative_tokens":1}' as the default; only fall back to DFlash if you specifically need it.

NVFP4 W4A4 quantization of huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated, optimized for NVIDIA DGX Spark (GB10 SM121) with the FlashInfer CUTLASS FP4 MoE kernel.

Model Details

Table
ItemValue
ArchitectureMoE (35B total, 3B active, 256 experts / 8 routed + 1 shared) + GDN (Mamba) + Attention
Base modelQwen/Qwen3.6-35B-A3B
Fine-tuned byhuihui-ai (abliteration)
Quantized byYuYu1015
Model size~25.1 GB (NVFP4, vs ~71.9 GB BF16 original)
Context lengthUp to 262,144 tokens
Thinking modeSupported (enable_thinking: true/false)
Tool callingSupported (qwen3_xml parser)
MTPBuilt-in MTP weights included (preserved in BF16)
DFlashCompatible with z-lab/Qwen3.6-35B-A3B-DFlash

Quantization Details

This model uses a three-strategy stack (ACD) on top of the RedHatAI official flow:

Table
StrategyDescription
A. RedHatAI official baselineQwen3_5MoeForConditionalGeneration + save_mtp_tensors_to_checkpoint (solves OOM on Qwen3.6, preserves MTP)
C. Mixed-domain calibrationultrachat_200k (128 chat) + Nemotron-Post-Training-Dataset-v2 (128 reasoning) = 256 total
D. Sweet-spot hyperparametersnum_calibration_samples=256, max_seq_length=4096 (quality > quantity)

B (last-layer protection) incompatible with vLLM fused MoE: vLLM's CompressedTensorsMoEMethod requires all projections within a MoE block (gate/up/down × 256 experts + shared_expert) to share the same quantization scheme. Partial ignore triggers ValueError: All MoE projections need to have same quantization scheme but found multiple.

E (SpinQuant R1+R2) incompatible with multi-modal config: llm-compressor's get_head_dim only reads top-level config, not Qwen3.6's nested text_config.

Table
ItemValue
Methodllm-compressor (main) + compressed-tensors (main)
SchemeNVFP4 W4A4 (E2M1 + FP8 per-group scaling, group size 16)
Formatcompressed-tensors
Calibration datasetsHuggingFaceH4/ultrachat_200k (128) + nvidia/Nemotron-Post-Training-Dataset-v2 (128)
Calibration samples (total)256
Calibration sequence length4096
MoE calibrationmoe_calibrate_all_experts=True (via PR #2383)
HardwareNVIDIA DGX Spark (GB10, 128GB unified memory)
Environmenttransformers>=5.0,<6 + llm-compressor main + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

Layers Preserved in BF16

Table
Layer patternReason
re:.*lm_headOutput head, sensitive to quantization noise
re:.*embed_tokens$Input embeddings
re:visual.* / re:model.visual.*Vision encoder
re:.*mlp.gate$MoE router gate (routing decision, must stay BF16)
re:.*shared_expert_gate$Shared expert routing gate
re:.*linear_attn.*GDN/DeltaNet (Mamba) layers — may output zeros if quantized
mtp.* (all MTP weights)Reattached in BF16 via save_mtp_tensors_to_checkpoint after quantization

Speculative Decoding

This model supports two speculative decoding methods:

DFlash (separate drafter, recommended for single-user / low-concurrency):

bash

--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

MTP (built-in weights, recommended default for this abliterated variant — see warning at top):

bash

--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

Note: On hybrid GDN architectures, MTP may hit a state-rollback bug (vLLM #39273) during high token-rejection rates. The num_speculative_tokens: 1 setting also reduces exposure to this issue.

Serving with vLLM

bash

vllm serve /path/to/model \
--quantization compressed-tensors \
--served-model-name qwen3.6-35b-a3b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--attention-backend flash_attn \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.80 \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--max-num-seqs 64 \
--enable-prefix-caching \
--enable-chunked-prefill \
--performance-mode throughput \
--speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
--trust-remote-code \
--language-model-only

DGX Spark (SM121) Compatibility Notes

  • Native W4A4 confirmed via FlashInfer CUTLASS NVFP4 MoE backend (no more W4A16 fallback)
  • Verify in logs: Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM + Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
  • FP8 KV cache is not compatible with GDN non-causal attention; use --kv-cache-dtype auto
  • DFlash requires --attention-backend flash_attn (flashinfer backend + DFlash is incompatible)
  • --language-model-only skips vision encoder profiling for text-only inference
  • Clear page cache before starting on UMA: sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

Known Limitations

  • Per-tensor global scales for fused q_proj / k_proj / v_proj may differ, causing a vLLM warning at load time. This is inherent to the llm-compressor per-layer quantization behavior; the impact on accuracy is typically small but measurable on strict tool-calling JSON schemas.
  • DFlash drafter was trained on the original Qwen3.6-35B-A3B, not the abliterated variant — acceptance rate may be lower than on the original model.

Safety Warning

This model has safety filtering removed (abliterated) and may generate sensitive, controversial, or inappropriate content. Users are solely responsible for all consequences arising from its use. Please ensure usage complies with local laws and ethical standards. Not suitable for public-facing or production applications.

Credits


繁體中文

[!TIP] 2026-04-21 量化上傳,使用 llm-compressor 搭配混合領域校準與敏感層保護,最大化精度保留。

[!IMPORTANT] DGX Spark (SM121) 原生 W4A4 — 已驗證可用

不同於早期 SM121 的 NVFP4 模型,此 checkpoint 透過 FlashInfer CUTLASS NVFP4 MoE kernel 跑真 W4A4(vLLM log 可見 FlashInferCutlassNvFp4LinearKernel + FLASHINFER_CUTLASS NvFp4 MoE backend)。需要:

  • vLLM 0.19.1rc1.dev374+g1174723eb 以上(含 PR #37725 arch-suffix 修復)
  • FlashInfer ≥ 0.6.8 帶 SM120f 編譯(PR #2650
  • CUDA ≥ 12.9

支援 MTP 投機解碼 — MTP 層以 BF16 從原 checkpoint 透過 save_mtp_tensors_to_checkpoint 保留。

[!WARNING] Abliteration 會改變投機解碼的最佳設定 — 這是已知取捨,非 bug。

本版本的特色是混合領域校準ultrachat_200k 對話 + Nemotron-Post-Training-Dataset-v2 推理,共 256 樣本)。校準能恢復量化精度,但無法逆轉 abliteration 在上游造成的分佈偏移 — DFlash drafter 是以原版 Qwen3.6-35B-A3B 權重訓練,abliterated 後的殘差分佈已不再符合 drafter 的先驗,接受率因此下降。

DGX Spark 實測吞吐:

  • DFlash(num_speculative_tokens: 15 — 約 50 t/s,偶爾飆至 ~100 t/s
  • MTP(num_speculative_tokens: 1 — 穩定約 40 t/s,偶爾飆至 ~70 t/s

反直覺地,MTP 搭配單一投機 token 在此 abliterated 變體上表現優於 DFlash — MTP 沿用模型自身的 hidden state,與混合領域校準所針對的 abliterated 分佈保持一致。建議預設使用 --speculative-config '{"method":"mtp","num_speculative_tokens":1}',僅在特殊需求時才改用 DFlash。

huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated 的 NVFP4 W4A4 量化版,針對 NVIDIA DGX Spark (GB10 SM121) 最佳化,使用 FlashInfer CUTLASS FP4 MoE kernel。

模型資訊

Table
項目數值
架構MoE(35B 總參數, 3B 活躍, 256 experts / 8 routed + 1 shared)+ GDN (Mamba) + Attention 混合
基礎模型Qwen/Qwen3.6-35B-A3B
微調者huihui-ai(abliteration)
量化者YuYu1015
模型大小~25.1 GB(NVFP4,原版 BF16 約 71.9 GB)
Context 長度最高 262,144 tokens
思考模式支援(enable_thinking: true/false
工具呼叫支援(qwen3_xml parser)
MTP內建 MTP 權重(保留 BF16)
DFlash相容 z-lab/Qwen3.6-35B-A3B-DFlash

量化詳情

此模型在 RedHatAI 官方流程上堆疊三項策略(ACD)

Table
策略說明
A. RedHatAI 官方基線Qwen3_5MoeForConditionalGeneration + save_mtp_tensors_to_checkpoint(解 Qwen3.6 OOM、保留 MTP)
C. 混合領域校準ultrachat_200k(128 對話)+ Nemotron-Post-Training-Dataset-v2(128 推理)共 256
D. 黃金比例參數num_calibration_samples=256max_seq_length=4096(品質 > 數量)

B 策略(最後層保護)與 vLLM fused MoE 不相容:vLLM 的 CompressedTensorsMoEMethod 要求 MoE block 內所有 projection(gate/up/down × 256 experts + shared_expert)必須同 scheme。Partial ignore 會觸發 ValueError: All MoE projections need to have same quantization scheme but found multiple

E 策略(SpinQuant R1+R2)與 multi-modal config 不相容:llm-compressor 的 get_head_dim 只讀頂層 config,不讀 Qwen3.6 巢狀的 text_config

Table
項目數值
方法llm-compressor(main)+ compressed-tensors(main)
方案NVFP4 W4A4(E2M1 + FP8 逐群縮放,群組大小 16)
格式compressed-tensors
校準資料集HuggingFaceH4/ultrachat_200k(128)+ nvidia/Nemotron-Post-Training-Dataset-v2(128)
校準樣本總數256
校準序列長度4096
MoE 校準moe_calibrate_all_experts=True(透過 PR #2383
量化硬體NVIDIA DGX Spark(GB10, 128GB 統一記憶體)
環境transformers>=5.0,<6 + llm-compressor main + PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True

保留 BF16 的層

Table
層 pattern原因
re:.*lm_head輸出頭,對量化雜訊敏感
re:.*embed_tokens$輸入嵌入
re:visual.* / re:model.visual.*視覺編碼器
re:.*mlp.gate$MoE 路由門(routing 決策必須 BF16)
re:.*shared_expert_gate$共享專家路由門
re:.*linear_attn.*GDN/DeltaNet (Mamba) 層 — 量化後可能輸出零
mtp.*(所有 MTP 權重)量化後透過 save_mtp_tensors_to_checkpoint 以 BF16 重新掛回

投機解碼

本模型支援兩種投機解碼方式:

DFlash(獨立 drafter,建議單用戶 / 低併發):

bash

--speculative-config '{"method": "dflash", "model": "z-lab/Qwen3.6-35B-A3B-DFlash", "num_speculative_tokens": 15}'

MTP(內建權重,此 abliterated 變體的建議預設 — 詳見頂部警告):

bash

--speculative-config '{"method": "mtp", "num_speculative_tokens": 1}'

注意:混合 GDN 架構下 MTP 可能觸發 state-rollback bug(vLLM #39273),高 rejection rate 時輸出可能退化。num_speculative_tokens: 1 也能降低觸發機率。

vLLM 部署

bash

vllm serve /path/to/model \
--quantization compressed-tensors \
--served-model-name qwen3.6-35b-a3b \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_xml \
--attention-backend flash_attn \
--kv-cache-dtype auto \
--gpu-memory-utilization 0.80 \
--max-model-len 65536 \
--max-num-batched-tokens 16384 \
--max-num-seqs 64 \
--enable-prefix-caching \
--enable-chunked-prefill \
--performance-mode throughput \
--speculative-config '{"method":"dflash","model":"/models/Qwen3.6-35B-A3B-DFlash","num_speculative_tokens":15}' \
--trust-remote-code \
--language-model-only

DGX Spark (SM121) 相容性說明

  • 原生 W4A4 已確認 透過 FlashInfer CUTLASS NVFP4 MoE backend(不再退回 W4A16)
  • log 驗證:Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM + Using 'FLASHINFER_CUTLASS' NvFp4 MoE backend
  • FP8 KV cache 與 GDN non-causal attention 不相容,請使用 --kv-cache-dtype auto
  • DFlash 需搭配 --attention-backend flash_attn(flashinfer backend + DFlash 不相容)
  • --language-model-only 跳過視覺編碼器 profiling,加速純文字推理啟動
  • UMA 架構啟動前請先清除 page cache:sudo sh -c 'echo 3 > /proc/sys/vm/drop_caches'

已知限制

  • Fused q_proj / k_proj / v_proj 的 per-tensor global scale 可能不一致,vLLM 載入時會印警告。這是 llm-compressor per-layer 量化的固有行為,一般精度影響輕微,但在嚴格 tool-calling JSON schema 下可能可測得。
  • DFlash drafter 是以原版 Qwen3.6-35B-A3B 訓練,非 abliterated 變體 — 接受率可能較原版低。

安全警告

此模型已移除安全過濾機制(abliterated),可能產生敏感、爭議性或不當內容。使用者須自行承擔所有風險與法律責任,並確保使用方式符合當地法規與倫理標準。不適用於公開或生產環境。

致謝

Model provider

YuYu1015

Model tree

Base

huihui-ai/Huihui-Qwen3.6-35B-A3B-abliterated

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today