sakamakismile/Huihui-Nex-N2-mini-abliterated-MTP-NVFP4 API & Inference Endpoint

What it is

Table

Architecture	`Qwen3_5MoeForConditionalGeneration` (`qwen3_5_moe`) — Qwen3.5-VL-MoE family
Total / active	35B total · ~3B active (256 experts, top-8, + shared expert)
Attention	hybrid: linear-attention (GatedDeltaNet-style SSM) ×3 → full-attention every 4th layer (40 layers)
Vision	Qwen3-VL ViT (depth 27, patch 16, spatial-merge 2) — kept BF16
MTP	native multi-token-prediction draft (`mtp_num_hidden_layers=1`), shipped in `MTP/` (BF16, ~1.69 GB)
Context	262,144 positions (mRoPE)
Quantization	NVFP4 `nvfp4-pack-quantized`, W4A4, group size 16, FP8-E4M3 scales
What's quantized	all language-model Linear layers incl. 30,720 expert projections; `lm_head` / `visual.*` / MoE router (`mlp.gate`, `mlp.shared_expert_gate`) / norms / conv kept BF16
Size	~23.6 GB (`model.safetensors` + `MTP/`)
Nature	abliterated / uncensored · reasoning model (emits `<think>…</think>`)

Serving with vLLM

Requires vLLM 0.22.x (native qwen3_5_moe + qwen3_5_mtp; compressed-tensors NVFP4 auto-detected — no --quantization flag) and a Blackwell (SM120) GPU. TP=4 (4× 16 GB) is the floor — the ~9.9 GB/GPU weights plus the MoE NVFP4 GEMM workspace do not fit TP=2 on 16 GB cards.

bash
# 4× 16 GB Blackwell. On a box WITHOUT NVLink/P2P the NCCL flag + --disable-custom-all-reduce
# are MANDATORY (else it hangs at the first all-reduce). MTP is auto-detected from MTP/ —
# do NOT pass a model path in --speculative-config.
NCCL_P2P_DISABLE=1 \
vllm serve sakamakismile/Huihui-Nex-N2-mini-abliterated-MTP-NVFP4 \
  --tensor-parallel-size 4 \
  --disable-custom-all-reduce \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.87 \
  --kv-cache-dtype fp8 \
  --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
  --reasoning-parser qwen3 \
  --port 8000

Benchmarks

Measured on 4× RTX PRO 2000 Blackwell 16 GB (vLLM 0.22.0, TP=4, MTP n=3, KV fp8, greedy):

Table
Metric	Result
Single-stream decode	69 tok/s
Aggregate throughput @ 4 concurrent	151 tok/s
Aggregate throughput @ 8 concurrent	287 tok/s
HumanEval pass@1	83.5% (137/164)

KV pool ~798k tokens (97× concurrency at 8 192 ctx), ~14.4 GB/GPU. Bilingual (JA/EN) output coherent, arithmetic correct — no W4A4 collapse. 83.5% pass@1 on a 35B-A3B model with only ~3B active parameters is strong for the active-parameter budget. HumanEval was run via chat with enable_thinking=false (this is a reasoning model — in thinking mode it scores comparably, ~80%, but spends far more tokens and occasionally over-thinks short problems). The language weights are byte-identical to the text-only sibling, so these results apply to both.

It is a reasoning model: give it generous max_tokens (even short answers spend hundreds of tokens in <think>), and use --reasoning-parser qwen3 to split reasoning_content from content.
For text-only use on this VL checkpoint, add --limit-mm-per-prompt '{"image":0,"video":0}'. To use images, drop that flag.

Recommended usage: persona + thinking control

This is a reasoning model with a per-request thinking switch, and it responds strongly to a system-prompt persona. We benchmarked both levers on a 16-task verifiable battery (multi-step math, logic traps, instruction-following, code, JSON extraction, JP character-counting). The combination is what to ship:

Table
Configuration	Score	Relative speed
non-thinking, no system prompt	13/16	fastest
non-thinking + persona	14/16	fast — recommended default
thinking, no system prompt	15/16	slow
thinking + persona	16/16	slowest

Recommended default — non-thinking + a careful-generalist persona. Fast and cheap, and it cleanly handles math, logic, code, JSON and bilingual writing:

json
{
  "messages": [
    {"role": "system", "content": "あなたは自己内省的で慎重な性格かつ、あらゆる分野に精通した回答ができる汎用人工知性体です。"},
    {"role": "user", "content": "..."}
  ],
  "chat_template_kwargs": {"enable_thinking": false}
}

Thinking control

Thinking is governed by chat_template_kwargs.enable_thinking: omit it (or true) to reason, false to answer directly. Default (no flag) = thinking on, and the <think>…</think> block is auto-stripped from content — you always get a clean answer.
The persona lifts accuracy in both modes (it recovered borderline code-gen and tightened outputs), so keep it on regardless of the thinking flag.
Escalate to enable_thinking: true only for hidden-enumeration-under-terse-output tasks — e.g. "count the X and reply with only the number", exact word/character counts. Non-thinking has no hidden scratchpad: when "be terse" and "count carefully" conflict, terseness wins and the count drifts. Thinking counts in the hidden block, then prints the clean answer → those tasks go 14/16 → 16/16.
Even without flipping the flag, asking the model to show its work ("list each item, then count") recovers most counting tasks in non-thinking mode, since the reasoning then happens in the visible answer.
This chat template has no /think /no_think keyword parsing — drive thinking via the enable_thinking flag (or an app-layer router that maps user intent → the flag).

Quantization recipe

llm-compressor one-shot, QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head","re:.*visual.*","re:.*mlp.gate$","re:.*mlp.shared_expert_gate$"]), 32 calibration samples (neuralmagic/calibration, seq 8192). The full model is loaded via AutoModelForImageTextToText so the vision tower is present and preserved in BF16; the per-expert nn.Linear projections are packed to NVFP4. The native MTP draft is carried over verbatim in MTP/.

License

Inherits apache-2.0 from the upstream models. Abliterated/uncensored: you are responsible for how you use it.

Credits

Original model: nex-agi/Nex-N2-mini
Abliteration (uncensoring): huihui-ai/Huihui-Nex-N2-mini-abliterated
NVFP4 quantization & benchmarks: Lna-Lab · Tooling: llm-compressor / compressed-tensors / vLLM

Support the Base Model Author (huihui-ai)

If you find the abliterated base useful, please support huihui-ai — this repo only adds the FP4 quantization; the abliteration work is theirs:

Ko-fi: https://ko-fi.com/huihuiai
Bitcoin: bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge

Huihui-Nex-N2-mini-abliterated-MTP-NVFP4

Get help setting up a custom Dedicated Endpoints.

README