sakamakismile
Huihui-Nex-N2-mini-abliterated-MTP-NVFP4
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What it is
| Architecture | Qwen3_5MoeForConditionalGeneration (qwen3_5_moe) — Qwen3.5-VL-MoE family |
| Total / active | 35B total · ~3B active (256 experts, top-8, + shared expert) |
| Attention | hybrid: linear-attention (GatedDeltaNet-style SSM) ×3 → full-attention every 4th layer (40 layers) |
| Vision | Qwen3-VL ViT (depth 27, patch 16, spatial-merge 2) — kept BF16 |
| MTP | native multi-token-prediction draft (mtp_num_hidden_layers=1), shipped in MTP/ (BF16, ~1.69 GB) |
| Context | 262,144 positions (mRoPE) |
| Quantization | NVFP4 nvfp4-pack-quantized, W4A4, group size 16, FP8-E4M3 scales |
| What's quantized | all language-model Linear layers incl. 30,720 expert projections; lm_head / visual.* / MoE router (mlp.gate, mlp.shared_expert_gate) / norms / conv kept BF16 |
| Size | ~23.6 GB (model.safetensors + MTP/) |
| Nature | abliterated / uncensored · reasoning model (emits <think>…</think>) |
Serving with vLLM
Requires vLLM 0.22.x (native qwen3_5_moe + qwen3_5_mtp; compressed-tensors NVFP4 auto-detected — no --quantization flag) and a Blackwell (SM120) GPU. TP=4 (4× 16 GB) is the floor — the ~9.9 GB/GPU weights plus the MoE NVFP4 GEMM workspace do not fit TP=2 on 16 GB cards.
bash
# 4× 16 GB Blackwell. On a box WITHOUT NVLink/P2P the NCCL flag + --disable-custom-all-reduce# are MANDATORY (else it hangs at the first all-reduce). MTP is auto-detected from MTP/ —# do NOT pass a model path in --speculative-config.NCCL_P2P_DISABLE=1 \vllm serve sakamakismile/Huihui-Nex-N2-mini-abliterated-MTP-NVFP4 \--tensor-parallel-size 4 \--disable-custom-all-reduce \--max-model-len 8192 \--gpu-memory-utilization 0.87 \--kv-cache-dtype fp8 \--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \--reasoning-parser qwen3 \--port 8000
Benchmarks
Measured on 4× RTX PRO 2000 Blackwell 16 GB (vLLM 0.22.0, TP=4, MTP n=3, KV fp8, greedy):
| Metric | Result |
|---|---|
| Single-stream decode | 69 tok/s |
| Aggregate throughput @ 4 concurrent | 151 tok/s |
| Aggregate throughput @ 8 concurrent | 287 tok/s |
| HumanEval pass@1 | 83.5% (137/164) |
KV pool ~798k tokens (97× concurrency at 8 192 ctx), ~14.4 GB/GPU. Bilingual (JA/EN) output coherent, arithmetic correct — no W4A4 collapse. 83.5% pass@1 on a 35B-A3B model with only ~3B active parameters is strong for the active-parameter budget. HumanEval was run via chat with enable_thinking=false (this is a reasoning model — in thinking mode it scores comparably, ~80%, but spends far more tokens and occasionally over-thinks short problems). The language weights are byte-identical to the text-only sibling, so these results apply to both.
- It is a reasoning model: give it generous
max_tokens(even short answers spend hundreds of tokens in<think>), and use--reasoning-parser qwen3to splitreasoning_contentfromcontent. - For text-only use on this VL checkpoint, add
--limit-mm-per-prompt '{"image":0,"video":0}'. To use images, drop that flag.
Recommended usage: persona + thinking control
This is a reasoning model with a per-request thinking switch, and it responds strongly to a system-prompt persona. We benchmarked both levers on a 16-task verifiable battery (multi-step math, logic traps, instruction-following, code, JSON extraction, JP character-counting). The combination is what to ship:
| Configuration | Score | Relative speed |
|---|---|---|
| non-thinking, no system prompt | 13/16 | fastest |
| non-thinking + persona | 14/16 | fast — recommended default |
| thinking, no system prompt | 15/16 | slow |
| thinking + persona | 16/16 | slowest |
Recommended default — non-thinking + a careful-generalist persona. Fast and cheap, and it cleanly handles math, logic, code, JSON and bilingual writing:
json
{"messages": [{"role": "system", "content": "あなたは自己内省的で慎重な性格かつ、あらゆる分野に精通した回答ができる汎用人工知性体です。"},{"role": "user", "content": "..."}],"chat_template_kwargs": {"enable_thinking": false}}
Thinking control
- Thinking is governed by
chat_template_kwargs.enable_thinking: omit it (ortrue) to reason,falseto answer directly. Default (no flag) = thinking on, and the<think>…</think>block is auto-stripped fromcontent— you always get a clean answer. - The persona lifts accuracy in both modes (it recovered borderline code-gen and tightened outputs), so keep it on regardless of the thinking flag.
- Escalate to
enable_thinking: trueonly for hidden-enumeration-under-terse-output tasks — e.g. "count the X and reply with only the number", exact word/character counts. Non-thinking has no hidden scratchpad: when "be terse" and "count carefully" conflict, terseness wins and the count drifts. Thinking counts in the hidden block, then prints the clean answer → those tasks go 14/16 → 16/16. - Even without flipping the flag, asking the model to show its work ("list each item, then count") recovers most counting tasks in non-thinking mode, since the reasoning then happens in the visible answer.
- This chat template has no
/think/no_thinkkeyword parsing — drive thinking via theenable_thinkingflag (or an app-layer router that maps user intent → the flag).
Quantization recipe
llm-compressor one-shot, QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head","re:.*visual.*","re:.*mlp.gate$","re:.*mlp.shared_expert_gate$"]), 32 calibration samples (neuralmagic/calibration, seq 8192). The full model is loaded via AutoModelForImageTextToText so the vision tower is present and preserved in BF16; the per-expert nn.Linear projections are packed to NVFP4. The native MTP draft is carried over verbatim in MTP/.
License
Inherits apache-2.0 from the upstream models. Abliterated/uncensored: you are responsible for how you use it.
Credits
- Original model: nex-agi/Nex-N2-mini
- Abliteration (uncensoring): huihui-ai/Huihui-Nex-N2-mini-abliterated
- NVFP4 quantization & benchmarks: Lna-Lab · Tooling: llm-compressor / compressed-tensors / vLLM
Support the Base Model Author (huihui-ai)
If you find the abliterated base useful, please support huihui-ai — this repo only adds the FP4 quantization; the abliteration work is theirs:
- Ko-fi: https://ko-fi.com/huihuiai
- Bitcoin:
bc1qqnkhuchxw0zqjh2ku3lu4hq45hc6gy84uk70ge
Model provider
sakamakismile
Model tree
Base
huihui-ai/Huihui-Nex-N2-mini-abliterated
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information