sakamakismile

Huihui-Nex-N2-mini-abliterated-MTP-NVFP4

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What it is

Table
ArchitectureQwen3_5MoeForConditionalGeneration (qwen3_5_moe) — Qwen3.5-VL-MoE family
Total / active35B total · ~3B active (256 experts, top-8, + shared expert)
Attentionhybrid: linear-attention (GatedDeltaNet-style SSM) ×3 → full-attention every 4th layer (40 layers)
VisionQwen3-VL ViT (depth 27, patch 16, spatial-merge 2) — kept BF16
MTPnative multi-token-prediction draft (mtp_num_hidden_layers=1), shipped in MTP/ (BF16, ~1.69 GB)
Context262,144 positions (mRoPE)
QuantizationNVFP4 nvfp4-pack-quantized, W4A4, group size 16, FP8-E4M3 scales
What's quantizedall language-model Linear layers incl. 30,720 expert projections; lm_head / visual.* / MoE router (mlp.gate, mlp.shared_expert_gate) / norms / conv kept BF16
Size~23.6 GB (model.safetensors + MTP/)
Natureabliterated / uncensored · reasoning model (emits <think>…</think>)

Serving with vLLM

Requires vLLM 0.22.x (native qwen3_5_moe + qwen3_5_mtp; compressed-tensors NVFP4 auto-detected — no --quantization flag) and a Blackwell (SM120) GPU. TP=4 (4× 16 GB) is the floor — the ~9.9 GB/GPU weights plus the MoE NVFP4 GEMM workspace do not fit TP=2 on 16 GB cards.

bash

# 4× 16 GB Blackwell. On a box WITHOUT NVLink/P2P the NCCL flag + --disable-custom-all-reduce
# are MANDATORY (else it hangs at the first all-reduce). MTP is auto-detected from MTP/ —
# do NOT pass a model path in --speculative-config.
NCCL_P2P_DISABLE=1 \
vllm serve sakamakismile/Huihui-Nex-N2-mini-abliterated-MTP-NVFP4 \
--tensor-parallel-size 4 \
--disable-custom-all-reduce \
--max-model-len 8192 \
--gpu-memory-utilization 0.87 \
--kv-cache-dtype fp8 \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
--reasoning-parser qwen3 \
--port 8000

Benchmarks

Measured on 4× RTX PRO 2000 Blackwell 16 GB (vLLM 0.22.0, TP=4, MTP n=3, KV fp8, greedy):

Table
MetricResult
Single-stream decode69 tok/s
Aggregate throughput @ 4 concurrent151 tok/s
Aggregate throughput @ 8 concurrent287 tok/s
HumanEval pass@183.5% (137/164)

KV pool ~798k tokens (97× concurrency at 8 192 ctx), ~14.4 GB/GPU. Bilingual (JA/EN) output coherent, arithmetic correct — no W4A4 collapse. 83.5% pass@1 on a 35B-A3B model with only ~3B active parameters is strong for the active-parameter budget. HumanEval was run via chat with enable_thinking=false (this is a reasoning model — in thinking mode it scores comparably, ~80%, but spends far more tokens and occasionally over-thinks short problems). The language weights are byte-identical to the text-only sibling, so these results apply to both.

  • It is a reasoning model: give it generous max_tokens (even short answers spend hundreds of tokens in <think>), and use --reasoning-parser qwen3 to split reasoning_content from content.
  • For text-only use on this VL checkpoint, add --limit-mm-per-prompt '{"image":0,"video":0}'. To use images, drop that flag.

This is a reasoning model with a per-request thinking switch, and it responds strongly to a system-prompt persona. We benchmarked both levers on a 16-task verifiable battery (multi-step math, logic traps, instruction-following, code, JSON extraction, JP character-counting). The combination is what to ship:

Table
ConfigurationScoreRelative speed
non-thinking, no system prompt13/16fastest
non-thinking + persona14/16fast — recommended default
thinking, no system prompt15/16slow
thinking + persona16/16slowest

Recommended default — non-thinking + a careful-generalist persona. Fast and cheap, and it cleanly handles math, logic, code, JSON and bilingual writing:

json

{
"messages": [
{"role": "system", "content": "あなたは自己内省的で慎重な性格かつ、あらゆる分野に精通した回答ができる汎用人工知性体です。"},
{"role": "user", "content": "..."}
],
"chat_template_kwargs": {"enable_thinking": false}
}

Thinking control

  • Thinking is governed by chat_template_kwargs.enable_thinking: omit it (or true) to reason, false to answer directly. Default (no flag) = thinking on, and the <think>…</think> block is auto-stripped from content — you always get a clean answer.
  • The persona lifts accuracy in both modes (it recovered borderline code-gen and tightened outputs), so keep it on regardless of the thinking flag.
  • Escalate to enable_thinking: true only for hidden-enumeration-under-terse-output tasks — e.g. "count the X and reply with only the number", exact word/character counts. Non-thinking has no hidden scratchpad: when "be terse" and "count carefully" conflict, terseness wins and the count drifts. Thinking counts in the hidden block, then prints the clean answer → those tasks go 14/16 → 16/16.
  • Even without flipping the flag, asking the model to show its work ("list each item, then count") recovers most counting tasks in non-thinking mode, since the reasoning then happens in the visible answer.
  • This chat template has no /think /no_think keyword parsing — drive thinking via the enable_thinking flag (or an app-layer router that maps user intent → the flag).

Quantization recipe

llm-compressor one-shot, QuantizationModifier(targets="Linear", scheme="NVFP4", ignore=["lm_head","re:.*visual.*","re:.*mlp.gate$","re:.*mlp.shared_expert_gate$"]), 32 calibration samples (neuralmagic/calibration, seq 8192). The full model is loaded via AutoModelForImageTextToText so the vision tower is present and preserved in BF16; the per-expert nn.Linear projections are packed to NVFP4. The native MTP draft is carried over verbatim in MTP/.

License

Inherits apache-2.0 from the upstream models. Abliterated/uncensored: you are responsible for how you use it.

Credits

Support the Base Model Author (huihui-ai)

If you find the abliterated base useful, please support huihui-ai — this repo only adds the FP4 quantization; the abliteration work is theirs:

Model provider

sakamakismile

Model tree

Base

huihui-ai/Huihui-Nex-N2-mini-abliterated

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today