sakamakismile

Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

TL;DR — run it (no build required)

The official vLLM image already ships the qwen3_5 architecture and the Qwen3_5MTP draft module, so you do not need to build anything.

bash

# from this directory; pick exactly TP_SIZE GPUs and avoid your display GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up
./run.sh test # waits for /v1/models
./run.sh bench # one-shot smoke test

Or the raw docker run (what run.sh/compose.yaml wrap):

bash

docker run -d --name vllm-huihui --runtime nvidia --gpus '"device=0,1,2,3"' \
-p 8000:8000 -v "$PWD":/model:ro --shm-size 32g \
-e VLLM_USE_FLASHINFER_SAMPLER=1 -e TORCH_MATMUL_PRECISION=high \
--entrypoint vllm vllm/vllm-openai:v0.22.0 serve /model \
--served-model-name huihui-qwen36-27b-local \
--trust-remote-code --tensor-parallel-size 4 --quantization modelopt \
--max-model-len 65536 --max-num-seqs 8 --max-num-batched-tokens 16384 \
--gpu-memory-utilization 0.85 --kv-cache-dtype fp8 --dtype auto \
--reasoning-parser qwen3 \
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
--chat-template /model/chat_template.jinja \
--enable-auto-tool-choice --tool-call-parser qwen3_xml \
--host 0.0.0.0 --port 8000

Smoke test:

bash

curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
"model":"huihui-qwen36-27b-local",
"messages":[{"role":"user","content":"東京の名所を3つ、簡潔に。"}],
"max_tokens":512, "temperature":0.7, "top_k":20, "top_p":0.95}' | jq .

Hardware & requirements

  • 4× NVIDIA RTX PRO 2000 Blackwell (16 GB each, SM120), PCIe (no NVLink).
  • Docker + NVIDIA Container Toolkit. The pre-built vllm/vllm-openai:v0.22.0 image carries vLLM ≥0.22 with NVFP4/modelopt + FlashInfer FP4 kernels and the qwen3_5 + MTP code.
  • TP=4 sharding is clean: heads 24, KV heads 4, hidden 5120, intermediate 17408 — all ÷4.

Bare-metal (no container) also works: pip install vllm (≥0.21 introduced qwen3_5), CUDA 13.x toolchain for the SM120 Triton/NVFP4 kernels, then the same vllm serve flags.


Flags, and why

Table
flagvaluewhy
--quantization modeloptrequiredcheckpoint is NVFP4 (hf_quant_config.json); without it weights read as garbage.
--trust-remote-coderecommendedqwen3_5 multimodal config.
--tensor-parallel-size4model needs ~7.2 GiB/GPU; 4× 16 GB is the design point.
--max-model-len65536 (≤ 262144)hybrid attention keeps KV cheap — long context is affordable.
--max-num-seqs8 (peak 24 @64K)concurrent slots. See benchmarks for the throughput curve.
--kv-cache-dtype fp8recommended~2× KV capacity for more concurrency / longer context.
--gpu-memory-utilization0.85model ≈7.2 GiB/GPU → ~6 GiB left for KV. Raise only on a clean card.
--reasoning-parser qwen3recommendedsplits <think>…</think> into reasoning_content; answer in content.
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}'recommendedturns on the MTP draft head. vLLM ≥0.22 auto-maps qwen3_5_mtp → mtp (harmless deprecation warning). SPEC_TOKENS=0 disables it.
--enable-auto-tool-choice --tool-call-parser qwen3_xmloptional (agentic)parses Qwen3 XML tool calls. Drop for pure chat (ENABLE_TOOLS=0).

Sampling (Qwen default, generation_config.json): temperature=0.7, top_k=20, top_p=0.95. It is a reasoning model — give it room (max_tokens ≥ 512).


Docker package (bundled)

compose.yaml · entrypoint.sh · run.sh · Dockerfile. The compose defaults to the official image + a mounted entrypoint.sh (build-free). Every flag is env-overridable:

bash

CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up # start on those 4 GPUs
MAX_MODEL_LEN=262144 MAX_NUM_SEQS=8 ./run.sh up # 256K long-context mode
MAX_MODEL_LEN=131072 MAX_NUM_SEQS=1 ./run.sh up # 128K single-request benchmark
SPEC_TOKENS=0 ./run.sh up # disable MTP speculative decoding
ENABLE_TOOLS=0 ./run.sh up # pure chat (no tool parser)
PORT=8001 ./run.sh up # serve on a different host port
./run.sh logs # tail · ./run.sh down # stop

Env knobs: PORT, MAX_MODEL_LEN, MAX_NUM_SEQS, MAX_NUM_BATCHED_TOKENS, GPU_MEM_UTIL, KV_CACHE_DTYPE, SPEC_TOKENS, TP_SIZE, ENABLE_TOOLS, REASONING_PARSER, TOOL_CALL_PARSER, CUDA_VISIBLE_DEVICES, VLLM_IMAGE. The model weights are mounted read-only (. → /model); the image carries only the runtime. shm_size: 32g is set (vLLM V1 uses a lot of shared memory).

To build a self-contained image instead: uncomment the build: block in compose.yaml and run ./run.sh rebuild (the Dockerfile just pip-installs vLLM on a CUDA 13.1 base).


Benchmark results (RTX PRO 2000 Blackwell ×4, TP=4, MTP n=3)

Conditions: 512 output tokens, ~350-token prompt, --kv-cache-dtype fp8, --gpu-memory-utilization 0.85.

64K context

Table
ReqAggregatePer-reqReqAggregatePer-req
181.0 t/s81.014669.9 t/s49.5
2134.0 t/s67.016720.2 t/s46.1
3205.1 t/s71.618764.7 t/s44.2
4274.5 t/s72.520798.7 t/s41.6
6380.3 t/s65.222835.0 t/s39.5
8454.2 t/s58.924879.5 t/s37.2
10518.9 t/s53.728859.7 t/s31.7
12613.8 t/s52.632736.8 t/s32.1

256K context (1→16 req)

83.3 → 131.4 → 203.8 → 269.7 → 376.0 → 442.0 → 516.4 → 618.8 → 666.4 → 701.3 t/s (per-req 83 → 45). 256K tracks 64K almost exactly — the hybrid KV (16/64 full + 48/64 linear attention) stays cheap at length.

Takeaways: peak throughput is ~880 tok/s @ 24 concurrent (64K), decaying past 28. Long context is nearly free: 256K runs 16-way without OOM. For 256K use --max-model-len 262144 --max-num-seqs 8; for a 128K single-request line ~83.9 tok/s (--max-num-seqs 1).


Rituals (gotchas)

  1. Kill zombie GPU procs — a failed/cancelled launch leaves workers in VRAM: nvidia-smi --query-compute-apps=pid,process_name --format=csv,noheaderkill -9 <Worker_TP* PIDs>.
  2. First launch is slow — torch.compile + Triton + NVFP4 warmup ≈2 min. Wait for Application startup complete / Uvicorn running on http://0.0.0.0:8000.
  3. gpu-memory-utilization must exceed real usage — clean start ≈7.2 GiB/GPU; with 0.85 vLLM targets ~13.2 GiB leaving ~6 GiB KV. Free memory < desired… = residual allocation from a previous run (see #1).
  4. Concurrent NCCL init can hang — bringing up two TP servers at once may spin one at NCCL init (GPUs stuck ~370 MiB / 100% util / low watts). Start them one at a time, or set NCCL_P2P_DISABLE=1 for the smaller group.
  5. MTP acceptancenum_speculative_tokens>1 reuses one MTP layer per step; higher values trade acceptance for draft depth. n=3 is a good default here.

OpenCode provider

jsonc

// ~/.config/opencode/opencode.jsonc
{
"provider": {
"local-vllm": {
"npm": "@ai-sdk/openai-compatible",
"name": "Local vLLM",
"options": { "baseURL": "http://127.0.0.1:8000/v1", "apiKey": "EMPTY" },
"models": {
"huihui-qwen36-27b-local": {
"name": "Huihui Qwen3.6 27B NVFP4 MTP Local",
"reasoning": true, "tool_call": true, "temperature": true,
"limit": { "context": 65536, "output": 8192 }
}
}
}
},
"model": "local-vllm/huihui-qwen36-27b-local",
"small_model": "local-vllm/huihui-qwen36-27b-local"
}

What's inside

  • Quantized → NVFP4 (modelopt 0.43, W4A4, group 16): the Linear layers; lm_head, conv/short-conv, routers and the MTP embedding kept higher precision (ignore list in config.json / hf_quant_config.json).
  • MTP draft head (mtp_num_hidden_layers: 1) → speculative decoding via vLLM.
  • Files: model.safetensors (~20 GB), config.json, hf_quant_config.json, chat_template.jinja, tokenizer, and this Docker package.

Model provider

sakamakismile

Model tree

Base

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today