sakamakismile
Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0TL;DR — run it (no build required)
The official vLLM image already ships the qwen3_5 architecture and the Qwen3_5MTP
draft module, so you do not need to build anything.
bash
# from this directory; pick exactly TP_SIZE GPUs and avoid your display GPUCUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up./run.sh test # waits for /v1/models./run.sh bench # one-shot smoke test
Or the raw docker run (what run.sh/compose.yaml wrap):
bash
docker run -d --name vllm-huihui --runtime nvidia --gpus '"device=0,1,2,3"' \-p 8000:8000 -v "$PWD":/model:ro --shm-size 32g \-e VLLM_USE_FLASHINFER_SAMPLER=1 -e TORCH_MATMUL_PRECISION=high \--entrypoint vllm vllm/vllm-openai:v0.22.0 serve /model \--served-model-name huihui-qwen36-27b-local \--trust-remote-code --tensor-parallel-size 4 --quantization modelopt \--max-model-len 65536 --max-num-seqs 8 --max-num-batched-tokens 16384 \--gpu-memory-utilization 0.85 --kv-cache-dtype fp8 --dtype auto \--reasoning-parser qwen3 \--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \--chat-template /model/chat_template.jinja \--enable-auto-tool-choice --tool-call-parser qwen3_xml \--host 0.0.0.0 --port 8000
Smoke test:
bash
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{"model":"huihui-qwen36-27b-local","messages":[{"role":"user","content":"東京の名所を3つ、簡潔に。"}],"max_tokens":512, "temperature":0.7, "top_k":20, "top_p":0.95}' | jq .
Hardware & requirements
- 4× NVIDIA RTX PRO 2000 Blackwell (16 GB each, SM120), PCIe (no NVLink).
- Docker + NVIDIA Container Toolkit. The pre-built
vllm/vllm-openai:v0.22.0image carries vLLM ≥0.22 with NVFP4/modelopt + FlashInfer FP4 kernels and the qwen3_5 + MTP code. - TP=4 sharding is clean: heads 24, KV heads 4, hidden 5120, intermediate 17408 — all ÷4.
Bare-metal (no container) also works:
pip install vllm(≥0.21 introduced qwen3_5), CUDA 13.x toolchain for the SM120 Triton/NVFP4 kernels, then the samevllm serveflags.
Flags, and why
| flag | value | why |
|---|---|---|
--quantization modelopt | required | checkpoint is NVFP4 (hf_quant_config.json); without it weights read as garbage. |
--trust-remote-code | recommended | qwen3_5 multimodal config. |
--tensor-parallel-size | 4 | model needs ~7.2 GiB/GPU; 4× 16 GB is the design point. |
--max-model-len | 65536 (≤ 262144) | hybrid attention keeps KV cheap — long context is affordable. |
--max-num-seqs | 8 (peak 24 @64K) | concurrent slots. See benchmarks for the throughput curve. |
--kv-cache-dtype fp8 | recommended | ~2× KV capacity for more concurrency / longer context. |
--gpu-memory-utilization | 0.85 | model ≈7.2 GiB/GPU → ~6 GiB left for KV. Raise only on a clean card. |
--reasoning-parser qwen3 | recommended | splits <think>…</think> into reasoning_content; answer in content. |
--speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' | recommended | turns on the MTP draft head. vLLM ≥0.22 auto-maps qwen3_5_mtp → mtp (harmless deprecation warning). SPEC_TOKENS=0 disables it. |
--enable-auto-tool-choice --tool-call-parser qwen3_xml | optional (agentic) | parses Qwen3 XML tool calls. Drop for pure chat (ENABLE_TOOLS=0). |
Sampling (Qwen default, generation_config.json): temperature=0.7, top_k=20,
top_p=0.95. It is a reasoning model — give it room (max_tokens ≥ 512).
Docker package (bundled)
compose.yaml · entrypoint.sh · run.sh · Dockerfile. The compose defaults to the
official image + a mounted entrypoint.sh (build-free). Every flag is env-overridable:
bash
CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up # start on those 4 GPUsMAX_MODEL_LEN=262144 MAX_NUM_SEQS=8 ./run.sh up # 256K long-context modeMAX_MODEL_LEN=131072 MAX_NUM_SEQS=1 ./run.sh up # 128K single-request benchmarkSPEC_TOKENS=0 ./run.sh up # disable MTP speculative decodingENABLE_TOOLS=0 ./run.sh up # pure chat (no tool parser)PORT=8001 ./run.sh up # serve on a different host port./run.sh logs # tail · ./run.sh down # stop
Env knobs: PORT, MAX_MODEL_LEN, MAX_NUM_SEQS, MAX_NUM_BATCHED_TOKENS,
GPU_MEM_UTIL, KV_CACHE_DTYPE, SPEC_TOKENS, TP_SIZE, ENABLE_TOOLS,
REASONING_PARSER, TOOL_CALL_PARSER, CUDA_VISIBLE_DEVICES, VLLM_IMAGE.
The model weights are mounted read-only (. → /model); the image carries only the runtime.
shm_size: 32g is set (vLLM V1 uses a lot of shared memory).
To build a self-contained image instead: uncomment the build: block in compose.yaml
and run ./run.sh rebuild (the Dockerfile just pip-installs vLLM on a CUDA 13.1 base).
Benchmark results (RTX PRO 2000 Blackwell ×4, TP=4, MTP n=3)
Conditions: 512 output tokens, ~350-token prompt, --kv-cache-dtype fp8,
--gpu-memory-utilization 0.85.
64K context
| Req | Aggregate | Per-req | Req | Aggregate | Per-req | |
|---|---|---|---|---|---|---|
| 1 | 81.0 t/s | 81.0 | 14 | 669.9 t/s | 49.5 | |
| 2 | 134.0 t/s | 67.0 | 16 | 720.2 t/s | 46.1 | |
| 3 | 205.1 t/s | 71.6 | 18 | 764.7 t/s | 44.2 | |
| 4 | 274.5 t/s | 72.5 | 20 | 798.7 t/s | 41.6 | |
| 6 | 380.3 t/s | 65.2 | 22 | 835.0 t/s | 39.5 | |
| 8 | 454.2 t/s | 58.9 | 24 | 879.5 t/s | 37.2 | |
| 10 | 518.9 t/s | 53.7 | 28 | 859.7 t/s | 31.7 | |
| 12 | 613.8 t/s | 52.6 | 32 | 736.8 t/s | 32.1 |
256K context (1→16 req)
83.3 → 131.4 → 203.8 → 269.7 → 376.0 → 442.0 → 516.4 → 618.8 → 666.4 → 701.3 t/s
(per-req 83 → 45). 256K tracks 64K almost exactly — the hybrid KV (16/64 full +
48/64 linear attention) stays cheap at length.
Takeaways: peak throughput is ~880 tok/s @ 24 concurrent (64K), decaying past 28.
Long context is nearly free: 256K runs 16-way without OOM. For 256K use
--max-model-len 262144 --max-num-seqs 8; for a 128K single-request line ~83.9 tok/s
(--max-num-seqs 1).
Rituals (gotchas)
- Kill zombie GPU procs — a failed/cancelled launch leaves workers in VRAM:
nvidia-smi --query-compute-apps=pid,process_name --format=csv,noheader→kill -9 <Worker_TP* PIDs>. - First launch is slow — torch.compile + Triton + NVFP4 warmup ≈2 min. Wait for
Application startup complete/Uvicorn running on http://0.0.0.0:8000. gpu-memory-utilizationmust exceed real usage — clean start ≈7.2 GiB/GPU; with 0.85 vLLM targets ~13.2 GiB leaving ~6 GiB KV.Free memory < desired…= residual allocation from a previous run (see #1).- Concurrent NCCL init can hang — bringing up two TP servers at once may spin one at
NCCL init (GPUs stuck ~370 MiB / 100% util / low watts). Start them one at a time,
or set
NCCL_P2P_DISABLE=1for the smaller group. - MTP acceptance —
num_speculative_tokens>1reuses one MTP layer per step; higher values trade acceptance for draft depth.n=3is a good default here.
OpenCode provider
jsonc
// ~/.config/opencode/opencode.jsonc{"provider": {"local-vllm": {"npm": "@ai-sdk/openai-compatible","name": "Local vLLM","options": { "baseURL": "http://127.0.0.1:8000/v1", "apiKey": "EMPTY" },"models": {"huihui-qwen36-27b-local": {"name": "Huihui Qwen3.6 27B NVFP4 MTP Local","reasoning": true, "tool_call": true, "temperature": true,"limit": { "context": 65536, "output": 8192 }}}}},"model": "local-vllm/huihui-qwen36-27b-local","small_model": "local-vllm/huihui-qwen36-27b-local"}
What's inside
- Quantized → NVFP4 (modelopt 0.43, W4A4, group 16): the Linear layers;
lm_head, conv/short-conv, routers and the MTP embedding kept higher precision (ignorelist inconfig.json/hf_quant_config.json). - MTP draft head (
mtp_num_hidden_layers: 1) → speculative decoding via vLLM. - Files:
model.safetensors(~20 GB),config.json,hf_quant_config.json,chat_template.jinja, tokenizer, and this Docker package.
Model provider
sakamakismile
Model tree
Base
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information