sakamakismile

Huihui-Qwen3.6-27B-abliterated-NVFP4-MTP

README

License: apache-2.0

TL;DR — run it (no build required)

The official vLLM image already ships the qwen3_5 architecture and the Qwen3_5MTP draft module, so you do not need to build anything.

bash
# from this directory; pick exactly TP_SIZE GPUs and avoid your display GPU
CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up
./run.sh test          # waits for /v1/models
./run.sh bench         # one-shot smoke test

Or the raw docker run (what run.sh/compose.yaml wrap):

bash
docker run -d --name vllm-huihui --runtime nvidia --gpus '"device=0,1,2,3"' \
  -p 8000:8000 -v "$PWD":/model:ro --shm-size 32g \
  -e VLLM_USE_FLASHINFER_SAMPLER=1 -e TORCH_MATMUL_PRECISION=high \
  --entrypoint vllm vllm/vllm-openai:v0.22.0 serve /model \
    --served-model-name huihui-qwen36-27b-local \
    --trust-remote-code --tensor-parallel-size 4 --quantization modelopt \
    --max-model-len 65536 --max-num-seqs 8 --max-num-batched-tokens 16384 \
    --gpu-memory-utilization 0.85 --kv-cache-dtype fp8 --dtype auto \
    --reasoning-parser qwen3 \
    --speculative-config '{"method":"qwen3_5_mtp","num_speculative_tokens":3}' \
    --chat-template /model/chat_template.jinja \
    --enable-auto-tool-choice --tool-call-parser qwen3_xml \
    --host 0.0.0.0 --port 8000

Smoke test:

bash
curl -s localhost:8000/v1/chat/completions -H 'Content-Type: application/json' -d '{
  "model":"huihui-qwen36-27b-local",
  "messages":[{"role":"user","content":"東京の名所を3つ、簡潔に。"}],
  "max_tokens":512, "temperature":0.7, "top_k":20, "top_p":0.95}' | jq .

Hardware & requirements

4× NVIDIA RTX PRO 2000 Blackwell (16 GB each, SM120), PCIe (no NVLink).
Docker + NVIDIA Container Toolkit. The pre-built vllm/vllm-openai:v0.22.0 image carries vLLM ≥0.22 with NVFP4/modelopt + FlashInfer FP4 kernels and the qwen3_5 + MTP code.
TP=4 sharding is clean: heads 24, KV heads 4, hidden 5120, intermediate 17408 — all ÷4.

Bare-metal (no container) also works: pip install vllm (≥0.21 introduced qwen3_5), CUDA 13.x toolchain for the SM120 Triton/NVFP4 kernels, then the same vllm serve flags.

Flags, and why

Table with columns: flag, value, why
flag	value	why
`--quantization modelopt`	required	checkpoint is NVFP4 (`hf_quant_config.json`); without it weights read as garbage.
`--trust-remote-code`	recommended	`qwen3_5` multimodal config.
`--tensor-parallel-size`	`4`	model needs ~7.2 GiB/GPU; 4× 16 GB is the design point.

Sampling (Qwen default, generation_config.json): temperature=0.7, top_k=20, top_p=0.95. It is a reasoning model — give it room (max_tokens ≥ 512).

Docker package (bundled)

compose.yaml · entrypoint.sh · run.sh · Dockerfile. The compose defaults to the official image + a mounted entrypoint.sh (build-free). Every flag is env-overridable:

bash
CUDA_VISIBLE_DEVICES=0,1,2,3 ./run.sh up        # start on those 4 GPUs
MAX_MODEL_LEN=262144 MAX_NUM_SEQS=8 ./run.sh up # 256K long-context mode
MAX_MODEL_LEN=131072 MAX_NUM_SEQS=1 ./run.sh up # 128K single-request benchmark
SPEC_TOKENS=0 ./run.sh up                       # disable MTP speculative decoding
ENABLE_TOOLS=0 ./run.sh up                      # pure chat (no tool parser)
PORT=8001 ./run.sh up                           # serve on a different host port
./run.sh logs   # tail   ·   ./run.sh down   # stop

Env knobs: PORT, MAX_MODEL_LEN, MAX_NUM_SEQS, MAX_NUM_BATCHED_TOKENS, GPU_MEM_UTIL, KV_CACHE_DTYPE, SPEC_TOKENS, TP_SIZE, ENABLE_TOOLS, REASONING_PARSER, TOOL_CALL_PARSER, CUDA_VISIBLE_DEVICES, VLLM_IMAGE. The model weights are mounted read-only (. → /model); the image carries only the runtime. shm_size: 32g is set (vLLM V1 uses a lot of shared memory).

To build a self-contained image instead: uncomment the build: block in compose.yaml and run ./run.sh rebuild (the Dockerfile just pip-installs vLLM on a CUDA 13.1 base).

Benchmark results (RTX PRO 2000 Blackwell ×4, TP=4, MTP n=3)

Conditions: 512 output tokens, ~350-token prompt, --kv-cache-dtype fp8, --gpu-memory-utilization 0.85.

64K context

Table with columns: Req, Aggregate, Per-req, Req, Aggregate, Per-req
Req	Aggregate	Per-req		Req	Aggregate	Per-req
1	81.0 t/s	81.0		14	669.9 t/s	49.5
2	134.0 t/s	67.0		16	720.2 t/s	46.1

256K context (1→16 req)

83.3 → 131.4 → 203.8 → 269.7 → 376.0 → 442.0 → 516.4 → 618.8 → 666.4 → 701.3 t/s (per-req 83 → 45). 256K tracks 64K almost exactly — the hybrid KV (16/64 full + 48/64 linear attention) stays cheap at length.

Takeaways: peak throughput is ~880 tok/s @ 24 concurrent (64K), decaying past 28. Long context is nearly free: 256K runs 16-way without OOM. For 256K use --max-model-len 262144 --max-num-seqs 8; for a 128K single-request line ~83.9 tok/s (--max-num-seqs 1).

Rituals (gotchas)

Kill zombie GPU procs — a failed/cancelled launch leaves workers in VRAM: nvidia-smi --query-compute-apps=pid,process_name --format=csv,noheader → kill -9 <Worker_TP* PIDs>.
First launch is slow — torch.compile + Triton + NVFP4 warmup ≈2 min. Wait for Application startup complete / Uvicorn running on http://0.0.0.0:8000.
gpu-memory-utilization must exceed real usage — clean start ≈7.2 GiB/GPU; with 0.85 vLLM targets ~13.2 GiB leaving ~6 GiB KV. Free memory < desired… = residual allocation from a previous run (see #1).
Concurrent NCCL init can hang — bringing up two TP servers at once may spin one at NCCL init (GPUs stuck ~370 MiB / 100% util / low watts). Start them one at a time, or set NCCL_P2P_DISABLE=1 for the smaller group.
MTP acceptance — num_speculative_tokens>1 reuses one MTP layer per step; higher values trade acceptance for draft depth. is a good default here.

OpenCode provider

jsonc
// ~/.config/opencode/opencode.jsonc
{
  "provider": {
    "local-vllm": {
      "npm": "@ai-sdk/openai-compatible",
      "name": "Local vLLM",
      "options": { "baseURL": "http://127.0.0.1:8000/v1", "apiKey": "EMPTY" },
      "models": {
        "huihui-qwen36-27b-local": {
          "name": "Huihui Qwen3.6 27B NVFP4 MTP Local",
          "reasoning": true, "tool_call": true, "temperature": true,
          "limit": { "context": 65536, "output": 8192 }
        }
      }
    }
  },
  "model": "local-vllm/huihui-qwen36-27b-local",
  "small_model": "local-vllm/huihui-qwen36-27b-local"
}

What's inside

Quantized → NVFP4 (modelopt 0.43, W4A4, group 16): the Linear layers; lm_head, conv/short-conv, routers and the MTP embedding kept higher precision (ignore list in config.json / hf_quant_config.json).
MTP draft head (mtp_num_hidden_layers: 1) → speculative decoding via vLLM.
Files: model.safetensors (~20 GB), config.json, hf_quant_config.json, chat_template.jinja, tokenizer, and this Docker package.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

sakamakismile

Model Tree

Base

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality