WaveCut/Qwopus3.6-27B-Coder-FP8-W4A16-G64-RTN-vllm API & Inference Endpoint

vLLM

bash
vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-W4A16-G64-RTN-vllm \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

For long-context serving, raise --max-model-len according to your KV-cache budget.

vLLM CUDA 13 Smoke and Benchmarks

Smoke and throughput checks were run on 2026-06-14 with vllm 0.23.0, torch 2.11.0+cu130, Python 3.12.3, one NVIDIA B200, and NVIDIA driver 580.105.08. CUDA Toolkit release notes document per-release minimum driver requirements; in this run, a B200 host with driver 570.* failed CUDA 13 initialization, while driver 580.105.08 worked.

The working RunPod image was runpod/pytorch:1.0.3-cu1300-torch291-ubuntu2404 (cu13-pytorch2.9, template 0uy1f6v18r). After vLLM install, nvidia-cutlass-dsl-libs-cu13 was force-reinstalled once to fix a CUTLASS RECORD mismatch; after that vLLM used the FlashInfer GDN prefill kernel.

vLLM resolved this model as Qwen3_5ForConditionalGeneration, loaded compressed-tensors, used MarlinLinearKernel for CompressedTensorsWNA16, and completed generation. MTP speculative decoding resolved Qwen3_5MTP and completed generation, but vLLM emitted missing-parameter warnings for several drafter params (fc.weight, MLP and attention weights) even though mtp.* tensors are present in model_extra_tensors.safetensors. Treat MTP/speculative performance on this package as experimental pending vLLM loader/layout follow-up.

Benchmarks used vllm bench throughput, fixed random prompts, max_model_len=8192, tensor parallel size 1, and local model files on overlay disk. TPS values are vLLM timed-section values; wall time includes model load, compile, CUDA graph capture, and warmup.

Table
case	input -> output	prompts	gpu util	mode	total tok/s	prompt tok/s est	output tok/s est	peak VRAM GiB	max W
balanced_graph_u65	1024 -> 128	64	0.65	graph	6394.8	5684.2	710.5	118.0	863.2
prefill_graph_u65	4096 -> 16	32	0.65	graph	7487.0	7457.9	29.1	117.6	870.0
decode_graph_u65	128 -> 256	64	0.65	graph	4257.9	1419.3	2838.6	116.6	827.9
balanced_eager_u65	1024 -> 128	32	0.65	eager	2218.2	1971.7	246.5	118.2	836.4
balanced_graph_u85	1024 -> 128	64	0.85	graph	6635.3	5898.0	737.3	153.8	862.1
balanced_mtp_u65	1024 -> 128	32	0.65	graph + MTP	4759.1	4230.3	528.8	118.1	856.8

First graph runs had cold costs around 77-80 seconds for torch.compile plus CUDA graph capture/profile. Repeated same-layout graph runs loaded the compile cache much faster. Eager mode was substantially slower than graph mode on this workload.

24GB RTX 3090 vLLM Smoke

A small fit smoke was run on 2026-06-15 Europe/Warsaw / 2026-06-14 UTC on one RTX 3090 24GB RunPod host with NVIDIA driver 580.159.03 (nvidia-smi CUDA 13.0), vllm 0.23.0, torch 2.11.0+cu128, and runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404.

The smoke used max_model_len=32768, kv_cache_dtype=fp8, dtype=bfloat16, max_num_seqs=1, max_num_batched_tokens=2048, chunked prefill enabled, prefix caching disabled, load_format=safetensors, and one 128 -> 16 random request.

Table
mode	result	peak VRAM	KV cache	32k concurrency	smoke throughput
no MTP	pass	21464 MiB	61440 tokens	1.88x	48.59 total tok/s, 5.40 output tok/s
MTP-1	pass with warnings	24004 MiB	53399 tokens	1.63x	29.26 total tok/s, 3.25 output tok/s

Recommended 24GB command shape:

bash
vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-W4A16-G64-RTN-vllm \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 2048 \
  --enable-chunked-prefill \
  --no-enable-prefix-caching \
  --load-format safetensors

For MTP-1 on 24GB, add:

bash
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'

MTP-1 fit and generation completed with rc=0, but vLLM again emitted missing-parameter warnings for the compressed-tensors MTP drafter layout. Treat RTN MTP quality/performance as experimental until that loader/layout issue is fixed.

Qwopus3.6-27B-Coder-FP8-W4A16-G64-RTN-vllm

Get help setting up a custom Dedicated Endpoints.

README

vLLM

vLLM CUDA 13 Smoke and Benchmarks

24GB RTX 3090 vLLM Smoke

Explore FriendliAI today