WaveCut
Qwopus3.6-27B-Coder-FP8-W4A16-G64-RTN-vllm
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0vLLM
bash
vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-W4A16-G64-RTN-vllm \--dtype bfloat16 \--max-model-len 4096 \--gpu-memory-utilization 0.85 \--trust-remote-code \--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
For long-context serving, raise --max-model-len according to your KV-cache budget.
vLLM CUDA 13 Smoke and Benchmarks
Smoke and throughput checks were run on 2026-06-14 with vllm 0.23.0, torch 2.11.0+cu130, Python 3.12.3, one NVIDIA B200, and NVIDIA driver 580.105.08. CUDA Toolkit release notes document per-release minimum driver requirements; in this run, a B200 host with driver 570.* failed CUDA 13 initialization, while driver 580.105.08 worked.
The working RunPod image was runpod/pytorch:1.0.3-cu1300-torch291-ubuntu2404 (cu13-pytorch2.9, template 0uy1f6v18r). After vLLM install, nvidia-cutlass-dsl-libs-cu13 was force-reinstalled once to fix a CUTLASS RECORD mismatch; after that vLLM used the FlashInfer GDN prefill kernel.
vLLM resolved this model as Qwen3_5ForConditionalGeneration, loaded compressed-tensors, used MarlinLinearKernel for CompressedTensorsWNA16, and completed generation. MTP speculative decoding resolved Qwen3_5MTP and completed generation, but vLLM emitted missing-parameter warnings for several drafter params (fc.weight, MLP and attention weights) even though mtp.* tensors are present in model_extra_tensors.safetensors. Treat MTP/speculative performance on this package as experimental pending vLLM loader/layout follow-up.
Benchmarks used vllm bench throughput, fixed random prompts, max_model_len=8192, tensor parallel size 1, and local model files on overlay disk. TPS values are vLLM timed-section values; wall time includes model load, compile, CUDA graph capture, and warmup.
| case | input -> output | prompts | gpu util | mode | total tok/s | prompt tok/s est | output tok/s est | peak VRAM GiB | max W |
|---|---|---|---|---|---|---|---|---|---|
| balanced_graph_u65 | 1024 -> 128 | 64 | 0.65 | graph | 6394.8 | 5684.2 | 710.5 | 118.0 | 863.2 |
| prefill_graph_u65 | 4096 -> 16 | 32 | 0.65 | graph | 7487.0 | 7457.9 | 29.1 | 117.6 | 870.0 |
| decode_graph_u65 | 128 -> 256 | 64 | 0.65 | graph | 4257.9 | 1419.3 | 2838.6 | 116.6 | 827.9 |
| balanced_eager_u65 | 1024 -> 128 | 32 | 0.65 | eager | 2218.2 | 1971.7 | 246.5 | 118.2 | 836.4 |
| balanced_graph_u85 | 1024 -> 128 | 64 | 0.85 | graph | 6635.3 | 5898.0 | 737.3 | 153.8 | 862.1 |
| balanced_mtp_u65 | 1024 -> 128 | 32 | 0.65 | graph + MTP | 4759.1 | 4230.3 | 528.8 | 118.1 | 856.8 |
First graph runs had cold costs around 77-80 seconds for torch.compile plus CUDA graph capture/profile. Repeated same-layout graph runs loaded the compile cache much faster. Eager mode was substantially slower than graph mode on this workload.
24GB RTX 3090 vLLM Smoke
A small fit smoke was run on 2026-06-15 Europe/Warsaw / 2026-06-14 UTC on one RTX 3090 24GB RunPod host with NVIDIA driver 580.159.03 (nvidia-smi CUDA 13.0), vllm 0.23.0, torch 2.11.0+cu128, and runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404.
The smoke used max_model_len=32768, kv_cache_dtype=fp8, dtype=bfloat16, max_num_seqs=1, max_num_batched_tokens=2048, chunked prefill enabled, prefix caching disabled, load_format=safetensors, and one 128 -> 16 random request.
| mode | result | peak VRAM | KV cache | 32k concurrency | smoke throughput |
|---|---|---|---|---|---|
| no MTP | pass | 21464 MiB | 61440 tokens | 1.88x | 48.59 total tok/s, 5.40 output tok/s |
| MTP-1 | pass with warnings | 24004 MiB | 53399 tokens | 1.63x | 29.26 total tok/s, 3.25 output tok/s |
Recommended 24GB command shape:
bash
vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-W4A16-G64-RTN-vllm \--dtype bfloat16 \--max-model-len 32768 \--kv-cache-dtype fp8 \--gpu-memory-utilization 0.95 \--max-num-seqs 1 \--max-num-batched-tokens 2048 \--enable-chunked-prefill \--no-enable-prefix-caching \--load-format safetensors
For MTP-1 on 24GB, add:
bash
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'
MTP-1 fit and generation completed with rc=0, but vLLM again emitted missing-parameter warnings for the compressed-tensors MTP drafter layout. Treat RTN MTP quality/performance as experimental until that loader/layout issue is fixed.
Model provider
WaveCut
Model tree
Base
Jackrong/Qwopus3.6-27B-Coder-FP8
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information