WaveCut

Qwopus3.6-27B-Coder-FP8-int4-AutoRound

README

License: apache-2.0

vLLM

bash
vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound \
  --dtype bfloat16 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.85 \
  --trust-remote-code \
  --speculative-config '{"method":"mtp","num_speculative_tokens":1}'

For long-context serving, raise --max-model-len according to your KV-cache budget.

vLLM CUDA 13 Smoke and Benchmarks

Smoke and throughput checks were run on 2026-06-14 with vllm 0.23.0, torch 2.11.0+cu130, Python 3.12.3, one NVIDIA B200, and NVIDIA driver 580.105.08. CUDA Toolkit release notes document per-release minimum driver requirements; in this run, a B200 host with driver 570.* failed CUDA 13 initialization, while driver 580.105.08 worked.

The working RunPod image was runpod/pytorch:1.0.3-cu1300-torch291-ubuntu2404 (cu13-pytorch2.9, template 0uy1f6v18r). After vLLM install, nvidia-cutlass-dsl-libs-cu13 was force-reinstalled once to fix a CUTLASS RECORD mismatch; after that vLLM used the FlashInfer GDN prefill kernel.

vLLM resolved this model as Qwen3_5ForConditionalGeneration, loaded the AutoRound/AutoGPTQ path with MarlinLinearKernel for AutoGPTQLinearMethod, and completed generation. MTP speculative decoding resolved Qwen3_5MTP, loaded without missing-parameter warnings, shared embedding/lm_head with the draft model, and completed generation.

Benchmarks used vllm bench throughput, fixed random prompts, max_model_len=8192, tensor parallel size 1, and local model files on overlay disk. TPS values are vLLM timed-section values; wall time includes model load, compile, CUDA graph capture, and warmup.

Table with columns: case, input -> output, prompts, gpu util, mode, total tok/s, prompt tok/s est, output tok/s est, peak VRAM GiB, max W
case	input -> output	prompts	gpu util	mode	total tok/s	prompt tok/s est	output tok/s est	peak VRAM GiB	max W
balanced_graph_u65	1024 -> 128	64	0.65	graph	6369.6	5661.9	707.7	117.6	850.4

First graph runs had cold costs around 77-80 seconds for torch.compile plus CUDA graph capture/profile. Repeated same-layout graph runs loaded the compile cache much faster. Eager mode was substantially slower than graph mode on this workload.

24GB RTX 3090 vLLM Smoke

A small fit smoke was run on 2026-06-14 on one RTX 3090 24GB RunPod host with NVIDIA driver 580.159.03 (nvidia-smi CUDA 13.0), vllm 0.23.0, torch 2.11.0+cu128, and runpod/pytorch:1.0.2-cu1281-torch280-ubuntu2404.

The smoke used max_model_len=32768, kv_cache_dtype=fp8, dtype=bfloat16, max_num_seqs=1, max_num_batched_tokens=2048, chunked prefill enabled, prefix caching disabled, and one 128 -> 16 random request. The vLLM Qwen3.5/Qwen3.6 recipe recommends MTP-1 speculative decoding with prefix caching disabled for latency-sensitive low-concurrency serving.

Table with columns: mode, load format, result, peak VRAM, KV cache, 32k concurrency, smoke throughput
mode	load format	result	peak VRAM	KV cache	32k concurrency	smoke throughput
no MTP	`fastsafetensors`	pass	22174 MiB	64170 tokens	1.96x	50.33 total tok/s, 5.59 output tok/s
MTP-1	`safetensors`	pass	24110 MiB

Recommended 24GB command shape:

bash
vllm serve WaveCut/Qwopus3.6-27B-Coder-FP8-int4-AutoRound \
  --dtype bfloat16 \
  --max-model-len 32768 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 2048 \
  --enable-chunked-prefill \
  --no-enable-prefix-caching \
  --load-format safetensors

For MTP-1 on 24GB, keep --load-format safetensors and add:

bash
--speculative-config '{"method":"mtp","num_speculative_tokens":1}'

Provenance

This repo was generated from the public Apache-2.0 source checkpoint. It keeps the upstream tokenizer, processor, chat template, vision config, and Qwen3.5 MTP config intact.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

WaveCut

Model Tree

Base

Jackrong/Qwopus3.6-27B-Coder-FP8

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities