Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0🌟 Qwen3.6-27B NVFP4 Quantization by NeuralNet 🧠🤖
This is an NVFP4-quantized version of Qwen/Qwen3.6-27B, optimized for deployment on NVIDIA Blackwell architecture GPUs using vLLM.
[!IMPORTANT] NVFP4 quantization requires NVIDIA Blackwell architecture (GB200, RTX 5000 series, etc.). This format is not compatible with Ampere, Ada Lovelace, or Hopper GPUs. If you are running on an older GPU, please use a different quantization format.
Original model: https://huggingface.co/Qwen/Qwen3.6-27B
⚡ Deployment with vLLM
This quantized model is intended to be served using vLLM (vllm>=0.9.0 recommended).
Quick Start
bash
vllm serve NeuralNet-Hub/Qwen3.6-27B-NVFP4 \--quantization nvfp4 \--dtype bfloat16 \--kv-cache-dtype fp8 \--max-model-len 262144 \--reasoning-parser qwen3 \--enable-auto-tool-choice \--tool-call-parser qwen3_coder
Using a Config File
yaml
# Deploy with: vllm serve --config config.yaml# Model configmodel: NeuralNet-Hub/Qwen3.6-27B-NVFP4dtype: bfloat16kv-cache-dtype: fp8gpu-memory-utilization: 0.95max-model-len: 262144max-num-batched-tokens: 4096max-num-seqs: 200max-cudagraph-capture-size: 209enable-prefix-caching: truetrust-remote-code: true# template parserreasoning-parser: qwen3enable-auto-tool-choice: truetool-call-parser: qwen3_coder# Optionaldefault-chat-template-kwargs: '{"enable_thinking": false}'download-dir: /workspace/modelshost: 0.0.0.0port: 18000
bash
vllm serve --config config.yaml
💬 Chat API Usage
Once you deploy your model using vLLM you can chat qwith Qwen3.6 with chat template compatible with OpenAI-format APIs. Thinking mode is enabled by default.
Thinking Mode (Default)
python
from openai import OpenAIclient = OpenAI(base_url="http://localhost:18000/v1", api_key="EMPTY")messages = [{"role": "user", "content": "Your message here"}]response = client.chat.completions.create(model="NeuralNet-Hub/Qwen3.6-27B-NVFP4",messages=messages,max_tokens=32768,temperature=1.0,top_p=0.95,extra_body={"top_k": 20},)print(response.choices[0].message.content)
Non-Thinking (Instruct) Mode
python
response = client.chat.completions.create(model="NeuralNet-Hub/Qwen3.6-27B-NVFP4",messages=messages,max_tokens=8192,temperature=0.7,top_p=0.8,presence_penalty=1.5,extra_body={"top_k": 20,"chat_template_kwargs": {"enable_thinking": False},},)
Image Input
python
messages = [{"role": "user","content": [{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},{"type": "text", "text": "Describe this image in detail."}]}]response = client.chat.completions.create(model="NeuralNet-Hub/Qwen3.6-27B-NVFP4",messages=messages,max_tokens=32768,temperature=1.0,top_p=0.95,extra_body={"top_k": 20},)
⚙️ Recommended Sampling Parameters
| Mode | temperature | top_p | top_k | presence_penalty |
|---|---|---|---|---|
| Thinking — general tasks | 1.0 | 0.95 | 20 | 0.0 |
| Thinking — precise coding | 0.6 | 0.95 | 20 | 0.0 |
| Instruct (non-thinking) | 0.7 | 0.80 | 20 | 1.5 |
🔧 Hardware Requirements
| Component | Requirement |
|---|---|
| GPU Architecture | NVIDIA Blackwell (sm_100+) |
| VRAM | 24 GB+ recommended |
| CUDA | 12.8+ |
| vLLM | 0.9.0+ |
[!WARNING] NVFP4 is exclusively supported on NVIDIA Blackwell GPUs. Attempting to run this model on Ampere (A100), Ada Lovelace (RTX 4000), or Hopper (H100) will fail. For those architectures, use the original BF16 model or an AWQ/GPTQ quantized variant.
🌐 Contact Us
NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence.
Website: https://neuralnet.solutions
Email: info[at]neuralnet.solutions
Model provider
NeuralNet-Hub
Model tree
Base
Qwen/Qwen3.6-27B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information