Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

🌟 Qwen3.6-35B-A3B NVFP4 Quantization by NeuralNet 🧠🤖

This is an NVFP4-quantized version of Qwen/Qwen3.6-35B-A3B, optimized for deployment on NVIDIA Blackwell architecture GPUs using vLLM.

[!IMPORTANT] NVFP4 quantization requires NVIDIA Blackwell architecture (GB200, RTX 5000 series, etc.). This format is not compatible with Ampere, Ada Lovelace, or Hopper GPUs. If you are running on an older GPU, please use a different quantization format.

Original model: https://huggingface.co/Qwen/Qwen3.6-35B-A3B


Quantization Details

This model was quantized to NVFP4 (4-bit NVIDIA Floating Point) using vLLM's built-in quantization pipeline. NVFP4 leverages native FP4 Tensor Core support introduced in Blackwell GPUs, delivering significant memory savings and throughput improvements with minimal quality degradation compared to BF16.

bash

vllm quantize \
--model Qwen/Qwen3.6-35B-A3B \
--quantization nvfp4 \
--output-dir NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4

⚡ Deployment with vLLM

This quantized model is intended to be served using vLLM (vllm>=0.9.0 recommended).

Quick Start

bash

vllm serve NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4 \
--quantization nvfp4 \
--dtype bfloat16 \
--kv-cache-dtype fp8 \
--max-model-len 262144 \
--reasoning-parser qwen3 \
--enable-auto-tool-choice \
--tool-call-parser qwen3_coder

Using a Config File

yaml

# Deploy with: vllm serve --config config.yaml
# Optimized for NVIDIA RTX 6000 PRO (Blackwell)
# Benchmarked: ~85-90 parallel requests, up to 1000 tok/sec at higher context lengths
model: NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4
dtype: bfloat16
kv-cache-dtype: fp8
gpu-memory-utilization: 0.95
max-model-len: 262144
max-num-batched-tokens: 4096
max-num-seqs: 200
max-cudagraph-capture-size: 209
enable-prefix-caching: true
trust-remote-code: true
reasoning-parser: qwen3
enable-auto-tool-choice: true
tool-call-parser: qwen3_coder
default-chat-template-kwargs: '{"enable_thinking": false}'
download-dir: /workspace/models
host: 0.0.0.0
port: 18000

bash

vllm serve --config config.yaml

💬 Chat API Usage

Qwen3.6 uses a standard chat template compatible with OpenAI-format APIs. Thinking mode is enabled by default.

Thinking Mode (Default)

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:18000/v1", api_key="EMPTY")
messages = [{"role": "user", "content": "Your message here"}]
response = client.chat.completions.create(
model="NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4",
messages=messages,
max_tokens=32768,
temperature=1.0,
top_p=0.95,
extra_body={"top_k": 20},
)
print(response.choices[0].message.content)

Non-Thinking (Instruct) Mode

python

response = client.chat.completions.create(
model="NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4",
messages=messages,
max_tokens=8192,
temperature=0.7,
top_p=0.8,
presence_penalty=1.5,
extra_body={
"top_k": 20,
"chat_template_kwargs": {"enable_thinking": False},
},
)

Image Input

python

messages = [
{
"role": "user",
"content": [
{"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}},
{"type": "text", "text": "Describe this image in detail."}
]
}
]
response = client.chat.completions.create(
model="NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4",
messages=messages,
max_tokens=32768,
temperature=1.0,
top_p=0.95,
extra_body={"top_k": 20},
)

⚙️ Recommended Sampling Parameters

Modetemperaturetop_ptop_kpresence_penalty
Thinking — general tasks1.00.95200.0
Thinking — precise coding0.60.95200.0
Instruct (non-thinking)0.70.80201.5

📥 Download with huggingface-cli

Install the CLI

bash

pip install -U "huggingface_hub[cli]"

Download the Full Repository

bash

huggingface-cli download NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4 --local-dir ./Qwen3.6-35B-A3B-NVFP4

Download Specific Files

bash

huggingface-cli download NeuralNet-Hub/Qwen3.6-35B-A3B-NVFP4 \
--include "*.safetensors" \
--local-dir ./Qwen3.6-35B-A3B-NVFP4

🔧 Hardware Requirements

ComponentRequirement
GPU ArchitectureNVIDIA Blackwell (sm_100+)
VRAM24 GB+ recommended
CUDA12.8+
vLLM0.9.0+

[!WARNING] NVFP4 is exclusively supported on NVIDIA Blackwell GPUs. Attempting to run this model on Ampere (A100), Ada Lovelace (RTX 4000), or Hopper (H100) will fail. For those architectures, use the original BF16 model or an AWQ/GPTQ quantized variant.


🌐 Contact Us

NeuralNet is a pioneering AI solutions provider that empowers businesses to harness the power of artificial intelligence.

Website: https://neuralnet.solutions Email: info[at]neuralnet.solutions

Model provider

NeuralNet-Hub

NeuralNet-Hub

Model tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today