Dedicated Endpoints
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Download
bash
hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \--local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4
vLLM quickstart
bash
VLLM_NVFP4_GEMM_BACKEND=marlin \vllm serve lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \--served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \--quantization compressed-tensors \--kv-cache-dtype fp8 \--max-model-len 131072 \--max-num-seqs 1 \--max-num-batched-tokens 4096 \--gpu-memory-utilization 0.90 \--enable-prefix-caching \--enable-auto-tool-choice \--tool-call-parser qwen3_coder \--reasoning-parser qwen3 \--trust-remote-code
Local path quickstart:
bash
hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \--local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4VLLM_NVFP4_GEMM_BACKEND=marlin \vllm serve ./qwen36-35b-a3b-hauhaucs-nvfp4 \--served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \--quantization compressed-tensors \--kv-cache-dtype fp8 \--max-model-len 131072 \--max-num-seqs 1 \--max-num-batched-tokens 4096 \--gpu-memory-utilization 0.90 \--enable-prefix-caching \--enable-auto-tool-choice \--tool-call-parser qwen3_coder \--reasoning-parser qwen3 \--trust-remote-code
Quantization recipe
python
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4",ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$","re:.*mlp.shared_expert_gate$", "re:.*linear_attn.*", "re:^mtp.*"],)oneshot(model=model,dataset=ds,recipe=recipe,max_seq_length=1024,num_calibration_samples=128,moe_calibrate_all_experts=True,pipeline="basic",)
- Calibration:
HuggingFaceH4/ultrachat_200k, 128 samples x 1024 tokens - MTP tensors copied from Qwen/Qwen3.6-35B-A3B
- Converted using li-yifei/gguf-to-nvfp4
Pipeline:
text
Q8_K_P GGUF -> step1_convert_qwen36_moe.py -> HF bf16 -> step2_quantize_qwen36_moe.py -> NVFP4
Source models
- Uncensored source: HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive
- Original base: Qwen/Qwen3.6-35B-A3B
Acknowledgments
Model provider
r3lax
Model tree
Base
HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information