lyf
Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quantization Recipe
python
recipe = QuantizationModifier(targets="Linear", scheme="NVFP4",ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$","re:.*mlp.shared_expert_gate$", "re:.*linear_attn.*", "re:^mtp.*"],)oneshot(model=model, dataset=ds, recipe=recipe,max_seq_length=1024, num_calibration_samples=128,moe_calibrate_all_experts=True, pipeline="basic")
- Calibration: HuggingFaceH4/ultrachat_200k, 128 samples × 1024 tokens
- MTP tensors copied from Qwen/Qwen3.6-35B-A3B (not present in GGUF)
Deployment (vLLM)
Vision + text smoke-tested on RTX 5090
This repository has been smoke-tested locally on an RTX 5090 with vllm/vllm-openai:v0.21.0-cu130-local, compressed-tensors, NVFP4 Marlin GEMM, FP8 KV cache, and a real image chat.completions request.
bash
VLLM_USE_FLASHINFER_MOE_FP4=0 \VLLM_NVFP4_GEMM_BACKEND=marlin \vllm serve ./Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \--served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \--quantization compressed-tensors \--kv-cache-dtype fp8 \--gpu-memory-utilization 0.90 \--max-model-len 4096 \--max-num-seqs 1 \--max-num-batched-tokens 1024 \--trust-remote-code
For short non-thinking answers, pass chat_template_kwargs at the top level of the OpenAI-compatible request:
json
{"chat_template_kwargs": {"enable_thinking": false}}
Text-only long context
bash
VLLM_USE_FLASHINFER_MOE_FP4=0 \VLLM_NVFP4_GEMM_BACKEND=marlin \vllm serve ./Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \--quantization compressed-tensors \--kv-cache-dtype fp8 \--gpu-memory-utilization 0.95 \--max-model-len 100000 \--max-num-seqs 1 \--reasoning-parser qwen3 \--language-model-only \--trust-remote-code
Pipeline
Converted using li-yifei/gguf-to-nvfp4:
markdown
Q8_K_P GGUF → step1_convert_qwen36_moe.py → HF bf16 → step2_quantize_qwen36_moe.py → NVFP4
Also See
- lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4-100K — Aggressive variant (linear_attn + MTP also NVFP4, smaller footprint for vision+long context)
Acknowledgments
Model provider
lyf
Model tree
Base
HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information