r3lax

Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4-GGUF

README

License: apache-2.0

Download

bash
hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4

vLLM quickstart

bash
VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

Local path quickstart:

bash
hf download lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --local-dir ./qwen36-35b-a3b-hauhaucs-nvfp4

VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve ./qwen36-35b-a3b-hauhaucs-nvfp4 \
  --served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --max-model-len 131072 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 4096 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --tool-call-parser qwen3_coder \
  --reasoning-parser qwen3 \
  --trust-remote-code

Quantization recipe

python
recipe = QuantizationModifier(
    targets="Linear", scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$",
            "re:.*mlp.shared_expert_gate$", "re:.*linear_attn.*", "re:^mtp.*"],
)
oneshot(
    model=model,
    dataset=ds,
    recipe=recipe,
    max_seq_length=1024,
    num_calibration_samples=128,
    moe_calibrate_all_experts=True,
    pipeline="basic",
)

Calibration: HuggingFaceH4/ultrachat_200k, 128 samples x 1024 tokens
MTP tensors copied from Qwen/Qwen3.6-35B-A3B
Converted using li-yifei/gguf-to-nvfp4

Pipeline:

text
Q8_K_P GGUF -> step1_convert_qwen36_moe.py -> HF bf16 -> step2_quantize_qwen36_moe.py -> NVFP4

Source models

Uncensored source: HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive
Original base: Qwen/Qwen3.6-35B-A3B

Acknowledgments

HauhauCS for the uncensored GGUF source
Qwen for the base model and MTP weights
AEON-7 and RedHatAI for conservative quantization approach reference

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

r3lax

Model Tree

Base

HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive

Quantized

this model

Input Modalities

Text

Image

Video

Output Modalities