lyf

Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4

README

License: apache-2.0

Quantization Recipe

python
recipe = QuantizationModifier(
    targets="Linear", scheme="NVFP4",
    ignore=["lm_head", "re:.*visual.*", "re:.*mlp.gate$",
            "re:.*mlp.shared_expert_gate$", "re:.*linear_attn.*", "re:^mtp.*"],
)
oneshot(model=model, dataset=ds, recipe=recipe,
        max_seq_length=1024, num_calibration_samples=128,
        moe_calibrate_all_experts=True, pipeline="basic")

Calibration: HuggingFaceH4/ultrachat_200k, 128 samples × 1024 tokens
MTP tensors copied from Qwen/Qwen3.6-35B-A3B (not present in GGUF)

Deployment (vLLM)

Vision + text smoke-tested on RTX 5090

This repository has been smoke-tested locally on an RTX 5090 with vllm/vllm-openai:v0.21.0-cu130-local, compressed-tensors, NVFP4 Marlin GEMM, FP8 KV cache, and a real image chat.completions request.

bash
VLLM_USE_FLASHINFER_MOE_FP4=0 \
VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve ./Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --served-model-name qwen36-35b-a3b-hauhaucs-nvfp4 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.90 \
  --max-model-len 4096 \
  --max-num-seqs 1 \
  --max-num-batched-tokens 1024 \
  --trust-remote-code

For short non-thinking answers, pass chat_template_kwargs at the top level of the OpenAI-compatible request:

json
{
  "chat_template_kwargs": {"enable_thinking": false}
}

Text-only long context

bash
VLLM_USE_FLASHINFER_MOE_FP4=0 \
VLLM_NVFP4_GEMM_BACKEND=marlin \
vllm serve ./Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4 \
  --quantization compressed-tensors \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.95 \
  --max-model-len 100000 \
  --max-num-seqs 1 \
  --reasoning-parser qwen3 \
  --language-model-only \
  --trust-remote-code

Pipeline

Converted using li-yifei/gguf-to-nvfp4:

markdown
Q8_K_P GGUF → step1_convert_qwen36_moe.py → HF bf16 → step2_quantize_qwen36_moe.py → NVFP4

Also See

lyf/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive-NVFP4-100K — Aggressive variant (linear_attn + MTP also NVFP4, smaller footprint for vision+long context)

Acknowledgments

HauhauCS for the uncensored GGUF source
Qwen for the base model and MTP weights
AEON-7 and RedHatAI for conservative quantization approach reference

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

lyf

Model Tree

Base

HauhauCS/Qwen3.6-35B-A3B-Uncensored-HauhauCS-Aggressive

Quantized

this model

Input Modalities

TextImageVideo

Output Modalities

Text

Supported Functionality