Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Run this model inference with full control and performance in your environment.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Details
- Base Model: Qwen/Qwen3-4B
- Abliteration Method: Heretic v1.2.0
- Trials: 200
- Trial Selected: Trial 96
- Refusals: 3/100 (vs 100/100 original)
- KL Divergence: 0.0000 (zero measurable model damage)
Files
HuggingFace Format (for transformers, llama.cpp conversion)
markdown
model-00001-of-00002.safetensorsmodel-00002-of-00002.safetensorsconfig.jsontokenizer.jsontokenizer_config.json
ComfyUI Format (for Z-Image / FLUX.2 Klein 4B text encoder)
markdown
comfyui/qwen3-4b-heretic.safetensors # bf16, 7.5GBcomfyui/qwen3-4b-heretic_fp8_e4m3fn.safetensors # fp8, 4.1GBcomfyui/qwen3-4b-heretic_nvfp4.safetensors # nvfp4, 2.6GB
GGUF Format (for llama.cpp and ComfyUI-GGUF)
| Quant | Size | Notes |
|---|---|---|
| F16 | ~7.5GB | Lossless reference |
| Q8_0 | ~4GB | Excellent quality |
| Q6_K | ~3GB | Very good quality |
| Q5_K_M | ~2.7GB | Good quality |
| Q4_K_M | ~2.3GB | Recommended balance |
| Q3_K_M | ~1.9GB | For low VRAM only |
NVFP4 Notes
The NVFP4 (4-bit floating point, E2M1) variants use ComfyUI's native quantization format. They are ~3x smaller than bf16 and load natively in ComfyUI without any plugins. Blackwell GPUs (RTX 5090/5080, SM100+) can use native FP4 tensor cores for best performance, but ComfyUI also supports software dequantization on older GPUs (tested working on RTX 4090).
Usage
With ComfyUI (Z-Image / FLUX.2 Klein 4B)
-
Download a ComfyUI format file:
- FP8 (recommended):
comfyui/qwen3-4b-heretic_fp8_e4m3fn.safetensors(4.1GB) - NVFP4 (smallest):
comfyui/qwen3-4b-heretic_nvfp4.safetensors(2.6GB) - bf16 (full precision):
comfyui/qwen3-4b-heretic.safetensors(7.5GB)
- FP8 (recommended):
-
Place in
ComfyUI/models/text_encoders/ -
In your Z-Image workflow, use the
ClipLoadernode and select the heretic file
With Transformers
python
from transformers import AutoModelForCausalLM, AutoTokenizerimport torchmodel = AutoModelForCausalLM.from_pretrained("DreamFast/qwen3-4b-heretic",device_map="auto",torch_dtype=torch.bfloat16)tokenizer = AutoTokenizer.from_pretrained("DreamFast/qwen3-4b-heretic")prompt = "Describe a dramatic sunset over a cyberpunk city"inputs = tokenizer(prompt, return_tensors="pt").to(model.device)outputs = model.generate(**inputs, max_new_tokens=200)print(tokenizer.decode(outputs[0], skip_special_tokens=True))
With llama.cpp
bash
llama-server -m qwen3-4b-heretic-Q4_K_M.gguf
Abliteration Process
Created using Heretic v1.2.0 with 200 optimization trials:
markdown
? Which trial do you want to use?> [Trial 96] Refusals: 3/100, KL divergence: 0.0000 <-- selected[Trial 90] Refusals: 5/100, KL divergence: 0.0000[Trial 95] Refusals: 9/100, KL divergence: 0.0000[Trial 122] Refusals: 90/100, KL divergence: 0.0000...
Trial 96 was selected for having the fewest refusals (3/100) with zero measurable KL divergence, indicating the abliteration surgically removed the refusal mechanism with no damage to model capabilities.
Limitations
- This model inherits all limitations of the base Qwen 3 4B model
- Abliteration reduces but does not completely eliminate refusals (3/100 remain)
License
This model is released under the Apache 2.0 License, following the base Qwen 3 4B model license.
Acknowledgments
- Qwen for the Qwen 3 4B model
- Heretic by p-e-w for the abliteration tool
- Tongyi-MAI Z-Image for Z-Image
- Black Forest Labs for FLUX.2 Klein
Model provider
skilledu
Model tree
Base
Qwen/Qwen3-4B
Quantized
this model
Modalities
Input
Text
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information