Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

  • Base Model: Qwen/Qwen3-4B
  • Abliteration Method: Heretic v1.2.0
  • Trials: 200
  • Trial Selected: Trial 96
  • Refusals: 3/100 (vs 100/100 original)
  • KL Divergence: 0.0000 (zero measurable model damage)

Files

HuggingFace Format (for transformers, llama.cpp conversion)

markdown

model-00001-of-00002.safetensors
model-00002-of-00002.safetensors
config.json
tokenizer.json
tokenizer_config.json

ComfyUI Format (for Z-Image / FLUX.2 Klein 4B text encoder)

markdown

comfyui/qwen3-4b-heretic.safetensors # bf16, 7.5GB
comfyui/qwen3-4b-heretic_fp8_e4m3fn.safetensors # fp8, 4.1GB
comfyui/qwen3-4b-heretic_nvfp4.safetensors # nvfp4, 2.6GB

GGUF Format (for llama.cpp and ComfyUI-GGUF)

QuantSizeNotes
F16~7.5GBLossless reference
Q8_0~4GBExcellent quality
Q6_K~3GBVery good quality
Q5_K_M~2.7GBGood quality
Q4_K_M~2.3GBRecommended balance
Q3_K_M~1.9GBFor low VRAM only

NVFP4 Notes

The NVFP4 (4-bit floating point, E2M1) variants use ComfyUI's native quantization format. They are ~3x smaller than bf16 and load natively in ComfyUI without any plugins. Blackwell GPUs (RTX 5090/5080, SM100+) can use native FP4 tensor cores for best performance, but ComfyUI also supports software dequantization on older GPUs (tested working on RTX 4090).

Usage

With ComfyUI (Z-Image / FLUX.2 Klein 4B)

  1. Download a ComfyUI format file:

    • FP8 (recommended): comfyui/qwen3-4b-heretic_fp8_e4m3fn.safetensors (4.1GB)
    • NVFP4 (smallest): comfyui/qwen3-4b-heretic_nvfp4.safetensors (2.6GB)
    • bf16 (full precision): comfyui/qwen3-4b-heretic.safetensors (7.5GB)
  2. Place in ComfyUI/models/text_encoders/

  3. In your Z-Image workflow, use the ClipLoader node and select the heretic file

With Transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"DreamFast/qwen3-4b-heretic",
device_map="auto",
torch_dtype=torch.bfloat16
)
tokenizer = AutoTokenizer.from_pretrained("DreamFast/qwen3-4b-heretic")
prompt = "Describe a dramatic sunset over a cyberpunk city"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With llama.cpp

bash

llama-server -m qwen3-4b-heretic-Q4_K_M.gguf

Abliteration Process

Created using Heretic v1.2.0 with 200 optimization trials:

markdown

? Which trial do you want to use?
> [Trial 96] Refusals: 3/100, KL divergence: 0.0000 <-- selected
[Trial 90] Refusals: 5/100, KL divergence: 0.0000
[Trial 95] Refusals: 9/100, KL divergence: 0.0000
[Trial 122] Refusals: 90/100, KL divergence: 0.0000
...

Trial 96 was selected for having the fewest refusals (3/100) with zero measurable KL divergence, indicating the abliteration surgically removed the refusal mechanism with no damage to model capabilities.

Limitations

  • This model inherits all limitations of the base Qwen 3 4B model
  • Abliteration reduces but does not completely eliminate refusals (3/100 remain)

License

This model is released under the Apache 2.0 License, following the base Qwen 3 4B model license.

Acknowledgments

Model provider

skilledu

Model tree

Base

Qwen/Qwen3-4B

Quantized

this model

Modalities

Input

Text

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today