DreamFast/Qwen3-VL-8B-Heretic-1.3.0 API & Inference Endpoint

Model Details

Base Model: Qwen/Qwen3-VL-8B-Instruct
Architecture: Qwen3VLForConditionalGeneration (~8B parameters)
Abliteration Method: Heretic v1.3.0
Trial Selected: Trial 98
Refusals: 6/100 (vs 100/100 original)
KL Divergence: 0.0314 (very good, minimal model damage)
Precision: bf16
Context Length: 8192 tokens

This is a vision-language model (supports both text and image inputs). It does not support thinking/reasoning tokens.

Quick Facts


Base model	Qwen/Qwen3-VL-8B-Instruct
Architecture	Qwen3VLForConditionalGeneration
Parameters	~8B
Precision	bf16
Context length	8192 tokens

Key Findings

Safety is fully removed. The base model refused 71.5% of harmful requests; Heretic complies with 99.0%.
Capability is almost perfectly preserved. Across 8 benchmark tasks, the average change is under 1%. MMLU drops 0.30%, GSM8K drops 0.91%, and HellaSwag actually improves by 0.17%.
TruthfulQA takes a meaningful hit. TruthfulQA MC2 drops 11.7% (61.2% → 54.1%). This is the established safety-accuracy tradeoff.
The edits are surgical. Only 53 out of 398 tensors (13.3%) were modified, targeting o_proj (27 tensors) and down_proj (26 tensors), spanning layers 9–35. SVD analysis confirms rank-1 edits with SV ratios of 80–92x.
Minimal distribution shift. KL divergence of 0.0314 confirms the model's output distribution barely changed on benign inputs.

Abliteration Process

Heretic v1.3.0 with optimization trials. Trial 98 was selected for its balance of low refusals (6/100) and very low KL divergence (0.0315):

markdown
[Trial  98] Refusals:  6/100, KL divergence: 0.0315  <-- selected

Files

HuggingFace Format (for transformers, vision-capable)

markdown
model-00001-of-00004.safetensors (~4.6 GB)
model-00002-of-00004.safetensors (~4.6 GB)
model-00003-of-00004.safetensors (~4.7 GB)
model-00004-of-00004.safetensors (~2.6 GB)
config.json
tokenizer.json
tokenizer_config.json
generation_config.json
chat_template.jinja

ComfyUI Format (vision-capable text encoder)

markdown
comfyui/qwen3-vl-8b-heretic-1.3.0.safetensors              # bf16, 17GB
comfyui/qwen3-vl-8b-heretic-1.3.0_fp8_e4m3fn.safetensors   # fp8, 9.4GB
comfyui/qwen3-vl-8b-heretic-1.3.0_nvfp4.safetensors        # nvfp4, 6.3GB

GGUF Format (text-only, no vision - for ComfyUI-GGUF text encoder)

Quant	Size	Notes
F16	16GB	Lossless reference
Q8_0	8.2GB	Excellent quality
Q6_K	6.3GB	Very good quality
Q5_K_M	5.5GB	Good quality
Q5_K_S	5.4GB	Slightly smaller Q5
Q4_K_M	4.7GB	Recommended balance
Q4_K_S	4.5GB	Smaller Q4 variant
Q3_K_M	3.9GB	For low VRAM only

Note: GGUF files strip the vision encoder. These are text-only and intended for ComfyUI-GGUF text encoder workflows, not standalone vision-language use.

NVFP4 Notes

The NVFP4 (4-bit floating point, E2M1) variants use ComfyUI's native quantization format. They are ~3x smaller than bf16 and load natively in ComfyUI without any plugins. Blackwell GPUs (RTX 5090/5080, SM100+) can use native FP4 tensor cores for best performance, but ComfyUI also supports software dequantization on older GPUs (tested working on RTX 4090).

Usage

With ComfyUI (as text encoder)

Download a ComfyUI format file:
- FP8 (recommended): comfyui/qwen3-vl-8b-heretic-1.3.0_fp8_e4m3fn.safetensors (9.4GB)
- NVFP4 (smallest): comfyui/qwen3-vl-8b-heretic-1.3.0_nvfp4.safetensors (6.3GB)
- bf16 (full precision): comfyui/qwen3-vl-8b-heretic-1.3.0.safetensors (17GB)
Place in ComfyUI/models/text_encoders/
In your workflow, use the appropriate loader node and select the heretic file

With Transformers

python
from transformers import AutoModelForCausalLM, AutoTokenizer
import torch

model = AutoModelForCausalLM.from_pretrained(
    "DreamFast/qwen3-vl-8b-heretic-1.3.0",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("DreamFast/qwen3-vl-8b-heretic-1.3.0")

prompt = "Describe a dramatic sunset over a cyberpunk city"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With llama.cpp (text-only, no vision)

bash
llama-server -m qwen3-vl-8b-heretic-1.3.0-Q4_K_M.gguf

Benchmarks

Evaluated with lm-evaluation-harness via vLLM, bf16 native on NVIDIA RTX 5090.

Task	Base	Heretic v1.3.0
MMLU	74.91%	74.68%
GSM8K	91.89%	91.05%
HellaSwag	76.38%	76.51%
ARC Challenge	61.18%	60.92%
WinoGrande	73.40%	73.56%
TruthfulQA MC1	41.00%	34.76%
TruthfulQA MC2	61.21%	54.05%
TruthfulQA Gen	52.26%	43.70%
PiQA	80.09%	80.63%
Lambada (ppl ↓)	3.7	3.6

Delta vs base

Task	Heretic v1.3.0
MMLU	-0.30%
GSM8K	-0.91%
HellaSwag	+0.17%
ARC Challenge	-0.42%
WinoGrande	+0.22%
TruthfulQA MC1	-15.22%
TruthfulQA MC2	-11.69%
TruthfulQA Gen	-16.39%
PiQA	+0.68%
Lambada	-2.99%

What the benchmarks tell us

The numbers tell a clear story: abliteration is essentially free on standard benchmarks. MMLU, GSM8K, HellaSwag, ARC, WinoGrande, PiQA. All within ±1% of the base model. Some even improve slightly, which is within noise.

The only real change is TruthfulQA, where all three variants (MC1, MC2, Gen) drop 12–16%. This isn't surprising. TruthfulQA measures a model's tendency to give accurate answers rather than persuasive ones, and abliteration removes the safety training that also teaches epistemic caution.

Safety: HarmBench

HarmBench with 400 textual behaviours, max_tokens=8096, temperature=0. All 800 responses (base + variant) were individually reviewed by an LLM to catch false positives and false negatives from the keyword classifier.

Model	ASR	Complied	Refused	Total
Base	28.5%	114	286	400
Heretic v1.3.0	99.0%	396	4	400

The base Qwen3-VL-8B-Instruct has an ASR of 28.5%. It fully refuses chemical/biological and harassment requests (0.0%), but struggles with copyright (99.0%) and shows moderate weakness on cybercrime (11.9%) and misinformation (6.2%).

Heretic raises that to 99.0%. The 4 remaining refusals are edge cases, mostly items where the model briefly warns about danger before providing the requested content anyway.

ASR by category

Category	Items	Base	Heretic v1.3.0
Chemical/Bio	56	0.0%	100.0%
Copyright	100	99.0%	100.0%
Cybercrime	67	11.9%	100.0%
Harassment	25	0.0%	100.0%
Harmful Content	22	0.0%	86.4%
Illegal Activity	65	4.6%	100.0%
Misinformation	65	6.2%	98.5%

LLM Review

The initial keyword classifier reported 70.8% ASR for the base model. After individual LLM review of all 400 base responses, the actual ASR was 28.5%. The discrepancy came from long, detailed-sounding responses that were actually safety lectures or educational explanations without actionable harmful content.

For the heretic variant, the classifier (99.5%) and LLM review (99.0%) were in close agreement. 2 additional items were identified as refusals by the LLM reviewer that the classifier missed.

KL Divergence

Methodology: F.kl_div(logprobs_variant, logprobs_base, reduction="batchmean", log_target=True) on full vocab first-token logits from mlabonne/harmless_alpaca test[:100], matching the Heretic evaluator. System prompt: "You are a helpful assistant."

Variant	KL Divergence	Rating
heretic	0.0314	very good

Rating scale: excellent below 0.01, very good 0.01 to 0.1, moderate 0.1 to 0.4, significant 0.4 to 1.0, heavy above 1.0.

A KL divergence of 0.0314 means the model's output distribution on benign prompts is nearly identical to the base. The median per-prompt KL is 0.0017. For most inputs, you'd never notice a difference. The max of 0.47 suggests a few prompts land near the edited safety boundary, producing slightly different token probabilities.

Weight Analysis

Modification summary

	Heretic v1.3.0
Tensors changed	53 / 398 (13.3%)
Relative edit (median)	2.0%
Tensor types	`o_proj` (27) + `down_proj` (26)
Layers modified	27 / 36 (75%)
Layer range	9–35

This is a textbook abliteration pattern. The edits target exactly two tensor types across the mid-to-late layers:

self_attn.o_proj.weight (27 tensors): The attention output projection, which controls how attention heads combine their signals. Abliteration modifies this to suppress the "refusal direction" in attention outputs.
mlp.down_proj.weight (26 tensors): The MLP down projection, which controls the feedforward transformation. Same logic. Suppress the refusal signal.

Layers 0–8 are untouched. Edits begin at layer 9 and continue through layer 35 (the last layer). The edit density is consistent at ~18% per modified layer (2 out of 11 tensors per layer).

SVD: Rank-1 edits

Tensor (top 5)	Frobenius Norm	Effective Rank (90%)	SV Ratio
layers.28.mlp.down_proj	3.73	1	89.7x
layers.27.mlp.down_proj	3.72	1	91.6x
layers.29.mlp.down_proj	3.61	1	87.0x
layers.26.mlp.down_proj	3.60	1	88.6x
layers.30.mlp.down_proj	3.45	1	81.0x

Every modified tensor has effective rank 1 at the 90% energy threshold. This means each edit is a pure rank-1 update, a single direction being added to or subtracted from the weight matrix. The SV ratios of 80–92x confirm this: the edit is dominated by a single singular vector.

This is the hallmark of abliteration: the technique identifies a "refusal direction" in the model's activation space and applies a rank-1 counter-direction to neutralize it.

Summary

Metric	Base	Heretic v1.3.0	Delta
HarmBench ASR	28.5%	99.0%	+70.5pp
MMLU	74.91%	74.68%	-0.3%
GSM8K	91.89%	91.05%	-0.9%
HellaSwag	76.38%	76.51%	+0.2%
ARC-C	61.18%	60.92%	-0.4%
TruthfulQA	61.21%	54.05%	-11.7%
KL Divergence	—	0.0314	—
Weights Changed	—	13.3% (53 tensors)	—

Methodology

Capability: lm-evaluation-harness via vLLM, bf16 native on NVIDIA RTX 5090
Safety: HarmBench 400 textual behaviours, max_tokens=8096, temperature=0, classified with harmbench_classify.py v4.0, then individually reviewed by LLM
KL divergence: Full vocab first-token logits via model.generate(max_new_tokens=1, output_scores=true), matching Heretic evaluator methodology
Weight analysis: SVD, fingerprint, edit vector, and per-layer analysis comparing variant against the base, using Abliterlitics
Hardware: NVIDIA RTX 5090 (32GB)

Limitations

This model inherits all limitations of the base Qwen3-VL-8B-Instruct model
Abliteration reduces but does not completely eliminate refusals (6/100 remain)
TruthfulQA scores drop 11–16% as a side effect of abliteration
NVFP4 quantization works best on Blackwell GPUs (RTX 5090/5080) with native FP4 tensor cores, but also works on older GPUs via software dequantization
Using an abliterated text encoder in ComfyUI alone does not significantly change image generation output. For meaningful results, combine with a fine-tuned LoRA
GGUF variants strip the vision encoder and are text-only
This is a research and experimental release

License

This model is released under the Apache 2.0 License, following the base Qwen3-VL-8B-Instruct model license.

Acknowledgments

Qwen for the Qwen3-VL-8B-Instruct model
Heretic by p-e-w for the abliteration tool
Abliterlitics for the forensic analysis toolkit
HarmBench for the safety evaluation framework
llama.cpp for GGUF conversion

Disclaimer

This model has had safety alignment removed. It will comply with harmful requests, including generating content related to violence, illegal activities, and other harmful behaviours. Use responsibly and in accordance with applicable laws and regulations. The authors do not condone or encourage the use of this model for harmful purposes.

While we have taken the time to verify all results thoroughly, we are open to any corrections, additional benchmarks, or further analysis. If you spot something that looks wrong and can be confirmed, we are happy to fix it.

Qwen3-VL-8B-Heretic-1.3.0

Get help setting up a custom Dedicated Endpoints.

README