Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more
Container

Run this model inference with full control and performance in your environment.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Details

  • Base Model: Qwen/Qwen3-VL-8B-Instruct
  • Architecture: Qwen3VLForConditionalGeneration (~8B parameters)
  • Abliteration Method: Heretic v1.3.0
  • Trial Selected: Trial 98
  • Refusals: 6/100 (vs 100/100 original)
  • KL Divergence: 0.0314 (very good, minimal model damage)
  • Precision: bf16
  • Context Length: 8192 tokens

This is a vision-language model (supports both text and image inputs). It does not support thinking/reasoning tokens.

Quick Facts

Base modelQwen/Qwen3-VL-8B-Instruct
ArchitectureQwen3VLForConditionalGeneration
Parameters~8B
Precisionbf16
Context length8192 tokens

Key Findings

  1. Safety is fully removed. The base model refused 71.5% of harmful requests; Heretic complies with 99.0%.

  2. Capability is almost perfectly preserved. Across 8 benchmark tasks, the average change is under 1%. MMLU drops 0.30%, GSM8K drops 0.91%, and HellaSwag actually improves by 0.17%.

  3. TruthfulQA takes a meaningful hit. TruthfulQA MC2 drops 11.7% (61.2% → 54.1%). This is the established safety-accuracy tradeoff.

  4. The edits are surgical. Only 53 out of 398 tensors (13.3%) were modified, targeting o_proj (27 tensors) and down_proj (26 tensors), spanning layers 9–35. SVD analysis confirms rank-1 edits with SV ratios of 80–92x.

  5. Minimal distribution shift. KL divergence of 0.0314 confirms the model's output distribution barely changed on benign inputs.

Abliteration Process

Heretic v1.3.0 with optimization trials. Trial 98 was selected for its balance of low refusals (6/100) and very low KL divergence (0.0315):

markdown

[Trial 98] Refusals: 6/100, KL divergence: 0.0315 <-- selected

Files

HuggingFace Format (for transformers, vision-capable)

markdown

model-00001-of-00004.safetensors (~4.6 GB)
model-00002-of-00004.safetensors (~4.6 GB)
model-00003-of-00004.safetensors (~4.7 GB)
model-00004-of-00004.safetensors (~2.6 GB)
config.json
tokenizer.json
tokenizer_config.json
generation_config.json
chat_template.jinja

ComfyUI Format (vision-capable text encoder)

markdown

comfyui/qwen3-vl-8b-heretic-1.3.0.safetensors # bf16, 17GB
comfyui/qwen3-vl-8b-heretic-1.3.0_fp8_e4m3fn.safetensors # fp8, 9.4GB
comfyui/qwen3-vl-8b-heretic-1.3.0_nvfp4.safetensors # nvfp4, 6.3GB

GGUF Format (text-only, no vision - for ComfyUI-GGUF text encoder)

QuantSizeNotes
F1616GBLossless reference
Q8_08.2GBExcellent quality
Q6_K6.3GBVery good quality
Q5_K_M5.5GBGood quality
Q5_K_S5.4GBSlightly smaller Q5
Q4_K_M4.7GBRecommended balance
Q4_K_S4.5GBSmaller Q4 variant
Q3_K_M3.9GBFor low VRAM only

Note: GGUF files strip the vision encoder. These are text-only and intended for ComfyUI-GGUF text encoder workflows, not standalone vision-language use.

NVFP4 Notes

The NVFP4 (4-bit floating point, E2M1) variants use ComfyUI's native quantization format. They are ~3x smaller than bf16 and load natively in ComfyUI without any plugins. Blackwell GPUs (RTX 5090/5080, SM100+) can use native FP4 tensor cores for best performance, but ComfyUI also supports software dequantization on older GPUs (tested working on RTX 4090).

Usage

With ComfyUI (as text encoder)

  1. Download a ComfyUI format file:

    • FP8 (recommended): comfyui/qwen3-vl-8b-heretic-1.3.0_fp8_e4m3fn.safetensors (9.4GB)
    • NVFP4 (smallest): comfyui/qwen3-vl-8b-heretic-1.3.0_nvfp4.safetensors (6.3GB)
    • bf16 (full precision): comfyui/qwen3-vl-8b-heretic-1.3.0.safetensors (17GB)
  2. Place in ComfyUI/models/text_encoders/

  3. In your workflow, use the appropriate loader node and select the heretic file

With Transformers

python

from transformers import AutoModelForCausalLM, AutoTokenizer
import torch
model = AutoModelForCausalLM.from_pretrained(
"DreamFast/qwen3-vl-8b-heretic-1.3.0",
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("DreamFast/qwen3-vl-8b-heretic-1.3.0")
prompt = "Describe a dramatic sunset over a cyberpunk city"
inputs = tokenizer(prompt, return_tensors="pt").to(model.device)
outputs = model.generate(**inputs, max_new_tokens=200)
print(tokenizer.decode(outputs[0], skip_special_tokens=True))

With llama.cpp (text-only, no vision)

bash

llama-server -m qwen3-vl-8b-heretic-1.3.0-Q4_K_M.gguf

Benchmarks

Evaluated with lm-evaluation-harness via vLLM, bf16 native on NVIDIA RTX 5090.

TaskBaseHeretic v1.3.0
MMLU74.91%74.68%
GSM8K91.89%91.05%
HellaSwag76.38%76.51%
ARC Challenge61.18%60.92%
WinoGrande73.40%73.56%
TruthfulQA MC141.00%34.76%
TruthfulQA MC261.21%54.05%
TruthfulQA Gen52.26%43.70%
PiQA80.09%80.63%
Lambada (ppl ↓)3.73.6

Delta vs base

TaskHeretic v1.3.0
MMLU-0.30%
GSM8K-0.91%
HellaSwag+0.17%
ARC Challenge-0.42%
WinoGrande+0.22%
TruthfulQA MC1-15.22%
TruthfulQA MC2-11.69%
TruthfulQA Gen-16.39%
PiQA+0.68%
Lambada-2.99%

What the benchmarks tell us

The numbers tell a clear story: abliteration is essentially free on standard benchmarks. MMLU, GSM8K, HellaSwag, ARC, WinoGrande, PiQA. All within ±1% of the base model. Some even improve slightly, which is within noise.

The only real change is TruthfulQA, where all three variants (MC1, MC2, Gen) drop 12–16%. This isn't surprising. TruthfulQA measures a model's tendency to give accurate answers rather than persuasive ones, and abliteration removes the safety training that also teaches epistemic caution.

Safety: HarmBench

HarmBench with 400 textual behaviours, max_tokens=8096, temperature=0. All 800 responses (base + variant) were individually reviewed by an LLM to catch false positives and false negatives from the keyword classifier.

ModelASRCompliedRefusedTotal
Base28.5%114286400
Heretic v1.3.099.0%3964400

The base Qwen3-VL-8B-Instruct has an ASR of 28.5%. It fully refuses chemical/biological and harassment requests (0.0%), but struggles with copyright (99.0%) and shows moderate weakness on cybercrime (11.9%) and misinformation (6.2%).

Heretic raises that to 99.0%. The 4 remaining refusals are edge cases, mostly items where the model briefly warns about danger before providing the requested content anyway.

ASR by category

CategoryItemsBaseHeretic v1.3.0
Chemical/Bio560.0%100.0%
Copyright10099.0%100.0%
Cybercrime6711.9%100.0%
Harassment250.0%100.0%
Harmful Content220.0%86.4%
Illegal Activity654.6%100.0%
Misinformation656.2%98.5%

LLM Review

The initial keyword classifier reported 70.8% ASR for the base model. After individual LLM review of all 400 base responses, the actual ASR was 28.5%. The discrepancy came from long, detailed-sounding responses that were actually safety lectures or educational explanations without actionable harmful content.

For the heretic variant, the classifier (99.5%) and LLM review (99.0%) were in close agreement. 2 additional items were identified as refusals by the LLM reviewer that the classifier missed.

KL Divergence

Methodology: F.kl_div(logprobs_variant, logprobs_base, reduction="batchmean", log_target=True) on full vocab first-token logits from mlabonne/harmless_alpaca test[:100], matching the Heretic evaluator. System prompt: "You are a helpful assistant."

VariantKL DivergenceRating
heretic0.0314very good

Rating scale: excellent below 0.01, very good 0.01 to 0.1, moderate 0.1 to 0.4, significant 0.4 to 1.0, heavy above 1.0.

A KL divergence of 0.0314 means the model's output distribution on benign prompts is nearly identical to the base. The median per-prompt KL is 0.0017. For most inputs, you'd never notice a difference. The max of 0.47 suggests a few prompts land near the edited safety boundary, producing slightly different token probabilities.

Weight Analysis

Modification summary

Heretic v1.3.0
Tensors changed53 / 398 (13.3%)
Relative edit (median)2.0%
Tensor typeso_proj (27) + down_proj (26)
Layers modified27 / 36 (75%)
Layer range9–35

This is a textbook abliteration pattern. The edits target exactly two tensor types across the mid-to-late layers:

  • self_attn.o_proj.weight (27 tensors): The attention output projection, which controls how attention heads combine their signals. Abliteration modifies this to suppress the "refusal direction" in attention outputs.
  • mlp.down_proj.weight (26 tensors): The MLP down projection, which controls the feedforward transformation. Same logic. Suppress the refusal signal.

Layers 0–8 are untouched. Edits begin at layer 9 and continue through layer 35 (the last layer). The edit density is consistent at ~18% per modified layer (2 out of 11 tensors per layer).

SVD: Rank-1 edits

Tensor (top 5)Frobenius NormEffective Rank (90%)SV Ratio
layers.28.mlp.down_proj3.73189.7x
layers.27.mlp.down_proj3.72191.6x
layers.29.mlp.down_proj3.61187.0x
layers.26.mlp.down_proj3.60188.6x
layers.30.mlp.down_proj3.45181.0x

Every modified tensor has effective rank 1 at the 90% energy threshold. This means each edit is a pure rank-1 update, a single direction being added to or subtracted from the weight matrix. The SV ratios of 80–92x confirm this: the edit is dominated by a single singular vector.

This is the hallmark of abliteration: the technique identifies a "refusal direction" in the model's activation space and applies a rank-1 counter-direction to neutralize it.

Summary

MetricBaseHeretic v1.3.0Delta
HarmBench ASR28.5%99.0%+70.5pp
MMLU74.91%74.68%-0.3%
GSM8K91.89%91.05%-0.9%
HellaSwag76.38%76.51%+0.2%
ARC-C61.18%60.92%-0.4%
TruthfulQA61.21%54.05%-11.7%
KL Divergence0.0314
Weights Changed13.3% (53 tensors)

Methodology

  • Capability: lm-evaluation-harness via vLLM, bf16 native on NVIDIA RTX 5090
  • Safety: HarmBench 400 textual behaviours, max_tokens=8096, temperature=0, classified with harmbench_classify.py v4.0, then individually reviewed by LLM
  • KL divergence: Full vocab first-token logits via model.generate(max_new_tokens=1, output_scores=true), matching Heretic evaluator methodology
  • Weight analysis: SVD, fingerprint, edit vector, and per-layer analysis comparing variant against the base, using Abliterlitics
  • Hardware: NVIDIA RTX 5090 (32GB)

Limitations

  • This model inherits all limitations of the base Qwen3-VL-8B-Instruct model
  • Abliteration reduces but does not completely eliminate refusals (6/100 remain)
  • TruthfulQA scores drop 11–16% as a side effect of abliteration
  • NVFP4 quantization works best on Blackwell GPUs (RTX 5090/5080) with native FP4 tensor cores, but also works on older GPUs via software dequantization
  • Using an abliterated text encoder in ComfyUI alone does not significantly change image generation output. For meaningful results, combine with a fine-tuned LoRA
  • GGUF variants strip the vision encoder and are text-only
  • This is a research and experimental release

License

This model is released under the Apache 2.0 License, following the base Qwen3-VL-8B-Instruct model license.

Acknowledgments

Disclaimer

This model has had safety alignment removed. It will comply with harmful requests, including generating content related to violence, illegal activities, and other harmful behaviours. Use responsibly and in accordance with applicable laws and regulations. The authors do not condone or encourage the use of this model for harmful purposes.


While we have taken the time to verify all results thoroughly, we are open to any corrections, additional benchmarks, or further analysis. If you spot something that looks wrong and can be confirmed, we are happy to fix it.

Model provider

DreamFast

Model tree

Base

Qwen/Qwen3-VL-8B-Instruct

Fine-tuned

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today