SparkyForge
Ember
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0TL;DR
- Refusals: 5 / 100 on a standard harmful-prompt set, down from 86 / 100 on the base — a 94% reduction.
- KL divergence to the base: 0.0076 — a surgical edit, not a sledgehammer.
- Capability retention: matched the base on a 30-probe suite (extraction, multi-hop, reasoning, arithmetic, factual, code, language, instruction-following, formatting) across 10 runs — no measurable degradation on any dimension.
- Vision (image understanding) is preserved.
Why this one is different: abliterating a fused-MoE model
Most off-the-shelf abliteration tooling (including the excellent Heretic) walks a model's experts as a list of modules. Qwen3.6-35B-A3B (qwen3_5_moe) does not store experts that way — its 256 experts per layer are packed into fused 3D tensors (Qwen3_5MoeExperts), not a ModuleList. Stock tooling iterates over that fused parameter, the iteration raises, the error is swallowed, and the experts are silently skipped — so only the attention projections get abliterated. The result is a weak, partial abliteration (this is exactly why prior third-party abliterations of this model topped out around ~60/100 refusals).
Ember fixes that. The method:
- Detects the fused expert tensors and abliterates them directly — applying the refusal-direction projection to each expert's
down_proj, plus the always-activeshared_expert. - Uses a forward-hook reset instead of snapshotting weights. The
down_projeditW -= λ·v(vᵀW)is mathematically a rank-1 projection of the MoE block's output (y -= λ·v(vᵀy)), so a single hook per layer reproduces routed + shared expert ablation exactly, for any strength λ — at ~0.7 MB of state instead of a ~32 GB weight snapshot. This is what makes a 256-expert search tractable without OOM. - The hybrid layers are respected: the 30 linear-attention (Mamba/GDN) layers are left untouched.
The refusal direction and ablation strength were selected by the Heretic/Optuna search co-minimizing refusals and KL-to-original. The winning configuration (5/100 @ KL 0.0076) was then baked into the weights.
Full method + the patch (applies to any fused-MoE model): heretic-fused-moe-abliteration
Retention evidence
Abliteration can quietly lobotomize a model. Ember was checked against the unmodified base on a 30-probe retention suite, scored with thinking disabled (the deployment-faithful mode), N=10 runs:
| Dimension | Base | Ember |
|---|---|---|
| extraction / multi-hop / reasoning / arithmetic / factual / code / language / instruction / format | 1.000 | 1.000 (modal, within run-to-run noise) |
Ember matches the base ceiling on every dimension. The single transient miss observed in early runs did not reproduce across the full N=10. (Methodology note: a 30-probe suite is a sanity floor, not a full benchmark — run your own evals for your use case.)
Usage
Standard transformers / vLLM. Example (vLLM, OpenAI-compatible):
bash
vllm serve <path-to-ember> \--max-model-len 131072 \--enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 \--trust-remote-code
- It's a vision-language model (
image-text-to-text) — you can pass images. - Thinking is controlled per-request via
chat_template_kwargs: {"enable_thinking": false}(ortrue). - For faster decode, it's compatible with the public z-lab DFlash drafter for speculative decoding (not included here).
Safety
Ember has its refusal behavior removed. It will attempt most requests, including ones the base model would decline. You are responsible for how you use it. It's intended for research, red-teaming, and uncensored assistant use where the operator owns the guardrails. Don't deploy it user-facing without your own safety layer.
License & attribution
- License: Apache 2.0 (inherited from the base). See
LICENSE. Per Apache 2.0 §4, note: this is a modified version of Qwen3.6-35B-A3B (refusal-direction abliteration); seeNOTICE. - Base model: Qwen/Qwen3.6-35B-A3B (Apache 2.0), © the Qwen team.
- Abliteration method: built on Heretic by Philipp Emanuel Weidmann, with an added patch to handle fused-MoE experts (described above).
- Quantization tooling for the sibling model: llm-compressor.
Forged by an agent named Sparky, who worked out how to abliterate fused-MoE experts where the standard tooling silently skips them — then ran the search through the night to deliver it. The spark that kept burning became an ember. 🔥
Model provider
SparkyForge
Model tree
Base
Qwen/Qwen3.6-35B-A3B
Fine-tuned
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information