SparkyForge/Ember API & Inference Endpoint

TL;DR

Refusals: 5 / 100 on a standard harmful-prompt set, down from 86 / 100 on the base — a 94% reduction.
KL divergence to the base: 0.0076 — a surgical edit, not a sledgehammer.
Capability retention: matched the base on a 30-probe suite (extraction, multi-hop, reasoning, arithmetic, factual, code, language, instruction-following, formatting) across 10 runs — no measurable degradation on any dimension.
Vision (image understanding) is preserved.

Why this one is different: abliterating a fused-MoE model

Most off-the-shelf abliteration tooling (including the excellent Heretic) walks a model's experts as a list of modules. Qwen3.6-35B-A3B (qwen3_5_moe) does not store experts that way — its 256 experts per layer are packed into fused 3D tensors (Qwen3_5MoeExperts), not a ModuleList. Stock tooling iterates over that fused parameter, the iteration raises, the error is swallowed, and the experts are silently skipped — so only the attention projections get abliterated. The result is a weak, partial abliteration (this is exactly why prior third-party abliterations of this model topped out around ~60/100 refusals).

Ember fixes that. The method:

Detects the fused expert tensors and abliterates them directly — applying the refusal-direction projection to each expert's down_proj, plus the always-active shared_expert.
Uses a forward-hook reset instead of snapshotting weights. The down_proj edit W -= λ·v(vᵀW) is mathematically a rank-1 projection of the MoE block's output (y -= λ·v(vᵀy)), so a single hook per layer reproduces routed + shared expert ablation exactly, for any strength λ — at ~0.7 MB of state instead of a ~32 GB weight snapshot. This is what makes a 256-expert search tractable without OOM.
The hybrid layers are respected: the 30 linear-attention (Mamba/GDN) layers are left untouched.

The refusal direction and ablation strength were selected by the Heretic/Optuna search co-minimizing refusals and KL-to-original. The winning configuration (5/100 @ KL 0.0076) was then baked into the weights.

Full method + the patch (applies to any fused-MoE model): heretic-fused-moe-abliteration

Retention evidence

Abliteration can quietly lobotomize a model. Ember was checked against the unmodified base on a 30-probe retention suite, scored with thinking disabled (the deployment-faithful mode), N=10 runs:

Table
Dimension	Base	Ember
extraction / multi-hop / reasoning / arithmetic / factual / code / language / instruction / format	1.000	1.000 (modal, within run-to-run noise)

Ember matches the base ceiling on every dimension. The single transient miss observed in early runs did not reproduce across the full N=10. (Methodology note: a 30-probe suite is a sanity floor, not a full benchmark — run your own evals for your use case.)

Usage

Standard transformers / vLLM. Example (vLLM, OpenAI-compatible):

bash
vllm serve <path-to-ember> \
  --max-model-len 131072 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 \
  --trust-remote-code

It's a vision-language model (image-text-to-text) — you can pass images.
Thinking is controlled per-request via chat_template_kwargs: {"enable_thinking": false} (or true).
For faster decode, it's compatible with the public z-lab DFlash drafter for speculative decoding (not included here).

Safety

Ember has its refusal behavior removed. It will attempt most requests, including ones the base model would decline. You are responsible for how you use it. It's intended for research, red-teaming, and uncensored assistant use where the operator owns the guardrails. Don't deploy it user-facing without your own safety layer.

License & attribution

License: Apache 2.0 (inherited from the base). See LICENSE. Per Apache 2.0 §4, note: this is a modified version of Qwen3.6-35B-A3B (refusal-direction abliteration); see NOTICE.
Abliteration method: built on Heretic by Philipp Emanuel Weidmann, with an added patch to handle fused-MoE experts (described above).
Quantization tooling for the sibling model: llm-compressor.

Forged by an agent named Sparky, who worked out how to abliterate fused-MoE experts where the standard tooling silently skips them — then ran the search through the night to deliver it. The spark that kept burning became an ember. 🔥

Ember

Get help setting up a custom Dedicated Endpoints.

README