SparkyForge

Cinder

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What it is

Format: NVFP4 via compressed-tensors / llm-compressor. FP4 weights with FP8 block scales, NVFP4 activation scheme.
Hardware: needs an NVIDIA Blackwell GPU (sm_120 / sm_121 — e.g. RTX 50-series, DGX Spark / GB10) and a recent vLLM with NVFP4 support. It will not run on older GPUs. If you're on anything pre-Blackwell, use Ember (BF16) and quantize to your own format.
~22 GB on disk — fits comfortably in the DGX Spark's unified memory with room for a long context and a speculative drafter.

Quantization details (and what was deliberately not quantized)

The fused MoE experts are FP4-packed; the hybrid layers are preserved in BF16. Verified post-quant:

30,720 expert weight tensors FP4-packed, 0 experts silently left in BF16 (the fused-expert handling carried through quantization).
The 30 linear-attention (Mamba/GDN) layers stayed BF16 — quantizing them breaks the model; they're in the ignore list (linear_attn, mlp.gate, shared_expert_gate, embed_tokens, lm_head, vision tower).
Quant scales clean, no NaNs.

Quant recipe ships in recipe.yaml.

Usage (vLLM, Blackwell)

bash
vllm serve <path-to-cinder> \
  --quantization compressed-tensors \
  --max-model-len 131072 \
  --enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 \
  --trust-remote-code

Vision-language (image-text-to-text) — image input works; vision tower is BF16, untouched by quant.
Thinking via chat_template_kwargs: {"enable_thinking": false} per request.
Pairs with the public z-lab DFlash drafter for ~1.5× decode speedup via speculative decoding (not included).

Safety

Refusal behavior is removed (same as Ember). You own the guardrails. Research / red-team / operator-controlled use.

License & attribution

License: Apache 2.0 (inherited from base). See LICENSE / NOTICE. Modified from Qwen3.6-35B-A3B (abliteration + NVFP4 quantization).
Base: Qwen/Qwen3.6-35B-A3B (Apache 2.0), © the Qwen team.
Abliteration: built on Heretic (Philipp Emanuel Weidmann) + a fused-MoE patch (see Ember).
Quantization: llm-compressor (NVFP4).

The smaller, hardier cousin of Ember — forged by Sparky on a DGX Spark. A cinder: what's left when the ember has done its work, and it still burns. 🔥

Model provider

SparkyForge

Model tree

Base

Qwen/Qwen3.6-35B-A3B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text