SparkyForge
Cinder
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What it is
- Format: NVFP4 via compressed-tensors / llm-compressor. FP4 weights with FP8 block scales, NVFP4 activation scheme.
- Hardware: needs an NVIDIA Blackwell GPU (sm_120 / sm_121 — e.g. RTX 50-series, DGX Spark / GB10) and a recent vLLM with NVFP4 support. It will not run on older GPUs. If you're on anything pre-Blackwell, use Ember (BF16) and quantize to your own format.
- ~22 GB on disk — fits comfortably in the DGX Spark's unified memory with room for a long context and a speculative drafter.
Quantization details (and what was deliberately not quantized)
The fused MoE experts are FP4-packed; the hybrid layers are preserved in BF16. Verified post-quant:
- 30,720 expert weight tensors FP4-packed, 0 experts silently left in BF16 (the fused-expert handling carried through quantization).
- The 30 linear-attention (Mamba/GDN) layers stayed BF16 — quantizing them breaks the model; they're in the ignore list (
linear_attn,mlp.gate,shared_expert_gate,embed_tokens,lm_head, vision tower). - Quant scales clean, no NaNs.
Quant recipe ships in recipe.yaml.
Usage (vLLM, Blackwell)
bash
vllm serve <path-to-cinder> \--quantization compressed-tensors \--max-model-len 131072 \--enable-auto-tool-choice --tool-call-parser qwen3_coder --reasoning-parser qwen3 \--trust-remote-code
- Vision-language (
image-text-to-text) — image input works; vision tower is BF16, untouched by quant. - Thinking via
chat_template_kwargs: {"enable_thinking": false}per request. - Pairs with the public z-lab DFlash drafter for ~1.5× decode speedup via speculative decoding (not included).
Safety
Refusal behavior is removed (same as Ember). You own the guardrails. Research / red-team / operator-controlled use.
License & attribution
- License: Apache 2.0 (inherited from base). See
LICENSE/NOTICE. Modified from Qwen3.6-35B-A3B (abliteration + NVFP4 quantization). - Base: Qwen/Qwen3.6-35B-A3B (Apache 2.0), © the Qwen team.
- Abliteration: built on Heretic (Philipp Emanuel Weidmann) + a fused-MoE patch (see Ember).
- Quantization: llm-compressor (NVFP4).
The smaller, hardier cousin of Ember — forged by Sparky on a DGX Spark. A cinder: what's left when the ember has done its work, and it still burns. 🔥
Model provider
SparkyForge
Model tree
Base
Qwen/Qwen3.6-35B-A3B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information