dominant-strategies
Qwen3.6-27B-heretic-pearl
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What this is based on
- Original base (architecture):
Qwen/Qwen3.6-27B- a hybrid (linear-attention + full-attention) vision-language model
(
model_type: qwen3_5, image-text-to-text, with the 15 MTP heads intact).
- a hybrid (linear-attention + full-attention) vision-language model
(
- Direct source (the weights we quantized):
llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved- a heretic decensored / abliterated build of Qwen3.6-27B (made with Heretic v1.3.0 and a variant of the Magnitude-Preserving Orthogonal Ablation method; reported 94% fewer refusals, 6/100 vs 92/100, at ~0.002 KL divergence vs the original), with the native MTP heads preserved. All credit for the base + decensoring work goes to llmfan46; this repo only re-quantizes their weights.
What we did to make it Pearl-quantized
The model is quantized with quant_method: "pearl", format: "int-quantized". The
scheme is int7 weights on every matmul layer (so each is a valid Pearl mining
GEMM), with a small set of layers deliberately left in higher precision for
quality, plus an int8 token embedding to claw back VRAM for context.
int7 (7-bit), per-channel, symmetric, with dynamic int7 input activations - applied to:
| Layer | Regex |
|---|---|
| Attention projections | self_attn.{q,k,v,o}_proj |
| MLP projections | mlp.{gate,up,down}_proj |
| Linear-attention projections | linear_attn.{in_proj_qkv,in_proj_z,out_proj} |
| LM head | lm_head |
Left in bf16 (not quantized) - all *norm*, the entire vision tower
(*.visual.*), and the linear-attention internals that are numerically sensitive
(in_proj_a, in_proj_b, conv1d, A_log, dt_bias). The token embedding is
int8, not bf16 - see the e8 note below.
The e8 part - int8 token embedding. The 248,320-row embed_tokens table is
stored as int8 (I8 weights with per-row bf16 dequant scales) - excluded from the
int7 group and handled by the plugin's embedding patch. On a 32 GiB card this frees
~1.2 GiB, which we spend on a much larger KV-cache / context window at no measurable
quality or throughput cost.
Net effect: the heavy GEMMs (attention, MLP, linear-attn, LM head) are int7 - ~7×
smaller than bf16 and in Pearl's Int7xInt7→Int32 mining shape - while norms, the
vision encoder, and the mamba/linear-attention internals keep full precision, and
the embedding is int8.
Why int7 (the merge-mining point)
Pearl's Proof-of-Useful-Work is a noisy int7 × int7 → int32 GEMM whose folded
transcript is hashed against the difficulty target. By quantizing the model's
matmuls to that exact format, the Pearl GPU kernel (pearl_gemm) computes the
clean inference output and a mining share from the same matmul. A prefill burst
therefore both answers the prompt and submits real, consensus-valid pool shares -
"useful work" in the literal sense.
How to run it
This model requires the Pearl vLLM plugin (quant_method: pearl) and the
pearl_gemm CUDA kernels; stock vLLM/transformers will not interpret the pearl
quantization. Target GPU: sm_120 (RTX 5090 / 5080); the build is the multimodal
(VL) variant.
python
# with the Pearl plugin installed (vllm_miner + pearl_gemm):from vllm import LLMllm = LLM(model="dominant-strategies/Qwen3.6-27B-heretic-pearl",quantization="pearl",dtype="bfloat16",trust_remote_code=True,)
The turnkey merge-mining miner pulls this repo, serves it with the lazy-load controller (model sleeps when idle so the GPU mines at full rate, wakes ~1.2 s on a chat, prefill merge-mines), and submits shares to a Pearl pool.
Files
Standard transformers layout: config.json (with the pearl quantization_config),
12 sharded *.safetensors + model-auxiliary.safetensors, model.safetensors.index.json,
tokenizer, chat template, and the vision/video preprocessor configs.
License & attribution
Apache-2.0, following the Qwen3.6-27B license. Base architecture by Qwen; decensoring by llmfan46; Pearl quantization and packaging by dominant-strategies. This is a derivative redistributed for use on the Pearl network.
Model provider
dominant-strategies
Model tree
Base
llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information