dominant-strategies

Qwen3.6-27B-heretic-pearl

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What this is based on

  • Original base (architecture): Qwen/Qwen3.6-27B
    • a hybrid (linear-attention + full-attention) vision-language model (model_type: qwen3_5, image-text-to-text, with the 15 MTP heads intact).
  • Direct source (the weights we quantized): llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved
    • a heretic decensored / abliterated build of Qwen3.6-27B (made with Heretic v1.3.0 and a variant of the Magnitude-Preserving Orthogonal Ablation method; reported 94% fewer refusals, 6/100 vs 92/100, at ~0.002 KL divergence vs the original), with the native MTP heads preserved. All credit for the base + decensoring work goes to llmfan46; this repo only re-quantizes their weights.

What we did to make it Pearl-quantized

The model is quantized with quant_method: "pearl", format: "int-quantized". The scheme is int7 weights on every matmul layer (so each is a valid Pearl mining GEMM), with a small set of layers deliberately left in higher precision for quality, plus an int8 token embedding to claw back VRAM for context.

int7 (7-bit), per-channel, symmetric, with dynamic int7 input activations - applied to:

Table
LayerRegex
Attention projectionsself_attn.{q,k,v,o}_proj
MLP projectionsmlp.{gate,up,down}_proj
Linear-attention projectionslinear_attn.{in_proj_qkv,in_proj_z,out_proj}
LM headlm_head

Left in bf16 (not quantized) - all *norm*, the entire vision tower (*.visual.*), and the linear-attention internals that are numerically sensitive (in_proj_a, in_proj_b, conv1d, A_log, dt_bias). The token embedding is int8, not bf16 - see the e8 note below.

The e8 part - int8 token embedding. The 248,320-row embed_tokens table is stored as int8 (I8 weights with per-row bf16 dequant scales) - excluded from the int7 group and handled by the plugin's embedding patch. On a 32 GiB card this frees ~1.2 GiB, which we spend on a much larger KV-cache / context window at no measurable quality or throughput cost.

Net effect: the heavy GEMMs (attention, MLP, linear-attn, LM head) are int7 - ~7× smaller than bf16 and in Pearl's Int7xInt7→Int32 mining shape - while norms, the vision encoder, and the mamba/linear-attention internals keep full precision, and the embedding is int8.

Why int7 (the merge-mining point)

Pearl's Proof-of-Useful-Work is a noisy int7 × int7 → int32 GEMM whose folded transcript is hashed against the difficulty target. By quantizing the model's matmuls to that exact format, the Pearl GPU kernel (pearl_gemm) computes the clean inference output and a mining share from the same matmul. A prefill burst therefore both answers the prompt and submits real, consensus-valid pool shares - "useful work" in the literal sense.

How to run it

This model requires the Pearl vLLM plugin (quant_method: pearl) and the pearl_gemm CUDA kernels; stock vLLM/transformers will not interpret the pearl quantization. Target GPU: sm_120 (RTX 5090 / 5080); the build is the multimodal (VL) variant.

python

# with the Pearl plugin installed (vllm_miner + pearl_gemm):
from vllm import LLM
llm = LLM(
model="dominant-strategies/Qwen3.6-27B-heretic-pearl",
quantization="pearl",
dtype="bfloat16",
trust_remote_code=True,
)

The turnkey merge-mining miner pulls this repo, serves it with the lazy-load controller (model sleeps when idle so the GPU mines at full rate, wakes ~1.2 s on a chat, prefill merge-mines), and submits shares to a Pearl pool.

Files

Standard transformers layout: config.json (with the pearl quantization_config), 12 sharded *.safetensors + model-auxiliary.safetensors, model.safetensors.index.json, tokenizer, chat template, and the vision/video preprocessor configs.

License & attribution

Apache-2.0, following the Qwen3.6-27B license. Base architecture by Qwen; decensoring by llmfan46; Pearl quantization and packaging by dominant-strategies. This is a derivative redistributed for use on the Pearl network.

Model provider

dominant-strategies

Model tree

Base

llmfan46/Qwen3.6-27B-uncensored-heretic-v2-Native-MTP-Preserved

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today