Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Mining shape

fieldvalue
base modelQwen/Qwen3.5-4B
modalitytext
common_dim2560
rank32
mine_layers16 (overhead dial; layer count)
pipelinevllm

Mining regime (LLM)

Text LLMs mine during prefill — when many tokens are processed at once (rows = tokens is large). Single-token decode does not mine (rows ≈ 1), so interactive chat mines far less than long-prompt or batched-prefill serving. Diffusion models mine on every forward (large token count always), so for continuous mining a diffusion model (see Matmultoken/Z-Image-Turbo-pouw) is the stronger substrate; this LLM repo is for prefill-heavy / batch workloads.

Use

python

# Serve via vLLM with quantization="pouw" (vLLM-MatMulToken plugin auto-registers it).
from vllm import LLM
llm = LLM(model="Matmultoken/Qwen3.5-4B-pouw", quantization="pouw") # mines on eligible matmuls while it serves
print(llm.generate("The history of money is")) # generation is bit-identical to the base model

Notes

  • The live PoW job + difficulty target always come from the chain at runtime — never baked into this repo. GPU kernels compile per-arch on first run (one-time, cached on disk).
  • Published under the Matmultoken organization. The base weights (apache-2.0) are bundled in this repo at a pinned snapshot for a reproducible mining shape; the original model's LICENSE and attribution are preserved in-repo.

Generated by MatMulToken publish_pouw_models.py. License: MIT.

Model provider

Matmultoken

Model tree

Base

Qwen/Qwen3.5-4B

Quantized

this model

Modalities

Input

Video, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today