Qwen3.5-4B-pouw API & Inference Endpoint

Mining shape

Table with columns: field, value
field	value
base model	`Qwen/Qwen3.5-4B`
modality	text
common_dim	2560
rank	32
mine_layers	16 (overhead dial; layer count)
pipeline	vllm

Mining regime (LLM)

Text LLMs mine during prefill — when many tokens are processed at once (rows = tokens is large). Single-token decode does not mine (rows ≈ 1), so interactive chat mines far less than long-prompt or batched-prefill serving. Diffusion models mine on every forward (large token count always), so for continuous mining a diffusion model (see Matmultoken/Z-Image-Turbo-pouw) is the stronger substrate; this LLM repo is for prefill-heavy / batch workloads.

Use

python
# Serve via vLLM with quantization="pouw" (vLLM-MatMulToken plugin auto-registers it).
from vllm import LLM
llm = LLM(model="Matmultoken/Qwen3.5-4B-pouw", quantization="pouw")  # mines on eligible matmuls while it serves
print(llm.generate("The history of money is"))    # generation is bit-identical to the base model

Notes

The live PoW job + difficulty target always come from the chain at runtime — never baked into this repo. GPU kernels compile per-arch on first run (one-time, cached on disk).
Published under the Matmultoken organization. The base weights (apache-2.0) are bundled in this repo at a pinned snapshot for a reproducible mining shape; the original model's LICENSE and attribution are preserved in-repo.

Generated by MatMulToken publish_pouw_models.py. License: MIT.

Mining shape

Table with columns: field, value
field	value
base model	`Qwen/Qwen3.5-4B`
modality	text
common_dim	2560
rank	32
mine_layers	16 (overhead dial; layer count)
pipeline	vllm

Mining regime (LLM)

Use

python
# Serve via vLLM with quantization="pouw" (vLLM-MatMulToken plugin auto-registers it).
from vllm import LLM
llm = LLM(model="Matmultoken/Qwen3.5-4B-pouw", quantization="pouw")  # mines on eligible matmuls while it serves
print(llm.generate("The history of money is"))    # generation is bit-identical to the base model

Notes

The live PoW job + difficulty target always come from the chain at runtime — never baked into this repo. GPU kernels compile per-arch on first run (one-time, cached on disk).
Published under the Matmultoken organization. The base weights (apache-2.0) are bundled in this repo at a pinned snapshot for a reproducible mining shape; the original model's LICENSE and attribution are preserved in-repo.

Generated by MatMulToken publish_pouw_models.py. License: MIT.

Qwen3.5-4B-pouw

README

Mining shape

Mining regime (LLM)

Use

Notes

Explore FriendliAI today

README

Mining shape

Mining regime (LLM)

Use

Notes