Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Mining shape
| field | value |
|---|---|
| base model | Qwen/Qwen3.5-4B |
| modality | text |
| common_dim | 2560 |
| rank | 32 |
| mine_layers | 16 (overhead dial; layer count) |
| pipeline | vllm |
Mining regime (LLM)
Text LLMs mine during prefill — when many tokens are processed at once (rows = tokens is large). Single-token decode does not mine (rows ≈ 1), so interactive chat mines far less than long-prompt or batched-prefill serving. Diffusion models mine on every forward (large token count always), so for continuous mining a diffusion model (see Matmultoken/Z-Image-Turbo-pouw) is the stronger substrate; this LLM repo is for prefill-heavy / batch workloads.
Use
python
# Serve via vLLM with quantization="pouw" (vLLM-MatMulToken plugin auto-registers it).from vllm import LLMllm = LLM(model="Matmultoken/Qwen3.5-4B-pouw", quantization="pouw") # mines on eligible matmuls while it servesprint(llm.generate("The history of money is")) # generation is bit-identical to the base model
Notes
- The live PoW job + difficulty target always come from the chain at runtime — never baked into this repo. GPU kernels compile per-arch on first run (one-time, cached on disk).
- Published under the
Matmultokenorganization. The base weights (apache-2.0) are bundled in this repo at a pinned snapshot for a reproducible mining shape; the original model's LICENSE and attribution are preserved in-repo.
Generated by MatMulToken publish_pouw_models.py. License: MIT.
Model provider
Matmultoken
Model tree
Base
Qwen/Qwen3.5-4B
Quantized
this model
Modalities
Input
Video, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information