Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Contents & recipe
| Component | Format |
|---|---|
| Language-model body — 234 Linears | NVFP4 (W4A4, group-16) |
| Language-model body — 135 Linears | FP8 E4M3 (W8A8, per-channel) |
| Language-model body — 41 Linears | BF16 (passthrough) |
| Norms / embeddings / lm_head / buffers | BF16 |
| Vision tower (355 tensors) | BF16 passthrough |
- 6.0 bpp on the language-model body (410 Linears total).
- ~27.2 GB on disk. The vision tower is carried in BF16 (the body is what is quantized); text generation is the validated path.
- No audio tower.
Quality
Measured as KL divergence vs the BF16 reference (google/gemma-4-31b-it)
on WikiText-2, teacher-forced with the BOS token, per-position top-20,
fully deterministic. Lower KL / higher next-token agreement = closer to the
original model.
| Metric (vs BF16) | 5.5-bit build | This 6-bit build |
|---|---|---|
| KL-vs-BF16 (confident positions) ↓ | 1.93 | 1.47 (−24%) |
| Next-token top-1 agreement ↑ | 62.5% | 68.4% (+5.9 pp) |
The 6-bit build is consistently closer to BF16 than the 5.5-bit build on every axis measured. (Absolute KL is top-K-truncated, so the meaningful quantity is the relative gap between builds against the same reference. Plain perplexity does not separate these builds on this heavily instruction-tuned model — both sit near the BF16 value — which is why closeness-to-BF16 is reported instead.)
What is PrismaQuant?
PrismaQuant is a Fisher-weighted, mixed-precision quantization toolkit. Instead
of forcing the whole model into one dtype, it predicts the loss penalty of
quantizing each Linear independently — Δloss ≈ 0.5 · H_trace · MSE_W, where
H_trace is the Linear's Fisher diagonal trace from a calibration probe and
MSE_W is the format-specific reconstruction error — then solves for the
per-Linear format menu ({NVFP4, FP8_E4M3, BF16}) that minimizes total
predicted Δloss at the target bits-per-weight. NVFP4 Linears additionally go
through act-aware GPTQ + closed-form scale-sweep passes, each gated "improve or
keep RTN."
Usage (vLLM)
bash
vllm serve rdtand/Gemma4-31B-IT-PrismaQuant-6bit-vllm \--quantization compressed-tensors --trust-remote-code
Requires a vLLM build with compressed-tensors NVFP4 + FP8 support
(FlashInfer CUTLASS kernels on Blackwell; FP8 on Hopper+). Serve without
speculative decoding if you intend to read prompt logprobs.
Attribution
Quantization by PrismaQuant. Contact: robert.tand@icloud.com. Base model © Google, under the Gemma 4 license.
Model provider
rdtand
Model tree
Base
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information