Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Contents & recipe

ComponentFormat
Language-model body — 234 LinearsNVFP4 (W4A4, group-16)
Language-model body — 135 LinearsFP8 E4M3 (W8A8, per-channel)
Language-model body — 41 LinearsBF16 (passthrough)
Norms / embeddings / lm_head / buffersBF16
Vision tower (355 tensors)BF16 passthrough
  • 6.0 bpp on the language-model body (410 Linears total).
  • ~27.2 GB on disk. The vision tower is carried in BF16 (the body is what is quantized); text generation is the validated path.
  • No audio tower.

Quality

Measured as KL divergence vs the BF16 reference (google/gemma-4-31b-it) on WikiText-2, teacher-forced with the BOS token, per-position top-20, fully deterministic. Lower KL / higher next-token agreement = closer to the original model.

Metric (vs BF16)5.5-bit buildThis 6-bit build
KL-vs-BF16 (confident positions) ↓1.931.47 (−24%)
Next-token top-1 agreement ↑62.5%68.4% (+5.9 pp)

The 6-bit build is consistently closer to BF16 than the 5.5-bit build on every axis measured. (Absolute KL is top-K-truncated, so the meaningful quantity is the relative gap between builds against the same reference. Plain perplexity does not separate these builds on this heavily instruction-tuned model — both sit near the BF16 value — which is why closeness-to-BF16 is reported instead.)

What is PrismaQuant?

PrismaQuant is a Fisher-weighted, mixed-precision quantization toolkit. Instead of forcing the whole model into one dtype, it predicts the loss penalty of quantizing each Linear independently — Δloss ≈ 0.5 · H_trace · MSE_W, where H_trace is the Linear's Fisher diagonal trace from a calibration probe and MSE_W is the format-specific reconstruction error — then solves for the per-Linear format menu ({NVFP4, FP8_E4M3, BF16}) that minimizes total predicted Δloss at the target bits-per-weight. NVFP4 Linears additionally go through act-aware GPTQ + closed-form scale-sweep passes, each gated "improve or keep RTN."

Usage (vLLM)

bash

vllm serve rdtand/Gemma4-31B-IT-PrismaQuant-6bit-vllm \
--quantization compressed-tensors --trust-remote-code

Requires a vLLM build with compressed-tensors NVFP4 + FP8 support (FlashInfer CUTLASS kernels on Blackwell; FP8 on Hopper+). Serve without speculative decoding if you intend to read prompt logprobs.

Attribution

Quantization by PrismaQuant. Contact: robert.tand@icloud.com. Base model © Google, under the Gemma 4 license.

Model provider

rdtand

Model tree

Base

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today