rdtand/Gemma4-31B-IT-PrismaQuant-6bit-vllm API & Inference Endpoint

Contents & recipe

Component	Format
Language-model body — 234 Linears	NVFP4 (W4A4, group-16)
Language-model body — 135 Linears	FP8 E4M3 (W8A8, per-channel)
Language-model body — 41 Linears	BF16 (passthrough)
Norms / embeddings / lm_head / buffers	BF16
Vision tower (355 tensors)	BF16 passthrough

6.0 bpp on the language-model body (410 Linears total).
~27.2 GB on disk. The vision tower is carried in BF16 (the body is what is quantized); text generation is the validated path.
No audio tower.

Quality

Measured as KL divergence vs the BF16 reference (google/gemma-4-31b-it) on WikiText-2, teacher-forced with the BOS token, per-position top-20, fully deterministic. Lower KL / higher next-token agreement = closer to the original model.

Metric (vs BF16)	5.5-bit build	This 6-bit build
KL-vs-BF16 (confident positions) ↓	1.93	1.47 (−24%)
Next-token top-1 agreement ↑	62.5%	68.4% (+5.9 pp)

The 6-bit build is consistently closer to BF16 than the 5.5-bit build on every axis measured. (Absolute KL is top-K-truncated, so the meaningful quantity is the relative gap between builds against the same reference. Plain perplexity does not separate these builds on this heavily instruction-tuned model — both sit near the BF16 value — which is why closeness-to-BF16 is reported instead.)

What is PrismaQuant?

PrismaQuant is a Fisher-weighted, mixed-precision quantization toolkit. Instead of forcing the whole model into one dtype, it predicts the loss penalty of quantizing each Linear independently — Δloss ≈ 0.5 · H_trace · MSE_W, where H_trace is the Linear's Fisher diagonal trace from a calibration probe and MSE_W is the format-specific reconstruction error — then solves for the per-Linear format menu ({NVFP4, FP8_E4M3, BF16}) that minimizes total predicted Δloss at the target bits-per-weight. NVFP4 Linears additionally go through act-aware GPTQ + closed-form scale-sweep passes, each gated "improve or keep RTN."

Usage (vLLM)

bash
vllm serve rdtand/Gemma4-31B-IT-PrismaQuant-6bit-vllm \
  --quantization compressed-tensors --trust-remote-code

Requires a vLLM build with compressed-tensors NVFP4 + FP8 support (FlashInfer CUTLASS kernels on Blackwell; FP8 on Hopper+). Serve without speculative decoding if you intend to read prompt logprobs.

Gemma4-31B-IT-PrismaQuant-6bit-vllm

Get help setting up a custom Dedicated Endpoints.

README

Contents & recipe

Quality

What is PrismaQuant?

Usage (vLLM)

Attribution

Explore FriendliAI today

Gemma4-31B-IT-PrismaQuant-6bit-vllm