trohrbaugh
gemma-4-31b-it-heretic-ara-NVFP4
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What was quantized
A deliberately mixed-precision recipe, mirroring NVIDIA's dense-Gemma-4 approach:
| Component | Precision |
|---|---|
MLP / FFN linears (gate_proj, up_proj, down_proj) | NVFP4 (W4A4, group size 16, FP8 e4m3 block scales) |
Attention (q/k/v/o_proj) | BF16 |
Embeddings, lm_head, norms | BF16 |
| Vision tower (~550M) | BF16 |
| KV cache (serve-time) | FP8 |
Attention is intentionally kept in BF16: Gemma-4's attention residual stream carries persistent per-channel activation outliers larger than NVFP4 can represent, so 4-bit activation quantization there degrades quality. Only the MLP is quantized (180 of the language model's linear layers; all attention layers untouched).
Quantization details
- Tool: llm-compressor (git
main), transformers 5.11, compressed-tensors 0.17. - Scheme:
NVFP4(weights + activations 4-bit),targets: Linear, with attention / vision / embeddings /lm_headheld out viaignore. - Calibration: 512 samples from
HuggingFaceH4/ultrachat_200k(train_sft), chat-template formatted, max sequence length 2048. Weight observermemoryless_minmax, activation observerstatic_minmax. - See
recipe.yamlin this repo for the exact modifier configuration.
Serving with vLLM (Blackwell / SM120)
Requires a vLLM build with native SM120 NVFP4 kernels (verify the startup log shows
Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM and not a Marlin fallback).
bash
vllm serve <path-or-repo>/gemma-4-31b-it-heretic-ara-NVFP4 \--kv-cache-dtype fp8 \--gpu-memory-utilization 0.90 \--reasoning-parser gemma4 \--tool-call-parser gemma4 \--enable-auto-tool-choice \--limit-mm-per-prompt '{"image": 2}'
Use Gemma-4's recommended sampling (temperature=1.0, top_p=0.95, top_k=64); these ship in
generation_config.json. The model forces the TRITON_ATTN backend due to Gemma-4's
heterogeneous attention head dimensions — this is expected, not an error.
Multi-Token Prediction (speculative decoding)
Add the stock Gemma-4 31B MTP drafter for faster decode. You must pass "method":"mtp" —
without it, vLLM treats the assistant as a generic draft model and fails on this multimodal
target:
bash
--speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":2}'
The drafter is the stock checkpoint and stays BF16 (no need to quantize it). It shares the target's embedding table, which is why those are kept BF16 in this checkpoint.
Important: config.json ignore patch
The quantization_config.ignore list in config.json includes a hand-added
re:.*self_attn.* regex. This is required for vLLM: llm-compressor resolves ignore regexes
into literal module names at save time, and those literals don't match the names vLLM derives
for the fused qkv_proj, causing a "Found a different quantization schemes" error at load.
The regex makes vLLM treat all attention projections as uniformly BF16. Do not remove it.
Evaluation
Fill in after running your own eval against the BF16 baseline.
| Metric | This NVFP4 model | BF16 baseline |
|---|---|---|
| Refusal rate (Heretic eval) | TODO | 5/100 |
| KL divergence vs. google/gemma-4-31b-it | TODO | 0.0120 |
| MMLU-Pro / GSM8K / (your suite) | TODO | TODO |
Observed serving performance (RTX PRO 6000 Blackwell Max-Q, single GPU, vLLM 0.22.1):
~64–70 tok/s single-stream decode with MTP (num_speculative_tokens=2, mean acceptance
length ~2.3), vs. ~25–35 tok/s without. Roughly 2–2.5×.
License & intended use
Derived from Gemma 4 and distributed under Apache 2.0; use is also subject to Google's Gemma Terms of Use and Prohibited Use Policy. This is a decensored (abliterated) model with refusal behavior substantially removed relative to the base; deployers are responsible for adding their own content-safety measures appropriate to their application.
Model provider
trohrbaugh
Model tree
Base
trohrbaugh/gemma-4-31b-it-heretic-ara
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information