trohrbaugh

gemma-4-31b-it-heretic-ara-NVFP4

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What was quantized

A deliberately mixed-precision recipe, mirroring NVIDIA's dense-Gemma-4 approach:

Table
ComponentPrecision
MLP / FFN linears (gate_proj, up_proj, down_proj)NVFP4 (W4A4, group size 16, FP8 e4m3 block scales)
Attention (q/k/v/o_proj)BF16
Embeddings, lm_head, normsBF16
Vision tower (~550M)BF16
KV cache (serve-time)FP8

Attention is intentionally kept in BF16: Gemma-4's attention residual stream carries persistent per-channel activation outliers larger than NVFP4 can represent, so 4-bit activation quantization there degrades quality. Only the MLP is quantized (180 of the language model's linear layers; all attention layers untouched).

Quantization details

  • Tool: llm-compressor (git main), transformers 5.11, compressed-tensors 0.17.
  • Scheme: NVFP4 (weights + activations 4-bit), targets: Linear, with attention / vision / embeddings / lm_head held out via ignore.
  • Calibration: 512 samples from HuggingFaceH4/ultrachat_200k (train_sft), chat-template formatted, max sequence length 2048. Weight observer memoryless_minmax, activation observer static_minmax.
  • See recipe.yaml in this repo for the exact modifier configuration.

Serving with vLLM (Blackwell / SM120)

Requires a vLLM build with native SM120 NVFP4 kernels (verify the startup log shows Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM and not a Marlin fallback).

bash

vllm serve <path-or-repo>/gemma-4-31b-it-heretic-ara-NVFP4 \
--kv-cache-dtype fp8 \
--gpu-memory-utilization 0.90 \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--enable-auto-tool-choice \
--limit-mm-per-prompt '{"image": 2}'

Use Gemma-4's recommended sampling (temperature=1.0, top_p=0.95, top_k=64); these ship in generation_config.json. The model forces the TRITON_ATTN backend due to Gemma-4's heterogeneous attention head dimensions — this is expected, not an error.

Multi-Token Prediction (speculative decoding)

Add the stock Gemma-4 31B MTP drafter for faster decode. You must pass "method":"mtp" — without it, vLLM treats the assistant as a generic draft model and fails on this multimodal target:

bash

--speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":2}'

The drafter is the stock checkpoint and stays BF16 (no need to quantize it). It shares the target's embedding table, which is why those are kept BF16 in this checkpoint.

Important: config.json ignore patch

The quantization_config.ignore list in config.json includes a hand-added re:.*self_attn.* regex. This is required for vLLM: llm-compressor resolves ignore regexes into literal module names at save time, and those literals don't match the names vLLM derives for the fused qkv_proj, causing a "Found a different quantization schemes" error at load. The regex makes vLLM treat all attention projections as uniformly BF16. Do not remove it.

Evaluation

Fill in after running your own eval against the BF16 baseline.

Table
MetricThis NVFP4 modelBF16 baseline
Refusal rate (Heretic eval)TODO5/100
KL divergence vs. google/gemma-4-31b-itTODO0.0120
MMLU-Pro / GSM8K / (your suite)TODOTODO

Observed serving performance (RTX PRO 6000 Blackwell Max-Q, single GPU, vLLM 0.22.1): ~64–70 tok/s single-stream decode with MTP (num_speculative_tokens=2, mean acceptance length ~2.3), vs. ~25–35 tok/s without. Roughly 2–2.5×.

License & intended use

Derived from Gemma 4 and distributed under Apache 2.0; use is also subject to Google's Gemma Terms of Use and Prohibited Use Policy. This is a decensored (abliterated) model with refusal behavior substantially removed relative to the base; deployers are responsible for adding their own content-safety measures appropriate to their application.

Model provider

trohrbaugh

Model tree

Base

trohrbaugh/gemma-4-31b-it-heretic-ara

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today