trohrbaugh/gemma-4-31b-it-heretic-ara-NVFP4 API & Inference Endpoint

What was quantized

A deliberately mixed-precision recipe, mirroring NVIDIA's dense-Gemma-4 approach:

Table
Component	Precision
MLP / FFN linears (`gate_proj`, `up_proj`, `down_proj`)	NVFP4 (W4A4, group size 16, FP8 e4m3 block scales)
Attention (`q/k/v/o_proj`)	BF16
Embeddings, `lm_head`, norms	BF16
Vision tower (~550M)	BF16
KV cache (serve-time)	FP8

Attention is intentionally kept in BF16: Gemma-4's attention residual stream carries persistent per-channel activation outliers larger than NVFP4 can represent, so 4-bit activation quantization there degrades quality. Only the MLP is quantized (180 of the language model's linear layers; all attention layers untouched).

Quantization details

Tool: llm-compressor (git main), transformers 5.11, compressed-tensors 0.17.
Scheme: NVFP4 (weights + activations 4-bit), targets: Linear, with attention / vision / embeddings / lm_head held out via ignore.
Calibration: 512 samples from HuggingFaceH4/ultrachat_200k (train_sft), chat-template formatted, max sequence length 2048. Weight observer memoryless_minmax, activation observer static_minmax.
See recipe.yaml in this repo for the exact modifier configuration.

Serving with vLLM (Blackwell / SM120)

Requires a vLLM build with native SM120 NVFP4 kernels (verify the startup log shows Using FlashInferCutlassNvFp4LinearKernel for NVFP4 GEMM and not a Marlin fallback).

bash
vllm serve <path-or-repo>/gemma-4-31b-it-heretic-ara-NVFP4 \
  --kv-cache-dtype fp8 \
  --gpu-memory-utilization 0.90 \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --enable-auto-tool-choice \
  --limit-mm-per-prompt '{"image": 2}'

Use Gemma-4's recommended sampling (temperature=1.0, top_p=0.95, top_k=64); these ship in generation_config.json. The model forces the TRITON_ATTN backend due to Gemma-4's heterogeneous attention head dimensions — this is expected, not an error.

Multi-Token Prediction (speculative decoding)

Add the stock Gemma-4 31B MTP drafter for faster decode. You must pass "method":"mtp" — without it, vLLM treats the assistant as a generic draft model and fails on this multimodal target:

bash
--speculative-config '{"method":"mtp","model":"google/gemma-4-31B-it-assistant","num_speculative_tokens":2}'

The drafter is the stock checkpoint and stays BF16 (no need to quantize it). It shares the target's embedding table, which is why those are kept BF16 in this checkpoint.

Important: config.json `ignore` patch

The quantization_config.ignore list in config.json includes a hand-added re:.*self_attn.* regex. This is required for vLLM: llm-compressor resolves ignore regexes into literal module names at save time, and those literals don't match the names vLLM derives for the fused qkv_proj, causing a "Found a different quantization schemes" error at load. The regex makes vLLM treat all attention projections as uniformly BF16. Do not remove it.

Evaluation

Fill in after running your own eval against the BF16 baseline.

Table
Metric	This NVFP4 model	BF16 baseline
Refusal rate (Heretic eval)	TODO	5/100
KL divergence vs. google/gemma-4-31b-it	TODO	0.0120
MMLU-Pro / GSM8K / (your suite)	TODO	TODO

Observed serving performance (RTX PRO 6000 Blackwell Max-Q, single GPU, vLLM 0.22.1): ~64–70 tok/s single-stream decode with MTP (num_speculative_tokens=2, mean acceptance length ~2.3), vs. ~25–35 tok/s without. Roughly 2–2.5×.

License & intended use

Derived from Gemma 4 and distributed under Apache 2.0; use is also subject to Google's Gemma Terms of Use and Prohibited Use Policy. This is a decensored (abliterated) model with refusal behavior substantially removed relative to the base; deployers are responsible for adding their own content-safety measures appropriate to their application.

gemma-4-31b-it-heretic-ara-NVFP4

Get help setting up a custom Dedicated Endpoints.

README

What was quantized

Quantization details

Serving with vLLM (Blackwell / SM120)

Multi-Token Prediction (speculative decoding)

Important: config.json `ignore` patch

Evaluation

License & intended use

Explore FriendliAI today

gemma-4-31b-it-heretic-ara-NVFP4

Get help setting up a custom Dedicated Endpoints.

What was quantized

Quantization details

Serving with vLLM (Blackwell / SM120)

Multi-Token Prediction (speculative decoding)

Important: config.json ignore patch

Evaluation

License & intended use

Explore FriendliAI today

Important: config.json `ignore` patch