vrfai/gemma-4-31B-it-nvfp4 API & Inference Endpoint

Quantization Details

This model was quantized using NVIDIA ModelOpt with the following configurations:

Property	Value
Base model	google/gemma-4-31B-it
Quant method	NVIDIA ModelOpt (NVFP4)
Weight scheme	4-bit float, block size 16
Input activation	4-bit float, block size 16
Calibration dataset	CNN DailyMail (512 samples, max_seq_len 1024)
Size	~30 GB (vs ~58 GB BF16)

Excluded from Quantization

The following modules are kept in full precision (BF16) to preserve accuracy:

lm_head
model.embed_vision*
All self_attn layers (layers 0–59)

Usage

You can deploy this model using vLLM with the modelopt quantization backend. Please ensure you refer to the vLLM documentation for Gemma 4 for advanced serving options.

bash
vllm serve vrfai/gemma-4-31B-it-nvfp4 \
  --quantization modelopt_fp4 \
  --max-model-len 32768 \
  --max-num-seqs 128 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.95 \
  --kv-cache-dtype fp8 \
  --enable-prefix-caching \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --async-scheduling \
  --trust-remote-code

Quantization Script

The recipes and scripts used to quantize this model can be found in the following repository:

VinRobotics/model-quantization-recipes

gemma-4-31B-it-nvfp4

Get help setting up a custom Dedicated Endpoints.

README

Quantization Details

Excluded from Quantization

Usage

Quantization Script

Explore FriendliAI today

gemma-4-31B-it-nvfp4