Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Quantization Details
This model was quantized using NVIDIA ModelOpt with the following configurations:
| Property | Value |
|---|---|
| Base model | google/gemma-4-31B-it |
| Quant method | NVIDIA ModelOpt (NVFP4) |
| Weight scheme | 4-bit float, block size 16 |
| Input activation | 4-bit float, block size 16 |
| Calibration dataset | CNN DailyMail (512 samples, max_seq_len 1024) |
| Size | ~30 GB (vs ~58 GB BF16) |
Excluded from Quantization
The following modules are kept in full precision (BF16) to preserve accuracy:
lm_headmodel.embed_vision*- All
self_attnlayers (layers 0–59)
Usage
You can deploy this model using vLLM with the modelopt quantization backend. Please ensure you refer to the vLLM documentation for Gemma 4 for advanced serving options.
bash
vllm serve vrfai/gemma-4-31B-it-nvfp4 \--quantization modelopt_fp4 \--max-model-len 32768 \--max-num-seqs 128 \--max-num-batched-tokens 8192 \--gpu-memory-utilization 0.95 \--kv-cache-dtype fp8 \--enable-prefix-caching \--enable-auto-tool-choice \--reasoning-parser gemma4 \--tool-call-parser gemma4 \--async-scheduling \--trust-remote-code
Quantization Script
The recipes and scripts used to quantize this model can be found in the following repository:
Model provider
vrfai
Model tree
Base
google/gemma-4-31B-it
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information