Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Quantization Details

This model was quantized using NVIDIA ModelOpt with the following configurations:

PropertyValue
Base modelgoogle/gemma-4-31B-it
Quant methodNVIDIA ModelOpt (NVFP4)
Weight scheme4-bit float, block size 16
Input activation4-bit float, block size 16
Calibration datasetCNN DailyMail (512 samples, max_seq_len 1024)
Size~30 GB (vs ~58 GB BF16)

Excluded from Quantization

The following modules are kept in full precision (BF16) to preserve accuracy:

  • lm_head
  • model.embed_vision*
  • All self_attn layers (layers 0–59)

Usage

You can deploy this model using vLLM with the modelopt quantization backend. Please ensure you refer to the vLLM documentation for Gemma 4 for advanced serving options.

bash

vllm serve vrfai/gemma-4-31B-it-nvfp4 \
--quantization modelopt_fp4 \
--max-model-len 32768 \
--max-num-seqs 128 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.95 \
--kv-cache-dtype fp8 \
--enable-prefix-caching \
--enable-auto-tool-choice \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--async-scheduling \
--trust-remote-code

Quantization Script

The recipes and scripts used to quantize this model can be found in the following repository:

Model provider

vrfai

Model tree

Base

google/gemma-4-31B-it

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today