RedHatAI
gemma-4-31B-it-NVFP4
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Overview
- Model Architecture: google/gemma-4-31B-it
- Input: Text / Image
- Output: Text
- Model Optimizations:
- Weight quantization: FP4
- Activation quantization: FP4
- Release Date: 2026-04-04
- Version: 1.0
- Model Developers: RedHatAI
This model is a quantized version of google/gemma-4-31B-it. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.
Model Optimizations
This model was obtained by quantizing the weights and activations of google/gemma-4-31B-it to FP4 data type using the NVFP4 format, ready for inference with vLLM. This optimization reduces the number of bits per parameter from 16 to 4, reducing the disk size and GPU memory requirements by approximately 75%.
Weights are quantized with FP4 (group_size=16), and activations are quantized with FP4 using local per-group scaling. Only the weights and activations of the linear operators within transformer blocks are quantized using LLM Compressor. Vision tower, embedding, and output head layers are kept in their original precision.
Deployment
Use with vLLM
This model can be deployed using vLLM. For detailed instructions including multi-GPU deployment, multimodal inference, thinking mode, function calling, and benchmarking, see the Gemma 4 vLLM usage guide.
- Start the vLLM server:
markdown
vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \--max-model-len 32768 \--gpu-memory-utilization 0.90
To enable thinking/reasoning and tool calling:
markdown
vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \--max-model-len 32768 \--gpu-memory-utilization 0.90 \--enable-auto-tool-choice \--reasoning-parser gemma4 \--tool-call-parser gemma4 \--chat-template examples/tool_chat_template_gemma4.jinja \--limit-mm-per-prompt '{"image": 4, "audio": 1}' \--async-scheduling
Tip: For text-only workloads, pass
--limit-mm-per-prompt '{"image": 0, "audio": 0}'to skip vision encoder memory allocation and free up GPU memory for a longer context window.
- Send requests to the server:
python
from openai import OpenAIopenai_api_key = "EMPTY"openai_api_base = "http://<your-server-host>:8000/v1"client = OpenAI(api_key=openai_api_key,base_url=openai_api_base,)model = "RedHatAI/gemma-4-31B-it-NVFP4"messages = [{"role": "user", "content": "Explain quantum mechanics clearly and concisely."},]outputs = client.chat.completions.create(model=model,messages=messages,)generated_text = outputs.choices[0].message.contentprint(generated_text)
Creation
This model was created by applying NVFP4 quantization with LLM Compressor, as presented in the code snippet below.
python
from datasets import load_datasetfrom transformers import AutoModelForCausalLM, AutoTokenizerfrom llmcompressor import applyfrom llmcompressor.modifiers.quantization import QuantizationModifierMODEL_ID = "google/gemma-4-31B-it"SAVE_DIR = MODEL_ID.split("/")[1] + "-NVFP4"model = AutoModelForCausalLM.from_pretrained(MODEL_ID, torch_dtype="auto", device_map="auto")tokenizer = AutoTokenizer.from_pretrained(MODEL_ID)ds = load_dataset("mgoin/ultrachat_200k_s3", split="train_sft")calibration_data = [ex["prompt"] for ex in ds.select(range(512))]recipe = QuantizationModifier(targets=["Linear"],ignore=["re:.*vision.*", "re:.*audio.*", "lm_head", "re:.*embed.*"],scheme="NVFP4",)apply(model=model, tokenizer=tokenizer, recipe=recipe, calibration_data=calibration_data)model.save_pretrained(SAVE_DIR, save_compressed=True)tokenizer.save_pretrained(SAVE_DIR)
Evaluation
This model was evaluated on GSM8K Platinum, MMLU-Pro, IFEval, MATH-500, AIME 2025, GPQA Diamond, and LiveCodeBench v6 using lm-evaluation-harness and lighteval, served with vLLM (OpenAI-compatible API). All evaluations were performed with thinking enabled.
Accuracy
Reproduction
The results were obtained using the following commands:
Each benchmark was run 3 times with different random seeds (1234, 2345, 3456) and the scores were averaged; AIME 2025 used 8 seeds.
vLLM server (instruction following and reasoning benchmarks):
markdown
vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \--tensor-parallel-size 2 \--max-model-len 69632 \--gpu-memory-utilization 0.90 \--enable-auto-tool-choice \--reasoning-parser gemma4 \--tool-call-parser gemma4 \--chat-template examples/tool_chat_template_gemma4.jinja \--limit-mm-per-prompt '{"image":0,"audio":0}' \--async-scheduling
GSM8K Platinum (lm-eval, 0-shot, 3 repetitions)
markdown
lm_eval --model local-chat-completions \--tasks gsm8k_platinum_cot_llama \--model_args "model=RedHatAI/gemma-4-31B-it-NVFP4,max_length=36096,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \--num_fewshot 0 \--apply_chat_template \--output_path results_gsm8k_platinum.json \--seed 1234 \--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"
MMLU-Pro (lm-eval, 0-shot, 3 repetitions)
markdown
lm_eval --model local-chat-completions \--tasks mmlu_pro_chat \--model_args "model=RedHatAI/gemma-4-31B-it-NVFP4,max_length=36096,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \--num_fewshot 0 \--apply_chat_template \--output_path results_mmlu_pro.json \--seed 1234 \--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"
IFEval (lm-eval, 0-shot, 3 repetitions)
markdown
lm_eval --model local-chat-completions \--tasks ifeval \--model_args "model=RedHatAI/gemma-4-31B-it-NVFP4,max_length=36096,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=128,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=1200" \--num_fewshot 0 \--apply_chat_template \--output_path results_ifeval.json \--seed 1234 \--gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"
MATH-500, AIME 2025, GPQA Diamond (lighteval, 3 repetitions; 8 for AIME 2025)
litellm_config.yaml:
yaml
model_parameters:provider: hosted_vllmmodel_name: hosted_vllm/RedHatAI/gemma-4-31B-it-NVFP4base_url: http://0.0.0.0:8000/v1api_key: ''timeout: 3600concurrent_requests: 128generation_parameters:temperature: 1.0max_new_tokens: 65536top_p: 0.95top_k: 64seed: 1234
Run once per seed (changing seed in the config each time):
markdown
lighteval endpoint litellm litellm_config.yaml 'math_500|0' \--output-dir results/ --save-detailslighteval endpoint litellm litellm_config.yaml 'aime25|0' \--output-dir results/ --save-detailslighteval endpoint litellm litellm_config.yaml 'gpqa:diamond|0' \--output-dir results/ --save-details
LiveCodeBench v6 (lighteval, 3 repetitions)
vLLM server:
markdown
vllm serve RedHatAI/gemma-4-31B-it-NVFP4 \--tensor-parallel-size 2 \--max-model-len 36864 \--gpu-memory-utilization 0.90 \--enable-auto-tool-choice \--reasoning-parser gemma4 \--tool-call-parser gemma4 \--chat-template examples/tool_chat_template_gemma4.jinja \--limit-mm-per-prompt '{"image":0,"audio":0}' \--async-scheduling
litellm_config.yaml:
yaml
model_parameters:provider: hosted_vllmmodel_name: hosted_vllm/RedHatAI/gemma-4-31B-it-NVFP4base_url: http://0.0.0.0:8000/v1api_key: ''timeout: 1200concurrent_requests: 256generation_parameters:temperature: 1.0max_new_tokens: 32768top_p: 0.95top_k: 64seed: 1234
Run once per seed:
markdown
lighteval endpoint litellm litellm_config.yaml 'lcb:codegeneration_v6|0' \--output-dir results/ --save-details
Model provider
RedHatAI
Model tree
Base
google/gemma-4-31B-it
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information