gemma-4-31B-it-FP8-block API & Inference Endpoint

Model Overview

Model Architecture: Gemma4ForConditionalGeneration
- Input: Text / Image
- Output: Text
Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
Release Date: 2026-04-04
Version: 1.0
Model Developers: RedHatAI

This model is a quantized version of google/gemma-4-31B-it. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.

Model Optimizations

This model was obtained by quantizing the weights and activations of google/gemma-4-31B-it to FP8 data type, ready for inference with vLLM. This optimization reduces the number of bits per parameter from 16 to 8, reducing the disk size and GPU memory requirements by approximately 50%.

Weights are quantized using block-wise FP8 scaling (128×128 blocks), and activations are quantized dynamically per group (group_size=128). Only the weights and activations of the linear operators within transformer blocks are quantized using LLM Compressor. Vision tower, embedding, and output head layers are kept in their original precision.

Deployment

Use with vLLM

This model can be deployed using vLLM. For detailed instructions including multi-GPU deployment, multimodal inference, thinking mode, function calling, and benchmarking, see the Gemma 4 vLLM usage guide.

Start the vLLM server:

markdown
vllm serve RedHatAI/gemma-4-31B-it-FP8-block \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja \
  --limit-mm-per-prompt '{"image": 4, "audio": 1}' \
  --async-scheduling

Tip: For text-only workloads, pass --limit-mm-per-prompt '{"image": 0, "audio": 0}' to skip vision encoder memory allocation and free up GPU memory for a longer context window.

Send requests to the server:

python
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "RedHatAI/gemma-4-31B-it-FP8-block"

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was created by applying data-free FP8 block quantization with LLM Compressor, as presented in the code snippet below.

python
from llmcompressor import model_free_ptq

MODEL_ID = "google/gemma-4-31B-it"
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-block"

model_free_ptq(
    model_stub=MODEL_ID,
    save_directory=SAVE_DIR,
    scheme="FP8_BLOCK",
    ignore=["re:.*vision.*", "lm_head", "re:.*embed_tokens.*"],
    max_workers=8,
    device="cuda:0",
)

Evaluation

This model was evaluated on GSM8K Platinum, MMLU-Pro, IFEval, MATH-500, AIME 2025, GPQA Diamond, LiveCodeBench v6, and BFCLv4 (function calling) using lm-evaluation-harness, lighteval, and BFCL — all served with vLLM (OpenAI-compatible API). Accuracy results are reported both without and with thinking enabled; BFCLv4 was evaluated with thinking enabled. LiveCodeBench v6 was evaluated without thinking only.

Accuracy

Without thinking

With thinking

Reproduction

The results were obtained using the following commands:

Each benchmark was run 3 times with different random seeds (1234, 2345, 3456) and the scores were averaged; AIME 2025 used 8 seeds.

vLLM server (with thinking):

markdown
vllm serve RedHatAI/gemma-4-31B-it-FP8-block \
  --served-model-name gemma-4-31b-it-FP8-block \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --language-model-only \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja \
  --async-scheduling \
  --default-chat-template-kwargs '{"enable_thinking": true}'

Note: To reproduce the results without thinking, remove --default-chat-template-kwargs '{"enable_thinking": true}'. To run without tool calling, remove --enable-auto-tool-choice, --tool-call-parser gemma4, and --reasoning-parser gemma4.

GSM8K Platinum (lm-eval, 0-shot, 3 repetitions)

markdown
lm_eval --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=gemma-4-31b-it-FP8-block,max_length=32768,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
  --num_fewshot 0 \
  --apply_chat_template \
  --output_path results_gsm8k_platinum.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"

MMLU-Pro (lm-eval, 0-shot, 3 repetitions)

markdown
lm_eval --model local-chat-completions \
  --tasks mmlu_pro_chat \
  --model_args "model=gemma-4-31b-it-FP8-block,max_length=32768,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
  --num_fewshot 0 \
  --apply_chat_template \
  --output_path results_mmlu_pro.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"

IFEval (lm-eval, 0-shot, 3 repetitions)

markdown
lm_eval --model local-chat-completions \
  --tasks ifeval \
  --model_args "model=gemma-4-31b-it-FP8-block,max_length=32768,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
  --num_fewshot 0 \
  --apply_chat_template \
  --output_path results_ifeval.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"

MATH-500, AIME 2025, GPQA Diamond (lighteval, 3 repetitions; 8 for AIME 2025)

litellm_config.yaml:

yaml
model_parameters:
  provider: hosted_vllm
  model_name: hosted_vllm/gemma-4-31b-it-FP8-block
  base_url: http://0.0.0.0:8000/v1
  api_key: ''
  timeout: 3600
  concurrent_requests: 32
  generation_parameters:
    temperature: 1.0
    max_new_tokens: 65536
    top_p: 0.95
    top_k: 64
    seed: 1234

Run once per seed (changing seed in the config each time):

markdown
lighteval endpoint litellm litellm_config.yaml 'math_500|0' \
  --output-dir results/ --save-details

lighteval endpoint litellm litellm_config.yaml 'aime25|0' \
  --output-dir results/ --save-details

lighteval endpoint litellm litellm_config.yaml 'gpqa:diamond|0' \
  --output-dir results/ --save-details

LiveCodeBench v6 (lighteval, 3 repetitions, without thinking)

litellm_config.yaml:

yaml
model_parameters:
  provider: hosted_vllm
  model_name: hosted_vllm/gemma-4-31b-it-FP8-block
  base_url: http://0.0.0.0:8000/v1
  api_key: ''
  timeout: 1200
  concurrent_requests: 32
  generation_parameters:
    temperature: 1.0
    max_new_tokens: 32768
    top_p: 0.95
    top_k: 64
    seed: 1234

Run once per seed (using the vLLM server without --default-chat-template-kwargs):

markdown
lighteval endpoint litellm litellm_config.yaml 'lcb:codegeneration_v6|0' \
  --output-dir results/ --save-details

BFCLv4

BFCL requires the model to be registered in the leaderboard codebase before running evaluation.

Step 1 — Register the model in bfcl_eval/constants/model_config.py

Add the following entry to api_inference_model_map:

python
"gemma-4-31b-it-FP8-block": ModelConfig(
    model_name="gemma-4-31b-it-FP8-block",
    display_name="Gemma-4-31b-it-FP8-Block (FC)",
    url="https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block",
    org="Google",
    license="Apache 2.0",
    model_handler=OpenAICompletionsHandler,
    input_price=None,
    output_price=None,
    is_fc_model=True,
    underscore_to_dot=True,
),

Step 2 — Add the key to bfcl_eval/constants/supported_models.py

Add "gemma-4-31b-it-FP8-block" to the SUPPORTED_MODELS list.

Step 3 — Start the vLLM server (use the command at the top of this section; the --served-model-name flag ensures BFCL can find the model by its registered slug).

Step 4 — Generate responses and evaluate

markdown
bfcl generate --model gemma-4-31b-it-FP8-block --test-category all
bfcl evaluate --model gemma-4-31b-it-FP8-block --test-category all

Model Overview

Model Architecture: Gemma4ForConditionalGeneration
- Input: Text / Image
- Output: Text
Model Optimizations:
- Weight quantization: FP8
- Activation quantization: FP8
Release Date: 2026-04-04
Version: 1.0
Model Developers: RedHatAI

This model is a quantized version of google/gemma-4-31B-it. It was evaluated on several tasks to assess its quality in comparison to the unquantized model.

Model Optimizations

Deployment

Use with vLLM

Start the vLLM server:

markdown
vllm serve RedHatAI/gemma-4-31B-it-FP8-block \
  --tensor-parallel-size 2 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja \
  --limit-mm-per-prompt '{"image": 4, "audio": 1}' \
  --async-scheduling

Tip: For text-only workloads, pass --limit-mm-per-prompt '{"image": 0, "audio": 0}' to skip vision encoder memory allocation and free up GPU memory for a longer context window.

Send requests to the server:

python
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://<your-server-host>:8000/v1"

client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)

model = "RedHatAI/gemma-4-31B-it-FP8-block"

messages = [
    {"role": "user", "content": "Explain quantum mechanics clearly and concisely."},
]

outputs = client.chat.completions.create(
    model=model,
    messages=messages,
    extra_body={"chat_template_kwargs": {"enable_thinking": True}},
)

generated_text = outputs.choices[0].message.content
print(generated_text)

Creation

This model was created by applying data-free FP8 block quantization with LLM Compressor, as presented in the code snippet below.

python
from llmcompressor import model_free_ptq

MODEL_ID = "google/gemma-4-31B-it"
SAVE_DIR = MODEL_ID.split("/")[1] + "-FP8-block"

model_free_ptq(
    model_stub=MODEL_ID,
    save_directory=SAVE_DIR,
    scheme="FP8_BLOCK",
    ignore=["re:.*vision.*", "lm_head", "re:.*embed_tokens.*"],
    max_workers=8,
    device="cuda:0",
)

Evaluation

Accuracy

Without thinking

With thinking

Reproduction

The results were obtained using the following commands:

Each benchmark was run 3 times with different random seeds (1234, 2345, 3456) and the scores were averaged; AIME 2025 used 8 seeds.

vLLM server (with thinking):

markdown
vllm serve RedHatAI/gemma-4-31B-it-FP8-block \
  --served-model-name gemma-4-31b-it-FP8-block \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90 \
  --language-model-only \
  --enable-auto-tool-choice \
  --reasoning-parser gemma4 \
  --tool-call-parser gemma4 \
  --chat-template examples/tool_chat_template_gemma4.jinja \
  --async-scheduling \
  --default-chat-template-kwargs '{"enable_thinking": true}'

Note: To reproduce the results without thinking, remove --default-chat-template-kwargs '{"enable_thinking": true}'. To run without tool calling, remove --enable-auto-tool-choice, --tool-call-parser gemma4, and --reasoning-parser gemma4.

GSM8K Platinum (lm-eval, 0-shot, 3 repetitions)

markdown
lm_eval --model local-chat-completions \
  --tasks gsm8k_platinum_cot_llama \
  --model_args "model=gemma-4-31b-it-FP8-block,max_length=32768,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
  --num_fewshot 0 \
  --apply_chat_template \
  --output_path results_gsm8k_platinum.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"

MMLU-Pro (lm-eval, 0-shot, 3 repetitions)

markdown
lm_eval --model local-chat-completions \
  --tasks mmlu_pro_chat \
  --model_args "model=gemma-4-31b-it-FP8-block,max_length=32768,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
  --num_fewshot 0 \
  --apply_chat_template \
  --output_path results_mmlu_pro.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"

IFEval (lm-eval, 0-shot, 3 repetitions)

markdown
lm_eval --model local-chat-completions \
  --tasks ifeval \
  --model_args "model=gemma-4-31b-it-FP8-block,max_length=32768,base_url=http://0.0.0.0:8000/v1/chat/completions,num_concurrent=32,max_retries=3,tokenized_requests=False,tokenizer_backend=None,timeout=3600" \
  --num_fewshot 0 \
  --apply_chat_template \
  --output_path results_ifeval.json \
  --seed 1234 \
  --gen_kwargs "do_sample=True,temperature=1.0,top_p=0.95,top_k=64,max_gen_toks=32000,seed=1234"

MATH-500, AIME 2025, GPQA Diamond (lighteval, 3 repetitions; 8 for AIME 2025)

litellm_config.yaml:

yaml
model_parameters:
  provider: hosted_vllm
  model_name: hosted_vllm/gemma-4-31b-it-FP8-block
  base_url: http://0.0.0.0:8000/v1
  api_key: ''
  timeout: 3600
  concurrent_requests: 32
  generation_parameters:
    temperature: 1.0
    max_new_tokens: 65536
    top_p: 0.95
    top_k: 64
    seed: 1234

Run once per seed (changing seed in the config each time):

markdown
lighteval endpoint litellm litellm_config.yaml 'math_500|0' \
  --output-dir results/ --save-details

lighteval endpoint litellm litellm_config.yaml 'aime25|0' \
  --output-dir results/ --save-details

lighteval endpoint litellm litellm_config.yaml 'gpqa:diamond|0' \
  --output-dir results/ --save-details

LiveCodeBench v6 (lighteval, 3 repetitions, without thinking)

litellm_config.yaml:

yaml
model_parameters:
  provider: hosted_vllm
  model_name: hosted_vllm/gemma-4-31b-it-FP8-block
  base_url: http://0.0.0.0:8000/v1
  api_key: ''
  timeout: 1200
  concurrent_requests: 32
  generation_parameters:
    temperature: 1.0
    max_new_tokens: 32768
    top_p: 0.95
    top_k: 64
    seed: 1234

Run once per seed (using the vLLM server without --default-chat-template-kwargs):

markdown
lighteval endpoint litellm litellm_config.yaml 'lcb:codegeneration_v6|0' \
  --output-dir results/ --save-details

BFCLv4

BFCL requires the model to be registered in the leaderboard codebase before running evaluation.

Step 1 — Register the model in bfcl_eval/constants/model_config.py

Add the following entry to api_inference_model_map:

python
"gemma-4-31b-it-FP8-block": ModelConfig(
    model_name="gemma-4-31b-it-FP8-block",
    display_name="Gemma-4-31b-it-FP8-Block (FC)",
    url="https://huggingface.co/RedHatAI/gemma-4-31B-it-FP8-block",
    org="Google",
    license="Apache 2.0",
    model_handler=OpenAICompletionsHandler,
    input_price=None,
    output_price=None,
    is_fc_model=True,
    underscore_to_dot=True,
),

Step 2 — Add the key to bfcl_eval/constants/supported_models.py

Add "gemma-4-31b-it-FP8-block" to the SUPPORTED_MODELS list.

Step 3 — Start the vLLM server (use the command at the top of this section; the --served-model-name flag ensures BFCL can find the model by its registered slug).

Step 4 — Generate responses and evaluate

markdown
bfcl generate --model gemma-4-31b-it-FP8-block --test-category all
bfcl evaluate --model gemma-4-31b-it-FP8-block --test-category all

gemma-4-31B-it-FP8-block

README

Model Overview

Model Optimizations

Deployment

Use with vLLM

Creation

Evaluation

Accuracy

Without thinking

With thinking

Reproduction

GSM8K Platinum (lm-eval, 0-shot, 3 repetitions)

MMLU-Pro (lm-eval, 0-shot, 3 repetitions)

IFEval (lm-eval, 0-shot, 3 repetitions)

MATH-500, AIME 2025, GPQA Diamond (lighteval, 3 repetitions; 8 for AIME 2025)

LiveCodeBench v6 (lighteval, 3 repetitions, without thinking)

BFCLv4

Explore FriendliAI today

README

Model Overview

Model Optimizations

Deployment

Use with vLLM

Creation

Evaluation

Accuracy

Without thinking

With thinking

Reproduction

GSM8K Platinum (lm-eval, 0-shot, 3 repetitions)

MMLU-Pro (lm-eval, 0-shot, 3 repetitions)

IFEval (lm-eval, 0-shot, 3 repetitions)

MATH-500, AIME 2025, GPQA Diamond (lighteval, 3 repetitions; 8 for AIME 2025)

LiveCodeBench v6 (lighteval, 3 repetitions, without thinking)

BFCLv4