• May 7, 2026
  • 5 min read

Gemma-4-31B-it API on FriendliAI: #1 Output Speed & Response Time

TL;DR
  • Gemma-4-31B-it is a dense, instruction-tuned, vision-language model by Google DeepMind, and it’s now available on Friendli Model APIs and Dedicated Endpoints.
  • FriendliAI delivers the highest-performance inference for Gemma-4-31B-it, ranking #1 for output speed, time-to-first-token, and end-to-end response times according to Artificial Analysis.
  • The model excels at inference for coding, agentic workflows, document extraction, and question answering.
  • It achieves frontier model results on benchmarks: 2,150 Codeforces ELO, 89.2% AIME 2026, 84.3% GPQA Diamond, and 76.9% MMMU Pro.
  • Learn how to execute agentic code reviews with Gemma-4-31B-it on FriendliAI.
Gemma-4-31B-it API on FriendliAI: #1 Output Speed & Response Time  thumbnail

Gemma-4-31B-it is the largest of the Gemma 4 open-weight model family by Google DeepMind. The model is live on FriendliAI, and our Model API delivers industry-leading output speeds, time-to-first-token, and end-to-end response times, according to Artificial Analysis.

It’s a 31B parameter dense, instruction-tuned, and multimodal vision-language model (VLM) equipped with configurable thinking and native function calling capabilities for advanced reasoning and agentic workflows. With FriendliAI, developers have two deployment options:

  • The Friendli Model API is a serverless endpoint that lets you run high-performance inference with a single API call to a designated model — no infrastructure management required.
  • Friendli Dedicated Endpoints serve the fastest, most scalable, and reliable inference for open-weight models on reserved GPU capacity — delivering 2–5x higher throughput than standard self-managed inference deployments.

Try Gemma-4-31B-it on FriendliAI.

Gemma-4-31B-it: Google’s open-weight agentic coding model

Gemma-4-31B-it is a dense transformer with about 31 billion total parameters across 60 layers, paired with a 550M-parameter vision encoder. Its hybrid attention pattern lets the model use the full 256K context window without straining memory during long-document inference. BF16 weights fit can be served on a single 80GB H100 and quantized variants can run on workstation GPUs under appropriate runtime configurations. The training data extends through January 2025 and spans web text, code, math, and images in 140+ languages.

The multimodal VLM accepts text or image inputs and produces text outputs. Variable image resolution and aspect ratio are handled natively, with a configurable visual token budget between 70 and 1120 tokens that lets you trade compute for fidelity on a per-call basis. It does not accept audio — that's a feature reserved for Gemma 4 E2B and E4B.

It features native agentic primitives that support function calling, structured JSON output, and system roles for the chat template, so developers don't have to ask the model to use tools with awkward prompting tricks. Configurable thinking mode enables the model to rationalize its reasoning before generating a response. Disable this setting, and it answers directly. No fine-tuning, no separate model — same weights, two operating modes.

Gemma-4-31B-it is well-suited for the following applications:

  • Code-review and coding-assistant pipelines
  • Agentic workflows that chain tool calls with structured outputs
  • Document-extraction systems that mix layout-heavy PDFs with downstream reasoning
  • Research-grade question answering where the answer matters more than the latency

Gemma-4-31B-it on FriendliAI: #1 on Artificial Analysis

FriendliAI serves the fastest, most scalable inference for Gemma-4-31B-it with reasoning enabled — ranking #1 for output speed, time-to-first-token, and end-to-end response times according to Artificial Analysis.

The performance leaderboard reported FriendliAI’s Model API generating 71 output tokens per second. For workloads with 10k input tokens, FriendliAI produces the first token in 30.3 seconds and 500 output tokens in a total of 37.3 seconds. Our optimized inference stack beats seven other inference providers offering model APIs for Gemma-4-31B-it for each of these metrics.

output speed
Output tokens per second based on requests with 10k input tokens. Source: Artificial Analysis, May 6, 2026.
end-to-end response times
Total response times for 10k input tokens and 500 output tokens. Source: Artificial Analysis, May 6, 2026.

Gemma-4-31B-it outperforms all Gemma 4 models and even larger models with higher parameter counts, scoring at frontier model scale on AIME 2026, GPQA Diamond, and LiveCodeBench v6.

  • Competitive coding. 31B-it scores a 2,150 Codeforces ELO and 80.0% on LiveCodeBench v6 — territory that, until recently, belonged to closed-source frontier models. Up against Gemma-3-27B's 110 ELO, it's not a small step.
  • Advanced math and graduate-level reasoning. 89.2% on AIME 2026 (no tools), 84.3% on GPQA Diamond, and 74.4% on BigBench Extra Hard. Thinking mode is doing real work here.
  • Multimodal document understanding. 76.9% on MMMU Pro, 85.6% on MATH-Vision, and a 0.131 average edit distance on OmniDocBench 1.5 — strong enough for OCR, chart parsing, and long-form PDF extraction.

Getting started on FriendliAI

You can run Gemma-4-31B-it on Model APIs for serverless inference billed by the token, or on Dedicated Endpoints if you'd rather pin reserved GPU capacity to your workload.

Pricing for Model APIs:

Token typePrice (per 1M tokens)
Input tokens$0.14
Output tokens$0.40

Set up your dependencies:

  1. Create an account at friendli.ai.
  2. Generate a Friendli API key and export it:
    export FRIENDLI_API_KEY="YOUR_API_KEY"
  3. Install the OpenAI Python SDK (recommended): pip install openai

For Dedicated Endpoints, the deployment flow is:

  1. Open the Dedicated Endpoints console and pick google/gemma-4-31B-it from the model catalog.
  2. Choose your GPU architecture. Select B200, H200, H100, or A100, depending on availability and quantization.
  3. Configure the endpoint. Select your quantization: BF16 or one of the supported quantized formats. Enable host KV cache for prompt-prefix reuse, turn on autoscaling, and set min/max replica counts as necessary.
  4. Deploy; the endpoint ID returned to you is what you'll pass to the SDK as ENDPOINT_ID.

Agentic code review with Gemma-4-31B-it

Gemma-4-31B-it's combination of strong coding ability and configurable reasoning makes it well-suited for agents that need to inspect code, reason through it, and produce structured feedback. Below is a minimal review agent that takes a Python function, searches for bugs and anti-patterns, and returns a JSON report.

To call the Friendli Model API, set up a virtual environment:

bash
python -m venv .venv 
source .venv/bin/activate 
pip install openai 
export FRIENDLI_API_KEY="YOUR_API_KEY" 

Run the following Python snippet:

gemma-4-31B-it.py
import os
from openai import OpenAI

client = OpenAI(
    api_key=os.environ["FRIENDLI_API_KEY"],
    base_url="https://api.friendli.ai/serverless/v1",
)

code_to_review = """
def average(numbers):
    total = 0
    for i in range(len(numbers)):
        total += numbers[i]
    return total / len(numbers)
"""

system_prompt = (
    "You are a senior software engineer who specializes in reviewing Python scripts. "
    "Return JSON with keys: issues (list), suggestions (list), severity (low|medium|high)."
)

completion = client.chat.completions.create(
    model="google/gemma-4-31B-it",
    messages=[
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": f"Review this function:\n\n```python\n{code_to_review}\n```"},
    ],
    response_format={"type": "json_object"},
    temperature=0.2,
)

print(completion.choices[0].message.content)

Here’s an example of a successful response in JSON format:

json
{
  "issues": [
    "The function will raise a ZeroDivisionError if an empty list is passed as an argument.",
    "The use of range(len(numbers)) is an anti-pattern in Python; it is slower and less readable than iterating over the elements directly."
  ],
  "suggestions": [
    "Use the built-in sum() function to calculate the total more efficiently.",
    "Add a check for an empty list (e.g., 'if not numbers: return 0' or raise a custom exception) to prevent division by zero.",
    "Consider using the 'statistics.mean()' function from the Python standard library for better precision and readability."
  ],
  "severity": "medium"
}

Run Gemma-4-31B-it on FriendliAI

Gemma-4-31B-it beats the other Gemma 4 models and even larger models on the most recognized benchmarks for coding, math, reasoning, and document understanding. It also handles multimodal vision-language understanding and native agentic tooling. Those capabilities come with the best output speed and response times on FriendliAI, making production-grade inference scalable.

Spin up serverless Model APIs without managing any infrastructure, or configure and deploy Dedicated Endpoints on reserved GPU. Either way, you're a single click away from deploying one of the strongest open-weight reasoners.

Run Gemma-4-31B-it on FriendliAI.


Written by

FriendliAI Tech & Research


Share


General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 550,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.


Explore FriendliAI today