- May 7, 2026
- 5 min read
Gemma-4-31B-it API on FriendliAI: #1 Output Speed & Response Time
- Gemma-4-31B-it is a dense, instruction-tuned, vision-language model by Google DeepMind, and it’s now available on Friendli Model APIs and Dedicated Endpoints.
- FriendliAI delivers the highest-performance inference for Gemma-4-31B-it, ranking #1 for output speed, time-to-first-token, and end-to-end response times according to Artificial Analysis.
- The model excels at inference for coding, agentic workflows, document extraction, and question answering.
- It achieves frontier model results on benchmarks: 2,150 Codeforces ELO, 89.2% AIME 2026, 84.3% GPQA Diamond, and 76.9% MMMU Pro.
- Learn how to execute agentic code reviews with Gemma-4-31B-it on FriendliAI.

Gemma-4-31B-it is the largest of the Gemma 4 open-weight model family by Google DeepMind. The model is live on FriendliAI, and our Model API delivers industry-leading output speeds, time-to-first-token, and end-to-end response times, according to Artificial Analysis.
It’s a 31B parameter dense, instruction-tuned, and multimodal vision-language model (VLM) equipped with configurable thinking and native function calling capabilities for advanced reasoning and agentic workflows. With FriendliAI, developers have two deployment options:
- The Friendli Model API is a serverless endpoint that lets you run high-performance inference with a single API call to a designated model — no infrastructure management required.
- Friendli Dedicated Endpoints serve the fastest, most scalable, and reliable inference for open-weight models on reserved GPU capacity — delivering 2–5x higher throughput than standard self-managed inference deployments.
Try Gemma-4-31B-it on FriendliAI.
Gemma-4-31B-it: Google’s open-weight agentic coding model
Gemma-4-31B-it is a dense transformer with about 31 billion total parameters across 60 layers, paired with a 550M-parameter vision encoder. Its hybrid attention pattern lets the model use the full 256K context window without straining memory during long-document inference. BF16 weights fit can be served on a single 80GB H100 and quantized variants can run on workstation GPUs under appropriate runtime configurations. The training data extends through January 2025 and spans web text, code, math, and images in 140+ languages.
The multimodal VLM accepts text or image inputs and produces text outputs. Variable image resolution and aspect ratio are handled natively, with a configurable visual token budget between 70 and 1120 tokens that lets you trade compute for fidelity on a per-call basis. It does not accept audio — that's a feature reserved for Gemma 4 E2B and E4B.
It features native agentic primitives that support function calling, structured JSON output, and system roles for the chat template, so developers don't have to ask the model to use tools with awkward prompting tricks. Configurable thinking mode enables the model to rationalize its reasoning before generating a response. Disable this setting, and it answers directly. No fine-tuning, no separate model — same weights, two operating modes.
Gemma-4-31B-it is well-suited for the following applications:
- Code-review and coding-assistant pipelines
- Agentic workflows that chain tool calls with structured outputs
- Document-extraction systems that mix layout-heavy PDFs with downstream reasoning
- Research-grade question answering where the answer matters more than the latency
Gemma-4-31B-it on FriendliAI: #1 on Artificial Analysis
FriendliAI serves the fastest, most scalable inference for Gemma-4-31B-it with reasoning enabled — ranking #1 for output speed, time-to-first-token, and end-to-end response times according to Artificial Analysis.
The performance leaderboard reported FriendliAI’s Model API generating 71 output tokens per second. For workloads with 10k input tokens, FriendliAI produces the first token in 30.3 seconds and 500 output tokens in a total of 37.3 seconds. Our optimized inference stack beats seven other inference providers offering model APIs for Gemma-4-31B-it for each of these metrics.


Gemma-4-31B-it outperforms all Gemma 4 models and even larger models with higher parameter counts, scoring at frontier model scale on AIME 2026, GPQA Diamond, and LiveCodeBench v6.
- Competitive coding. 31B-it scores a 2,150 Codeforces ELO and 80.0% on LiveCodeBench v6 — territory that, until recently, belonged to closed-source frontier models. Up against Gemma-3-27B's 110 ELO, it's not a small step.
- Advanced math and graduate-level reasoning. 89.2% on AIME 2026 (no tools), 84.3% on GPQA Diamond, and 74.4% on BigBench Extra Hard. Thinking mode is doing real work here.
- Multimodal document understanding. 76.9% on MMMU Pro, 85.6% on MATH-Vision, and a 0.131 average edit distance on OmniDocBench 1.5 — strong enough for OCR, chart parsing, and long-form PDF extraction.
Getting started on FriendliAI
You can run Gemma-4-31B-it on Model APIs for serverless inference billed by the token, or on Dedicated Endpoints if you'd rather pin reserved GPU capacity to your workload.
Pricing for Model APIs:
| Token type | Price (per 1M tokens) |
|---|---|
| Input tokens | $0.14 |
| Output tokens | $0.40 |
Set up your dependencies:
- Create an account at friendli.ai.
- Generate a Friendli API key and export it:
export FRIENDLI_API_KEY="YOUR_API_KEY" - Install the OpenAI Python SDK (recommended):
pip install openai
For Dedicated Endpoints, the deployment flow is:
- Open the Dedicated Endpoints console and pick google/gemma-4-31B-it from the model catalog.
- Choose your GPU architecture. Select B200, H200, H100, or A100, depending on availability and quantization.
- Configure the endpoint. Select your quantization: BF16 or one of the supported quantized formats. Enable host KV cache for prompt-prefix reuse, turn on autoscaling, and set min/max replica counts as necessary.
- Deploy; the endpoint ID returned to you is what you'll pass to the SDK as
ENDPOINT_ID.
Agentic code review with Gemma-4-31B-it
Gemma-4-31B-it's combination of strong coding ability and configurable reasoning makes it well-suited for agents that need to inspect code, reason through it, and produce structured feedback. Below is a minimal review agent that takes a Python function, searches for bugs and anti-patterns, and returns a JSON report.
To call the Friendli Model API, set up a virtual environment:
Run the following Python snippet:
Here’s an example of a successful response in JSON format:
Run Gemma-4-31B-it on FriendliAI
Gemma-4-31B-it beats the other Gemma 4 models and even larger models on the most recognized benchmarks for coding, math, reasoning, and document understanding. It also handles multimodal vision-language understanding and native agentic tooling. Those capabilities come with the best output speed and response times on FriendliAI, making production-grade inference scalable.
Spin up serverless Model APIs without managing any infrastructure, or configure and deploy Dedicated Endpoints on reserved GPU. Either way, you're a single click away from deploying one of the strongest open-weight reasoners.
Written by
FriendliAI Tech & Research
Share
General FAQ
What is FriendliAI?
FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.
How does FriendliAI help my business?
Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing
Which models and modalities are supported?
Over 550,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models
Can I deploy models from Hugging Face directly?
Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership
Still have questions?
If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an engineer — our engineers (not a bot) will reply within one business day.

