Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Requirements

This is the unified Gemma 4 architecture (model_type: gemma4_unified), which is newer than the classic gemma4 (e.g. 31B). You need:

  • transformers ≥ 5.10.0 (when gemma4_unified was added)
  • vLLM with gemma4_unified support — at the time of writing this is on the main branch / nightly (uv pip install -U vllm --pre), not yet in a tagged stable release (≤ 0.22.0). It will be in the next stable release.

Usage with vLLM

bash

vllm serve bahadirakdemir/gemma-4-12B-it-text-fp8 \
--quantization modelopt \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.5 \
--limit-mm-per-prompt '{"image": 0, "audio": 0}'

For speculative decoding, pair it with the matching FP8 MTP drafter bahadirakdemir/gemma-4-12B-it-assistant-fp8:

bash

vllm serve bahadirakdemir/gemma-4-12B-it-text-fp8 \
--quantization modelopt \
--max-model-len 8192 \
--max-num-batched-tokens 8192 \
--gpu-memory-utilization 0.5 \
--limit-mm-per-prompt '{"image": 0, "audio": 0}' \
--speculative-config '{"model": "bahadirakdemir/gemma-4-12B-it-assistant-fp8", "num_speculative_tokens": 4}'

Tested with vllm/vllm-openai:gemma4-0505-arm64-cu130 on NVIDIA GB10.

Quantization details

MethodModelOpt FP8 PTQ (E4M3, per-tensor static scales)
Quantizedlanguage-model linears (attention + MLP projections)
Kept in BF16lm_head, tied embeddings, all norms
Calibration32 instruct-style prompts, max length 1024

License: Apache 2.0, inherited from upstream Gemma 4 — see the Gemma 4 license.

Model provider

bahadirakdemir

Model tree

Base

google/gemma-4-12B-it

Quantized

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today