Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Requirements
This is the unified Gemma 4 architecture (model_type: gemma4_unified), which is
newer than the classic gemma4 (e.g. 31B). You need:
- transformers ≥ 5.10.0 (when
gemma4_unifiedwas added) - vLLM with
gemma4_unifiedsupport — at the time of writing this is on themainbranch / nightly (uv pip install -U vllm --pre), not yet in a tagged stable release (≤ 0.22.0). It will be in the next stable release.
Usage with vLLM
bash
vllm serve bahadirakdemir/gemma-4-12B-it-text-fp8 \--quantization modelopt \--max-model-len 8192 \--max-num-batched-tokens 8192 \--gpu-memory-utilization 0.5 \--limit-mm-per-prompt '{"image": 0, "audio": 0}'
For speculative decoding, pair it with the matching FP8 MTP drafter
bahadirakdemir/gemma-4-12B-it-assistant-fp8:
bash
vllm serve bahadirakdemir/gemma-4-12B-it-text-fp8 \--quantization modelopt \--max-model-len 8192 \--max-num-batched-tokens 8192 \--gpu-memory-utilization 0.5 \--limit-mm-per-prompt '{"image": 0, "audio": 0}' \--speculative-config '{"model": "bahadirakdemir/gemma-4-12B-it-assistant-fp8", "num_speculative_tokens": 4}'
Tested with vllm/vllm-openai:gemma4-0505-arm64-cu130 on NVIDIA GB10.
Quantization details
| Method | ModelOpt FP8 PTQ (E4M3, per-tensor static scales) |
| Quantized | language-model linears (attention + MLP projections) |
| Kept in BF16 | lm_head, tied embeddings, all norms |
| Calibration | 32 instruct-style prompts, max length 1024 |
License: Apache 2.0, inherited from upstream Gemma 4 — see the Gemma 4 license.
Model provider
bahadirakdemir
Model tree
Base
google/gemma-4-12B-it
Quantized
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information