bahadirakdemir/gemma-4-12B-it-text-fp8 API & Inference Endpoint

Requirements

This is the unified Gemma 4 architecture (model_type: gemma4_unified), which is newer than the classic gemma4 (e.g. 31B). You need:

transformers ≥ 5.10.0 (when gemma4_unified was added)
vLLM with gemma4_unified support — at the time of writing this is on the main branch / nightly (uv pip install -U vllm --pre), not yet in a tagged stable release (≤ 0.22.0). It will be in the next stable release.

Usage with vLLM

bash
vllm serve bahadirakdemir/gemma-4-12B-it-text-fp8 \
  --quantization modelopt \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.5 \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}'

For speculative decoding, pair it with the matching FP8 MTP drafter bahadirakdemir/gemma-4-12B-it-assistant-fp8:

bash
vllm serve bahadirakdemir/gemma-4-12B-it-text-fp8 \
  --quantization modelopt \
  --max-model-len 8192 \
  --max-num-batched-tokens 8192 \
  --gpu-memory-utilization 0.5 \
  --limit-mm-per-prompt '{"image": 0, "audio": 0}' \
  --speculative-config '{"model": "bahadirakdemir/gemma-4-12B-it-assistant-fp8", "num_speculative_tokens": 4}'

Tested with vllm/vllm-openai:gemma4-0505-arm64-cu130 on NVIDIA GB10.

Quantization details


Method	ModelOpt FP8 PTQ (E4M3, per-tensor static scales)
Quantized	language-model linears (attention + MLP projections)
Kept in BF16	`lm_head`, tied embeddings, all norms
Calibration	32 instruct-style prompts, max length 1024

License: Apache 2.0, inherited from upstream Gemma 4 — see the Gemma 4 license.

gemma-4-12B-it-text-fp8

Get help setting up a custom Dedicated Endpoints.

README

Requirements

Usage with vLLM

Quantization details

Explore FriendliAI today

gemma-4-12B-it-text-fp8