Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quantization Details

  • Scheme: FP8_DYNAMIC (per-channel static weights + per-token dynamic activations)
  • Format: F8_E4M3
  • Tool: vllm-project/llm-compressor
  • Data-free: No calibration dataset required
  • Skipped layers: lm_head, vision encoder, embed_tokens, MoE routers

Architecture

Gemma 4 26B-A4B is a Mixture-of-Experts model:

  • 26B total parameters / 4B active per token
  • 128 fine-grained experts, top-8 routing
  • Multimodal: text + image input

⚠️ FP8_BLOCK is incompatible with this MoE model due to expert dimension constraints. Only FP8_DYNAMIC is supported.

Usage with vLLM

bash

vllm serve lokeshe09/gemma-4-26B-A4B-it-FP8-Dynamic \
--max-model-len 96000 \
--gpu-memory-utilization 0.90

Quantization Script

python

from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier
model = AutoModelForCausalLM.from_pretrained(
"google/gemma-4-26B-A4B-it",
device_map="cuda:0", torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-26B-A4B-it")
oneshot(
model=model,
recipe=QuantizationModifier(
targets="Linear",
scheme="FP8_DYNAMIC",
ignore=["lm_head", "re:.*vision.*", "re:.*embed_tokens.*", "re:.*router.*"],
),
)
model.save_pretrained("gemma-4-26B-A4B-it-FP8-Dynamic")
tokenizer.save_pretrained("gemma-4-26B-A4B-it-FP8-Dynamic")

Quantized by

lokeshe09

Model provider

lokeshe09

lokeshe09

Model tree

Base

google/gemma-4-26B-A4B-it

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today