lokeshe09/gemma-4-26B-A4B-it-FP8-Dynamicc API & Inference Endpoint

Quantization Details

Scheme: FP8_DYNAMIC (per-channel static weights + per-token dynamic activations)
Format: F8_E4M3
Tool: vllm-project/llm-compressor
Data-free: No calibration dataset required
Skipped layers: lm_head, vision encoder, embed_tokens, MoE routers

Architecture

Gemma 4 26B-A4B is a Mixture-of-Experts model:

26B total parameters / 4B active per token
128 fine-grained experts, top-8 routing
Multimodal: text + image input

⚠️ FP8_BLOCK is incompatible with this MoE model due to expert dimension constraints. Only FP8_DYNAMIC is supported.

Usage with vLLM

bash
vllm serve lokeshe09/gemma-4-26B-A4B-it-FP8-Dynamic \
    --max-model-len 96000 \
    --gpu-memory-utilization 0.90

Quantization Script

python
from transformers import AutoModelForCausalLM, AutoTokenizer
from llmcompressor import oneshot
from llmcompressor.modifiers.quantization import QuantizationModifier

model = AutoModelForCausalLM.from_pretrained(
    "google/gemma-4-26B-A4B-it",
    device_map="cuda:0", torch_dtype="auto", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-26B-A4B-it")

oneshot(
    model=model,
    recipe=QuantizationModifier(
        targets="Linear",
        scheme="FP8_DYNAMIC",
        ignore=["lm_head", "re:.*vision.*", "re:.*embed_tokens.*", "re:.*router.*"],
    ),
)
model.save_pretrained("gemma-4-26B-A4B-it-FP8-Dynamic")
tokenizer.save_pretrained("gemma-4-26B-A4B-it-FP8-Dynamic")

Quantized by

lokeshe09

gemma-4-26B-A4B-it-FP8-Dynamicc

Get help setting up a custom Dedicated Endpoints.

README

Quantization Details

Architecture

Usage with vLLM

Quantization Script

Quantized by

Explore FriendliAI today

gemma-4-26B-A4B-it-FP8-Dynamicc