Dedicated Endpoints
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quantization Details
- Scheme:
FP8_DYNAMIC(per-channel static weights + per-token dynamic activations) - Format: F8_E4M3
- Tool: vllm-project/llm-compressor
- Data-free: No calibration dataset required
- Skipped layers:
lm_head, vision encoder, embed_tokens, MoE routers
Architecture
Gemma 4 26B-A4B is a Mixture-of-Experts model:
- 26B total parameters / 4B active per token
- 128 fine-grained experts, top-8 routing
- Multimodal: text + image input
⚠️
FP8_BLOCKis incompatible with this MoE model due to expert dimension constraints. OnlyFP8_DYNAMICis supported.
Usage with vLLM
bash
vllm serve lokeshe09/gemma-4-26B-A4B-it-FP8-Dynamic \--max-model-len 96000 \--gpu-memory-utilization 0.90
Quantization Script
python
from transformers import AutoModelForCausalLM, AutoTokenizerfrom llmcompressor import oneshotfrom llmcompressor.modifiers.quantization import QuantizationModifiermodel = AutoModelForCausalLM.from_pretrained("google/gemma-4-26B-A4B-it",device_map="cuda:0", torch_dtype="auto", trust_remote_code=True)tokenizer = AutoTokenizer.from_pretrained("google/gemma-4-26B-A4B-it")oneshot(model=model,recipe=QuantizationModifier(targets="Linear",scheme="FP8_DYNAMIC",ignore=["lm_head", "re:.*vision.*", "re:.*embed_tokens.*", "re:.*router.*"],),)model.save_pretrained("gemma-4-26B-A4B-it-FP8-Dynamic")tokenizer.save_pretrained("gemma-4-26B-A4B-it-FP8-Dynamic")
Quantized by
Model provider
lokeshe09
Model tree
Base
google/gemma-4-26B-A4B-it
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information