Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Quantization details
| Property | Value |
|---|---|
| Method | W4A16 (int4 weight-only) |
| Strategy | Per-channel (group_size=-1) |
| Scheme | symmetric |
| Tool | llmcompressor |
| Target GPU | A100 (Ampere cc=80) via AllSpark kernel |
Why per-channel instead of grouped?
On Ampere GPUs vLLM's only working W4A16 kernel is AllSpark, which only
accepts group_size=-1. Grouped kernels (Marlin, Conch) require either
group_size=128 and all weight input dims divisible by 128 — the MoE
expert down_proj layers have input_size=2112 (not divisible by 128), so
grouped quantization fails at serve time regardless of group size.
Ignored layers (kept in bf16)
lm_headmodel.embed_vision.embedding_projection- All vision tower layers
- All router projection layers
Usage
python
from transformers import AutoModelForCausalLM, AutoProcessorimport torchmodel = AutoModelForCausalLM.from_pretrained("lokeshe09/gemma-4-26B-A4B-it-W4A16-A100",device_map="auto",torch_dtype=torch.bfloat16,trust_remote_code=True,)processor = AutoProcessor.from_pretrained("lokeshe09/gemma-4-26B-A4B-it-W4A16-A100", trust_remote_code=True)
Model provider
lokeshe09
Model tree
Base
google/gemma-4-26B-A4B-it
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information