lokeshe09/gemma-4-26B-A4B-it-W4A16-A100 API & Inference Endpoint

Quantization details

Property	Value
Method	W4A16 (int4 weight-only)
Strategy	Per-channel (`group_size=-1`)
Scheme	symmetric
Tool	llmcompressor
Target GPU	A100 (Ampere cc=80) via AllSpark kernel

Why per-channel instead of grouped?

On Ampere GPUs vLLM's only working W4A16 kernel is AllSpark, which only accepts group_size=-1. Grouped kernels (Marlin, Conch) require either group_size=128 and all weight input dims divisible by 128 — the MoE expert down_proj layers have input_size=2112 (not divisible by 128), so grouped quantization fails at serve time regardless of group size.

Ignored layers (kept in bf16)

lm_head
model.embed_vision.embedding_projection
All vision tower layers
All router projection layers

Usage

python
from transformers import AutoModelForCausalLM, AutoProcessor
import torch

model = AutoModelForCausalLM.from_pretrained(
    "lokeshe09/gemma-4-26B-A4B-it-W4A16-A100",
    device_map="auto",
    torch_dtype=torch.bfloat16,
    trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("lokeshe09/gemma-4-26B-A4B-it-W4A16-A100", trust_remote_code=True)

gemma-4-26B-A4B-it-W4A16-A100

Get help setting up a custom Dedicated Endpoints.

README

Quantization details

Why per-channel instead of grouped?

Ignored layers (kept in bf16)

Usage

Explore FriendliAI today

gemma-4-26B-A4B-it-W4A16-A100