Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Quantization details

PropertyValue
MethodW4A16 (int4 weight-only)
StrategyPer-channel (group_size=-1)
Schemesymmetric
Toolllmcompressor
Target GPUA100 (Ampere cc=80) via AllSpark kernel

Why per-channel instead of grouped?

On Ampere GPUs vLLM's only working W4A16 kernel is AllSpark, which only accepts group_size=-1. Grouped kernels (Marlin, Conch) require either group_size=128 and all weight input dims divisible by 128 — the MoE expert down_proj layers have input_size=2112 (not divisible by 128), so grouped quantization fails at serve time regardless of group size.

Ignored layers (kept in bf16)

  • lm_head
  • model.embed_vision.embedding_projection
  • All vision tower layers
  • All router projection layers

Usage

python

from transformers import AutoModelForCausalLM, AutoProcessor
import torch
model = AutoModelForCausalLM.from_pretrained(
"lokeshe09/gemma-4-26B-A4B-it-W4A16-A100",
device_map="auto",
torch_dtype=torch.bfloat16,
trust_remote_code=True,
)
processor = AutoProcessor.from_pretrained("lokeshe09/gemma-4-26B-A4B-it-W4A16-A100", trust_remote_code=True)

Model provider

lokeshe09

lokeshe09

Model tree

Base

google/gemma-4-26B-A4B-it

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today