88plug

Gemma4-E4B-it-W4A16

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

At a Glance

Table with columns: Property, Value
Property	Value
Base model	`google/gemma-4-e4b-it`
Architecture	Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision
Quant method	datafree RTN (QuantizationModifier; AutoRound blocked)
Quant scheme	W4A16 (4-bit weights, 16-bit activations)
Quant format	compressed-tensors (native vLLM)
Quantized	`language_model.*` — all `Linear` layers (attn + MLP)
Kept BF16	`vision_tower`, `audio_tower`, `multi_modal_projector`, `embed_tokens_per_layer` (PLE), `per_layer_model_projection` (PLE), `lm_head`, norms, embeddings
Disk size	~14 GB
Min GPU	1× RTX 3090 24GB

PLE layers kept at BF16

embed_tokens_per_layer and per_layer_model_projection implement Per-Layer Embeddings — ablations show catastrophic output degradation if quantized. Always excluded.

Memory Requirements

Table with columns: Configuration, BF16, This Quant (W4A16)
Configuration	BF16	This Quant (W4A16)
Weights (disk/VRAM)	~28 GB	~14 GB
KV cache @ 32k ctx (fp8)	~2.0 GB	~2.0 GB
Total @ 32k ctx	~30 GB	~16 GB
Minimum GPU	A100 40GB	1× RTX 3090 24GB

The 4B active parameters (MoE) keep activation memory low. The full 26B+ parameter count still requires significant weight VRAM — W4A16 halves that requirement.

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM

bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Gemma4-E4B-it-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed.

Python client

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")

response = client.chat.completions.create(
    model="88plug/Gemma4-E4B-it-W4A16",
    messages=[{"role": "user", "content": "Explain sparse mixture-of-experts in two sentences."}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Quantization Design

The recipe targets all Linear modules in the LLM backbone with W4A16 (4-bit symmetric weight quantization, activations remain BF16). The following are excluded and kept at BF16:

Table with columns: Excluded pattern, Reason
Excluded pattern	Reason
`lm_head`	Output projection — quality-sensitive
`.*embed_tokens$`	Token embeddings
`.*norm$`	Layer norms
`.embed_tokens_per_layer.`	PLE: per-layer token embeddings — catastrophic if quantized
`.per_layer_model_projection.`	PLE: projection into hidden dim — catastrophic if quantized

All self_attn.{q,k,v,o}_proj and mlp.{gate,up,down}_proj layers across all transformer blocks are quantized to W4A16.

Calibration: 1024 samples — 512 from HuggingFaceH4/ultrachat_200k (chat) + 512 from wikitext-103-raw-v1 (text), max sequence length 2048.

Competitor Comparables

Table with columns: Model, Source, Format, Compare angle
Model	Source	Format	Compare angle
`google/gemma-4-e4b-it`	official	BF16	quality ceiling
`RedHatAI/gemma-3n-E4B-it-quantized.w4a16`	RedHatAI	compressed-tensors W4A16	same format, prior generation
`88plug/Gemma4-E4B-it-W8A16`	88plug	compressed-tensors W8A16	higher precision variant

First-to-market note: No compressed-tensors W4A16 quant found for gemma-4-e4b-it at release time. This is the first vLLM-native W4A16 for Gemma4 E4B.

Benchmarks

Results pending.

Table with columns: Engine, Format, Batch, ctx, tok/s, TTFT p50, TTFT p99, VRAM
Engine	Format	Batch	ctx	tok/s	TTFT p50	TTFT p99	VRAM
vLLM v0.21.0	W4A16 compressed-tensors	1	32k	—	—	—	—
vLLM v0.21.0	W4A16 compressed-tensors	8	32k	—

Hardware: A6000 48 GB, CUDA 12.9, driver 570.

Quality Targets

Table with columns: Metric, Target
Metric	Target
KL divergence vs BF16	< 0.014
MMLU recovery	≥ 99%

SGLang Note

SGLang does not natively support compressed-tensors weights. To use SGLang, run the BF16 base model (google/gemma-4-e4b-it) directly:

bash
docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path google/gemma-4-e4b-it \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

SGLang benchmark results above reflect BF16 baseline throughput, not this quant.

llama.cpp / GGUF

Convert from the BF16 base checkpoint — not from compressed-tensors weights. VLM requires a separate mmproj GGUF for image input.

bash
python convert_hf_to_gguf.py google/gemma-4-e4b-it \
  --outfile Gemma4-E4B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e4b-it \
  --mmproj --outfile Gemma4-E4B-mmproj.gguf

llama-quantize Gemma4-E4B-BF16.gguf Gemma4-E4B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Gemma4-E4B-BF16.gguf Gemma4-E4B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Gemma4-E4B-Q8_0.gguf \
  --mmproj Gemma4-E4B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Citation

bibtex
@misc{gemma4report,
  title  = {Gemma 4 Technical Report},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://huggingface.co/google/gemma-4-e4b-it}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Gemma4-E4B-it-W8A16 (INT8, ~5 GB) · Gemma4-E4B-it-W4A16 (INT4, ~14 GB)

Browse all releases → huggingface.co/88plug

Model provider

88plug

Model tree

Base

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

At a Glance

Table with columns: Property, Value
Property	Value
Base model	`google/gemma-4-e4b-it`
Architecture	Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision
Quant method	datafree RTN (QuantizationModifier; AutoRound blocked)
Quant scheme	W4A16 (4-bit weights, 16-bit activations)
Quant format	compressed-tensors (native vLLM)
Quantized	`language_model.*` — all `Linear` layers (attn + MLP)
Kept BF16	`vision_tower`, `audio_tower`, `multi_modal_projector`, `embed_tokens_per_layer` (PLE), `per_layer_model_projection` (PLE), `lm_head`, norms, embeddings
Disk size	~14 GB
Min GPU	1× RTX 3090 24GB

PLE layers kept at BF16

embed_tokens_per_layer and per_layer_model_projection implement Per-Layer Embeddings — ablations show catastrophic output degradation if quantized. Always excluded.

Memory Requirements

Table with columns: Configuration, BF16, This Quant (W4A16)
Configuration	BF16	This Quant (W4A16)
Weights (disk/VRAM)	~28 GB	~14 GB
KV cache @ 32k ctx (fp8)	~2.0 GB	~2.0 GB
Total @ 32k ctx	~30 GB	~16 GB
Minimum GPU	A100 40GB	1× RTX 3090 24GB

The 4B active parameters (MoE) keep activation memory low. The full 26B+ parameter count still requires significant weight VRAM — W4A16 halves that requirement.

Quick Start

vLLM

bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Gemma4-E4B-it-W4A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed.

Python client

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8080/v1", api_key="x")

response = client.chat.completions.create(
    model="88plug/Gemma4-E4B-it-W4A16",
    messages=[{"role": "user", "content": "Explain sparse mixture-of-experts in two sentences."}],
    max_tokens=256,
)
print(response.choices[0].message.content)

Quantization Design

The recipe targets all Linear modules in the LLM backbone with W4A16 (4-bit symmetric weight quantization, activations remain BF16). The following are excluded and kept at BF16:

Table with columns: Excluded pattern, Reason
Excluded pattern	Reason
`lm_head`	Output projection — quality-sensitive
`.*embed_tokens$`	Token embeddings
`.*norm$`	Layer norms
`.embed_tokens_per_layer.`	PLE: per-layer token embeddings — catastrophic if quantized
`.per_layer_model_projection.`	PLE: projection into hidden dim — catastrophic if quantized

All self_attn.{q,k,v,o}_proj and mlp.{gate,up,down}_proj layers across all transformer blocks are quantized to W4A16.

Calibration: 1024 samples — 512 from HuggingFaceH4/ultrachat_200k (chat) + 512 from wikitext-103-raw-v1 (text), max sequence length 2048.

Competitor Comparables

Table with columns: Model, Source, Format, Compare angle
Model	Source	Format	Compare angle
`google/gemma-4-e4b-it`	official	BF16	quality ceiling
`RedHatAI/gemma-3n-E4B-it-quantized.w4a16`	RedHatAI	compressed-tensors W4A16	same format, prior generation
`88plug/Gemma4-E4B-it-W8A16`	88plug	compressed-tensors W8A16	higher precision variant

First-to-market note: No compressed-tensors W4A16 quant found for gemma-4-e4b-it at release time. This is the first vLLM-native W4A16 for Gemma4 E4B.

Benchmarks

Results pending.

Table with columns: Engine, Format, Batch, ctx, tok/s, TTFT p50, TTFT p99, VRAM
Engine	Format	Batch	ctx	tok/s	TTFT p50	TTFT p99	VRAM
vLLM v0.21.0	W4A16 compressed-tensors	1	32k	—	—	—	—
vLLM v0.21.0	W4A16 compressed-tensors	8	32k	—

Hardware: A6000 48 GB, CUDA 12.9, driver 570.

Quality Targets

Table with columns: Metric, Target
Metric	Target
KL divergence vs BF16	< 0.014
MMLU recovery	≥ 99%

SGLang Note

SGLang does not natively support compressed-tensors weights. To use SGLang, run the BF16 base model (google/gemma-4-e4b-it) directly:

bash
docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path google/gemma-4-e4b-it \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

SGLang benchmark results above reflect BF16 baseline throughput, not this quant.

llama.cpp / GGUF

Convert from the BF16 base checkpoint — not from compressed-tensors weights. VLM requires a separate mmproj GGUF for image input.

bash
python convert_hf_to_gguf.py google/gemma-4-e4b-it \
  --outfile Gemma4-E4B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e4b-it \
  --mmproj --outfile Gemma4-E4B-mmproj.gguf

llama-quantize Gemma4-E4B-BF16.gguf Gemma4-E4B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Gemma4-E4B-BF16.gguf Gemma4-E4B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Gemma4-E4B-Q8_0.gguf \
  --mmproj Gemma4-E4B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Citation

bibtex
@misc{gemma4report,
  title  = {Gemma 4 Technical Report},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://huggingface.co/google/gemma-4-e4b-it}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Gemma4-E4B-it-W8A16 (INT8, ~5 GB) · Gemma4-E4B-it-W4A16 (INT4, ~14 GB)

Browse all releases → huggingface.co/88plug

Gemma4-E4B-it-W4A16

Get help setting up a custom Dedicated Endpoints.

README

At a Glance

PLE layers kept at BF16

Memory Requirements

Quick Start

vLLM

Python client

Quantization Design

Competitor Comparables

Benchmarks

Quality Targets

SGLang Note

llama.cpp / GGUF

Citation

About

Explore FriendliAI today

README

At a Glance

PLE layers kept at BF16

Memory Requirements

Quick Start

vLLM

Python client

Quantization Design

Competitor Comparables

Benchmarks

Quality Targets

SGLang Note

llama.cpp / GGUF

Citation

About