88plug

Gemma4-E2B-it-W8A16

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a Glance

Table
PropertyValue
Base modelgoogle/gemma-4-e2b-it
ArchitectureSparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision
Quant formatcompressed-tensors (native vLLM)
Quant methodAutoRound W8A16 (RTN, datafree)
Quantizedlanguage_model.* transformer layers
Kept BF16vision_tower, multi_modal_projector, embed_tokens_per_layer (PLE)
Min GPU1× RTX 3080 10GB / RTX 4070

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM

bash

docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Gemma4-E2B-it-W8A16 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Requires vLLM ≥ v0.21.0.

SGLang

bash

docker run --gpus device=0 -p 30000:30000 \
lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
--model-path google/gemma-4-e2b-it \
--tp 1 \
--mem-fraction-static 0.85 \
--port 30000

llama.cpp

Fits entirely on an 8 GB GPU with Q4 quantization. VLM requires mmproj GGUF for image input.

bash

python convert_hf_to_gguf.py google/gemma-4-e2b-it \
--outfile Gemma4-E2B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
--mmproj --outfile Gemma4-E2B-mmproj.gguf
llama-quantize Gemma4-E2B-BF16.gguf Gemma4-E2B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
Gemma4-E2B-BF16.gguf Gemma4-E2B-IQ4_XS.gguf IQ4_XS
llama-server \
--model Gemma4-E2B-Q8_0.gguf \
--mmproj Gemma4-E2B-mmproj.gguf \
--n-gpu-layers 999 \
--ctx-size 32768 \
--port 8081

Benchmarks

Results pending.

Table
EngineFormatBatchctxtok/sTTFT p50TTFT p99VRAM
vLLM v0.21.0W8A16132k
vLLM v0.21.0W8A16832k
SGLang v0.5.8BF16 (baseline)132k
llama.cpp b9297Q8_0 GGUF132k
llama.cpp b9297IQ4_XS GGUF132k

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


Quality Targets

Table
MetricTarget
KL divergence vs BF16< 0.005
MMLU recovery≥ 99.7%

vs. Other Gemma4-E2B Quants

This is the first compressed-tensors W8A16 checkpoint for Gemma4-E2B. At ~2.5 GB it is the smallest vLLM-native multimodal checkpoint that fits on consumer 8 GB GPUs.

Table
QuantMethodSizeGPU CompatibilityNotes
88plug W8A16 (this)compressed-tensors RTN W8A16~2.5 GBAny Ampere+ ≥8 GBFirst W8A16; native vLLM; vision+text
BF16 baselineNone~4.5 GB1× RTX 3080 10GBReference
Community GGUF Q4_K_Mllama.cpp GGUF~2.5 GBCPU / any GPUVision requires mmproj GGUF
Community GGUF Q8_0llama.cpp GGUF~4.5 GBAny GPU ≥6 GBNear-lossless; vision requires mmproj

Limitations

  • Vision tower excluded: SigLIP vision encoder stays BF16 — RTN INT8 not applied to vision components.
  • PLE layers excluded: embed_tokens_per_layer and per_layer_model_projection (Per-Layer Embeddings) kept at BF16 to prevent catastrophic quality loss.
  • RTN (data-free) quantization: No calibration corpus used. W8A16 RTN is near-lossless but has not been AutoRound-calibrated.
  • Benchmark results pending: Throughput and quality benchmarks will be added post-publication.

Citation

bibtex

@misc{gemma4report,
title = {Gemma 4 Technical Report},
author = {Google DeepMind},
year = {2025},
url = {https://huggingface.co/google/gemma-4-e2b-it}
}

About

88plug AI Lab produces production-grade compressed-tensors quantizations of frontier LLMs, VLMs, and omni models — built for native vLLM v0.21.0+ deployment with zero extra flags.

W8A16 — INT8 weights + BF16 activations. Near-lossless on any Ampere+ GPU. Runs where FP8 hardware cannot.

W4A16 — AutoRound with iters=200 and a mixed calibration corpus. Targets ≥ 99% MMLU recovery — the quality bar that makes W4A16 viable for production.

All weights are in compressed-tensors format. vLLM detects quantization automatically from quantization_config in config.json. No --quantization flag required.

Also available: Gemma4-E2B-it-W4A16 (INT4, ~6 GB) · Gemma4-E2B-it-W8A16 (INT8, ~7 GB)

Browse all releases → huggingface.co/88plug

Model provider

88plug

Model tree

Base

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today