88plug

Gemma4-E2B-it-W8A16-NeuralMax

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

At a Glance

Table
PropertyValue
Base modelgoogle/gemma-4-e2b-it
ArchitectureSparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision
Quant formatcompressed-tensors (native vLLM)
Quant methodAutoRound W8A16 (RTN, datafree)
Quantizedlanguage_model.* transformer layers
Kept BF16vision_tower, multi_modal_projector, embed_tokens_per_layer (PLE)
Min GPU1× RTX 3080 10GB / RTX 4070

Quick Start

vLLM

bash

docker run --gpus device=0 -p 8080:8080 \
vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
88plug/Gemma4-E2B-W8A16 \
--kv-cache-dtype fp8 \
--max-model-len 32768 \
--gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed.

SGLang

bash

docker run --gpus device=0 -p 30000:30000 \
lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
--model-path google/gemma-4-e2b-it \
--tp 1 \
--mem-fraction-static 0.85 \
--port 30000

llama.cpp

Fits entirely on an 8 GB GPU with Q4 quantization. VLM requires mmproj GGUF for image input.

bash

python convert_hf_to_gguf.py google/gemma-4-e2b-it \
--outfile Gemma4-E2B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
--mmproj --outfile Gemma4-E2B-mmproj.gguf
llama-quantize Gemma4-E2B-BF16.gguf Gemma4-E2B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
Gemma4-E2B-BF16.gguf Gemma4-E2B-IQ4_XS.gguf IQ4_XS
llama-server \
--model Gemma4-E2B-Q8_0.gguf \
--mmproj Gemma4-E2B-mmproj.gguf \
--n-gpu-layers 999 \
--ctx-size 32768 \
--port 8081

Benchmarks

Results pending.

Table
EngineFormatBatchctxtok/sTTFT p50TTFT p99VRAM
vLLM v0.21.0W8A16132k
vLLM v0.21.0W8A16832k
SGLang v0.5.8BF16 (baseline)132k
llama.cpp b9297Q8_0 GGUF132k
llama.cpp b9297IQ4_XS GGUF132k

Hardware: A6000 48 GB, CUDA 12.9, driver 570.


Quality Targets

Table
MetricTarget
KL divergence vs BF16< 0.005
MMLU recovery≥ 99.7%

Citation

bibtex

@misc{gemma4report,
title = {Gemma 4 Technical Report},
author = {Google DeepMind},
year = {2025},
url = {https://huggingface.co/google/gemma-4-e2b-it}
}

About

Produced by 88plug AI Lab — zero-loss quantizations of frontier omni and voice models.

Model provider

88plug

Model tree

Base

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today