Gemma4-E2B-it-W8A16-NeuralMax API & Inference Endpoint

At a Glance

Table with columns: Property, Value
Property	Value
Base model	`google/gemma-4-e2b-it`
Architecture	Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision
Quant format	compressed-tensors (native vLLM)
Quant method	AutoRound W8A16 (RTN, datafree)
Quantized	`language_model.*` transformer layers
Kept BF16	vision_tower, multi_modal_projector, embed_tokens_per_layer (PLE)
Min GPU	1× RTX 3080 10GB / RTX 4070

Quick Start

vLLM

bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Gemma4-E2B-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed.

SGLang

bash
docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path google/gemma-4-e2b-it \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp

Fits entirely on an 8 GB GPU with Q4 quantization. VLM requires mmproj GGUF for image input.

bash
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --outfile Gemma4-E2B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --mmproj --outfile Gemma4-E2B-mmproj.gguf

llama-quantize Gemma4-E2B-BF16.gguf Gemma4-E2B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Gemma4-E2B-BF16.gguf Gemma4-E2B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Gemma4-E2B-Q8_0.gguf \
  --mmproj Gemma4-E2B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Results pending.

Table with columns: Engine, Format, Batch, ctx, tok/s, TTFT p50, TTFT p99, VRAM
Engine	Format	Batch	ctx	tok/s	TTFT p50	TTFT p99	VRAM
vLLM v0.21.0	W8A16	1	32k	—	—	—	—
vLLM v0.21.0	W8A16	8	32k	—

Hardware: A6000 48 GB, CUDA 12.9, driver 570.

Quality Targets

Table with columns: Metric, Target
Metric	Target
KL divergence vs BF16	< 0.005
MMLU recovery	≥ 99.7%

Citation

bibtex
@misc{gemma4report,
  title  = {Gemma 4 Technical Report},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://huggingface.co/google/gemma-4-e2b-it}
}

About

Produced by 88plug AI Lab — zero-loss quantizations of frontier omni and voice models.

Property

Value

Base model

google/gemma-4-e2b-it

Architecture

Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision

Quant format

compressed-tensors (native vLLM)

Quant method

AutoRound W8A16 (RTN, datafree)

Quantized

language_model.* transformer layers

Kept BF16

vision_tower, multi_modal_projector, embed_tokens_per_layer (PLE)

Min GPU

1× RTX 3080 10GB / RTX 4070

bash

docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Gemma4-E2B-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

bash

docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path google/gemma-4-e2b-it \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

bash

python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --outfile Gemma4-E2B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --mmproj --outfile Gemma4-E2B-mmproj.gguf

llama-quantize Gemma4-E2B-BF16.gguf Gemma4-E2B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Gemma4-E2B-BF16.gguf Gemma4-E2B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Gemma4-E2B-Q8_0.gguf \
  --mmproj Gemma4-E2B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Engine

Format

Batch

ctx

tok/s

TTFT p50

TTFT p99

VRAM

vLLM v0.21.0

W8A16

32k

—

vLLM v0.21.0

W8A16

32k

—

Metric

Target

KL divergence vs BF16

< 0.005

MMLU recovery

≥ 99.7%

Gemma4-E2B-it-W8A16-NeuralMax

README

At a Glance

Quick Start

vLLM

SGLang

llama.cpp

Benchmarks

Quality Targets

Citation

About

Explore FriendliAI today

README

At a Glance

Quick Start

vLLM

SGLang

llama.cpp

Benchmarks

Quality Targets

Citation

About