88plug

Gemma4-E4B-it-W8A16-NeuralMax

README

License: apache-2.0

At a Glance

Table with columns: Property, Value
Property	Value
Base model	`google/gemma-4-e4b-it`
Architecture	Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision
Quant format	compressed-tensors (native vLLM)
Quant method	AutoRound W8A16 (RTN, datafree)
Quantized	`language_model.*` transformer layers
Kept BF16	vision_tower, multi_modal_projector, embed_tokens_per_layer (PLE)
Min GPU	1× RTX 3090 24GB

PLE layers kept at BF16

embed_tokens_per_layer and per_layer_model_projection implement Per-Layer Embeddings — catastrophic quality loss if quantized. Always excluded.

Quick Start

vLLM

bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Gemma4-E4B-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed.

SGLang

bash
docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path google/gemma-4-e4b-it \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp

VLM — requires a separate mmproj GGUF for image input. Text-only is a single GGUF.

bash
python convert_hf_to_gguf.py google/gemma-4-e4b-it \
  --outfile Gemma4-E4B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e4b-it \
  --mmproj --outfile Gemma4-E4B-mmproj.gguf

llama-quantize Gemma4-E4B-BF16.gguf Gemma4-E4B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Gemma4-E4B-BF16.gguf Gemma4-E4B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Gemma4-E4B-Q8_0.gguf \
  --mmproj Gemma4-E4B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Results pending.

Table with columns: Engine, Format, Batch, ctx, tok/s, TTFT p50, TTFT p99, VRAM
Engine	Format	Batch	ctx	tok/s	TTFT p50	TTFT p99	VRAM
vLLM v0.21.0	W8A16	1	32k	—	—	—	—
vLLM v0.21.0	W8A16	8	32k	—

Hardware: A6000 48 GB, CUDA 12.9, driver 570.

Quality Targets

Table with columns: Metric, Target
Metric	Target
KL divergence vs BF16	< 0.005
MMLU recovery	≥ 99.7%

Citation

bibtex
@misc{gemma4report,
  title  = {Gemma 4 Technical Report},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://huggingface.co/google/gemma-4-e4b-it}
}

About

Produced by 88plug AI Lab — zero-loss quantizations of frontier omni and voice models.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Model Details

Model Provider

88plug

Model Tree

Base

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality