88plug
Gemma4-E4B-it-W8A16-NeuralMax
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0At a Glance
| Property | Value |
|---|---|
| Base model | google/gemma-4-e4b-it |
| Architecture | Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision |
| Quant format | compressed-tensors (native vLLM) |
| Quant method | AutoRound W8A16 (RTN, datafree) |
| Quantized | language_model.* transformer layers |
| Kept BF16 | vision_tower, multi_modal_projector, embed_tokens_per_layer (PLE) |
| Min GPU | 1× RTX 3090 24GB |
PLE layers kept at BF16
embed_tokens_per_layer and per_layer_model_projection implement Per-Layer Embeddings — catastrophic quality loss if quantized. Always excluded.
Quick Start
vLLM
bash
docker run --gpus device=0 -p 8080:8080 \vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \88plug/Gemma4-E4B-W8A16 \--kv-cache-dtype fp8 \--max-model-len 32768 \--gpu-memory-utilization 0.90
Weights are in compressed-tensors format — no --quantization flag needed.
SGLang
bash
docker run --gpus device=0 -p 30000:30000 \lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \--model-path google/gemma-4-e4b-it \--tp 1 \--mem-fraction-static 0.85 \--port 30000
llama.cpp
VLM — requires a separate mmproj GGUF for image input. Text-only is a single GGUF.
bash
python convert_hf_to_gguf.py google/gemma-4-e4b-it \--outfile Gemma4-E4B-BF16.ggufpython convert_hf_to_gguf.py google/gemma-4-e4b-it \--mmproj --outfile Gemma4-E4B-mmproj.ggufllama-quantize Gemma4-E4B-BF16.gguf Gemma4-E4B-Q8_0.gguf Q8_0llama-quantize --imatrix calibration_datav3.txt \Gemma4-E4B-BF16.gguf Gemma4-E4B-IQ4_XS.gguf IQ4_XSllama-server \--model Gemma4-E4B-Q8_0.gguf \--mmproj Gemma4-E4B-mmproj.gguf \--n-gpu-layers 999 \--ctx-size 32768 \--port 8081
Benchmarks
Results pending.
| Engine | Format | Batch | ctx | tok/s | TTFT p50 | TTFT p99 | VRAM |
|---|---|---|---|---|---|---|---|
| vLLM v0.21.0 | W8A16 | 1 | 32k | — | — | — | — |
| vLLM v0.21.0 | W8A16 | 8 | 32k | — | — | — | — |
| SGLang v0.5.8 | BF16 (baseline) | 1 | 32k | — | — | — | — |
| llama.cpp b9297 | Q8_0 GGUF | 1 | 32k | — | — | — | — |
| llama.cpp b9297 | IQ4_XS GGUF | 1 | 32k | — | — | — | — |
Hardware: A6000 48 GB, CUDA 12.9, driver 570.
Quality Targets
| Metric | Target |
|---|---|
| KL divergence vs BF16 | < 0.005 |
| MMLU recovery | ≥ 99.7% |
Citation
bibtex
@misc{gemma4report,title = {Gemma 4 Technical Report},author = {Google DeepMind},year = {2025},url = {https://huggingface.co/google/gemma-4-e4b-it}}
About
Produced by 88plug AI Lab — zero-loss quantizations of frontier omni and voice models.
Model provider
88plug
Model tree
Base
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information