Gemma4-E2B-it-W8A16 API & Inference Endpoint

Load path (important)

These weights are compressed-tensors (pack-quantized / int-quantized).

Table with columns: Runtime, Supported
Runtime	Supported
vLLM ≥ 0.21	Yes — preferred (auto-detect CT; no `--quantization` flag)
transformers + `compressed-tensors`	Yes for many text models; multimodal may need custom code
Text Generation Inference (TGI)	Not supported for these CT packs
Hugging Face Inference Widget	Often fails — use vLLM locally instead

bash
# Preferred
vllm serve 88plug/<ModelName> --trust-remote-code

Do not deploy via TGI “text-generation-inference” paths — that backend does not load our CT format and produces opaque worker/load errors.

Gemma4-E2B-W8A16

INT8 post-training quantization of google/gemma-4-e2b-it — Google's 2B-active multimodal MoE with 128 experts. The smallest capable multimodal MoE. Runs on any 8 GB GPU.

At a Glance

Table with columns: Property, Value
Property	Value
Base model	`google/gemma-4-e2b-it`
Release tier	Provisional (datafree RTN — re-quant scheduled)
Quant method	datafree RTN W8A16 (AutoRound blocked — KNOWN-FAILURES)
FLAC status	Not measured (T+7d milestone)
Architecture	Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision
Quant format	compressed-tensors (native vLLM)
Quantized	`language_model.*` transformer layers

Quick Start

Tested with vLLM v0.21.0 (vllm/vllm-openai:v0.21.0-cu129-ubuntu2404). Weights are in compressed-tensors format — vLLM detects and loads quantization automatically. No --quantization flag needed.

vLLM

bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Gemma4-E2B-it-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Requires vLLM ≥ v0.21.0.

SGLang

bash
docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path google/gemma-4-e2b-it \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp

Fits entirely on an 8 GB GPU with Q4 quantization. VLM requires mmproj GGUF for image input.

bash
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --outfile Gemma4-E2B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --mmproj --outfile Gemma4-E2B-mmproj.gguf

llama-quantize Gemma4-E2B-BF16.gguf Gemma4-E2B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Gemma4-E2B-BF16.gguf Gemma4-E2B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Gemma4-E2B-Q8_0.gguf \
  --mmproj Gemma4-E2B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Table with columns: Metric, Score
Metric	Score
Throughput (in2048_out512)	8619 tok/s
Throughput (in512_out256)	8661 tok/s
Throughput (in8192_out512)	8624 tok/s

Results from 88plug benchmark ladder — vLLM v0.21.0, lm-evaluation-harness. No fabricated numbers.

Quality Targets

Table with columns: Metric, Target
Metric	Target
KL divergence vs BF16	< 0.005
MMLU recovery	≥ 99.7%

vs. Other Gemma4-E2B Quants

This is the first compressed-tensors W8A16 checkpoint for Gemma4-E2B. At ~2.5 GB it is the smallest vLLM-native multimodal checkpoint that fits on consumer 8 GB GPUs.

Table with columns: Quant, Method, Size, GPU Compatibility, Notes
Quant	Method	Size	GPU Compatibility	Notes
88plug W8A16 (this)	compressed-tensors RTN W8A16	~2.5 GB	Any Ampere+ ≥8 GB	First W8A16; native vLLM; vision+text
BF16 baseline	None	~4.5 GB	1× RTX 3080 10GB	Reference
Community GGUF Q4_K_M	llama.cpp GGUF	~2.5 GB	CPU / any GPU	Vision requires mmproj GGUF

Limitations

Vision tower excluded: SigLIP vision encoder stays BF16 — RTN INT8 not applied to vision components.
PLE layers excluded: embed_tokens_per_layer and per_layer_model_projection (Per-Layer Embeddings) kept at BF16 to prevent catastrophic quality loss.
RTN (data-free) quantization: No calibration corpus used. W8A16 RTN is near-lossless but has not been AutoRound-calibrated.
Benchmark results pending: Throughput and quality benchmarks will be added post-publication.

Citation

bibtex
@misc{gemma4report,
  title  = {Gemma 4 Technical Report},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://huggingface.co/google/gemma-4-e2b-it}
}

About

88plug AI Lab ships compressed-tensors quantizations for native vLLM v0.21.0+ deployment.

This release: Provisional tier — datafree RTN (weight-only rounding, no calibration corpus). A gold AutoRound re-quant is scheduled; 88plug architecture forbids new provisional W4A16 uploads.

Browse all releases → huggingface.co/88plug

Load path (important)

These weights are compressed-tensors (pack-quantized / int-quantized).

Table with columns: Runtime, Supported
Runtime	Supported
vLLM ≥ 0.21	Yes — preferred (auto-detect CT; no `--quantization` flag)
transformers + `compressed-tensors`	Yes for many text models; multimodal may need custom code
Text Generation Inference (TGI)	Not supported for these CT packs
Hugging Face Inference Widget	Often fails — use vLLM locally instead

bash
# Preferred
vllm serve 88plug/<ModelName> --trust-remote-code

Do not deploy via TGI “text-generation-inference” paths — that backend does not load our CT format and produces opaque worker/load errors.

Gemma4-E2B-W8A16

INT8 post-training quantization of google/gemma-4-e2b-it — Google's 2B-active multimodal MoE with 128 experts. The smallest capable multimodal MoE. Runs on any 8 GB GPU.

At a Glance

Table with columns: Property, Value
Property	Value
Base model	`google/gemma-4-e2b-it`
Release tier	Provisional (datafree RTN — re-quant scheduled)
Quant method	datafree RTN W8A16 (AutoRound blocked — KNOWN-FAILURES)
FLAC status	Not measured (T+7d milestone)
Architecture	Sparse MoE, 128 experts, hybrid sliding+global attention + SigLIP vision
Quant format	compressed-tensors (native vLLM)
Quantized	`language_model.*` transformer layers

Quick Start

vLLM

bash
docker run --gpus device=0 -p 8080:8080 \
  vllm/vllm-openai:v0.21.0-cu129-ubuntu2404 vllm serve \
  88plug/Gemma4-E2B-it-W8A16 \
  --kv-cache-dtype fp8 \
  --max-model-len 32768 \
  --gpu-memory-utilization 0.90

Weights are in compressed-tensors format — no --quantization flag needed. Requires vLLM ≥ v0.21.0.

SGLang

bash
docker run --gpus device=0 -p 30000:30000 \
  lmsysorg/sglang:v0.5.8-cu129 python -m sglang.launch_server \
  --model-path google/gemma-4-e2b-it \
  --tp 1 \
  --mem-fraction-static 0.85 \
  --port 30000

llama.cpp

Fits entirely on an 8 GB GPU with Q4 quantization. VLM requires mmproj GGUF for image input.

bash
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --outfile Gemma4-E2B-BF16.gguf
python convert_hf_to_gguf.py google/gemma-4-e2b-it \
  --mmproj --outfile Gemma4-E2B-mmproj.gguf

llama-quantize Gemma4-E2B-BF16.gguf Gemma4-E2B-Q8_0.gguf Q8_0
llama-quantize --imatrix calibration_datav3.txt \
  Gemma4-E2B-BF16.gguf Gemma4-E2B-IQ4_XS.gguf IQ4_XS

llama-server \
  --model Gemma4-E2B-Q8_0.gguf \
  --mmproj Gemma4-E2B-mmproj.gguf \
  --n-gpu-layers 999 \
  --ctx-size 32768 \
  --port 8081

Benchmarks

Table with columns: Metric, Score
Metric	Score
Throughput (in2048_out512)	8619 tok/s
Throughput (in512_out256)	8661 tok/s
Throughput (in8192_out512)	8624 tok/s

Results from 88plug benchmark ladder — vLLM v0.21.0, lm-evaluation-harness. No fabricated numbers.

Quality Targets

Table with columns: Metric, Target
Metric	Target
KL divergence vs BF16	< 0.005
MMLU recovery	≥ 99.7%

vs. Other Gemma4-E2B Quants

This is the first compressed-tensors W8A16 checkpoint for Gemma4-E2B. At ~2.5 GB it is the smallest vLLM-native multimodal checkpoint that fits on consumer 8 GB GPUs.

Table with columns: Quant, Method, Size, GPU Compatibility, Notes
Quant	Method	Size	GPU Compatibility	Notes
88plug W8A16 (this)	compressed-tensors RTN W8A16	~2.5 GB	Any Ampere+ ≥8 GB	First W8A16; native vLLM; vision+text
BF16 baseline	None	~4.5 GB	1× RTX 3080 10GB	Reference
Community GGUF Q4_K_M	llama.cpp GGUF	~2.5 GB	CPU / any GPU	Vision requires mmproj GGUF

Limitations

Vision tower excluded: SigLIP vision encoder stays BF16 — RTN INT8 not applied to vision components.
PLE layers excluded: embed_tokens_per_layer and per_layer_model_projection (Per-Layer Embeddings) kept at BF16 to prevent catastrophic quality loss.
RTN (data-free) quantization: No calibration corpus used. W8A16 RTN is near-lossless but has not been AutoRound-calibrated.
Benchmark results pending: Throughput and quality benchmarks will be added post-publication.

Citation

bibtex
@misc{gemma4report,
  title  = {Gemma 4 Technical Report},
  author = {Google DeepMind},
  year   = {2025},
  url    = {https://huggingface.co/google/gemma-4-e2b-it}
}

About

88plug AI Lab ships compressed-tensors quantizations for native vLLM v0.21.0+ deployment.

This release: Provisional tier — datafree RTN (weight-only rounding, no calibration corpus). A gold AutoRound re-quant is scheduled; 88plug architecture forbids new provisional W4A16 uploads.

Browse all releases → huggingface.co/88plug

Gemma4-E2B-it-W8A16

README

Load path (important)

Gemma4-E2B-W8A16

At a Glance

Quick Start

vLLM

SGLang

llama.cpp

Benchmarks

Quality Targets

vs. Other Gemma4-E2B Quants

Limitations

Citation

About

Explore FriendliAI today

README

Load path (important)

Gemma4-E2B-W8A16

At a Glance

Quick Start

vLLM

SGLang

llama.cpp

Benchmarks

Quality Targets

vs. Other Gemma4-E2B Quants

Limitations

Citation

About