salve-mundii

gemma4-E4B-opt

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Compression techniques

This checkpoint stacks three independent compression steps. All of them are applied only to the language model — the vision tower, audio tower, multimodal projectors, embeddings and lm_head are left at full precision (see ignore list in config.json), because vLLM cannot serve a quantized vision/audio tower.

Table
StepTechniqueDetail
1. PruningSparseGPT, 10% unstructured sparsityTargets MLP layers only: gate_proj, up_proj, down_proj of every language_model decoder layer. Applied before quantization as a light, Hessian‑aware weight clean‑up.
2. QuantizationGPTQ, W4A164‑bit integer weights, group size 128, symmetric, static activation ordering. Activations stay BF16. Calibrated on 128 samples of ultrachat_200k (max seq len 256).
3. KV cacheFP8 (E4M3)KV cache stored in fp8_e4m3 at serve time (calculate_kv_scales: false). Cuts attention‑cache memory and bandwidth.

Note on the pruning step. The 10% sparsity is unstructured. vLLM ≥0.19 has no sparse kernel, so the weights are served dense — the pruning does not itself accelerate inference. Its role here is regularization/weight clean‑up ahead of GPTQ. The energy and footprint wins below come from the 4‑bit weights + FP8 KV cache, not from the sparsity. Nevertheless, we have found that pruning implicitly reduces the model’s energy consumption by altering its behaviour in such a way that it produces more concise output.

Table
Model NameAvg Tokens per Question (MMMU-Pro)
FP16 baseline485
quant-only (W4A16)603
prune10 (W4A16 + 10% MLP)511

The exact, machine‑readable recipe is in recipe.yaml / recipe.json and the quantization metadata in quantization.yaml.

Results vs. FP16 baseline

Both the baseline and this model were evaluated on a single NVIDIA L4, vLLM 0.19.0, with the same serving profile (max_num_seqs=1024, chunked‑prefill on, speculative decoding off), on the full MMMU‑Pro set (3,460 questions, 0 failed). Energy was measured by NVML (≥10 Hz) integrated over the run window, cross‑checked with CodeCarbon.

Table
MetricFP16 baselineThis modelΔ
MMMU‑Pro accuracy38.21%35.29%−2.92 pp (−7.64% relative)
  – standard (10 options)44.22%40.81%−3.41 pp
  – vision32.20%29.77%−2.43 pp
Energy, full MMMU‑Pro run1,285,158 J75,550 J−94.1%
Tokens per Wh9.7164.917.0×
Wall‑clock for the run17,870 s (~4.96 h)1,057 s (~17.6 min)16.9× faster
Avg GPU power71.9 W71.5 W≈ equal
Checkpoint size on disk14.92 GB9.33 GB−37.5%

Serving

Tested to be served with the official competition image vllm/vllm-openai:v0.20.2 on a single NVIDIA L4. We developed this version of the model targeting the vLLM framework.

bash

vllm serve <model-uri> --config vllm_config.yaml

The bundled vllm_config.yaml pins the tuned profile:

yaml

gpu-memory-utilization: 0.9
max-model-len: 20000
max-num-seqs: 1024
max-num-batched-tokens: 4096
enable-prefix-caching: true
enable-chunked-prefill: true
quantization: compressed-tensors
kv-cache-dtype: fp8_e4m3
limit-mm-per-prompt:
image: 3

Calibration & training tooling

  • Hardware: 1×NVIDIA L4 (sm89), CUDA 12.8
  • Stack: vLLM 0.19.0 · transformers 5.5.4 · llmcompressor 0.10.1.dev148 · torch 2.10.0+cu128
  • Quantization cost: ~4,840 s wall‑clock, 16.8 GB peak VRAM (one‑shot, offloaded Hessians)

License

Released under the Apache License 2.0.

Model provider

salve-mundii

Model tree

Base

google/gemma-4-E4B-it

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today