salve-mundii
gemma4-E4B-opt
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Compression techniques
This checkpoint stacks three independent compression steps. All of them are applied only to the
language model — the vision tower, audio tower, multimodal projectors, embeddings and lm_head
are left at full precision (see ignore list in config.json), because vLLM cannot serve a
quantized vision/audio tower.
| Step | Technique | Detail |
|---|---|---|
| 1. Pruning | SparseGPT, 10% unstructured sparsity | Targets MLP layers only: gate_proj, up_proj, down_proj of every language_model decoder layer. Applied before quantization as a light, Hessian‑aware weight clean‑up. |
| 2. Quantization | GPTQ, W4A16 | 4‑bit integer weights, group size 128, symmetric, static activation ordering. Activations stay BF16. Calibrated on 128 samples of ultrachat_200k (max seq len 256). |
| 3. KV cache | FP8 (E4M3) | KV cache stored in fp8_e4m3 at serve time (calculate_kv_scales: false). Cuts attention‑cache memory and bandwidth. |
Note on the pruning step. The 10% sparsity is unstructured. vLLM ≥0.19 has no sparse kernel, so the weights are served dense — the pruning does not itself accelerate inference. Its role here is regularization/weight clean‑up ahead of GPTQ. The energy and footprint wins below come from the 4‑bit weights + FP8 KV cache, not from the sparsity. Nevertheless, we have found that pruning implicitly reduces the model’s energy consumption by altering its behaviour in such a way that it produces more concise output.
| Model Name | Avg Tokens per Question (MMMU-Pro) |
|---|---|
| FP16 baseline | 485 |
| quant-only (W4A16) | 603 |
| prune10 (W4A16 + 10% MLP) | 511 |
The exact, machine‑readable recipe is in recipe.yaml / recipe.json
and the quantization metadata in quantization.yaml.
Results vs. FP16 baseline
Both the baseline and this model were evaluated on a single NVIDIA L4, vLLM 0.19.0, with the
same serving profile (max_num_seqs=1024, chunked‑prefill on, speculative decoding off), on the
full MMMU‑Pro set (3,460 questions, 0 failed). Energy was measured by NVML (≥10 Hz) integrated
over the run window, cross‑checked with CodeCarbon.
| Metric | FP16 baseline | This model | Δ |
|---|---|---|---|
| MMMU‑Pro accuracy | 38.21% | 35.29% | −2.92 pp (−7.64% relative) |
| – standard (10 options) | 44.22% | 40.81% | −3.41 pp |
| – vision | 32.20% | 29.77% | −2.43 pp |
| Energy, full MMMU‑Pro run | 1,285,158 J | 75,550 J | −94.1% |
| Tokens per Wh | 9.7 | 164.9 | 17.0× |
| Wall‑clock for the run | 17,870 s (~4.96 h) | 1,057 s (~17.6 min) | 16.9× faster |
| Avg GPU power | 71.9 W | 71.5 W | ≈ equal |
| Checkpoint size on disk | 14.92 GB | 9.33 GB | −37.5% |
Serving
Tested to be served with the official competition image
vllm/vllm-openai:v0.20.2 on a single NVIDIA L4. We developed this version of the model targeting the vLLM framework.
bash
vllm serve <model-uri> --config vllm_config.yaml
The bundled vllm_config.yaml pins the tuned profile:
yaml
gpu-memory-utilization: 0.9max-model-len: 20000max-num-seqs: 1024max-num-batched-tokens: 4096enable-prefix-caching: trueenable-chunked-prefill: truequantization: compressed-tensorskv-cache-dtype: fp8_e4m3limit-mm-per-prompt:image: 3
Calibration & training tooling
- Hardware: 1×NVIDIA L4 (sm89), CUDA 12.8
- Stack: vLLM 0.19.0 · transformers 5.5.4 · llmcompressor 0.10.1.dev148 · torch 2.10.0+cu128
- Quantization cost: ~4,840 s wall‑clock, 16.8 GB peak VRAM (one‑shot, offloaded Hessians)
License
Released under the Apache License 2.0.
Model provider
salve-mundii
Model tree
Base
google/gemma-4-E4B-it
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information