salve-mundii

gemma4-E4B-opt

README

License: apache-2.0

Compression techniques

This checkpoint stacks three independent compression steps. All of them are applied only to the language model — the vision tower, audio tower, multimodal projectors, embeddings and lm_head are left at full precision (see ignore list in config.json), because vLLM cannot serve a quantized vision/audio tower.

Table with columns: Step, Technique, Detail
Step	Technique	Detail
1. Pruning	SparseGPT, 10% unstructured sparsity	Targets MLP layers only: `gate_proj`, `up_proj`, `down_proj` of every `language_model` decoder layer. Applied before quantization as a light, Hessian‑aware weight clean‑up.
2. Quantization	GPTQ, W4A16	4‑bit integer weights, group size 128, symmetric, static activation ordering. Activations stay BF16. Calibrated on 128 samples of `ultrachat_200k` (max seq len 256).
3. KV cache	FP8 (E4M3)	KV cache stored in `fp8_e4m3` at serve time (`calculate_kv_scales: false`). Cuts attention‑cache memory and bandwidth.

Note on the pruning step. The 10% sparsity is unstructured. vLLM ≥0.19 has no sparse kernel, so the weights are served dense — the pruning does not itself accelerate inference. Its role here is regularization/weight clean‑up ahead of GPTQ. The energy and footprint wins below come from the 4‑bit weights + FP8 KV cache, not from the sparsity. Nevertheless, we have found that pruning implicitly reduces the model’s energy consumption by altering its behaviour in such a way that it produces more concise output.

Table with columns: Model Name, Avg Tokens per Question (MMMU-Pro)
Model Name	Avg Tokens per Question (MMMU-Pro)
FP16 baseline	485
quant-only (W4A16)	603
prune10 (W4A16 + 10% MLP)	511

The exact, machine‑readable recipe is in recipe.yaml / recipe.json and the quantization metadata in quantization.yaml.

Results vs. FP16 baseline

Both the baseline and this model were evaluated on a single NVIDIA L4, vLLM 0.19.0, with the same serving profile (max_num_seqs=1024, chunked‑prefill on, speculative decoding off), on the full MMMU‑Pro set (3,460 questions, 0 failed). Energy was measured by NVML (≥10 Hz) integrated over the run window, cross‑checked with CodeCarbon.

Table with columns: Metric, FP16 baseline, This model, Δ
Metric	FP16 baseline	This model	Δ
MMMU‑Pro accuracy	38.21%	35.29%	−2.92 pp (−7.64% relative)
– standard (10 options)	44.22%	40.81%	−3.41 pp
– vision	32.20%	29.77%	−2.43 pp
Energy, full MMMU‑Pro run	1,285,158 J

Serving

Tested to be served with the official competition image vllm/vllm-openai:v0.20.2 on a single NVIDIA L4. We developed this version of the model targeting the vLLM framework.

bash
vllm serve <model-uri> --config vllm_config.yaml

The bundled vllm_config.yaml pins the tuned profile:

yaml
gpu-memory-utilization: 0.9
max-model-len: 20000
max-num-seqs: 1024
max-num-batched-tokens: 4096
enable-prefix-caching: true
enable-chunked-prefill: true
quantization: compressed-tensors
kv-cache-dtype: fp8_e4m3
limit-mm-per-prompt:
  image: 3

Calibration & training tooling

Hardware: 1×NVIDIA L4 (sm89), CUDA 12.8
Stack: vLLM 0.19.0 · transformers 5.5.4 · llmcompressor 0.10.1.dev148 · torch 2.10.0+cu128
Quantization cost: ~4,840 s wall‑clock, 16.8 GB peak VRAM (one‑shot, offloaded Hessians)

License

Released under the Apache License 2.0.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

salve-mundii

Model Tree

Base

google/gemma-4-E4B-it

Quantized

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer