Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Model Overview

This model is a pruned and quantized version of google/gemma-4-E4B-it. Through FLAP structured pruning of the FFN layers followed by GPTQ INT4 weight-only quantization, it significantly reduces parameter count and inference memory while maintaining model performance.

Base Model Information

  • Base Model: google/gemma-4-E4B-it
  • Model Type: Multimodal Large Language Model (Text + Image + Audio)
  • Architecture: Gemma4ForConditionalGeneration
  • Layers: 42 layers
  • Data Type: bfloat16 (unquantized parts), INT4 (quantized parts)

Compression Methods

1. FLAP Pruning (Fluctuation-based Adaptive Structured Pruning)

This model follows the FLAP method for FFN structured pruning.

  1. Bias Compensation
    • Adds bias compensation for the average contribution of pruned channels, maintaining good performance without retraining
    • Original Gemma4's down_proj has no bias; bias parameters are added after pruning

2. GPTQ INT4 Quantization (W4A16, weight-only)

This model uses GPTQ INT4 weight-only quantization via GPTQModel for efficient inference with vLLM.

  • Quantization Method: GPTQ (GPT-QModel) W4A16 weight-only
  • Weight Quantization: 4-bit symmetric group-wise quantization
  • Activation: Not quantized (remains in bfloat16)
  • Quantization Format: GPTQ v1 (gptq checkpoint format)
  • Quantization Tool: GPTQModel 7.0.0

Quantization Configuration

ParameterValueDescription
bits4Quantization bit-width (INT4)
group_size128Quantization group size
symTrueSymmetric quantization
desc_actTrueActivation descending order (improves quantization accuracy)
formatgptqGPTQ v1 format, compatible with vLLM
damp_percent0.01Hessian diagonal damping percentage
calibration_samples256Number of calibration samples (WikiText-2)
calibration_seq_len2048Calibration sequence length

Quantized Modules

The following modules in the language model are quantized to INT4:

Module TypeDescription
self_attn.o_projAttention output projection
mlp.gate_projMLP gate projection
mlp.up_projMLP up projection
mlp.down_projMLP down projection

Total: 168 quantized modules (42 layers × 4 modules per layer)

Unquantized Modules (kept in bfloat16)

The following modules are explicitly excluded from quantization via GPTQModel's dynamic configuration and remain in bfloat16:

Module PatternReason
self_attn.q_projQKV projections are sensitive to quantization error; keeping bf16 significantly improves accuracy
self_attn.k_projSame as above
self_attn.v_projSame as above
per_layer_projectionPLE module weights are small (~1.25MB bf16 per layer) but suffer from large quantization error
per_layer_input_gateSame as above
Vision tower weights (model.vision_tower.*)Vision encoder typically does not need quantization
Audio branch weights (model.audio_tower.*)Audio encoder typically does not need quantization
Embeddings (embed_tokens.weight, embed_tokens_per_layer.weight)Embedding layers are not suitable for quantization
LayerNorm/RMSNorm weightsNormalization layers have minimal parameters, no need for quantization
All bias tensors (including FLAP down_proj.bias)Bias terms kept at original precision
Language model head (lm_head)Output projection kept at original precision

GPTQ Quantization Principle

GPTQ employs a layer-wise quantization strategy that minimizes quantization error based on approximate second-order information (Hessian matrix):

markdown

For each layer:
1. Compute the Hessian matrix H⁻¹ for the layer (based on calibration data activations)
2. Quantize weights column by column, using H⁻¹ to correct unquantized columns
to compensate for quantization error:
δ_w = -w_q_err · (H⁻¹_{jj} / [H⁻¹]_{j,.})
3. When desc_act=True, process columns in descending order of activation magnitude,
prioritizing important weights
Quantized weight: W_int4 = quantize(W_bf16, scale, zero_point)
Dequantized: W_bf16 ≈ W_int4 · scale + zero_point
Where scale and zero_point are computed per group of group_size=128

Pruning Configuration

This model adopts a non-uniform pruning strategy, with differentiated processing for Gemma4's YOCO architecture:

Layer RangeRolePruning Ratiointermediate_size
0-3sliding_attention20%8192
4sliding_attention0%10240
5-6full_attention & sliding_attention20%8192
7-8sliding_attention0%10240
9sliding_attention10%9216
10-11sliding_attention & full_attention0%10240
12-13sliding_attention20%8192
14sliding_attention0%10240
15-16sliding_attention20%8192
17full_attention0%10240
18-19sliding_attention20%8192
20-21sliding_attention20%8192
22-23sliding_attention & full_attention0%10240
24-27sliding_attention20%8192
28-29sliding_attention & full_attention0%10240
30-31sliding_attention20%8192
32sliding_attention0%10240
33-34sliding_attention20%8192
35-41full_attention & sliding_attention0%10240

intermediate_size Distribution After Pruning

intermediate_sizeLayer CountDescription
10240 (original)19 layersUnpruned
9216 (10% pruned)1 layerLightly pruned
8192 (20% pruned)22 layersPruned

FFN Parameter Compression: ~15%

Model Structure Changes

Configuration Changes

json

{
"text_config": {
"intermediate_size": 10240,
"intermediate_sizes": [8192, 8192, ..., 10240, 10240],
"flap_pruned": true
},
"quantization_config": {
"quant_method": "gptq",
"bits": 4,
"group_size": 128,
"sym": true,
"desc_act": true,
"format": "gptq",
"checkpoint_format": "gptq",
"dynamic": {
"-:.*per_layer_projection": {},
"-:.*per_layer_input_gate": {},
"-:.*self_attn.q_proj$": {},
"-:.*self_attn.k_proj$": {},
"-:.*self_attn.v_proj$": {}
}
}
}
  • intermediate_sizes: Actual intermediate_size per layer (added after non-uniform pruning)
  • flap_pruned: Indicates the model has undergone FLAP pruning
  • quantization_config: GPTQ INT4 quantization configuration

Weight Changes

  1. gate_proj / up_proj: Rows corresponding to pruned channels are removed
  2. down_proj:
    • Columns corresponding to pruned channels are removed
    • New bias parameter added (bias compensation values)
  3. Quantized weights: Stored as INT4 packed weights (qweight) with corresponding scale/zero-point tensors (scales, qzeros, g_idx)

Usage

[!Note] Deployment command

vLLM Deployment

bash

# **Required** Download the model with all files (**including plugin files**) to local storage
MODEL_DIR=$(python -c "from huggingface_hub import snapshot_download; print(snapshot_download('ISCASRGL/gemma4-lite-round2'))")
# Set PYTHONPATH to include the plugin (required due to model modifications from pruning)
export PYTHONPATH="$MODEL_DIR:$PYTHONPATH"
# Start vLLM service
vllm serve ISCASRGL/gemma4-lite-round2 --config $MODEL_DIR/vllm_config.yaml

vLLM Plugin Description

This model includes vllm_flap_plugin for direct deployment of FLAP-pruned models in vLLM:

  • per-layer intermediate_size: Supports different FFN widths per layer after non-uniform pruning
  • FLAP bias compensation: Adds bias support for down_proj
  • Conditional patch: Only activates when config.flap_pruned=True, does not affect non-FLAP models

Technical Details

Bias Compensation Principle

markdown

Before pruning: FFN(x) = down_proj(act(gate_proj(x)) * up_proj(x))
After pruning: FFN'(x) = down_proj_pruned(h_pruned) + output_bias
Where:
output_bias = Σ_{j∈pruned} E[h_j] × W_down[:, j]
= (E[h] * ~mask) @ W_down.T

GPTQ INT4 Quantization Details

markdown

Quantization process:
1. Collect calibration data activations, compute Hessian matrix H
2. Quantize column by column: w_q = round(w / scale) - zero_point
3. Error compensation: correct remaining unquantized column weights
4. desc_act: process columns in descending order of |H_jj|
Dequantization: W_bf16 ≈ (W_int4 - zero_point) × scale
Group quantization: scale and zero_point computed per group of group_size=128
Symmetric quantization: zero_point = 0, W_bf16 ≈ W_int4 × scale

File Structure

markdown

├── config.json # Model configuration (with quantization_config)
├── model-00001-of-00003.safetensors # Model weights (shard 1)
├── model-00002-of-00003.safetensors # Model weights (shard 2)
├── model-00003-of-00003.safetensors # Model weights (shard 3)
├── model.safetensors.index.json # Weight index
├── quantize_config.json # GPTQ quantization configuration
├── quant_log.csv # Quantization log (per-layer quantization error)
├── generation_config.json # Generation configuration
├── processor_config.json # Multimodal processor configuration
├── tokenizer.json # Tokenizer
├── tokenizer_config.json # Tokenizer configuration
├── chat_template.jinja # Chat template
├── vllm_flap_plugin.egg-info # Plugin metadata
└── vllm_flap_plugin/ # vLLM compatibility plugin
├── __init__.py
└── README.md

Compression Summary

Compression StageMethodReduction
FLAP PruningNon-uniform FFN pruning~15% FFN parameters
GPTQ INT4 QuantizationWeight-only 4-bit quantization~25% memory for quantized modules

Total Model Size: ~9.9 GB (compared to ~16 GB for the original bfloat16 model, approximately 38% reduction)

Model provider

ISCASRGL

Model tree

Base

google/gemma-4-E4B-it

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today