Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Model Overview
This model is a pruned and quantized version of google/gemma-4-E4B-it. Through FLAP structured pruning of the FFN layers followed by GPTQ INT4 weight-only quantization, it significantly reduces parameter count and inference memory while maintaining model performance.
Base Model Information
- Base Model: google/gemma-4-E4B-it
- Model Type: Multimodal Large Language Model (Text + Image + Audio)
- Architecture: Gemma4ForConditionalGeneration
- Layers: 42 layers
- Data Type: bfloat16 (unquantized parts), INT4 (quantized parts)
Compression Methods
1. FLAP Pruning (Fluctuation-based Adaptive Structured Pruning)
This model follows the FLAP method for FFN structured pruning.
- Bias Compensation
- Adds bias compensation for the average contribution of pruned channels, maintaining good performance without retraining
- Original Gemma4's
down_projhas no bias; bias parameters are added after pruning
2. GPTQ INT4 Quantization (W4A16, weight-only)
This model uses GPTQ INT4 weight-only quantization via GPTQModel for efficient inference with vLLM.
- Quantization Method: GPTQ (GPT-QModel) W4A16 weight-only
- Weight Quantization: 4-bit symmetric group-wise quantization
- Activation: Not quantized (remains in bfloat16)
- Quantization Format: GPTQ v1 (
gptqcheckpoint format) - Quantization Tool: GPTQModel 7.0.0
Quantization Configuration
| Parameter | Value | Description |
|---|---|---|
bits | 4 | Quantization bit-width (INT4) |
group_size | 128 | Quantization group size |
sym | True | Symmetric quantization |
desc_act | True | Activation descending order (improves quantization accuracy) |
format | gptq | GPTQ v1 format, compatible with vLLM |
damp_percent | 0.01 | Hessian diagonal damping percentage |
calibration_samples | 256 | Number of calibration samples (WikiText-2) |
calibration_seq_len | 2048 | Calibration sequence length |
Quantized Modules
The following modules in the language model are quantized to INT4:
| Module Type | Description |
|---|---|
self_attn.o_proj | Attention output projection |
mlp.gate_proj | MLP gate projection |
mlp.up_proj | MLP up projection |
mlp.down_proj | MLP down projection |
Total: 168 quantized modules (42 layers × 4 modules per layer)
Unquantized Modules (kept in bfloat16)
The following modules are explicitly excluded from quantization via GPTQModel's dynamic configuration and remain in bfloat16:
| Module Pattern | Reason |
|---|---|
self_attn.q_proj | QKV projections are sensitive to quantization error; keeping bf16 significantly improves accuracy |
self_attn.k_proj | Same as above |
self_attn.v_proj | Same as above |
per_layer_projection | PLE module weights are small (~1.25MB bf16 per layer) but suffer from large quantization error |
per_layer_input_gate | Same as above |
Vision tower weights (model.vision_tower.*) | Vision encoder typically does not need quantization |
Audio branch weights (model.audio_tower.*) | Audio encoder typically does not need quantization |
Embeddings (embed_tokens.weight, embed_tokens_per_layer.weight) | Embedding layers are not suitable for quantization |
| LayerNorm/RMSNorm weights | Normalization layers have minimal parameters, no need for quantization |
All bias tensors (including FLAP down_proj.bias) | Bias terms kept at original precision |
Language model head (lm_head) | Output projection kept at original precision |
GPTQ Quantization Principle
GPTQ employs a layer-wise quantization strategy that minimizes quantization error based on approximate second-order information (Hessian matrix):
markdown
For each layer:1. Compute the Hessian matrix H⁻¹ for the layer (based on calibration data activations)2. Quantize weights column by column, using H⁻¹ to correct unquantized columnsto compensate for quantization error:δ_w = -w_q_err · (H⁻¹_{jj} / [H⁻¹]_{j,.})3. When desc_act=True, process columns in descending order of activation magnitude,prioritizing important weightsQuantized weight: W_int4 = quantize(W_bf16, scale, zero_point)Dequantized: W_bf16 ≈ W_int4 · scale + zero_pointWhere scale and zero_point are computed per group of group_size=128
Pruning Configuration
This model adopts a non-uniform pruning strategy, with differentiated processing for Gemma4's YOCO architecture:
| Layer Range | Role | Pruning Ratio | intermediate_size |
|---|---|---|---|
| 0-3 | sliding_attention | 20% | 8192 |
| 4 | sliding_attention | 0% | 10240 |
| 5-6 | full_attention & sliding_attention | 20% | 8192 |
| 7-8 | sliding_attention | 0% | 10240 |
| 9 | sliding_attention | 10% | 9216 |
| 10-11 | sliding_attention & full_attention | 0% | 10240 |
| 12-13 | sliding_attention | 20% | 8192 |
| 14 | sliding_attention | 0% | 10240 |
| 15-16 | sliding_attention | 20% | 8192 |
| 17 | full_attention | 0% | 10240 |
| 18-19 | sliding_attention | 20% | 8192 |
| 20-21 | sliding_attention | 20% | 8192 |
| 22-23 | sliding_attention & full_attention | 0% | 10240 |
| 24-27 | sliding_attention | 20% | 8192 |
| 28-29 | sliding_attention & full_attention | 0% | 10240 |
| 30-31 | sliding_attention | 20% | 8192 |
| 32 | sliding_attention | 0% | 10240 |
| 33-34 | sliding_attention | 20% | 8192 |
| 35-41 | full_attention & sliding_attention | 0% | 10240 |
intermediate_size Distribution After Pruning
| intermediate_size | Layer Count | Description |
|---|---|---|
| 10240 (original) | 19 layers | Unpruned |
| 9216 (10% pruned) | 1 layer | Lightly pruned |
| 8192 (20% pruned) | 22 layers | Pruned |
FFN Parameter Compression: ~15%
Model Structure Changes
Configuration Changes
json
{"text_config": {"intermediate_size": 10240,"intermediate_sizes": [8192, 8192, ..., 10240, 10240],"flap_pruned": true},"quantization_config": {"quant_method": "gptq","bits": 4,"group_size": 128,"sym": true,"desc_act": true,"format": "gptq","checkpoint_format": "gptq","dynamic": {"-:.*per_layer_projection": {},"-:.*per_layer_input_gate": {},"-:.*self_attn.q_proj$": {},"-:.*self_attn.k_proj$": {},"-:.*self_attn.v_proj$": {}}}}
intermediate_sizes: Actual intermediate_size per layer (added after non-uniform pruning)flap_pruned: Indicates the model has undergone FLAP pruningquantization_config: GPTQ INT4 quantization configuration
Weight Changes
- gate_proj / up_proj: Rows corresponding to pruned channels are removed
- down_proj:
- Columns corresponding to pruned channels are removed
- New
biasparameter added (bias compensation values)
- Quantized weights: Stored as INT4 packed weights (
qweight) with corresponding scale/zero-point tensors (scales,qzeros,g_idx)
Usage
[!Note] Deployment command
vLLM Deployment
bash
# **Required** Download the model with all files (**including plugin files**) to local storageMODEL_DIR=$(python -c "from huggingface_hub import snapshot_download; print(snapshot_download('ISCASRGL/gemma4-lite-round2'))")# Set PYTHONPATH to include the plugin (required due to model modifications from pruning)export PYTHONPATH="$MODEL_DIR:$PYTHONPATH"# Start vLLM servicevllm serve ISCASRGL/gemma4-lite-round2 --config $MODEL_DIR/vllm_config.yaml
vLLM Plugin Description
This model includes vllm_flap_plugin for direct deployment of FLAP-pruned models in vLLM:
- per-layer intermediate_size: Supports different FFN widths per layer after non-uniform pruning
- FLAP bias compensation: Adds bias support for
down_proj - Conditional patch: Only activates when
config.flap_pruned=True, does not affect non-FLAP models
Technical Details
Bias Compensation Principle
markdown
Before pruning: FFN(x) = down_proj(act(gate_proj(x)) * up_proj(x))After pruning: FFN'(x) = down_proj_pruned(h_pruned) + output_biasWhere:output_bias = Σ_{j∈pruned} E[h_j] × W_down[:, j]= (E[h] * ~mask) @ W_down.T
GPTQ INT4 Quantization Details
markdown
Quantization process:1. Collect calibration data activations, compute Hessian matrix H2. Quantize column by column: w_q = round(w / scale) - zero_point3. Error compensation: correct remaining unquantized column weights4. desc_act: process columns in descending order of |H_jj|Dequantization: W_bf16 ≈ (W_int4 - zero_point) × scaleGroup quantization: scale and zero_point computed per group of group_size=128Symmetric quantization: zero_point = 0, W_bf16 ≈ W_int4 × scale
File Structure
markdown
├── config.json # Model configuration (with quantization_config)├── model-00001-of-00003.safetensors # Model weights (shard 1)├── model-00002-of-00003.safetensors # Model weights (shard 2)├── model-00003-of-00003.safetensors # Model weights (shard 3)├── model.safetensors.index.json # Weight index├── quantize_config.json # GPTQ quantization configuration├── quant_log.csv # Quantization log (per-layer quantization error)├── generation_config.json # Generation configuration├── processor_config.json # Multimodal processor configuration├── tokenizer.json # Tokenizer├── tokenizer_config.json # Tokenizer configuration├── chat_template.jinja # Chat template├── vllm_flap_plugin.egg-info # Plugin metadata└── vllm_flap_plugin/ # vLLM compatibility plugin├── __init__.py└── README.md
Compression Summary
| Compression Stage | Method | Reduction |
|---|---|---|
| FLAP Pruning | Non-uniform FFN pruning | ~15% FFN parameters |
| GPTQ INT4 Quantization | Weight-only 4-bit quantization | ~25% memory for quantized modules |
Total Model Size: ~9.9 GB (compared to ~16 GB for the original bfloat16 model, approximately 38% reduction)
Model provider
ISCASRGL
Model tree
Base
google/gemma-4-E4B-it
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information