xv0y5ncu

Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quick start

bash

pip install glq

python

import glq.hf_integration # registers the GLQ quantization config
from transformers import AutoModelForCausalLM, AutoTokenizer
model = AutoModelForCausalLM.from_pretrained(
"xv0y5ncu/Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8",
torch_dtype="bfloat16",
device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained("xv0y5ncu/Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8")

vLLM (pip install vllm + pip install glq):

bash

vllm serve xv0y5ncu/Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8 --quantization glq

Quality vs other 3-bpw-class quants

mmlu_pro at 2% sample (n = 247), all generation with enable_thinking=true:

Table
Variantbpw (param-weighted)accstderrenginevs bf16
bf1616.00.66400.0289vLLM100%
GLQ 3.5 mix 3–8 (this model)3.500.56280.0310vLLM84.8%
GGUF Q3_K_S3.890.46960.0300llama.cpp70.7%
GGUF UD-IQ3_XXS (calibrated)3.750.48180.0295llama.cpp72.6%
GLQ uniform 3.03.000.44130.0311vLLM66.5%

At 3.5 bpw — lower than both GGUF Q3 variants — GLQ outperforms them by ~9pp thanks to sensitivity-driven bit allocation.

Bit allocation summary

markdown

Total quantized weights: 3,945,267,200
3 bpw: 113 sublayers (2,941,255,680 weights, 74.6%) — mostly mlp.{gate,up,down}_proj
4 bpw: 41 sublayers ( 525,598,720 weights, 13.3%)
5 bpw: 35 sublayers ( 200,540,160 weights, 5.1%)
6 bpw: 32 sublayers ( 110,755,840 weights, 2.8%)
7 bpw: 72 sublayers ( 121,896,960 weights, 3.1%)
8 bpw: 49 sublayers ( 45,219,840 weights, 1.1%) — most sensitive: k_proj, per_layer_*
PLE embedding: 4 bpw (separate)
embed_tokens, lm_head, norms: bf16
Average SQNR: 21.77 dB

Quantization details

  • Method: E8 lattice codebook (65 536 entries) + RHT (random Hadamard transform) + LDLQ optimal rounding, with a 256-entry second-stage codebook for residual quantization (RVQ).
  • Calibration: 128 samples × 2048 tokens from C4.
  • Allocator: greedy marginal-gain over per-layer proxy losses, profiled at 3–8 bpw.
  • Quantizer: glq.quantize_model.
  • Hardware: NVIDIA RTX PRO 6000 Blackwell, 96 GB VRAM. Total quantization wall time: ~8 min (profile + allocate + quantize).

License

License: Apache 2.0 — see https://ai.google.dev/gemma/docs/gemma_4_license

Original model: https://huggingface.co/google/gemma-4-E4B-it

Derivative quantization work; the base Gemma 4 model and the GLQ tooling (https://github.com/cnygaard/glq) are both Apache 2.0. Quantizing the weights does not change the license.


🔗 GLQ on GitHub: https://github.com/cnygaard/glq — if you like it, a ⭐ is appreciated.

Model provider

xv0y5ncu

Model tree

Base

google/gemma-4-E4B-it

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today