xv0y5ncu

Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8

README

License: apache-2.0

Quick start

bash
pip install glq

python
import glq.hf_integration  # registers the GLQ quantization config
from transformers import AutoModelForCausalLM, AutoTokenizer

model = AutoModelForCausalLM.from_pretrained(
    "xv0y5ncu/Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8",
    torch_dtype="bfloat16",
    device_map="cuda",
)
tokenizer = AutoTokenizer.from_pretrained("xv0y5ncu/Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8")

vLLM (pip install vllm + pip install glq):

bash
vllm serve xv0y5ncu/Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8 --quantization glq

Quality vs other 3-bpw-class quants

mmlu_pro at 2% sample (n = 247), all generation with enable_thinking=true:

Table with columns: Variant, bpw (param-weighted), acc, stderr, engine, vs bf16
Variant	bpw (param-weighted)	acc	stderr	engine	vs bf16
bf16	16.0	0.6640	0.0289	vLLM	100%
GLQ 3.5 mix 3–8 (this model)	3.50	0.5628	0.0310	vLLM	84.8%
GGUF Q3_K_S	3.89	0.4696	0.0300	llama.cpp	70.7%
GGUF UD-IQ3_XXS (calibrated)	3.75	0.4818	0.0295	llama.cpp	72.6%
GLQ uniform 3.0	3.00	0.4413	0.0311	vLLM	66.5%

At 3.5 bpw — lower than both GGUF Q3 variants — GLQ outperforms them by ~9pp thanks to sensitivity-driven bit allocation.

Bit allocation summary

markdown
Total quantized weights: 3,945,267,200
   3 bpw: 113 sublayers (2,941,255,680 weights, 74.6%)  — mostly mlp.{gate,up,down}_proj
   4 bpw:  41 sublayers (  525,598,720 weights, 13.3%)
   5 bpw:  35 sublayers (  200,540,160 weights,  5.1%)
   6 bpw:  32 sublayers (  110,755,840 weights,  2.8%)
   7 bpw:  72 sublayers (  121,896,960 weights,  3.1%)
   8 bpw:  49 sublayers (   45,219,840 weights,  1.1%)  — most sensitive: k_proj, per_layer_*

PLE embedding: 4 bpw (separate)
embed_tokens, lm_head, norms: bf16
Average SQNR: 21.77 dB

Quantization details

Method: E8 lattice codebook (65 536 entries) + RHT (random Hadamard transform) + LDLQ optimal rounding, with a 256-entry second-stage codebook for residual quantization (RVQ).
Calibration: 128 samples × 2048 tokens from C4.
Allocator: greedy marginal-gain over per-layer proxy losses, profiled at 3–8 bpw.
Quantizer: glq.quantize_model.
Hardware: NVIDIA RTX PRO 6000 Blackwell, 96 GB VRAM. Total quantization wall time: ~8 min (profile + allocate + quantize).

License

License: Apache 2.0 — see https://ai.google.dev/gemma/docs/gemma_4_license

Original model: https://huggingface.co/google/gemma-4-E4B-it

Derivative quantization work; the base Gemma 4 model and the GLQ tooling (https://github.com/cnygaard/glq) are both Apache 2.0. Quantizing the weights does not change the license.

🔗 GLQ on GitHub: https://github.com/cnygaard/glq — if you like it, a ⭐ is appreciated.

Available on FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Container

Run this model inference with full control and performance in your environment.

Learn more

Model Details

Model Provider

xv0y5ncu

Model Tree

Base

google/gemma-4-E4B-it

Quantized

this model

Input Modalities

TextImage

Output Modalities

Text

Supported Functionality

Dedicated EndpointsContainer

Explore FriendliAI today

Get started Talk to an engineer