xv0y5ncu
Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quick start
bash
pip install glq
python
import glq.hf_integration # registers the GLQ quantization configfrom transformers import AutoModelForCausalLM, AutoTokenizermodel = AutoModelForCausalLM.from_pretrained("xv0y5ncu/Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8",torch_dtype="bfloat16",device_map="cuda",)tokenizer = AutoTokenizer.from_pretrained("xv0y5ncu/Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8")
vLLM (pip install vllm + pip install glq):
bash
vllm serve xv0y5ncu/Gemma-4-E4B-it-GLQ-3.5bpw-mix3-8 --quantization glq
Quality vs other 3-bpw-class quants
mmlu_pro at 2% sample (n = 247), all generation with enable_thinking=true:
| Variant | bpw (param-weighted) | acc | stderr | engine | vs bf16 |
|---|---|---|---|---|---|
| bf16 | 16.0 | 0.6640 | 0.0289 | vLLM | 100% |
| GLQ 3.5 mix 3–8 (this model) | 3.50 | 0.5628 | 0.0310 | vLLM | 84.8% |
| GGUF Q3_K_S | 3.89 | 0.4696 | 0.0300 | llama.cpp | 70.7% |
| GGUF UD-IQ3_XXS (calibrated) | 3.75 | 0.4818 | 0.0295 | llama.cpp | 72.6% |
| GLQ uniform 3.0 | 3.00 | 0.4413 | 0.0311 | vLLM | 66.5% |
At 3.5 bpw — lower than both GGUF Q3 variants — GLQ outperforms them by ~9pp thanks to sensitivity-driven bit allocation.
Bit allocation summary
markdown
Total quantized weights: 3,945,267,2003 bpw: 113 sublayers (2,941,255,680 weights, 74.6%) — mostly mlp.{gate,up,down}_proj4 bpw: 41 sublayers ( 525,598,720 weights, 13.3%)5 bpw: 35 sublayers ( 200,540,160 weights, 5.1%)6 bpw: 32 sublayers ( 110,755,840 weights, 2.8%)7 bpw: 72 sublayers ( 121,896,960 weights, 3.1%)8 bpw: 49 sublayers ( 45,219,840 weights, 1.1%) — most sensitive: k_proj, per_layer_*PLE embedding: 4 bpw (separate)embed_tokens, lm_head, norms: bf16Average SQNR: 21.77 dB
Quantization details
- Method: E8 lattice codebook (65 536 entries) + RHT (random Hadamard transform) + LDLQ optimal rounding, with a 256-entry second-stage codebook for residual quantization (RVQ).
- Calibration: 128 samples × 2048 tokens from C4.
- Allocator: greedy marginal-gain over per-layer proxy losses, profiled at 3–8 bpw.
- Quantizer:
glq.quantize_model. - Hardware: NVIDIA RTX PRO 6000 Blackwell, 96 GB VRAM. Total quantization wall time: ~8 min (profile + allocate + quantize).
License
License: Apache 2.0 — see https://ai.google.dev/gemma/docs/gemma_4_license
Original model: https://huggingface.co/google/gemma-4-E4B-it
Derivative quantization work; the base Gemma 4 model and the GLQ tooling (https://github.com/cnygaard/glq) are both Apache 2.0. Quantizing the weights does not change the license.
🔗 GLQ on GitHub: https://github.com/cnygaard/glq — if you like it, a ⭐ is appreciated.
Model provider
xv0y5ncu
Model tree
Base
google/gemma-4-E4B-it
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information