xv0y5ncu/Gemma-4-12B-it-GLQ-5.0bpw API & Inference Endpoint

Runtime support

Run this model with Hugging Face Transformers (see Usage below). GLQ has no vLLM support for this model yet — please use Hugging Face Transformers.

Usage

bash
pip install glq

python
import torch, glq.hf_integration              # registers the GLQ quantizer
import transformers as tf
from transformers import AutoConfig, AutoTokenizer

model_id = "xv0y5ncu/Gemma-4-12B-it-GLQ-5.0bpw"
cfg = AutoConfig.from_pretrained(model_id)
model = getattr(tf, cfg.architectures[0]).from_pretrained(
    model_id, dtype=torch.float16).to("cuda")
tok = AutoTokenizer.from_pretrained(model_id)

enc = tok.apply_chat_template(
    [{"role": "user", "content": "What is the capital of France?"}],
    add_generation_prompt=True, return_tensors="pt", return_dict=True,
).to("cuda")
out = model.generate(**enc, max_new_tokens=64, do_sample=False)
print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True))
# -> The capital of France is Paris.

gemma-4-12B-it is instruction-tuned with a thinking-mode chat template — always use apply_chat_template.

Method & fidelity

E8 lattice codebook (65 536 entries) + randomized Hadamard transform + LDLQ, block-diagonal (no power-of-2 padding).
Per-layer mixed precision (4–8 bpw), target average 5.0 bpw, 328 layers.
Average quantization fidelity: SQNR 25.5 dB.

Downstream quality benchmarks (lm-eval / perplexity) have not yet been run for this checkpoint; output has been verified coherent on simple prompts.

License

Released under Apache 2.0, inherited from the base model. Please review and comply with the original Gemma 4 license: https://ai.google.dev/gemma/docs/gemma_4_license

⭐ GLQ on GitHub

Gemma-4-12B-it-GLQ-5.0bpw

Get help setting up a custom Dedicated Endpoints.

README

Runtime support

Usage

Method & fidelity

License

Explore FriendliAI today

Gemma-4-12B-it-GLQ-5.0bpw