Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Runtime support
Run this model with Hugging Face Transformers (see Usage below). GLQ has no vLLM support for this model yet — please use Hugging Face Transformers.
Usage
bash
pip install glq
python
import torch, glq.hf_integration # registers the GLQ quantizerimport transformers as tffrom transformers import AutoConfig, AutoTokenizermodel_id = "xv0y5ncu/Gemma-4-12B-it-GLQ-5.0bpw"cfg = AutoConfig.from_pretrained(model_id)model = getattr(tf, cfg.architectures[0]).from_pretrained(model_id, dtype=torch.float16).to("cuda")tok = AutoTokenizer.from_pretrained(model_id)enc = tok.apply_chat_template([{"role": "user", "content": "What is the capital of France?"}],add_generation_prompt=True, return_tensors="pt", return_dict=True,).to("cuda")out = model.generate(**enc, max_new_tokens=64, do_sample=False)print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True))# -> The capital of France is Paris.
gemma-4-12B-it is instruction-tuned with a thinking-mode chat template — always
use apply_chat_template.
Method & fidelity
- E8 lattice codebook (65 536 entries) + randomized Hadamard transform + LDLQ, block-diagonal (no power-of-2 padding).
- Per-layer mixed precision (4–8 bpw), target average 5.0 bpw, 328 layers.
- Average quantization fidelity: SQNR 25.5 dB.
Downstream quality benchmarks (lm-eval / perplexity) have not yet been run for this checkpoint; output has been verified coherent on simple prompts.
License
Released under Apache 2.0, inherited from the base model. Please review and comply with the original Gemma 4 license: https://ai.google.dev/gemma/docs/gemma_4_license
Model provider
xv0y5ncu
Model tree
Base
google/gemma-4-12B-it
Quantized
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information