Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Runtime support

Run this model with Hugging Face Transformers (see Usage below). GLQ has no vLLM support for this model yet — please use Hugging Face Transformers.

Usage

bash

pip install glq

python

import torch, glq.hf_integration # registers the GLQ quantizer
import transformers as tf
from transformers import AutoConfig, AutoTokenizer
model_id = "xv0y5ncu/Gemma-4-12B-it-GLQ-5.0bpw"
cfg = AutoConfig.from_pretrained(model_id)
model = getattr(tf, cfg.architectures[0]).from_pretrained(
model_id, dtype=torch.float16).to("cuda")
tok = AutoTokenizer.from_pretrained(model_id)
enc = tok.apply_chat_template(
[{"role": "user", "content": "What is the capital of France?"}],
add_generation_prompt=True, return_tensors="pt", return_dict=True,
).to("cuda")
out = model.generate(**enc, max_new_tokens=64, do_sample=False)
print(tok.decode(out[0, enc["input_ids"].shape[1]:], skip_special_tokens=True))
# -> The capital of France is Paris.

gemma-4-12B-it is instruction-tuned with a thinking-mode chat template — always use apply_chat_template.

Method & fidelity

  • E8 lattice codebook (65 536 entries) + randomized Hadamard transform + LDLQ, block-diagonal (no power-of-2 padding).
  • Per-layer mixed precision (4–8 bpw), target average 5.0 bpw, 328 layers.
  • Average quantization fidelity: SQNR 25.5 dB.

Downstream quality benchmarks (lm-eval / perplexity) have not yet been run for this checkpoint; output has been verified coherent on simple prompts.

License

Released under Apache 2.0, inherited from the base model. Please review and comply with the original Gemma 4 license: https://ai.google.dev/gemma/docs/gemma_4_license


GLQ on GitHub

Model provider

xv0y5ncu

Model tree

Base

google/gemma-4-12B-it

Quantized

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today