Luminia

gemma-4-31B-it-qat-bnb-4bit

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

What this is

Source: unsloth/gemma-4-31B-it-qat-q4_0-unquantized (bf16, ~58 GB) — Google's QAT checkpoint, de-quantized back to bf16.
Method: bitsandbytes 4-bit, replicating the exact quantization_config used by the official Unsloth dynamic-4bit repo:
- bnb_4bit_quant_type = "nf4"
- bnb_4bit_use_double_quant = true
- bnb_4bit_compute_dtype = "bfloat16"
- bnb_4bit_quant_storage = "uint8"
- the full llm_int8_skip_modules list, so embeddings, lm_head, the vision tower, the multimodal projector and all per-layer/vision modules stay in bf16. Only the language-model decoder linears (600 Linear4bit modules across 60 layers) are quantized.
Result: ~17.8 GB on GPU (down from ~58 GB), fits comfortably on a single 24 GB GPU for inference / QLoRA.

📊 Quantization fidelity — KL-divergence vs the bf16 source (the right metric)

The question that actually matters for a 4-bit build is: how faithfully does NF4 preserve each model relative to its own bf16 source? Measured as per-token KL(P_bf16 ‖ P_nf4), one identical harness for both, fp32 log_softmax, with a self-KLD = 0 sanity check (n ≈ 65k scored tokens/dataset):

Table with columns: KLD vs own bf16 source ↓, vanilla → NF4, QAT → NF4 (this)
KLD vs own bf16 source ↓	vanilla → NF4	QAT → NF4 (this)
WikiText-2 — median (typical token)	0.203	0.227
WikiText-2 — mean (incl. worst-case tail)	1.113	0.386
C4 — median	0.161	0.199
C4 — mean	0.807	0.321

Honest takeaways:

✅ Worst-case suppression (the real QAT win): the mean KLD — driven by rare, high-error tokens — is ~2.5–2.9× lower for QAT→NF4. QAT robustness mainly flattens catastrophic quantization errors in the tail.
⚖️ Typical-token precision: the median KLD is slightly higher (~1.1–1.2×). For the average token, NF4-on-QAT is marginally looser, not tighter, than NF4-on-vanilla.
👉 When to prefer this build: if you care about tail stability / fewer catastrophic token errors. For typical-token fidelity it is on par with (a hair behind) the vanilla bnb-4bit. It is not a blanket "3.1× better" model.

Why KL-divergence and not perplexity? On this model family perplexity is an unreliable quant metric — in testing, both NF4 builds scored lower PPL than their own bf16 source (impossible for true quantization error), and most of any PPL gap reflects the QAT base model differing from vanilla (bf16 PPL ≈ 268 vs ≈ 705) rather than 4-bit fidelity. KL-divergence vs each model's own bf16 source is the meaningful measure.

⚠️ Honest caveat: NF4 ≠ q4_0

Google's QAT for Gemma targets q4_0 (a llama.cpp/GGUF symmetric int4 scheme), not NF4 double-quant. The QAT robustness transfers partially (it helps the error tail, as the KLD table shows), but this is not the exact error profile the GGUF q4_0 QAT artifact would give. If you want the literal QAT target precision, use the GGUF q4_0 build. This repo is the bitsandbytes/QLoRA-friendly counterpart.

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "Nekochu/gemma-4-31B-it-qat-bnb-4bit"
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
proc = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": [{"type": "text", "text": "Name three primary colors."}]}]
inputs = proc.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

out = model.generate(**inputs, max_new_tokens=64)
print(proc.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Requires transformers, accelerate, bitsandbytes, pillow, torchvision.

Sanity check

Greedy generation after quantization:

Prompt: Name three primary colors. Output: "The three primary colors are red, blue, and yellow."

Prompt: Explain what quantization-aware training is in one sentence. Output: "Quantization-aware training (QAT) is a model training process that simulates the effects of low-precision quantization during the forward pass, allowing the network to learn weights that are more robust to the rounding errors introduced when converting the model to a smaller format."

Reproduction

Quantized with transformers==5.10.2, bitsandbytes==0.49.2, torch==2.7.1+cu126 on a single A100-80GB by loading the bf16 source with the BitsAndBytesConfig above and model.save_pretrained(...). KL-divergence evaluated with an identical harness across both NF4 builds and their respective bf16 sources (same input IDs, full-model forward so Gemma final-logit softcapping is applied identically, KLD in fp32 via log_softmax, self-KLD = 0 verified).

License

Apache-2.0, inherited from the base model. Gemma is subject to the Gemma Terms of Use.

Model provider

Luminia

Model tree

Base

unsloth/gemma-4-31B-it-qat-q4_0-unquantized

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

What this is

Source: unsloth/gemma-4-31B-it-qat-q4_0-unquantized (bf16, ~58 GB) — Google's QAT checkpoint, de-quantized back to bf16.
Method: bitsandbytes 4-bit, replicating the exact quantization_config used by the official Unsloth dynamic-4bit repo:
- bnb_4bit_quant_type = "nf4"
- bnb_4bit_use_double_quant = true
- bnb_4bit_compute_dtype = "bfloat16"
- bnb_4bit_quant_storage = "uint8"
- the full llm_int8_skip_modules list, so embeddings, lm_head, the vision tower, the multimodal projector and all per-layer/vision modules stay in bf16. Only the language-model decoder linears (600 Linear4bit modules across 60 layers) are quantized.
Result: ~17.8 GB on GPU (down from ~58 GB), fits comfortably on a single 24 GB GPU for inference / QLoRA.

📊 Quantization fidelity — KL-divergence vs the bf16 source (the right metric)

Table with columns: KLD vs own bf16 source ↓, vanilla → NF4, QAT → NF4 (this)
KLD vs own bf16 source ↓	vanilla → NF4	QAT → NF4 (this)
WikiText-2 — median (typical token)	0.203	0.227
WikiText-2 — mean (incl. worst-case tail)	1.113	0.386
C4 — median	0.161	0.199
C4 — mean	0.807	0.321

Honest takeaways:

✅ Worst-case suppression (the real QAT win): the mean KLD — driven by rare, high-error tokens — is ~2.5–2.9× lower for QAT→NF4. QAT robustness mainly flattens catastrophic quantization errors in the tail.
⚖️ Typical-token precision: the median KLD is slightly higher (~1.1–1.2×). For the average token, NF4-on-QAT is marginally looser, not tighter, than NF4-on-vanilla.
👉 When to prefer this build: if you care about tail stability / fewer catastrophic token errors. For typical-token fidelity it is on par with (a hair behind) the vanilla bnb-4bit. It is not a blanket "3.1× better" model.

Why KL-divergence and not perplexity? On this model family perplexity is an unreliable quant metric — in testing, both NF4 builds scored lower PPL than their own bf16 source (impossible for true quantization error), and most of any PPL gap reflects the QAT base model differing from vanilla (bf16 PPL ≈ 268 vs ≈ 705) rather than 4-bit fidelity. KL-divergence vs each model's own bf16 source is the meaningful measure.

⚠️ Honest caveat: NF4 ≠ q4_0

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "Nekochu/gemma-4-31B-it-qat-bnb-4bit"
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
proc = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": [{"type": "text", "text": "Name three primary colors."}]}]
inputs = proc.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

out = model.generate(**inputs, max_new_tokens=64)
print(proc.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Requires transformers, accelerate, bitsandbytes, pillow, torchvision.

Sanity check

Greedy generation after quantization:

Prompt: Name three primary colors. Output: "The three primary colors are red, blue, and yellow."

Prompt: Explain what quantization-aware training is in one sentence. Output: "Quantization-aware training (QAT) is a model training process that simulates the effects of low-precision quantization during the forward pass, allowing the network to learn weights that are more robust to the rounding errors introduced when converting the model to a smaller format."

Reproduction

License

Apache-2.0, inherited from the base model. Gemma is subject to the Gemma Terms of Use.

gemma-4-31B-it-qat-bnb-4bit

Get help setting up a custom Dedicated Endpoints.

README

What this is

📊 Quantization fidelity — KL-divergence vs the bf16 source (the right metric)

⚠️ Honest caveat: NF4 ≠ q4_0

Usage

Sanity check

Reproduction

License

Explore FriendliAI today

README

What this is

📊 Quantization fidelity — KL-divergence vs the bf16 source (the right metric)

⚠️ Honest caveat: NF4 ≠ q4_0

Usage

Sanity check

Reproduction

License