What this is
- Source:
unsloth/gemma-4-31B-it-qat-q4_0-unquantized (bf16, ~58 GB) — Google's QAT checkpoint, de-quantized back to bf16.
- Method: bitsandbytes 4-bit, replicating the exact
quantization_config used by the official Unsloth dynamic-4bit repo:
bnb_4bit_quant_type = "nf4"
bnb_4bit_use_double_quant = true
bnb_4bit_compute_dtype = "bfloat16"
bnb_4bit_quant_storage = "uint8"
- the full
llm_int8_skip_modules list, so embeddings, lm_head, the vision tower, the multimodal projector and all per-layer/vision modules stay in bf16. Only the language-model decoder linears (600 Linear4bit modules across 60 layers) are quantized.
- Result: ~17.8 GB on GPU (down from ~58 GB), fits comfortably on a single 24 GB GPU for inference / QLoRA.
📊 Quantization fidelity — KL-divergence vs the bf16 source (the right metric)
The question that actually matters for a 4-bit build is: how faithfully does NF4 preserve each model relative to its own bf16 source? Measured as per-token KL(P_bf16 ‖ P_nf4), one identical harness for both, fp32 log_softmax, with a self-KLD = 0 sanity check (n ≈ 65k scored tokens/dataset):
Table with columns: KLD vs own bf16 source ↓, vanilla → NF4, QAT → NF4 (this)| KLD vs own bf16 source ↓ | vanilla → NF4 | QAT → NF4 (this) |
|---|
| WikiText-2 — median (typical token) | 0.203 | 0.227 |
| WikiText-2 — mean (incl. worst-case tail) | 1.113 | 0.386 |
| C4 — median | 0.161 | 0.199 |
| C4 — mean | 0.807 | 0.321 |
Honest takeaways:
- ✅ Worst-case suppression (the real QAT win): the mean KLD — driven by rare, high-error tokens — is ~2.5–2.9× lower for QAT→NF4. QAT robustness mainly flattens catastrophic quantization errors in the tail.
- ⚖️ Typical-token precision: the median KLD is slightly higher (~1.1–1.2×). For the average token, NF4-on-QAT is marginally looser, not tighter, than NF4-on-vanilla.
- 👉 When to prefer this build: if you care about tail stability / fewer catastrophic token errors. For typical-token fidelity it is on par with (a hair behind) the vanilla bnb-4bit. It is not a blanket "3.1× better" model.
Why KL-divergence and not perplexity? On this model family perplexity is an unreliable quant metric — in testing, both NF4 builds scored lower PPL than their own bf16 source (impossible for true quantization error), and most of any PPL gap reflects the QAT base model differing from vanilla (bf16 PPL ≈ 268 vs ≈ 705) rather than 4-bit fidelity. KL-divergence vs each model's own bf16 source is the meaningful measure.
⚠️ Honest caveat: NF4 ≠ q4_0
Google's QAT for Gemma targets q4_0 (a llama.cpp/GGUF symmetric int4 scheme), not NF4 double-quant. The QAT robustness transfers partially (it helps the error tail, as the KLD table shows), but this is not the exact error profile the GGUF q4_0 QAT artifact would give. If you want the literal QAT target precision, use the GGUF q4_0 build. This repo is the bitsandbytes/QLoRA-friendly counterpart.
Usage
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "Nekochu/gemma-4-31B-it-qat-bnb-4bit"
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
proc = AutoProcessor.from_pretrained(model_id)
messages = [{"role": "user", "content": [{"type": "text", "text": "Name three primary colors."}]}]
inputs = proc.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=64)
print(proc.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Requires transformers, accelerate, bitsandbytes, pillow, torchvision.
Sanity check
Greedy generation after quantization:
Prompt: Name three primary colors.
Output: "The three primary colors are red, blue, and yellow."
Prompt: Explain what quantization-aware training is in one sentence.
Output: "Quantization-aware training (QAT) is a model training process that simulates the effects of low-precision quantization during the forward pass, allowing the network to learn weights that are more robust to the rounding errors introduced when converting the model to a smaller format."
Reproduction
Quantized with transformers==5.10.2, bitsandbytes==0.49.2, torch==2.7.1+cu126 on a single A100-80GB by loading the bf16 source with the BitsAndBytesConfig above and model.save_pretrained(...). KL-divergence evaluated with an identical harness across both NF4 builds and their respective bf16 sources (same input IDs, full-model forward so Gemma final-logit softcapping is applied identically, KLD in fp32 via log_softmax, self-KLD = 0 verified).
License
Apache-2.0, inherited from the base model. Gemma is subject to the Gemma Terms of Use.