What this is
- Source:
unsloth/gemma-4-31B-it-qat-q4_0-unquantized (bf16, ~58 GB) — Google's QAT checkpoint, de-quantized back to bf16.
- Method: bitsandbytes 4-bit, replicating the exact
quantization_config used by the official Unsloth dynamic-4bit repo:
bnb_4bit_quant_type = "nf4"
bnb_4bit_use_double_quant = true
bnb_4bit_compute_dtype = "bfloat16"
bnb_4bit_quant_storage = "uint8"
- the full
llm_int8_skip_modules list, so embeddings, lm_head, the vision tower, the multimodal projector and all per-layer/vision modules stay in bf16. Only the language-model decoder linears (600 Linear4bit modules across 60 layers) are quantized.
- Result: ~17.8 GB on GPU (down from ~58 GB), fits comfortably on a single 24 GB GPU for inference / QLoRA.
📊 Perplexity comparison (measured)
Identical evaluation harness across all three models: detokenized WikiText-103 paragraphs (80 docs, ~30k tokens), <bos> prepended per window, Gemma final-logit softcapping applied, computed on the inner text decoder.
Table with columns: Model, What it is, Perplexity ↓| Model | What it is | Perplexity ↓ |
|---|
unsloth/gemma-4-31B-it-qat-q4_0-unquantized | QAT bf16 (upper bound) | 682.4 |
Nekochu/gemma-4-31B-it-qat-bnb-4bit (this) | QAT → NF4 | 732.7 |
unsloth/gemma-4-31B-it-unsloth-bnb-4bit | vanilla it → NF4 (existing) | 2295.6 |
Takeaways:
- This QAT→NF4 model sits only +7.4 % above its bf16 upper bound — quantization barely degraded it.
- It is ~3.1× lower perplexity than the existing vanilla→NF4 bnb-4bit model. The QAT hypothesis holds.
⚠️ The absolute PPL values are inflated by rare proper-noun / encyclopedic tokens in WikiText and are not comparable to numbers from other harnesses. Only the relative ranking (computed under one identical harness) is meaningful.
⚠️ Honest caveat: NF4 ≠ q4_0
Google's QAT for Gemma targets q4_0 (a llama.cpp/GGUF symmetric int4 scheme), not NF4 double-quant. The QAT robustness transfers largely (as the table shows) but this is not the exact error profile the GGUF q4_0 QAT artifact would give. If you want the literal QAT target precision, use the GGUF q4_0 build. This repo is the bitsandbytes/QLoRA-friendly counterpart.
Usage
from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "Nekochu/gemma-4-31B-it-qat-bnb-4bit"
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
proc = AutoProcessor.from_pretrained(model_id)
messages = [{"role": "user", "content": [{"type": "text", "text": "Name three primary colors."}]}]
inputs = proc.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=64)
print(proc.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Requires transformers, accelerate, bitsandbytes, pillow, torchvision.
Sanity check
Greedy generation after quantization:
Prompt: Name three primary colors.
Output: "The three primary colors are red, blue, and yellow."
Prompt: Explain what quantization-aware training is in one sentence.
Output: "Quantization-aware training (QAT) is a model training process that simulates the effects of low-precision quantization during the forward pass, allowing the network to learn weights that are more robust to the rounding errors introduced when converting the model to a smaller format."
Reproduction
Quantized with transformers==5.10.2, bitsandbytes==0.49.2, torch==2.7.1+cu126 on a single A100-80GB by loading the bf16 source with the BitsAndBytesConfig above and model.save_pretrained(...). Perplexity evaluated with the same stack.
License
Apache-2.0, inherited from the base model. Gemma is subject to the Gemma Terms of Use.