Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0What this is
- Source:
unsloth/gemma-4-31B-it-qat-q4_0-unquantized(bf16, ~58 GB) — Google's QAT checkpoint, de-quantized back to bf16. - Method: bitsandbytes 4-bit, replicating the exact
quantization_configused by the official Unsloth dynamic-4bit repo:bnb_4bit_quant_type = "nf4"bnb_4bit_use_double_quant = truebnb_4bit_compute_dtype = "bfloat16"bnb_4bit_quant_storage = "uint8"- the full
llm_int8_skip_moduleslist, so embeddings,lm_head, the vision tower, the multimodal projector and all per-layer/vision modules stay in bf16. Only the language-model decoder linears (600Linear4bitmodules across 60 layers) are quantized.
- Result: ~17.8 GB on GPU (down from ~58 GB), fits comfortably on a single 24 GB GPU for inference / QLoRA.
📊 Perplexity comparison (measured)
Identical evaluation harness across all three models: detokenized WikiText-103 paragraphs (80 docs, ~30k tokens), <bos> prepended per window, Gemma final-logit softcapping applied, computed on the inner text decoder.
| Model | What it is | Perplexity ↓ |
|---|---|---|
unsloth/gemma-4-31B-it-qat-q4_0-unquantized | QAT bf16 (upper bound) | 682.4 |
Luminia/gemma-4-31B-it-qat-bnb-4bit (this) | QAT → NF4 | 732.7 |
unsloth/gemma-4-31B-it-unsloth-bnb-4bit | vanilla it → NF4 (existing) | 2295.6 |
Takeaways:
- This QAT→NF4 model sits only +7.4 % above its bf16 upper bound — quantization barely degraded it.
- It is ~3.1× lower perplexity than the existing vanilla→NF4 bnb-4bit model. The QAT hypothesis holds.
⚠️ The absolute PPL values are inflated by rare proper-noun / encyclopedic tokens in WikiText and are not comparable to numbers from other harnesses. Only the relative ranking (computed under one identical harness) is meaningful.
⚠️ Honest caveat: NF4 ≠ q4_0
Google's QAT for Gemma targets q4_0 (a llama.cpp/GGUF symmetric int4 scheme), not NF4 double-quant. The QAT robustness transfers largely (as the table shows) but this is not the exact error profile the GGUF q4_0 QAT artifact would give. If you want the literal QAT target precision, use the GGUF q4_0 build. This repo is the bitsandbytes/QLoRA-friendly counterpart.
Usage
python
from transformers import AutoModelForImageTextToText, AutoProcessormodel_id = "Nekochu/gemma-4-31B-it-qat-bnb-4bit"model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")proc = AutoProcessor.from_pretrained(model_id)messages = [{"role": "user", "content": [{"type": "text", "text": "Name three primary colors."}]}]inputs = proc.apply_chat_template(messages, add_generation_prompt=True, tokenize=True,return_dict=True, return_tensors="pt").to(model.device)out = model.generate(**inputs, max_new_tokens=64)print(proc.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))
Requires transformers, accelerate, bitsandbytes, pillow, torchvision.
Sanity check
Greedy generation after quantization:
Prompt: Name three primary colors. Output: "The three primary colors are red, blue, and yellow."
Prompt: Explain what quantization-aware training is in one sentence. Output: "Quantization-aware training (QAT) is a model training process that simulates the effects of low-precision quantization during the forward pass, allowing the network to learn weights that are more robust to the rounding errors introduced when converting the model to a smaller format."
Reproduction
Quantized with transformers==5.10.2, bitsandbytes==0.49.2, torch==2.7.1+cu126 on a single A100-80GB by loading the bf16 source with the BitsAndBytesConfig above and model.save_pretrained(...). Perplexity evaluated with the same stack.
License
Apache-2.0, inherited from the base model. Gemma is subject to the Gemma Terms of Use.
Model provider
Luminia
Model tree
Base
unsloth/gemma-4-31B-it-qat-q4_0-unquantized
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information