Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

What this is

  • Source: unsloth/gemma-4-31B-it-qat-q4_0-unquantized (bf16, ~58 GB) — Google's QAT checkpoint, de-quantized back to bf16.
  • Method: bitsandbytes 4-bit, replicating the exact quantization_config used by the official Unsloth dynamic-4bit repo:
    • bnb_4bit_quant_type = "nf4"
    • bnb_4bit_use_double_quant = true
    • bnb_4bit_compute_dtype = "bfloat16"
    • bnb_4bit_quant_storage = "uint8"
    • the full llm_int8_skip_modules list, so embeddings, lm_head, the vision tower, the multimodal projector and all per-layer/vision modules stay in bf16. Only the language-model decoder linears (600 Linear4bit modules across 60 layers) are quantized.
  • Result: ~17.8 GB on GPU (down from ~58 GB), fits comfortably on a single 24 GB GPU for inference / QLoRA.

📊 Perplexity comparison (measured)

Identical evaluation harness across all three models: detokenized WikiText-103 paragraphs (80 docs, ~30k tokens), <bos> prepended per window, Gemma final-logit softcapping applied, computed on the inner text decoder.

ModelWhat it isPerplexity ↓
unsloth/gemma-4-31B-it-qat-q4_0-unquantizedQAT bf16 (upper bound)682.4
Luminia/gemma-4-31B-it-qat-bnb-4bit (this)QAT → NF4732.7
unsloth/gemma-4-31B-it-unsloth-bnb-4bitvanilla it → NF4 (existing)2295.6

Takeaways:

  • This QAT→NF4 model sits only +7.4 % above its bf16 upper bound — quantization barely degraded it.
  • It is ~3.1× lower perplexity than the existing vanilla→NF4 bnb-4bit model. The QAT hypothesis holds.

⚠️ The absolute PPL values are inflated by rare proper-noun / encyclopedic tokens in WikiText and are not comparable to numbers from other harnesses. Only the relative ranking (computed under one identical harness) is meaningful.

⚠️ Honest caveat: NF4 ≠ q4_0

Google's QAT for Gemma targets q4_0 (a llama.cpp/GGUF symmetric int4 scheme), not NF4 double-quant. The QAT robustness transfers largely (as the table shows) but this is not the exact error profile the GGUF q4_0 QAT artifact would give. If you want the literal QAT target precision, use the GGUF q4_0 build. This repo is the bitsandbytes/QLoRA-friendly counterpart.

Usage

python

from transformers import AutoModelForImageTextToText, AutoProcessor
model_id = "Nekochu/gemma-4-31B-it-qat-bnb-4bit"
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
proc = AutoProcessor.from_pretrained(model_id)
messages = [{"role": "user", "content": [{"type": "text", "text": "Name three primary colors."}]}]
inputs = proc.apply_chat_template(
messages, add_generation_prompt=True, tokenize=True,
return_dict=True, return_tensors="pt"
).to(model.device)
out = model.generate(**inputs, max_new_tokens=64)
print(proc.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Requires transformers, accelerate, bitsandbytes, pillow, torchvision.

Sanity check

Greedy generation after quantization:

Prompt: Name three primary colors. Output: "The three primary colors are red, blue, and yellow."

Prompt: Explain what quantization-aware training is in one sentence. Output: "Quantization-aware training (QAT) is a model training process that simulates the effects of low-precision quantization during the forward pass, allowing the network to learn weights that are more robust to the rounding errors introduced when converting the model to a smaller format."

Reproduction

Quantized with transformers==5.10.2, bitsandbytes==0.49.2, torch==2.7.1+cu126 on a single A100-80GB by loading the bf16 source with the BitsAndBytesConfig above and model.save_pretrained(...). Perplexity evaluated with the same stack.

License

Apache-2.0, inherited from the base model. Gemma is subject to the Gemma Terms of Use.

Model provider

Luminia

Model tree

Base

unsloth/gemma-4-31B-it-qat-q4_0-unquantized

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today