Nekochu

gemma-4-31B-it-qat-bnb-4bit

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

What this is

Source: unsloth/gemma-4-31B-it-qat-q4_0-unquantized (bf16, ~58 GB) — Google's QAT checkpoint, de-quantized back to bf16.
Method: bitsandbytes 4-bit, replicating the exact quantization_config used by the official Unsloth dynamic-4bit repo:
- bnb_4bit_quant_type = "nf4"
- bnb_4bit_use_double_quant = true
- bnb_4bit_compute_dtype = "bfloat16"
- bnb_4bit_quant_storage = "uint8"
- the full llm_int8_skip_modules list, so embeddings, lm_head, the vision tower, the multimodal projector and all per-layer/vision modules stay in bf16. Only the language-model decoder linears (600 Linear4bit modules across 60 layers) are quantized.
Result: ~17.8 GB on GPU (down from ~58 GB), fits comfortably on a single 24 GB GPU for inference / QLoRA.

📊 Perplexity comparison (measured)

Identical evaluation harness across all three models: detokenized WikiText-103 paragraphs (80 docs, ~30k tokens), <bos> prepended per window, Gemma final-logit softcapping applied, computed on the inner text decoder.

Table with columns: Model, What it is, Perplexity ↓
Model	What it is	Perplexity ↓
`unsloth/gemma-4-31B-it-qat-q4_0-unquantized`	QAT bf16 (upper bound)	682.4
`Nekochu/gemma-4-31B-it-qat-bnb-4bit` (this)	QAT → NF4	732.7
`unsloth/gemma-4-31B-it-unsloth-bnb-4bit`	vanilla `it` → NF4 (existing)	2295.6

Takeaways:

This QAT→NF4 model sits only +7.4 % above its bf16 upper bound — quantization barely degraded it.
It is ~3.1× lower perplexity than the existing vanilla→NF4 bnb-4bit model. The QAT hypothesis holds.

⚠️ The absolute PPL values are inflated by rare proper-noun / encyclopedic tokens in WikiText and are not comparable to numbers from other harnesses. Only the relative ranking (computed under one identical harness) is meaningful.

⚠️ Honest caveat: NF4 ≠ q4_0

Google's QAT for Gemma targets q4_0 (a llama.cpp/GGUF symmetric int4 scheme), not NF4 double-quant. The QAT robustness transfers largely (as the table shows) but this is not the exact error profile the GGUF q4_0 QAT artifact would give. If you want the literal QAT target precision, use the GGUF q4_0 build. This repo is the bitsandbytes/QLoRA-friendly counterpart.

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "Nekochu/gemma-4-31B-it-qat-bnb-4bit"
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
proc = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": [{"type": "text", "text": "Name three primary colors."}]}]
inputs = proc.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

out = model.generate(**inputs, max_new_tokens=64)
print(proc.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Requires transformers, accelerate, bitsandbytes, pillow, torchvision.

Sanity check

Greedy generation after quantization:

Prompt: Name three primary colors. Output: "The three primary colors are red, blue, and yellow."

Prompt: Explain what quantization-aware training is in one sentence. Output: "Quantization-aware training (QAT) is a model training process that simulates the effects of low-precision quantization during the forward pass, allowing the network to learn weights that are more robust to the rounding errors introduced when converting the model to a smaller format."

Reproduction

Quantized with transformers==5.10.2, bitsandbytes==0.49.2, torch==2.7.1+cu126 on a single A100-80GB by loading the bf16 source with the BitsAndBytesConfig above and model.save_pretrained(...). Perplexity evaluated with the same stack.

License

Apache-2.0, inherited from the base model. Gemma is subject to the Gemma Terms of Use.

Model provider

Nekochu

Model tree

Base

unsloth/gemma-4-31B-it-qat-q4_0-unquantized

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

What this is

Source: unsloth/gemma-4-31B-it-qat-q4_0-unquantized (bf16, ~58 GB) — Google's QAT checkpoint, de-quantized back to bf16.
Method: bitsandbytes 4-bit, replicating the exact quantization_config used by the official Unsloth dynamic-4bit repo:
- bnb_4bit_quant_type = "nf4"
- bnb_4bit_use_double_quant = true
- bnb_4bit_compute_dtype = "bfloat16"
- bnb_4bit_quant_storage = "uint8"
- the full llm_int8_skip_modules list, so embeddings, lm_head, the vision tower, the multimodal projector and all per-layer/vision modules stay in bf16. Only the language-model decoder linears (600 Linear4bit modules across 60 layers) are quantized.
Result: ~17.8 GB on GPU (down from ~58 GB), fits comfortably on a single 24 GB GPU for inference / QLoRA.

📊 Perplexity comparison (measured)

Table with columns: Model, What it is, Perplexity ↓
Model	What it is	Perplexity ↓
`unsloth/gemma-4-31B-it-qat-q4_0-unquantized`	QAT bf16 (upper bound)	682.4
`Nekochu/gemma-4-31B-it-qat-bnb-4bit` (this)	QAT → NF4	732.7
`unsloth/gemma-4-31B-it-unsloth-bnb-4bit`	vanilla `it` → NF4 (existing)	2295.6

Takeaways:

This QAT→NF4 model sits only +7.4 % above its bf16 upper bound — quantization barely degraded it.
It is ~3.1× lower perplexity than the existing vanilla→NF4 bnb-4bit model. The QAT hypothesis holds.

⚠️ The absolute PPL values are inflated by rare proper-noun / encyclopedic tokens in WikiText and are not comparable to numbers from other harnesses. Only the relative ranking (computed under one identical harness) is meaningful.

⚠️ Honest caveat: NF4 ≠ q4_0

Usage

python
from transformers import AutoModelForImageTextToText, AutoProcessor

model_id = "Nekochu/gemma-4-31B-it-qat-bnb-4bit"
model = AutoModelForImageTextToText.from_pretrained(model_id, device_map="auto")
proc = AutoProcessor.from_pretrained(model_id)

messages = [{"role": "user", "content": [{"type": "text", "text": "Name three primary colors."}]}]
inputs = proc.apply_chat_template(
    messages, add_generation_prompt=True, tokenize=True,
    return_dict=True, return_tensors="pt"
).to(model.device)

out = model.generate(**inputs, max_new_tokens=64)
print(proc.tokenizer.decode(out[0][inputs["input_ids"].shape[1]:], skip_special_tokens=True))

Requires transformers, accelerate, bitsandbytes, pillow, torchvision.

Sanity check

Greedy generation after quantization:

Prompt: Name three primary colors. Output: "The three primary colors are red, blue, and yellow."

Prompt: Explain what quantization-aware training is in one sentence. Output: "Quantization-aware training (QAT) is a model training process that simulates the effects of low-precision quantization during the forward pass, allowing the network to learn weights that are more robust to the rounding errors introduced when converting the model to a smaller format."

Reproduction

License

Apache-2.0, inherited from the base model. Gemma is subject to the Gemma Terms of Use.

gemma-4-31B-it-qat-bnb-4bit

Get help setting up a custom Dedicated Endpoints.

README

What this is

📊 Perplexity comparison (measured)

⚠️ Honest caveat: NF4 ≠ q4_0

Usage

Sanity check

Reproduction

License

Explore FriendliAI today

README

What this is

📊 Perplexity comparison (measured)

⚠️ Honest caveat: NF4 ≠ q4_0

Usage

Sanity check

Reproduction

License