aleksandard

gemma-4-12B-it-int4-MLPonly-AutoRound

Deploy Dedicated

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Serving with vLLM (verified)

Tested on RTX 5090 (Blackwell, sm120), CUDA 13.

Gemma 4 12B "unified" support landed in vllm-project/vllm#44429 and is not yet in a stable release — you need a vLLM nightly build. On Blackwell, the FlashInfer sampler fails to JIT-compile, so disable it.

Install nightly (CUDA 13; use cu129 URLs on CUDA 12.9 hosts):

bash
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu130 \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --index-strategy unsafe-best-match

Serve:

bash
export VLLM_USE_FLASHINFER_SAMPLER=0
vllm serve <path-to-this-model> \
  --served-model-name gemma4-12b \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 --port 8000

The model loads in ~11 GB, leaving plenty of room on a 32 GB card for KV cache (raise --max-model-len accordingly). Recommended sampling for Gemma 4: temperature=1.0, top_p=0.95, top_k=64.

Quick test:

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4-12b",
    "messages": [{"role": "user", "content": "Explain quantization in one paragraph."}],
    "max_tokens": 200, "temperature": 1.0, "top_p": 0.95, "top_k": 64
  }'

Usage (transformers)

Also loads under transformers (requires gptqmodel):

bash
pip install transformers torch gptqmodel optimum

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "aleksandard/gemma-4-12B-it-int4-MLPonly-AutoRound"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="cuda"
)

messages = [{"role": "user", "content": "Explain quantization in one paragraph."}]
ids = tok.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt", return_dict=False
).to("cuda")
out = model.generate(ids, max_new_tokens=256)
print(tok.decode(out[0][ids.shape[-1]:], skip_special_tokens=True))

Notes

vLLM stable (<= 0.22.0) does not serve Gemma 4 dense 12B — it hits a shape mismatch in the attention path caused by Gemma 4's heterogeneous head dimensions (head_dim 256 for sliding-window layers vs 512 for global layers). Use a nightly build as described above.
On Blackwell, VLLM_USE_FLASHINFER_SAMPLER=0 is required to avoid a FlashInfer JIT-compile failure during sampling.

Reproduce

bash
auto-round \
  --model google/gemma-4-12B-it \
  --scheme W4A16 \
  --iters 0 \
  --disable_opt_rtn \
  --layer_config '{"model.language_model.layers.\d+.self_attn.q_proj":{"bits":16},"model.language_model.layers.\d+.self_attn.k_proj":{"bits":16},"model.language_model.layers.\d+.self_attn.v_proj":{"bits":16},"model.language_model.layers.\d+.self_attn.o_proj":{"bits":16}}' \
  --format auto_gptq \
  --output_dir ./gemma-4-12B-it-int4-MLPonly

Limitations

This is a quantized derivative; it inherits all limitations and biases of the base model and may show additional deviation due to 4-bit quantization. See the base model card for full details. Quantization was calibration-free (RTN); a calibrated build may recover some quality.

License

Apache 2.0, inherited from the base model. This repository changes only the numeric precision of the weights.

Model provider

aleksandard

Model tree

Base

google/gemma-4-12B-it

Quantized

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Serving with vLLM (verified)

Tested on RTX 5090 (Blackwell, sm120), CUDA 13.

Install nightly (CUDA 13; use cu129 URLs on CUDA 12.9 hosts):

bash
uv pip install -U vllm --pre \
  --extra-index-url https://wheels.vllm.ai/nightly/cu130 \
  --extra-index-url https://download.pytorch.org/whl/cu130 \
  --index-strategy unsafe-best-match

Serve:

bash
export VLLM_USE_FLASHINFER_SAMPLER=0
vllm serve <path-to-this-model> \
  --served-model-name gemma4-12b \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --host 0.0.0.0 --port 8000

The model loads in ~11 GB, leaving plenty of room on a 32 GB card for KV cache (raise --max-model-len accordingly). Recommended sampling for Gemma 4: temperature=1.0, top_p=0.95, top_k=64.

Quick test:

bash
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "gemma4-12b",
    "messages": [{"role": "user", "content": "Explain quantization in one paragraph."}],
    "max_tokens": 200, "temperature": 1.0, "top_p": 0.95, "top_k": 64
  }'

Usage (transformers)

Also loads under transformers (requires gptqmodel):

bash
pip install transformers torch gptqmodel optimum

python
from transformers import AutoModelForCausalLM, AutoTokenizer

model_id = "aleksandard/gemma-4-12B-it-int4-MLPonly-AutoRound"
tok = AutoTokenizer.from_pretrained(model_id)
model = AutoModelForCausalLM.from_pretrained(
    model_id, torch_dtype="auto", device_map="cuda"
)

messages = [{"role": "user", "content": "Explain quantization in one paragraph."}]
ids = tok.apply_chat_template(
    messages, add_generation_prompt=True, return_tensors="pt", return_dict=False
).to("cuda")
out = model.generate(ids, max_new_tokens=256)
print(tok.decode(out[0][ids.shape[-1]:], skip_special_tokens=True))

Notes

vLLM stable (<= 0.22.0) does not serve Gemma 4 dense 12B — it hits a shape mismatch in the attention path caused by Gemma 4's heterogeneous head dimensions (head_dim 256 for sliding-window layers vs 512 for global layers). Use a nightly build as described above.
On Blackwell, VLLM_USE_FLASHINFER_SAMPLER=0 is required to avoid a FlashInfer JIT-compile failure during sampling.

Reproduce

bash
auto-round \
  --model google/gemma-4-12B-it \
  --scheme W4A16 \
  --iters 0 \
  --disable_opt_rtn \
  --layer_config '{"model.language_model.layers.\d+.self_attn.q_proj":{"bits":16},"model.language_model.layers.\d+.self_attn.k_proj":{"bits":16},"model.language_model.layers.\d+.self_attn.v_proj":{"bits":16},"model.language_model.layers.\d+.self_attn.o_proj":{"bits":16}}' \
  --format auto_gptq \
  --output_dir ./gemma-4-12B-it-int4-MLPonly

Limitations

License

Apache 2.0, inherited from the base model. This repository changes only the numeric precision of the weights.

gemma-4-12B-it-int4-MLPonly-AutoRound

Get help setting up a custom Dedicated Endpoints.

README

Serving with vLLM (verified)

Usage (transformers)

Notes

Reproduce

Limitations

License

Explore FriendliAI today

README

Serving with vLLM (verified)

Usage (transformers)

Notes

Reproduce

Limitations

License