Why This Model?
On non-Blackwell GPUs (RTX 3090, 4090, etc.), the Marlin fallback path is used for
NVFP4 inference. The official NVIDIA checkpoint leaves self-attention and MLP in BF16,
which inflates VRAM to ~20 GB and leaves very little room for KV cache on 24 GB cards.
This checkpoint quantizes those layers too, bringing model VRAM down to ~17.8 GB and
enabling a 232K token KV pool on a 24 GB card — comparable to W4A16 setups:
Table with columns: Checkpoint, Self-Attn/MLP, Model VRAM, KV Pool (24 GB)| Checkpoint | Self-Attn/MLP | Model VRAM | KV Pool (24 GB) |
|---|
nvidia/Gemma-4-26B-A4B-NVFP4 | BF16 | 20.07 GB | 74K tokens |
| This model | NVFP4 | 17.81 GB | 232K tokens |
How It Was Created
Quantized with NVIDIA ModelOpt 0.46 using NVFP4_DEFAULT_CFG, which targets all
nn.Linear layers for NVFP4 quantization (W4A4, group_size=16, two-level scaling
with weight_scale + weight_scale_2 + input_scale).
Excluded from quantization (kept in BF16):
lm_head (tied to embed_tokens)
model.embed_vision* (vision embedding projection)
model.language_model.layers.*.router* (MoE router — intentionally excluded by
modelopt defaults for accuracy)
model.vision_tower* (entire vision encoder)
Quantized to NVFP4:
- All self-attention projections (q_proj, k_proj, v_proj, o_proj) — 30 layers
- All MLP projections (gate_proj, up_proj, down_proj) — 30 layers
- All MoE expert projections (gate/up/down per expert) — 30 layers × 128 experts
import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint
from transformers import AutoModelForCausalLM, AutoTokenizer
MODEL_ID = "google/gemma-4-26B-A4B-it-qat-q4_0-unquantized"
OUTPUT_DIR = "gemma-4-26b-a4b-nvfp4-full"
model = AutoModelForCausalLM.from_pretrained(
MODEL_ID, torch_dtype=torch.bfloat16, device_map="cpu", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)
model = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG)
export_hf_checkpoint(model, dtype=torch.bfloat16, export_dir=OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)
Note: ModelOpt 0.44 fails on Gemma 4's Gemma4TextExperts module (singular vs
plural weight_quantizer attribute mismatch). Version 0.46+ is required.
Running with SGLang
Important: The 26B MoE uses GELU activation (not SiLU). You need the GELU MoE
patch from PR sgl-project/sglang#24280.
Save the patch script as /tmp/patch_gelu_moe_v2.sh.
docker run -d \
--name sglang-nvfp4 \
--gpus all \
--network host \
-e SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v /tmp/patch_gelu_moe_v2.sh:/patch.sh:ro \
lmsysorg/sglang:gemma4-mtp \
bash -c "bash /patch.sh && \
pip install --no-deps --quiet 'git+https://github.com/huggingface/transformers.git@1423d22f7a3b62e8c70ad67b58ec25cd9b675897' 2>&1 && \
python3 -m sglang.launch_server \
--model-path kunhunjon/gemma-4-26B-A4B-it-qat-NVFP4-full \
--quantization modelopt_fp4 \
--kv-cache-dtype fp8_e5m2 \
--mem-fraction-static 0.91 \
--context-length 262144 \
--swa-full-tokens-ratio 0.05 \
--cuda-graph-max-bs 1 \
--max-running-requests 1 \
--reasoning-parser gemma4 \
--tool-call-parser gemma4 \
--dtype bfloat16 \
--trust-remote-code \
--host 0.0.0.0 --port 30000"
Notes:
--quantization modelopt_fp4 selects the NVFP4 handler with Marlin fallback for
non-Blackwell GPUs (SM80+). On Blackwell (SM100+), the native FP4 CUTLASS path is
used automatically.
--mem-fraction-static 0.91 fits the model + KV pool on 24 GB cards.
Reduce to 0.90 if OOM occurs during CUDA graph capture.
--swa-full-tokens-ratio 0.05 expands the full-attention KV pool 6× vs default
(Gemma 4 uses hybrid sliding window attention).
--cuda-graph-max-bs 1 minimizes CUDA graph memory on 24 GB cards.
- The transformers git commit is needed for
gemma4 architecture support.
Memory on RTX 3090 (24 GB)
Table with columns: Component, VRAM| Component | VRAM |
|---|
| Model weights (NVFP4 + Marlin overhead) | 17.81 GB |
| KV cache (FP8, 232K tokens) | 3.33 GB |
| CUDA graphs (bs=1) | ~0.06 GB |
| Overhead | ~0.79 GB |
| Total | ~22.0 GB |
Accuracy Considerations
NVIDIA's modelopt defaults intentionally exclude self-attention from NVFP4 quantization
for Gemma 4, as 4-bit attention weights can degrade accuracy on retrieval, reasoning,
and long-context tasks more than MLP quantization. This checkpoint overrides that
default, trading some accuracy for reduced VRAM.
The E4B variant (bg-digitalservices/Gemma-4-E4B-IT-NVFP4) also quantizes
self-attention and reports only 1-3% quality loss on GPQA/MMLU Pro. Expect similar
degradation for this 26B model. For production workloads where accuracy is critical,
use nvidia/Gemma-4-26B-A4B-NVFP4 instead (attention stays BF16, but needs more VRAM).
VRAM Tuning
The --mem-fraction-static and --swa-full-tokens-ratio parameters control the
KV cache capacity:
Table with columns: mem-fraction, swa-ratio, Full KV Tokens, SWA Tokens, Notes| mem-fraction | swa-ratio | Full KV Tokens | SWA Tokens | Notes |
|---|
| 0.90 | 0.05 | ~74K | ~3.7K | Safe, more CUDA graph room |
| 0.91 | 0.05 | ~232K | ~11.6K | Recommended for 24 GB |
| 0.92 | 0.05 | OOM risk | — | May fail during graph capture |
[!Note]
This model card is for the new versions of the Gemma 4 family optimized with Quantization-Aware Training (QAT), which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model.
Four versions of the QAT checkpoints are available:
- Unquantized QAT checkpoints (Q4_0): Half-precision weights extracted from the QAT pipeline, ideal for custom downstream compilation and research. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B, and their drafter models.
- GGUF (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B.
- Mobile-optimized (wNa8o8): A custom schema engineered explicitly for mobile hardware efficiency. It features targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available for Gemma 4 E2B and E4B.
- Compressed Tensors (w4a16): QAT checkpoints serialized in the compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B, and 31B.
Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.
Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in five distinct sizes: E2B, E4B, 12B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.
Gemma 4 introduces key capability and architectural advancements:
-
Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes.
-
Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B, E4B, and 12B models).
-
Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.
-
Optimized for On-Device – Smaller models are specifically designed for efficient local execution on laptops and mobile devices.
-
Increased Context Window – The small models feature a 128K context window, while the medium models support 256K.
-
Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.
-
Native System Prompt Support – Gemma 4 introduces native support for the system role, enabling more structured and controllable conversations.
Models Overview
Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (12B, 26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.
The models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This hybrid design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. To optimize memory for long contexts, global layers feature unified Keys and Values, and apply Proportional RoPE (p-RoPE).
Dense Models
Table with columns: Property, E2B, E4B, 12B Unified, 31B Dense| Property | E2B | E4B | 12B Unified | 31B Dense |
|---|
| Total Parameters | 2.3B effective (5.1B with embeddings) | 4.5B effective (8B with embeddings) | 11.95B | 30.7B |
| Layers | 35 | 42 | 48 | 60 |
| Sliding Window | 512 tokens | 512 tokens | 1024 tokens | 1024 tokens |
Mixture-of-Experts (MoE) Model
Table with columns: Property, 26B A4B MoE| Property | 26B A4B MoE |
|---|
| Total Parameters | 25.2B |
| Active Parameters | 3.8B |
| Layers | 30 |
| Sliding Window | 1024 tokens |
| Context Length | 256K tokens |
| Vocabulary Size | 262K |
| Expert Count | 8 active / 128 total and 1 shared |
| Supported Modalities | Text, Image |
Best Practices
1. Sampling Parameters
Use the following standardized sampling configuration across all use cases:
temperature=1.0
top_p=0.95
top_k=64
2. Thinking Mode Configuration
To properly manage the thinking process, use the following control tokens:
- Trigger Thinking: Thinking is enabled by including the
|<<|think|> token at the start of the system prompt. To disable thinking, remove the token.
- Standard Generation: When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure:
|<<|channel>thought\n[Internal reasoning]<<channel|>
3. Multi-Turn Conversations
- No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.
Ethics and Safety
As open models become central to enterprise infrastructure, provenance and security are paramount. Developed by Google DeepMind, Gemma 4 undergoes the same rigorous safety evaluations as our proprietary Gemini models.
Evaluation Approach
Gemma 4 models were developed in partnership with internal safety and responsible AI teams. A range of automated as well as human evaluations were conducted to help improve model safety. These evaluations align with Google's AI principles, as well as safety policies.
Ethical Considerations and Risks
- Bias and Fairness – VLMs trained on large-scale data can reflect socio-cultural biases. Gemma 4 underwent careful scrutiny and evaluations to mitigate bias risks.
- Misinformation and Misuse – Guidelines are provided for responsible use. See the Responsible Generative AI Toolkit.
- Transparency and Accountability – This model card summarizes details on architecture, capabilities, limitations, and evaluation processes.