kunhunjon

gemma-4-26B-A4B-it-qat-NVFP4-full

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

Why This Model?

On non-Blackwell GPUs (RTX 3090, 4090, etc.), the Marlin fallback path is used for NVFP4 inference. The official NVIDIA checkpoint leaves self-attention and MLP in BF16, which inflates VRAM to ~20 GB and leaves very little room for KV cache on 24 GB cards.

This checkpoint quantizes those layers too, bringing model VRAM down to ~17.8 GB and enabling a 232K token KV pool on a 24 GB card — comparable to W4A16 setups:

Table with columns: Checkpoint, Self-Attn/MLP, Model VRAM, KV Pool (24 GB)
Checkpoint	Self-Attn/MLP	Model VRAM	KV Pool (24 GB)
`nvidia/Gemma-4-26B-A4B-NVFP4`	BF16	20.07 GB	74K tokens
This model	NVFP4	17.81 GB	232K tokens

How It Was Created

Quantized with NVIDIA ModelOpt 0.46 using NVFP4_DEFAULT_CFG, which targets all nn.Linear layers for NVFP4 quantization (W4A4, group_size=16, two-level scaling with weight_scale + weight_scale_2 + input_scale).

Excluded from quantization (kept in BF16):

lm_head (tied to embed_tokens)
model.embed_vision* (vision embedding projection)
model.language_model.layers.*.router* (MoE router — intentionally excluded by modelopt defaults for accuracy)
model.vision_tower* (entire vision encoder)

Quantized to NVFP4:

All self-attention projections (q_proj, k_proj, v_proj, o_proj) — 30 layers
All MLP projections (gate_proj, up_proj, down_proj) — 30 layers
All MoE expert projections (gate/up/down per expert) — 30 layers × 128 experts

python
import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "google/gemma-4-26B-A4B-it-qat-q4_0-unquantized"
OUTPUT_DIR = "gemma-4-26b-a4b-nvfp4-full"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="cpu", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

model = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG)

export_hf_checkpoint(model, dtype=torch.bfloat16, export_dir=OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

Note: ModelOpt 0.44 fails on Gemma 4's Gemma4TextExperts module (singular vs plural weight_quantizer attribute mismatch). Version 0.46+ is required.

Running with SGLang

Important: The 26B MoE uses GELU activation (not SiLU). You need the GELU MoE patch from PR sgl-project/sglang#24280. Save the patch script as /tmp/patch_gelu_moe_v2.sh.

bash
docker run -d \
  --name sglang-nvfp4 \
  --gpus all \
  --network host \
  -e SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /tmp/patch_gelu_moe_v2.sh:/patch.sh:ro \
  lmsysorg/sglang:gemma4-mtp \
  bash -c "bash /patch.sh && \
  pip install --no-deps --quiet 'git+https://github.com/huggingface/transformers.git@1423d22f7a3b62e8c70ad67b58ec25cd9b675897' 2>&1 && \
  python3 -m sglang.launch_server \
    --model-path kunhunjon/gemma-4-26B-A4B-it-qat-NVFP4-full \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8_e5m2 \
    --mem-fraction-static 0.91 \
    --context-length 262144 \
    --swa-full-tokens-ratio 0.05 \
    --cuda-graph-max-bs 1 \
    --max-running-requests 1 \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4 \
    --dtype bfloat16 \
    --trust-remote-code \
    --host 0.0.0.0 --port 30000"

Notes:

--quantization modelopt_fp4 selects the NVFP4 handler with Marlin fallback for non-Blackwell GPUs (SM80+). On Blackwell (SM100+), the native FP4 CUTLASS path is used automatically.
--mem-fraction-static 0.91 fits the model + KV pool on 24 GB cards. Reduce to 0.90 if OOM occurs during CUDA graph capture.
--swa-full-tokens-ratio 0.05 expands the full-attention KV pool 6× vs default (Gemma 4 uses hybrid sliding window attention).
--cuda-graph-max-bs 1 minimizes CUDA graph memory on 24 GB cards.
The transformers git commit is needed for gemma4 architecture support.

Memory on RTX 3090 (24 GB)

Table with columns: Component, VRAM
Component	VRAM
Model weights (NVFP4 + Marlin overhead)	17.81 GB
KV cache (FP8, 232K tokens)	3.33 GB
CUDA graphs (bs=1)	~0.06 GB
Overhead	~0.79 GB
Total	~22.0 GB

Accuracy Considerations

NVIDIA's modelopt defaults intentionally exclude self-attention from NVFP4 quantization for Gemma 4, as 4-bit attention weights can degrade accuracy on retrieval, reasoning, and long-context tasks more than MLP quantization. This checkpoint overrides that default, trading some accuracy for reduced VRAM.

The E4B variant (bg-digitalservices/Gemma-4-E4B-IT-NVFP4) also quantizes self-attention and reports only 1-3% quality loss on GPQA/MMLU Pro. Expect similar degradation for this 26B model. For production workloads where accuracy is critical, use nvidia/Gemma-4-26B-A4B-NVFP4 instead (attention stays BF16, but needs more VRAM).

VRAM Tuning

The --mem-fraction-static and --swa-full-tokens-ratio parameters control the KV cache capacity:

Table with columns: mem-fraction, swa-ratio, Full KV Tokens, SWA Tokens, Notes
mem-fraction	swa-ratio	Full KV Tokens	SWA Tokens	Notes
0.90	0.05	~74K	~3.7K	Safe, more CUDA graph room
0.91	0.05	~232K	~11.6K	Recommended for 24 GB
0.92	0.05	OOM risk	—	May fail during graph capture

[!Note] This model card is for the new versions of the Gemma 4 family optimized with Quantization-Aware Training (QAT), which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model. Four versions of the QAT checkpoints are available:

Unquantized QAT checkpoints (Q4_0): Half-precision weights extracted from the QAT pipeline, ideal for custom downstream compilation and research. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B, and their drafter models.

GGUF (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B.

Mobile-optimized (wNa8o8): A custom schema engineered explicitly for mobile hardware efficiency. It features targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available for Gemma 4 E2B and E4B.

Compressed Tensors (w4a16): QAT checkpoints serialized in the compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B, and 31B.

Gemma is a family of open models built by Google DeepMind. Gemma 4 models are multimodal, handling text and image input (with audio supported on E2B, E4B, and 12B) and generating text output. This release includes open-weights models in both pre-trained and instruction-tuned variants. Gemma 4 features a context window of up to 256K tokens and maintains multilingual support in over 140 languages.

Featuring both Dense and Mixture-of-Experts (MoE) architectures, Gemma 4 is well-suited for tasks like text generation, coding, and reasoning. The models are available in five distinct sizes: E2B, E4B, 12B, 26B A4B, and 31B. Their diverse sizes make them deployable in environments ranging from high-end phones to laptops and servers, democratizing access to state-of-the-art AI.

Gemma 4 introduces key capability and architectural advancements:

Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes.
Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B, E4B, and 12B models).
Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.
Optimized for On-Device – Smaller models are specifically designed for efficient local execution on laptops and mobile devices.
Increased Context Window – The small models feature a 128K context window, while the medium models support 256K.
Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.
Native System Prompt Support – Gemma 4 introduces native support for the system role, enabling more structured and controllable conversations.

Models Overview

Gemma 4 models are designed to deliver frontier-level performance at each size, targeting deployment scenarios from mobile and edge devices (E2B, E4B) to consumer GPUs and workstations (12B, 26B A4B, 31B). They are well-suited for reasoning, agentic workflows, coding, and multimodal understanding.

The models employ a hybrid attention mechanism that interleaves local sliding window attention with full global attention, ensuring the final layer is always global. This hybrid design delivers the processing speed and low memory footprint of a lightweight model without sacrificing the deep awareness required for complex, long-context tasks. To optimize memory for long contexts, global layers feature unified Keys and Values, and apply Proportional RoPE (p-RoPE).

Dense Models

Table with columns: Property, E2B, E4B, 12B Unified, 31B Dense
Property	E2B	E4B	12B Unified	31B Dense
Total Parameters	2.3B effective (5.1B with embeddings)	4.5B effective (8B with embeddings)	11.95B	30.7B
Layers	35	42	48	60
Sliding Window	512 tokens	512 tokens	1024 tokens	1024 tokens

Mixture-of-Experts (MoE) Model

Table with columns: Property, 26B A4B MoE
Property	26B A4B MoE
Total Parameters	25.2B
Active Parameters	3.8B
Layers	30
Sliding Window	1024 tokens
Context Length	256K tokens
Vocabulary Size	262K
Expert Count	8 active / 128 total and 1 shared
Supported Modalities	Text, Image

Best Practices

1. Sampling Parameters

Use the following standardized sampling configuration across all use cases:

temperature=1.0
top_p=0.95
top_k=64

2. Thinking Mode Configuration

To properly manage the thinking process, use the following control tokens:

Trigger Thinking: Thinking is enabled by including the |<<|think|> token at the start of the system prompt. To disable thinking, remove the token.
Standard Generation: When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure: |<<|channel>thought\n[Internal reasoning]<<channel|>

3. Multi-Turn Conversations

No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.

Ethics and Safety

As open models become central to enterprise infrastructure, provenance and security are paramount. Developed by Google DeepMind, Gemma 4 undergoes the same rigorous safety evaluations as our proprietary Gemini models.

Evaluation Approach

Gemma 4 models were developed in partnership with internal safety and responsible AI teams. A range of automated as well as human evaluations were conducted to help improve model safety. These evaluations align with Google's AI principles, as well as safety policies.

Ethical Considerations and Risks

Bias and Fairness – VLMs trained on large-scale data can reflect socio-cultural biases. Gemma 4 underwent careful scrutiny and evaluations to mitigate bias risks.
Misinformation and Misuse – Guidelines are provided for responsible use. See the Responsible Generative AI Toolkit.
Transparency and Accountability – This model card summarizes details on architecture, capabilities, limitations, and evaluation processes.

Model provider

kunhunjon

Model tree

Base

google/gemma-4-26B-A4B-it-qat-q4_0-unquantized

Quantized

this model

Modalities

Input

Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Model card

Explore FriendliAI today

Get started Talk to an engineer

Why This Model?

This checkpoint quantizes those layers too, bringing model VRAM down to ~17.8 GB and enabling a 232K token KV pool on a 24 GB card — comparable to W4A16 setups:

Table with columns: Checkpoint, Self-Attn/MLP, Model VRAM, KV Pool (24 GB)
Checkpoint	Self-Attn/MLP	Model VRAM	KV Pool (24 GB)
`nvidia/Gemma-4-26B-A4B-NVFP4`	BF16	20.07 GB	74K tokens
This model	NVFP4	17.81 GB	232K tokens

How It Was Created

Excluded from quantization (kept in BF16):

lm_head (tied to embed_tokens)
model.embed_vision* (vision embedding projection)
model.language_model.layers.*.router* (MoE router — intentionally excluded by modelopt defaults for accuracy)
model.vision_tower* (entire vision encoder)

Quantized to NVFP4:

All self-attention projections (q_proj, k_proj, v_proj, o_proj) — 30 layers
All MLP projections (gate_proj, up_proj, down_proj) — 30 layers
All MoE expert projections (gate/up/down per expert) — 30 layers × 128 experts

python
import modelopt.torch.quantization as mtq
from modelopt.torch.export import export_hf_checkpoint
from transformers import AutoModelForCausalLM, AutoTokenizer

MODEL_ID = "google/gemma-4-26B-A4B-it-qat-q4_0-unquantized"
OUTPUT_DIR = "gemma-4-26b-a4b-nvfp4-full"

model = AutoModelForCausalLM.from_pretrained(
    MODEL_ID, torch_dtype=torch.bfloat16, device_map="cpu", trust_remote_code=True
)
tokenizer = AutoTokenizer.from_pretrained(MODEL_ID, trust_remote_code=True)

model = mtq.quantize(model, mtq.NVFP4_DEFAULT_CFG)

export_hf_checkpoint(model, dtype=torch.bfloat16, export_dir=OUTPUT_DIR)
tokenizer.save_pretrained(OUTPUT_DIR)

Note: ModelOpt 0.44 fails on Gemma 4's Gemma4TextExperts module (singular vs plural weight_quantizer attribute mismatch). Version 0.46+ is required.

Running with SGLang

Important: The 26B MoE uses GELU activation (not SiLU). You need the GELU MoE patch from PR sgl-project/sglang#24280. Save the patch script as /tmp/patch_gelu_moe_v2.sh.

bash
docker run -d \
  --name sglang-nvfp4 \
  --gpus all \
  --network host \
  -e SGLANG_ALLOW_OVERWRITE_LONGER_CONTEXT_LEN=1 \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -v /tmp/patch_gelu_moe_v2.sh:/patch.sh:ro \
  lmsysorg/sglang:gemma4-mtp \
  bash -c "bash /patch.sh && \
  pip install --no-deps --quiet 'git+https://github.com/huggingface/transformers.git@1423d22f7a3b62e8c70ad67b58ec25cd9b675897' 2>&1 && \
  python3 -m sglang.launch_server \
    --model-path kunhunjon/gemma-4-26B-A4B-it-qat-NVFP4-full \
    --quantization modelopt_fp4 \
    --kv-cache-dtype fp8_e5m2 \
    --mem-fraction-static 0.91 \
    --context-length 262144 \
    --swa-full-tokens-ratio 0.05 \
    --cuda-graph-max-bs 1 \
    --max-running-requests 1 \
    --reasoning-parser gemma4 \
    --tool-call-parser gemma4 \
    --dtype bfloat16 \
    --trust-remote-code \
    --host 0.0.0.0 --port 30000"

Notes:

--quantization modelopt_fp4 selects the NVFP4 handler with Marlin fallback for non-Blackwell GPUs (SM80+). On Blackwell (SM100+), the native FP4 CUTLASS path is used automatically.
--mem-fraction-static 0.91 fits the model + KV pool on 24 GB cards. Reduce to 0.90 if OOM occurs during CUDA graph capture.
--swa-full-tokens-ratio 0.05 expands the full-attention KV pool 6× vs default (Gemma 4 uses hybrid sliding window attention).
--cuda-graph-max-bs 1 minimizes CUDA graph memory on 24 GB cards.
The transformers git commit is needed for gemma4 architecture support.

Memory on RTX 3090 (24 GB)

Table with columns: Component, VRAM
Component	VRAM
Model weights (NVFP4 + Marlin overhead)	17.81 GB
KV cache (FP8, 232K tokens)	3.33 GB
CUDA graphs (bs=1)	~0.06 GB
Overhead	~0.79 GB
Total	~22.0 GB

Accuracy Considerations

VRAM Tuning

The --mem-fraction-static and --swa-full-tokens-ratio parameters control the KV cache capacity:

Table with columns: mem-fraction, swa-ratio, Full KV Tokens, SWA Tokens, Notes
mem-fraction	swa-ratio	Full KV Tokens	SWA Tokens	Notes
0.90	0.05	~74K	~3.7K	Safe, more CUDA graph room
0.91	0.05	~232K	~11.6K	Recommended for 24 GB
0.92	0.05	OOM risk	—	May fail during graph capture

[!Note] This model card is for the new versions of the Gemma 4 family optimized with Quantization-Aware Training (QAT), which allows preserving similar quality to bfloat16 while dramatically reducing the memory requirements to load the model. Four versions of the QAT checkpoints are available:

Unquantized QAT checkpoints (Q4_0): Half-precision weights extracted from the QAT pipeline, ideal for custom downstream compilation and research. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B, and their drafter models.

GGUF (Q4_0): Ready-to-deploy formats for broad ecosystem compatibility. Available for Gemma 4 E2B, E4B, 12B, 26B A4B, and 31B.

Mobile-optimized (wNa8o8): A custom schema engineered explicitly for mobile hardware efficiency. It features targeted 2-bit decoding layers, optimized KV caches, and static activations to maximize VRAM savings. Available for Gemma 4 E2B and E4B.

Compressed Tensors (w4a16): QAT checkpoints serialized in the compressed-tensors format for native, optimized inference with vLLM. Available for Gemma 4 E2B, E4B, 12B, and 31B.

Gemma 4 introduces key capability and architectural advancements:

Reasoning – All models in the family are designed as highly capable reasoners, with configurable thinking modes.
Extended Multimodalities – Processes Text, Image with variable aspect ratio and resolution support (all models), Video, and Audio (featured natively on the E2B, E4B, and 12B models).
Diverse & Efficient Architectures – Offers Dense and Mixture-of-Experts (MoE) variants of different sizes for scalable deployment.
Optimized for On-Device – Smaller models are specifically designed for efficient local execution on laptops and mobile devices.
Increased Context Window – The small models feature a 128K context window, while the medium models support 256K.
Enhanced Coding & Agentic Capabilities – Achieves notable improvements in coding benchmarks alongside native function-calling support, powering highly capable autonomous agents.
Native System Prompt Support – Gemma 4 introduces native support for the system role, enabling more structured and controllable conversations.

Models Overview

Dense Models

Table with columns: Property, E2B, E4B, 12B Unified, 31B Dense
Property	E2B	E4B	12B Unified	31B Dense
Total Parameters	2.3B effective (5.1B with embeddings)	4.5B effective (8B with embeddings)	11.95B	30.7B
Layers	35	42	48	60
Sliding Window	512 tokens	512 tokens	1024 tokens	1024 tokens

Mixture-of-Experts (MoE) Model

Table with columns: Property, 26B A4B MoE
Property	26B A4B MoE
Total Parameters	25.2B
Active Parameters	3.8B
Layers	30
Sliding Window	1024 tokens
Context Length	256K tokens
Vocabulary Size	262K
Expert Count	8 active / 128 total and 1 shared
Supported Modalities	Text, Image

Best Practices

1. Sampling Parameters

Use the following standardized sampling configuration across all use cases:

temperature=1.0
top_p=0.95
top_k=64

2. Thinking Mode Configuration

To properly manage the thinking process, use the following control tokens:

Trigger Thinking: Thinking is enabled by including the |<<|think|> token at the start of the system prompt. To disable thinking, remove the token.
Standard Generation: When thinking is enabled, the model will output its internal reasoning followed by the final answer using this structure: |<<|channel>thought\n[Internal reasoning]<<channel|>

3. Multi-Turn Conversations

No Thinking Content in History: In multi-turn conversations, the historical model output should only include the final response. Thoughts from previous model turns must not be added before the next user turn begins.

Ethics and Safety

Evaluation Approach

Ethical Considerations and Risks

Bias and Fairness – VLMs trained on large-scale data can reflect socio-cultural biases. Gemma 4 underwent careful scrutiny and evaluations to mitigate bias risks.
Misinformation and Misuse – Guidelines are provided for responsible use. See the Responsible Generative AI Toolkit.
Transparency and Accountability – This model card summarizes details on architecture, capabilities, limitations, and evaluation processes.

gemma-4-26B-A4B-it-qat-NVFP4-full

Get help setting up a custom Dedicated Endpoints.

README

Why This Model?

How It Was Created

Running with SGLang

Memory on RTX 3090 (24 GB)

Accuracy Considerations

VRAM Tuning

Models Overview

Dense Models

Mixture-of-Experts (MoE) Model

Best Practices

1. Sampling Parameters

2. Thinking Mode Configuration

3. Multi-Turn Conversations

Ethics and Safety

Evaluation Approach

Ethical Considerations and Risks

Explore FriendliAI today

README

Why This Model?

How It Was Created

Running with SGLang

Memory on RTX 3090 (24 GB)

Accuracy Considerations

VRAM Tuning

Models Overview

Dense Models

Mixture-of-Experts (MoE) Model

Best Practices

1. Sampling Parameters

2. Thinking Mode Configuration

3. Multi-Turn Conversations

Ethics and Safety

Evaluation Approach

Ethical Considerations and Risks