Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0Quantization
- Quantizer: NVIDIA ModelOpt
0.44.0 - Model Optimizer examples tag:
0.44.0 - Quantization format:
NVFP4 - Quantization hardware: 2x NVIDIA GeForce RTX 5090 GPUs (Blackwell, compute capability 12.0)
- KV-cache quantization: none baked into the checkpoint
- Calibration data:
cnn_dailymail, 512 text samples, sequence length 512 - Export format: unified Hugging Face checkpoint
This checkpoint was quantized, calibrated, and exported on Blackwell RTX 5090 GPUs. The packed weights target NVIDIA's native Blackwell NVFP4 execution path in runtimes such as vLLM.
Multimodal files from the source checkpoint were preserved, including
processor_config.json, tokenizer.json, tokenizer_config.json,
chat_template.jinja, and generation_config.json. The ModelOpt export kept
the multimodal projection modules unquantized:
model.embed_vision*model.embed_audio*lm_head
The exported processor was smoke-tested with image input using the Gemma image
token <|image|>.
vLLM Usage
Use a vLLM build with ModelOpt NVFP4 support and run on Blackwell-class GPUs for
native NVFP4 execution. Pass quantization="modelopt_fp4" explicitly when
loading this checkpoint.
Python
bash
uv pip install -U vllm
python
from vllm import LLM, SamplingParamsllm = LLM(model="berkerdooo/gemma-4-12B-it-NVFP4",quantization="modelopt_fp4",trust_remote_code=True,)outputs = llm.generate(["Explain why the sky is blue."],SamplingParams(max_tokens=128, temperature=0),)print(outputs[0].outputs[0].text)
OpenAI-Compatible Server
bash
vllm serve berkerdooo/gemma-4-12B-it-NVFP4 \--quantization modelopt_fp4 \--trust-remote-code
python
from openai import OpenAIclient = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")response = client.chat.completions.create(model="berkerdooo/gemma-4-12B-it-NVFP4",messages=[{"role": "user","content": "Explain NVFP4 quantization in one paragraph.",}],temperature=0,max_tokens=128,)print(response.choices[0].message.content)
Multimodal Request
The processor, tokenizer, chat template, and image token from the original Gemma 4 checkpoint are included in this repo. With the vLLM server above, image inputs can be sent through the OpenAI-compatible chat API:
python
from openai import OpenAIclient = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")response = client.chat.completions.create(model="berkerdooo/gemma-4-12B-it-NVFP4",messages=[{"role": "user","content": [{"type": "text", "text": "Describe this image."},{"type": "image_url","image_url": {"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"},},],}],temperature=0,max_tokens=128,)print(response.choices[0].message.content)
If you build prompts manually instead of using the server API, use the standard
Gemma 4 multimodal chat template and include the <|image|> token for image
inputs.
Reproduction Command
bash
CUDA_VISIBLE_DEVICES=0,1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \python hf_ptq.py \--pyt_ckpt_path google/gemma-4-12B-it \--export_path ./gemma-4-12B-it-nvfp4 \--qformat nvfp4 \--kv_cache_qformat none \--dataset cnn_dailymail \--calib_size 512 \--calib_seq 512 \--batch_size 1 \--use_seq_device_map \--gpu_max_mem_percentage 0.90 \--attn_implementation sdpa \--skip_generate
Verification
- Export completed successfully with peak GPU memory of 23.53 GB on GPU 0 and 0.98 GB on GPU 1 using 2x NVIDIA GeForce RTX 5090 GPUs.
config.jsonloads asGemma4UnifiedForConditionalGeneration.AutoProcessorloads asGemma4UnifiedProcessor.- A dummy image-text processor call produced
input_ids,attention_mask,mm_token_type_ids,pixel_values, andimage_position_ids.
Plain transformers.AutoModelForCausalLM.from_pretrained is not the target
loader for this checkpoint unless the runtime understands ModelOpt-packed NVFP4
weights.
Model provider
berkerdooo
Model tree
Base
google/gemma-4-12B-it
Quantized
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information