berkerdooo/gemma-4-12B-it-NVFP4 API & Inference Endpoint

Quantization

Quantizer: NVIDIA ModelOpt 0.44.0
Model Optimizer examples tag: 0.44.0
Quantization format: NVFP4
Quantization hardware: 2x NVIDIA GeForce RTX 5090 GPUs (Blackwell, compute capability 12.0)
KV-cache quantization: none baked into the checkpoint
Calibration data: cnn_dailymail, 512 text samples, sequence length 512
Export format: unified Hugging Face checkpoint

This checkpoint was quantized, calibrated, and exported on Blackwell RTX 5090 GPUs. The packed weights target NVIDIA's native Blackwell NVFP4 execution path in runtimes such as vLLM.

Multimodal files from the source checkpoint were preserved, including processor_config.json, tokenizer.json, tokenizer_config.json, chat_template.jinja, and generation_config.json. The ModelOpt export kept the multimodal projection modules unquantized:

model.embed_vision*
model.embed_audio*
lm_head

The exported processor was smoke-tested with image input using the Gemma image token <|image|>.

vLLM Usage

Use a vLLM build with ModelOpt NVFP4 support and run on Blackwell-class GPUs for native NVFP4 execution. Pass quantization="modelopt_fp4" explicitly when loading this checkpoint.

Python

bash
uv pip install -U vllm

python
from vllm import LLM, SamplingParams

llm = LLM(
    model="berkerdooo/gemma-4-12B-it-NVFP4",
    quantization="modelopt_fp4",
    trust_remote_code=True,
)

outputs = llm.generate(
    ["Explain why the sky is blue."],
    SamplingParams(max_tokens=128, temperature=0),
)
print(outputs[0].outputs[0].text)

OpenAI-Compatible Server

bash
vllm serve berkerdooo/gemma-4-12B-it-NVFP4 \
  --quantization modelopt_fp4 \
  --trust-remote-code

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="berkerdooo/gemma-4-12B-it-NVFP4",
    messages=[
        {
            "role": "user",
            "content": "Explain NVFP4 quantization in one paragraph.",
        }
    ],
    temperature=0,
    max_tokens=128,
)

print(response.choices[0].message.content)

Multimodal Request

The processor, tokenizer, chat template, and image token from the original Gemma 4 checkpoint are included in this repo. With the vLLM server above, image inputs can be sent through the OpenAI-compatible chat API:

python
from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")

response = client.chat.completions.create(
    model="berkerdooo/gemma-4-12B-it-NVFP4",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {
                    "type": "image_url",
                    "image_url": {
                        "url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"
                    },
                },
            ],
        }
    ],
    temperature=0,
    max_tokens=128,
)

print(response.choices[0].message.content)

If you build prompts manually instead of using the server API, use the standard Gemma 4 multimodal chat template and include the <|image|> token for image inputs.

Reproduction Command

bash
CUDA_VISIBLE_DEVICES=0,1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python hf_ptq.py \
  --pyt_ckpt_path google/gemma-4-12B-it \
  --export_path ./gemma-4-12B-it-nvfp4 \
  --qformat nvfp4 \
  --kv_cache_qformat none \
  --dataset cnn_dailymail \
  --calib_size 512 \
  --calib_seq 512 \
  --batch_size 1 \
  --use_seq_device_map \
  --gpu_max_mem_percentage 0.90 \
  --attn_implementation sdpa \
  --skip_generate

Verification

Export completed successfully with peak GPU memory of 23.53 GB on GPU 0 and 0.98 GB on GPU 1 using 2x NVIDIA GeForce RTX 5090 GPUs.
config.json loads as Gemma4UnifiedForConditionalGeneration.
AutoProcessor loads as Gemma4UnifiedProcessor.
A dummy image-text processor call produced input_ids, attention_mask, mm_token_type_ids, pixel_values, and image_position_ids.

Plain transformers.AutoModelForCausalLM.from_pretrained is not the target loader for this checkpoint unless the runtime understands ModelOpt-packed NVFP4 weights.

gemma-4-12B-it-NVFP4

Get help setting up a custom Dedicated Endpoints.

README