Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

License: apache-2.0

Quantization

  • Quantizer: NVIDIA ModelOpt 0.44.0
  • Model Optimizer examples tag: 0.44.0
  • Quantization format: NVFP4
  • Quantization hardware: 2x NVIDIA GeForce RTX 5090 GPUs (Blackwell, compute capability 12.0)
  • KV-cache quantization: none baked into the checkpoint
  • Calibration data: cnn_dailymail, 512 text samples, sequence length 512
  • Export format: unified Hugging Face checkpoint

This checkpoint was quantized, calibrated, and exported on Blackwell RTX 5090 GPUs. The packed weights target NVIDIA's native Blackwell NVFP4 execution path in runtimes such as vLLM.

Multimodal files from the source checkpoint were preserved, including processor_config.json, tokenizer.json, tokenizer_config.json, chat_template.jinja, and generation_config.json. The ModelOpt export kept the multimodal projection modules unquantized:

  • model.embed_vision*
  • model.embed_audio*
  • lm_head

The exported processor was smoke-tested with image input using the Gemma image token <|image|>.

vLLM Usage

Use a vLLM build with ModelOpt NVFP4 support and run on Blackwell-class GPUs for native NVFP4 execution. Pass quantization="modelopt_fp4" explicitly when loading this checkpoint.

Python

bash

uv pip install -U vllm

python

from vllm import LLM, SamplingParams
llm = LLM(
model="berkerdooo/gemma-4-12B-it-NVFP4",
quantization="modelopt_fp4",
trust_remote_code=True,
)
outputs = llm.generate(
["Explain why the sky is blue."],
SamplingParams(max_tokens=128, temperature=0),
)
print(outputs[0].outputs[0].text)

OpenAI-Compatible Server

bash

vllm serve berkerdooo/gemma-4-12B-it-NVFP4 \
--quantization modelopt_fp4 \
--trust-remote-code

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="berkerdooo/gemma-4-12B-it-NVFP4",
messages=[
{
"role": "user",
"content": "Explain NVFP4 quantization in one paragraph.",
}
],
temperature=0,
max_tokens=128,
)
print(response.choices[0].message.content)

Multimodal Request

The processor, tokenizer, chat template, and image token from the original Gemma 4 checkpoint are included in this repo. With the vLLM server above, image inputs can be sent through the OpenAI-compatible chat API:

python

from openai import OpenAI
client = OpenAI(base_url="http://localhost:8000/v1", api_key="EMPTY")
response = client.chat.completions.create(
model="berkerdooo/gemma-4-12B-it-NVFP4",
messages=[
{
"role": "user",
"content": [
{"type": "text", "text": "Describe this image."},
{
"type": "image_url",
"image_url": {
"url": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/cats.png"
},
},
],
}
],
temperature=0,
max_tokens=128,
)
print(response.choices[0].message.content)

If you build prompts manually instead of using the server API, use the standard Gemma 4 multimodal chat template and include the <|image|> token for image inputs.

Reproduction Command

bash

CUDA_VISIBLE_DEVICES=0,1 PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True \
python hf_ptq.py \
--pyt_ckpt_path google/gemma-4-12B-it \
--export_path ./gemma-4-12B-it-nvfp4 \
--qformat nvfp4 \
--kv_cache_qformat none \
--dataset cnn_dailymail \
--calib_size 512 \
--calib_seq 512 \
--batch_size 1 \
--use_seq_device_map \
--gpu_max_mem_percentage 0.90 \
--attn_implementation sdpa \
--skip_generate

Verification

  • Export completed successfully with peak GPU memory of 23.53 GB on GPU 0 and 0.98 GB on GPU 1 using 2x NVIDIA GeForce RTX 5090 GPUs.
  • config.json loads as Gemma4UnifiedForConditionalGeneration.
  • AutoProcessor loads as Gemma4UnifiedProcessor.
  • A dummy image-text processor call produced input_ids, attention_mask, mm_token_type_ids, pixel_values, and image_position_ids.

Plain transformers.AutoModelForCausalLM.from_pretrained is not the target loader for this checkpoint unless the runtime understands ModelOpt-packed NVFP4 weights.

Model provider

berkerdooo

Model tree

Base

google/gemma-4-12B-it

Quantized

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today