Dedicated Endpoints
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
Quantization Details
This model was quantized using NVIDIA ModelOpt with the following configurations:
| Property | Value |
|---|---|
| Base model | google/gemma-4-12B-it |
| Quant method | NVIDIA ModelOpt (NVFP4) |
| Weight scheme | 4-bit float, block size 16 |
| Input activation | 4-bit float, block size 16 |
| Calibration dataset | CNN DailyMail (512 samples, max_seq_len 1024) |
| Size | ~11 GB (vs ~23 GB BF16) |
Excluded from Quantization
The following modules are kept in full precision (BF16) to preserve accuracy:
lm_headmodel.embed_vision*model.embed_audio*- All
self_attnlayers (layers 0–47)
Quantization Script
The recipes and scripts used to quantize this model can be found in the following repository:
Model provider
vrfai
Model tree
Base
google/gemma-4-12B-it
Quantized
this model
Modalities
Input
Video, Audio, Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information