Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Quantization Details

This model was quantized using NVIDIA ModelOpt with the following configurations:

PropertyValue
Base modelgoogle/gemma-4-12B-it
Quant methodNVIDIA ModelOpt (NVFP4)
Weight scheme4-bit float, block size 16
Input activation4-bit float, block size 16
Calibration datasetCNN DailyMail (512 samples, max_seq_len 1024)
Size~11 GB (vs ~23 GB BF16)

Excluded from Quantization

The following modules are kept in full precision (BF16) to preserve accuracy:

  • lm_head
  • model.embed_vision*
  • model.embed_audio*
  • All self_attn layers (layers 0–47)

Quantization Script

The recipes and scripts used to quantize this model can be found in the following repository:

Model provider

vrfai

Model tree

Base

google/gemma-4-12B-it

Quantized

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today