Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Learn more

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Quantization Details

This model was quantized using NVIDIA ModelOpt with the following configurations:

PropertyValue
Base modelgoogle/gemma-4-12B-it
Quant methodNVIDIA ModelOpt (FP8 E4M3 - num_bits: (4, 3))
Weight schemePer-channel (axis: 0)
Input activationDynamic Per-token (type: dynamic)
Calibration datasetCNN DailyMail (512 samples, max_seq_len 1024)
Calibration algorithmmax
Size~15 GB (vs ~23 GB BF16)

Excluded from Quantization

The following modules are kept in full precision (BF16) to preserve accuracy:

  • lm_head
  • model.embed_vision*
  • model.embed_audio*
  • All self_attn layers (layers 0–47)

Quantization Script

The recipes and scripts used to quantize this model can be found in the following repository:

Model provider

vrfai

Model tree

Base

google/gemma-4-12B-it

Quantized

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

View details

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today