vrfai/gemma-4-12B-it-fp8 API & Inference Endpoint | FriendliAI

Dedicated Endpoints

Run this model inference on single tenant GPU with unmatched speed and reliability at scale.

Get help setting up a custom Dedicated Endpoints.

Talk with our engineer to get a quote for reserved GPU instances with discounts.

README

Quantization Details

This model was quantized using NVIDIA ModelOpt with the following configurations:

Property	Value
Base model	google/gemma-4-12B-it
Quant method	NVIDIA ModelOpt (FP8 E4M3 - `num_bits: (4, 3)`)
Weight scheme	Per-channel (`axis: 0`)
Input activation	Dynamic Per-token (`type: dynamic`)
Calibration dataset	CNN DailyMail (512 samples, max_seq_len 1024)
Calibration algorithm	max
Size	~15 GB (vs ~23 GB BF16)

Excluded from Quantization

The following modules are kept in full precision (BF16) to preserve accuracy:

lm_head
model.embed_vision*
model.embed_audio*
All self_attn layers (layers 0–47)

Quantization Script

The recipes and scripts used to quantize this model can be found in the following repository:

VinRobotics/model-quantization-recipes

Model provider

vrfai

Model tree

Base

google/gemma-4-12B-it

Quantized

this model

Modalities

Input

Video, Audio, Text, Image

Output

Text

Pricing

Dedicated Endpoints

Supported Functionality

Model APIs

Dedicated Endpoints

Container

More information

Explore FriendliAI today

Get started Talk to an engineer