Dedicated Endpoints
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0recipe.yaml
| Setting | Value |
|---|---|
| Modifier | QuantizationModifier |
| Targets | Linear |
| Scheme | NVFP4 |
| Ignore Layers | lm_head |
re:.*embed.* | |
re:.*router.* | |
re:.*vision_tower.* | |
| Bypass Divisibility Checks | false |
memory footprint
| Model | Memory Footprint |
|---|---|
| Original (BF16) | ~49 GB |
| NVFP4 | ~16.5 GB |
| Metric | Value |
|---|---|
| Compression | ~3.0× |
llm-compressor
An open-source library developed by the vLLM team, designed to optimize Large Language Models (LLMs) for production deployment — https://github.com/vllm-project/llm-compressor
Model provider
prithivMLmods
Model tree
Base
google/gemma-4-26B-A4B-it-qat-q4_0-unquantized
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information