Dedicated Endpoints
Run this model inference on single tenant GPU with unmatched speed and reliability at scale.
Get help setting up a custom Dedicated Endpoints.
Talk with our engineer to get a quote for reserved GPU instances with discounts.
README
License: apache-2.0recipe.yaml
yaml
default_stage:default_modifiers:QuantizationModifier:targets: [Linear]ignore: [lm_head, 're:.*vision_tower.*', 're:.*embed_vision.*']scheme: FP8_DYNAMICbypass_divisibility_checks: false
llm-compressor
An open-source library developed by the vLLM team, designed to optimize Large Language Models (LLMs) for production deployment — https://github.com/vllm-project/llm-compressor
Model provider
prithivMLmods
Model tree
Base
google/gemma-4-31B-it-qat-q4_0-unquantized
Quantized
this model
Modalities
Input
Text, Image
Output
Text
Pricing
Dedicated Endpoints
View detailsSupported Functionality
Model APIs
Dedicated Endpoints
Container
More information