(function() { var utmInheritingDomain = "appstore.com", utmRegExp = /(&|\?)utm_[A-Za-z]+=[A-Za-z0-9]+/gi, links = document.getElementsByTagName("a"), utms = [ "utm_medium={{URL - utm_medium}}", "utm_source={{URL - utm_source}}", "utm_campaign={{URL - utm_campaign}}" ]; for (var index = 0; index < links.length; index += 1) { var tempLink = links[index].href, tempParts; if (tempLink.indexOf(utmInheritingDomain) > 0) { tempLink = tempLink.replace(utmRegExp, ""); tempParts = tempLink.split("#"); if (tempParts[0].indexOf("?") < 0 ) { tempParts[0] += "?" + utms.join("&"); } else { tempParts[0] += "&" + utms.join("&"); } tempLink = tempParts.join("#"); } links[index].href = tempLink; } }());

Dedicated Endpoints
Build and run generative AI models on autopilot


Autopilot LLM endpoints for production

Autopilot LLM endpoints for production
Easily create LLM inference endpoints that are performant, scalable, and cost-effective
“Working with FriendliAI, we created a
convenient and dependable service
without the need for self-management”
TUNiB Logo
FEATURES & BENEFITS
Superior cost-efficiency and performance with Friendli Engine
Build and serve custom models
Efficient and cost-effective serving with autoscaling
Dedicated GPU resource management

AWS Partner MarkAWS Qualified Software Mark
We are excited to announce that FriendliAI has been officially recognized as an Amazon Web Services (AWS) Partner.
Now you can find Friendli Dedicated Endpoints on AWS marketplace, making building and serving LLMs seamless and efficient.

Cost reduction illustration

Superior cost-efficiency
and performance

A performant LLM serving solution is the first step to operating your AI application in the cloud.

Compared to vLLM, we boast:

10x+ faster
token generation
5x+ faster
initial response time
Run Friendli Engine on the cloud to reduce LLM serving cost up to 90%.
Our engine achieves 6 times higher throughput. Serve more traffic on fewer GPUs with Friendli Engine.
Our engine generates tokens 10 times faster to guarantee unmatched efficiency and performance in your generative AI operations.

Custom model support

Custom model support Images

We offer comprehensive support for both open-source and custom LLMs, allowing organizations to deploy models tailored to their unique requirements and domain-specific challenges.With the flexibility to integrate proprietary datasets, businesses can unlock new opportunities for innovation and differentiation in their AI-driven applications.Create a new endpoint with your private Hugging Face Model Hub repository or upload your model directly to Dedicated Endpoints.


Dedicated GPU Resource Management

Dedicated GPU Resource Management Images

Friendli Dedicated Endpoints provides dedicated GPU instances ensuring consistent access to computing resources without contention or performance fluctuations.By eliminating resource sharing, organizations can rely on predictable performance levels for their LLM inference tasks, enhancing productivity and reliability.


Multi-LoRA serving on a single GPU

Custom model support Images Asset

With our specialized optimization, you can serve multiple LoRA models on a single endpoint using just one GPU. Streamline your operations and maximize resource efficiency.Enjoy greater flexibility and performance as you customize your models with enhanced access and efficiency. Optimize your deployments while maintaining top-tier performance.


Train your model with Friendli Fine-tuning

Custom model support Images Asset

Optimize your models using enterprise data to achieve business-specific goals. Friendli Fine-Tuning enhances performance, saving both time and resources.Seamlessly deploy your endpoints to serve inference requests, and maximize your business outcomes with tailored, optimized models.


Auto-scale your resources in the cloud

Custom model support Images Asset

When deploying generative AI in the cloud, it’s important to scale as your business grows.Friendli Dedicated Endpoints employs intelligent auto-scaling mechanisms that dynamically adjust computing resources based on real-time demand and workload patterns.


Test your endpoints in the playground

Experiment with your model’s capabilities in the endpoint playground.Configure parameters like token length, temperature, top P, and frequency penalty.

PRICING

Basic

Sign up
Featured highlights
check

Get $10 in free credits upon sign up

check

Build and run generative AI models on autopilot

check

Configurable autoscaling

check

Test your endpoints in the playground

check

Billed monthly

Enterprise

Contact Sales
Featured highlights
check

Advanced features

check

Priority access to high-demand GPUs, including A100s and H100s

check

Monitor endpoints with Metrics & Logs

check

Dedicated support

check

Custom pricing

Pricing details

Endpoint

GPU Type

$ / hour

A100 80GB

$3.8

H100 80GB

$5.6

Fine-tuning

Model

$ / 1M tokens

Models up to 16B parameters

$0.50

Models 16.1B - 72B

$3.00

* We charge based on the total number of tokens processed by your fine-tuning jobs.

EXPLORE FRIENDLI SUITE

Other ways to run generative AI models with Friendli

Friendli Container

Serve LLM and LMM inferences with Friendli Engine in your private environment

Learn more

Friendli Serverless Endpoints

Fast and affordable API for open-source generative AI

Learn more