FriendliAI Secures $20M to Accelerate AI Inference Innovation — Read the Full Story

Dedicated Endpoints
Build and run generative AI models on autopilot

Get StartedRead the docs

Autopilot LLM endpoints for production

Autopilot LLM endpoints for production

Easily create LLM inference endpoints that are performant, scalable, and cost-effective


Customer stories

SKT logo

SKT is South Korea's leading telecom operator known for its innovative mobile services, extensive 5G infrastructure and advancements in AI development.

CHALLENGE

Building and serving AI agents for SKT’s massive customer base required strict SLAs, high reliability, and the ability to efficiently handle heavy traffic.

SOLUTION

Friendli Dedicated Endpoints enabled exceptional reliability and traffic efficiency while reducing operational costs.

RESULT

Within
few hours of
onboarding

5×

LLM
throughput

3×

Cost
saving

TUNiB logo

Tunib's DearMate chatbot service offers various personas like friend, lover, counselor, and coach. Friendli Dedicated Endpoints' managed platform allows Tunib to focus on model training while automating GPU resource management and fault recovery.

Read more
Friendli Dedicated Endpoints simplifies generative AI model serving and optimizes our service development process.

FEATURES & BENEFITS

Superior cost-efficiency and performance with Friendli Inference

Build and serve custom models

Efficient and cost-effective serving with autoscaling

Dedicated GPU resource management


We are excited to announce that FriendliAI has been officially recognized as an Amazon Web Services (AWS) Partner.
Now you can find Friendli Dedicated Endpoints on AWS marketplace, making building and serving LLMs seamless and efficient.


Cost reduction illustration

Superior cost-efficiency
and performance

A performant LLM serving solution is the first step to operating your AI application in the cloud.

Compared to vLLM, we boast:

10x+ faster

token generation

5x+ faster

initial response time

Run Friendli Inference on the cloud to reduce LLM serving cost up to 90%.

Our engine achieves 6 times higher throughput. Serve more traffic on fewer GPUs with Friendli Inference.

Our engine generates tokens 10 times faster to guarantee unmatched efficiency and performance in your generative AI operations.


Custom model support

Custom model support Images

We offer comprehensive support for both open-source and custom LLMs, allowing organizations to deploy models tailored to their unique requirements and domain-specific challenges.With the flexibility to integrate proprietary datasets, businesses can unlock new opportunities for innovation and differentiation in their AI-driven applications.Create a new endpoint with your private Hugging Face Model Hub repository or upload your model directly to Dedicated Endpoints.


Dedicated GPU Resource Management

Friendli Dedicated Endpoints provides dedicated GPU instances ensuring consistent access to computing resources without contention or performance fluctuations.By eliminating resource sharing, organizations can rely on predictable performance levels for their LLM inference tasks, enhancing productivity and reliability.

Dedicated GPU Resource Management Images

Multi-LoRA serving on a single GPU

Custom model support Images Asset

With our specialized optimization, you can serve multiple LoRA models on a single endpoint using just one GPU. Streamline your operations and maximize resource efficiency.Enjoy greater flexibility and performance as you customize your models with enhanced access and efficiency. Optimize your deployments while maintaining top-tier performance.


Auto-scale your resources in the cloud

Custom model support Images Asset

When deploying generative AI in the cloud, it’s important to scale as your business grows.Friendli Dedicated Endpoints employs intelligent auto-scaling mechanisms that dynamically adjust computing resources based on real-time demand and workload patterns.


Test your endpoints in the playground

Experiment with your model’s capabilities in the endpoint playground.Configure parameters like token length, temperature, top P, and frequency penalty.

PRICING

Basic

Get $5 free credits
Featured highlights

Pay-as-you-go

Configurable autoscaling

Fine-tune custom models

Enterprise

Contact for custom pricing
Featured highlights

Everything in the Basic plan

Monitor endpoints with Metrics & Logs

Custom pricing

Pricing details

Endpoint

GPU Type

Basic

Enterprise

B200

$8.9 / hour

Talk to an expert
Talk to an expert

H200

$4.5 / hour

Talk to an expert
Talk to an expert

H100

$3.9 / hour

Talk to an expert
Talk to an expert

A100 80GB

$2.9 / hour

Talk to an expert
Talk to an expert

Read more from our blogs

One Click from W&B to FriendliAI: Deploy Models as Live Endpoints thumbnail
  • June 5, 2025
  • 3 min read

One Click from W&B to FriendliAI: Deploy Models as Live Endpoints

Weights & Biases
W&B
AI DevOps
Retrieval Augmented Generation (RAG) with MongoDB and FriendliAI thumbnail
  • August 6, 2024
  • 6 min read

Retrieval Augmented Generation (RAG) with MongoDB and FriendliAI

RAG
MongoDB
Deploying Weights & Biases Model Checkpoints on Friendli Dedicated Endpoints thumbnail
  • June 20, 2024
  • 4 min read

Deploying Weights & Biases Model Checkpoints on Friendli Dedicated Endpoints

Tutorial
Weights & Biases
Dedicated Endpoints
EXPLORE FRIENDLI SUITE

Other ways to run generative AI models with Friendli

Friendli Container

Serve LLM and LMM inferences with Friendli Inference in your private environment

Learn more

Friendli Serverless Endpoints

Fast and affordable API for open-source generative AI

Learn more

Products

Friendli Dedicated EndpointsFriendli Serverless EndpointsFriendli Container

Solutions

InferenceUse Cases
Models

Developers

DocsBlogResearch

Company

About usNewsCareersPatentsBrand ResourcesContact us
Pricing

Contact us:

contact@friendli.ai

FriendliAI Corp:

Redwood City, CA

Hub:

Seoul, Korea

Privacy PolicyService Level AgreementTerms of ServiceCA Notice

Copyright © 2025 FriendliAI Corp. All rights reserved