Accelerate

Generative AI Inference

Fast, efficient, and reliable generative AI inference solution for production

TRUSTED BY

FriendliAI x Hugging Face strategic partnership

Deploy popular open-source models from the Hugging Face Hub to Friendli Endpoints for lightning-fast, high-performance inference.

Read the blog

Accelerate generative AI inference

Friendli Inference provides fast and low-cost inference

GROUNDBREAKING
PERFORMANCE

50-

90%

Cost savings

6×

Fewer GPUs required 1

10.7×

Higher throughput 2

6.2×

Lower latency 3

Our cutting-edge technologies make this possible.

See more technologies

Iteration Batching

Groundbreaking optimization technique developed by us

(Also known as Continuous Batching)

Friendli DNN Library

Optimized GPU kernels for generative AI

Friendli TCache

Intelligently reusing computational results

Native Quantization

Efficient serving without compromising accuracy

Lightning-fast performance for Large Language Models

Our engine accelerates open-source and custom LLMs. The engine supports a wide array of quantization techniques, including FP8, INT8, and AWQ in all models. Take advantage of our optimized open-sourced models or leverage Friendli Inference for your business with a custom model.

See available models

Llama 4

DeepSeek R1

Gemma 3

Qwen

All-in-one platform for AI agents

Build and serve compound AI systems for complex tasks

Deploy custom models effortlessly

Serve custom models tailored to your specific needs. You can upload your model or import it from either W&B Registry or Hugging Face Model Hub.

Monitor and debug LLM performance

Our advanced monitoring and debugging tools empower you to understand your model, identify issues, and optimize performance.

Model-agnostic function calls and structured outputs

Build reliable API integrations for your AI agent using function calling or structured outputs. Our engine guarantees consistent results regardless of the model you use.

Seamless data integration for real-time RAG

Enhance your AI’s knowledge in real time with our Retrieval-Augmented Generation (RAG) system. Effortlessly update your agents to up-to-date information, reducing hallucinations.

Integrate predefined tools or provide your own

Empower your AI agent’s abilities with tools. Choose from our extensive library of predefined tools or seamlessly integrate your own.

Ready for production

Our services are here to help you scale your business with ease.

Guaranteed SLA and performance

Experience peace with consistent performance and high reliability. We are committed to delivering exceptional service so you can focus on growing your business.

Maximum security in our cloud or yours

Protect your data with our robust security measures. Whether you choose our cloud or prefer to operate in your infrastructure, we prioritize your security and compliance.

Autoscale on growing demands

Stay ahead of the curve with our intelligent autoscaling capabilities. Our system automatically adjusts resources to ensure optimal performance, allowing you to scale as you grow.

CUSTOMER STORIES

FriendliAI can solve your generative AI use case.

View more Book Demo

S.Korea’s leading telecom provider, SK Telecom, operates their LLMs reliably and cost-efficiently without self-management by using Friendli Dedicated Endpoints

NextDay AI’s personalized character chatbots process over 3 trillion tokens per month, incurring high H100 GPU costs. By using Friendli Container, they saved GPU usage by more than 50%. NextDay AI’s chatbot is ranked among the top 20 generative AI web products by Andreessen Horowitz (a16z).

Friendli Inference’s performance and cost savings consistently exceed our expectations. After exploring open-source options, I cannot overstate the value and peace of mind FriendliAI brings to the table. With FriendliAI, our service grew over six times, making it an essential driver of our growth.

PARTNERSHIPS

Meet FriendliAI’s partners

With our partners, we deliver reliable and efficient solutions customized to your specific needs.

Friendli Suite

The complete platform that unlocks your full generative AI potential

Friendli Dedicated Endpoints

Build and run LLMs/LMMs on autopilot in the cloud

Friendli Container

Serve generative AI in your secure environment

Friendli Serverless Endpoints

Access fast and affordable generative AI inference

Friendli Dedicated Endpoints

Build and run LLMs/LMMs on autopilot in the cloud

Easy and scalable deployment for production workloads

With our user-friendly interface and robust infrastructure, you can seamlessly transition from development to production with minimal effort. Dedicated Endpoints simplify LLM operation, allowing you to focus on your business goals. Our integrated dashboard provides a complete insight into the endpoint performance over time.

Auto-scale your endpoints efficiently

Our system dynamically adjusts resources based on your real-time demand, ensuring stable performance during peak times and overall cost efficiency. With the added capability to scale down to zero, you can eliminate unnecessary costs during periods of low activity. This intelligent auto-scaling feature prevents both under-provisioning and over-provisioning of expensive GPU resources.

Dedicated GPU resource management

Dedicated Endpoints provides exclusive access to high-performance GPU resources, ensuring consistent access to computing resources without contention or performance fluctuations. By eliminating resource sharing, you can rely on predictable performance levels to enhance your AI workloads, improving productivity and reliability.

Friendli Container

Serve generative AI in your secure environment

Built to meet privacy and security needs

Running models within your infrastructure allows you to maintain complete control of your data, ensuring that sensitive information never leaves your environment.

Integrate with internal systems

Our solution offers seamless Kubernetes integration, facilitating orchestration and observability. You can easily integrate Prometheus and Grafana for monitoring.

Save big on GPU costs

Whether on-premise or through a managed cluster, Friendli Container can process heavy requests efficiently, requiring you to spend fewer GPUs for a larger scale.

Friendli Serverless Endpoints

Access fast and affordable generative AI inference

250 tokens/sec at $0.1/1M tokens

Serverless Endpoints delivers output tokens at a staggering 250 tokens per second with per-token billing as low as $0.1 per million tokens for the Llama 3.1 8B model.

Supports 128K context length

Build complex applications that require in-depth understanding and context retention on Serverless Endpoints. Our Llama 3.1 endpoints support complete 128K context length handling.

Easily build AI agents with tool-assist

Are you building an AI agent that can search the web, integrate knowledge bases, and solve complex problems using many tools? Serverless Endpoints has it all.

INTEGRATIONS

Seamlessly build and deploy LLM agents with our integrations

Start building now Docs

TUTORIAL

DOCS

CHATBOT

TECH BLOG

Are you ready to build and deploy your generative AI application effortlessly?

Get started free

1. Testing conducted by FriendliAI in October 2023 using Llama-2-13B running on Friendli Inference. See the detailed results and methodology here.

2. Performance compared to vLLM on a single NVIDIA A100 80GB GPU running AWQ-ed Mixtral 8x7B from Mistral AI with the following settings: mean input token length = 500, mean output token length = 150. Evaluation conducted by FriendliAI.

3. Performance of Friendli Container compared to vLLM on a single NVIDIA A100 80GB GPU running AWQ-ed Mixtral 8x7B from Mistral AI with the following settings: mean input token length = 500, mean output token length = 150, mean request per second = 0.5. Evaluation conducted by FriendliAI.

Accelerate

Generative AI Inference

FriendliAI x Hugging Face strategic partnership

Accelerate generative AI inference

Our cutting-edge technologies make this possible.

Iteration Batching

Friendli DNN Library

Friendli TCache

Native Quantization

Lightning-fast performance for Large Language Models

All-in-one platform for AI agents

Deploy custom models effortlessly

Monitor and debug LLM performance

Model-agnostic function calls and structured outputs

Seamless data integration for real-time RAG

Integrate predefined tools or provide your own

Ready for production

Guaranteed SLA and performance

Maximum security in our cloud or yours

Autoscale on growing demands

FriendliAI can solve your generative AI use case.

Meet FriendliAI’s partners

Friendli Suite

Friendli Dedicated Endpoints

Easy and scalable deployment for production workloads

Auto-scale your endpoints efficiently

Dedicated GPU resource management

Friendli Container

Built to meet privacy and security needs

Integrate with internal systems

Save big on GPU costs

Friendli Serverless Endpoints

250 tokens/sec at $0.1/1M tokens

Supports 128K context length

Easily build AI agents with tool-assist

Seamlessly build and deploy LLM agents with our integrations

Read more from our blogs

Explore 370K+ AI Models on FriendliAI's Models Page

How to Compare Multimodal AI Models Side-by-Side

Deploy Multimodal Models from Hugging Face to FriendliAI with Ease

Are you ready to build and deploy your generative AI application effortlessly?