Accelerate

Generative AI Inference

Fast, efficient, and reliable generative AI inference solution for production


TRUSTED BY

FriendliAI x Hugging Face strategic partnership

Deploy popular open-source models from the Hugging Face Hub to Friendli Endpoints for lightning-fast, high-performance inference.

Read the blog

Accelerate generative AI inference

Friendli Inference provides fast and low-cost inference

GROUNDBREAKING PERFORMANCE

50-

90%

Cost savings

Fewer GPUs required 1

10.7×

Higher throughput 2

6.2×

Lower latency 3

Powered by FriendliAI’s industry-leading technology

Iteration Batching

Iteration Batching

Groundbreaking optimization technique developed by us

(Also known as Continuous Batching)

Friendli DNN Library

Friendli DNN Library

Optimized GPU kernels for generative AI

Friendli TCache

Friendli TCache

Intelligently reusing computational results

Native Quantization

Native Quantization

Efficient serving without compromising accuracy

Lightning-fast performance for multimodal models

Our engine accelerates open-source and custom LLMs. The engine supports a wide array of quantization techniques, including FP8, INT8, and AWQ in all models. Take advantage of our optimized open-sourced models or leverage Friendli Inference for your business with a custom model.

DeepSeek R1

DeepSeek R1

Qwen

Qwen

Llama 4

Llama 4

Mixtral

Mixtral


All-in-one platform for AI agents

Build and serve compound AI systems for complex tasks

Our Key Value,All-in-one platform for AI agents

Deploy custom models effortlessly

Serve custom models tailored to your specific needs. You can upload your model or import it from either W&B Registry or Hugging Face Model Hub.

Monitor and debug LLM performance

Our advanced monitoring and debugging tools empower you to understand your model, identify issues, and optimize performance.

Model-agnostic function calls and structured outputs

Build reliable API integrations for your AI agent using function calling or structured outputs. Our engine guarantees consistent results regardless of the model you use.

Integrate predefined tools or provide your own

Empower your AI agent’s abilities with tools. Choose from our extensive library of predefined tools or seamlessly integrate your own.


Ready for production

Our services are here to help you scale your business with ease.

Our Key Value,Ready for production

Guaranteed SLA and performance

Experience peace with consistent performance and high reliability. We are committed to delivering exceptional service so you can focus on growing your business.

Maximum security in our cloud or yours

Protect your data with our robust security measures. Whether you choose our cloud or prefer to operate in your infrastructure, we prioritize your security and compliance.

Autoscale on growing demands

Stay ahead of the curve with our intelligent autoscaling capabilities. Our system automatically adjusts resources to ensure optimal performance, allowing you to scale as you grow.


CUSTOMER STORIES

See what’s possible with FriendliAI

Use cases diagram
South Korea’s leading telecom provider, SKT, powers AI services for millions of users. Right after onboarding, Friendli delivered 5× higher LLM throughput and 3× cost savings. Friendli Dedicated Endpoints met SKT’s strict SLAs with exceptional reliability and efficiency.

NextDay AI’s personalized character chatbots process over 3 trillion tokens per month, incurring high H100 GPU costs. By using Friendli Container, they saved GPU usage by more than 50%. NextDay AI’s chatbot is ranked among the top 20 generative AI web products by Andreessen Horowitz (a16z).

Friendli Inference’s performance and cost savings consistently exceed our expectations. After exploring open-source options, I cannot overstate the value and peace of mind FriendliAI brings to the table. With FriendliAI, our service grew over six times, making it an essential driver of our growth.

PARTNERSHIPS

Meet FriendliAI’s partners

We partner with industry leaders to deliver top-class performance, scalability, and support. These collaborations help us push the boundaries of AI inference so our customers can deploy faster, scale smarter, and run more efficiently. Whether it’s cloud infrastructure or cutting-edge hardware, our ecosystem is designed for production-grade AI.

Friendli Suite

The complete platform that unlocks your full generative AI potential

01

Friendli Dedicated Endpoints

High-performance inference with guaranteed capacity


02

Friendli Container

Serve generative AI in your own environment


03

Friendli Serverless Endpoints

Access fast and affordable generative AI inference

01

Friendli Dedicated Endpoints

High-performance inference with guaranteed capacity

Easy and scalable deployment for production workloads

With our user-friendly interface and robust infrastructure, you can seamlessly transition from development to production with minimal effort. Dedicated Endpoints simplify LLM operation, allowing you to focus on your business goals. Our integrated dashboard provides a complete insight into the endpoint performance over time.

Dedicated Endpoints

Auto-scale your endpoints efficiently

Our system dynamically adjusts resources based on your real-time demand, ensuring stable performance during peak times and overall cost efficiency. With the added capability to scale down to zero, you can eliminate unnecessary costs during periods of low activity. This intelligent auto-scaling feature prevents both under-provisioning and over-provisioning of expensive GPU resources.

Dedicated Endpoints

Dedicated GPU resource management

Dedicated Endpoints provides exclusive access to high-performance GPU resources, ensuring consistent access to computing resources without contention or performance fluctuations. By eliminating resource sharing, you can rely on predictable performance levels to enhance your AI workloads, improving productivity and reliability.

Dedicated Endpoints

02

Friendli Serverless Endpoints

Access fast and affordable generative AI inference

Easily build AI agents with tool-assist

With built-in tools like web search, Python interpreter, and calculator, Friendli Serverless Endpoints lets you build advanced AI agents effortlessly.

Side-by-side comparison

Evaluate and compare the performance of multiple models simultaneously in a single view. Ideal for selecting the right model for your specific use case and understanding trade-offs between outputs.

Deploy with ease

Deploy serverless endpoints to dedicated GPU environments with just a few clicks—scalable, reliable, and ready for production.


03

Friendli Container

Serve generative AI in your own environment

Friendli Container

Built to meet privacy and security needs

Running models within your infrastructure allows you to maintain complete control of your data, ensuring that sensitive information never leaves your environment.

Integrate with internal systems

Our solution offers seamless Kubernetes integration, facilitating orchestration and observability. You can easily integrate Prometheus and Grafana for monitoring.

Plug-and-play AWS integration

Seamlessly launch Friendli Container within AWS environments, including AWS EKS and SageMaker with to a single command.




1. Testing conducted by FriendliAI in October 2023 using Llama-2-13B running on Friendli Inference. See the detailed results and methodology here.
2. Performance compared to vLLM on a single NVIDIA A100 80GB GPU running AWQ-ed Mixtral 8x7B from Mistral AI with the following settings: mean input token length = 500, mean output token length = 150. Evaluation conducted by FriendliAI.
3. Performance of Friendli Container compared to vLLM on a single NVIDIA A100 80GB GPU running AWQ-ed Mixtral 8x7B from Mistral AI with the following settings: mean input token length = 500, mean output token length = 150, mean request per second = 0.5. Evaluation conducted by FriendliAI.