(function() { var utmInheritingDomain = "appstore.com", utmRegExp = /(&|\?)utm_[A-Za-z]+=[A-Za-z0-9]+/gi, links = document.getElementsByTagName("a"), utms = [ "utm_medium={{URL - utm_medium}}", "utm_source={{URL - utm_source}}", "utm_campaign={{URL - utm_campaign}}" ]; for (var index = 0; index < links.length; index += 1) { var tempLink = links[index].href, tempParts; if (tempLink.indexOf(utmInheritingDomain) > 0) { tempLink = tempLink.replace(utmRegExp, ""); tempParts = tempLink.split("#"); if (tempParts[0].indexOf("?") < 0 ) { tempParts[0] += "?" + utms.join("&"); } else { tempParts[0] += "&" + utms.join("&"); } tempLink = tempParts.join("#"); } links[index].href = tempLink; } }());

Supercharge

Generative AI Inference

Efficient, fast, and reliable generative AI inference solution for production


TRUSTED BY
  • Upstage
  • Enuma
  • Twelve Labs
  • NextdayAI
  • Nacloud
  • SKT
  • Tunib
  • ScatterLab
  • Upstage
  • Enuma
  • Twelve Labs
  • NextdayAI
  • Nacloud
  • SKT
  • Tunib
  • ScatterLab

Accelerate generative AI inference

Friendli Engine provides fast and low-cost inference

GROUNDBREAKING
PERFORMANCE

50-

90%

Cost savings

Fewer GPUs required 1

10.7×

Higher throughput 2

6.2×

Lower latency 3

Our cutting-edge technologies make this possible.

Iteration Batching

Iteration Batching

Groundbreaking optimization technique developed by us

(Also known as Continuous Batching)

Friendli DNN Library

Friendli DNN Library

Optimized GPU kernels for generative AI

Friendli TCache

Friendli TCache

Intelligently reusing computational results

Native Quantization

Native Quantization

Efficient serving without compromising accuracy

Lightning-fast performance for Large Language Models

Our engine accelerates open-source and custom LLMs. The engine supports a wide array of quantization techniques, including FP8, INT8, and AWQ in all models. Take advantage of our optimized open-sourced models or leverage Friendli Engine for your business with a custom model.

Llama 3.1

Llama 3.1

Arctic

Arctic

Gemma 2

Gemma 2

Mixtral

Mixtral


All-in-one platform for AI agents

Build and serve compound AI systems for complex tasks

Our Key Value,All-in-one platform for AI agents

Deploy custom models effortlessly

Serve custom models tailored to your specific needs. You can upload your model or import it from either W&B Registry or Hugging Face Model Hub.

Train and fine-tune models

Effortlessly fine-tune and deploy your models. Use PEFT to efficiently tune your models and deploy them using Multi-LoRA serving.

Monitor and debug LLM performance

Our advanced monitoring and debugging tools empower you to understand your model, identify issues, and optimize performance.

Model-agnostic function calls and structured outputs

Build reliable API integrations for your AI agent using function calling or structured outputs. Our engine guarantees consistent results regardless of the model you use.

Seamless data integration for real-time RAG

Enhance your AI’s knowledge in real time with our Retrieval-Augmented Generation (RAG) system. Effortlessly update your agents to up-to-date information, reducing hallucinations.

Integrate predefined tools or provide your own

Empower your AI agent’s abilities with tools. Choose from our extensive library of predefined tools or seamlessly integrate your own.


Ready for production

Our services are here to help you scale your business with ease.

Our Key Value,Ready for production

Guaranteed SLA and performance

Experience peace with consistent performance and high reliability. We are committed to delivering exceptional service so you can focus on growing your business.

Maximum security in our cloud or yours

Protect your data with our robust security measures. Whether you choose our cloud or prefer to operate in your infrastructure, we prioritize your security and compliance.

Autoscale on growing demands

Stay ahead of the curve with our intelligent autoscaling capabilities. Our system automatically adjusts resources to ensure optimal performance, allowing you to scale as you grow.


CUSTOMER STORIES

FriendliAI can solve your generative AI use case.

Use cases diagram

NextDay AI’s personalized character chatbots process ~0.5 trillion tokens per month, incurring high H100 GPU costs. By using Friendli Container, they instantly cut their GPU costs by more than 50%. NextDay AI’s chatbot is ranked among the top 20 generative AI web products by Andreessen Horowitz (a16z).

S.Korea’s leading telecom provider, SK Telecom, operates their LLMs reliably and cost-efficiently without self-management by using Friendli Dedicated Endpoints

PARTNERSHIPS

Meet FriendliAI’s partners

With our partners, we deliver reliable and efficient solutions customized to your specific needs.

AWS
Nvidia
Azure
MongoDB
Friendli Logo

Friendli Suite

The complete platform that unlocks your full generative AI potential

01

Friendli Dedicated Endpoints

Build and run LLMs/LMMs on autopilot in the cloud


02

Friendli Container

Serve generative AI in your secure environment


03

Friendli Serverless Endpoints

Access fast and affordable generative AI inference

01

Friendli Dedicated Endpoints

Build and run LLMs/LMMs on autopilot in the cloud

Easy and scalable deployment for production workloads

With our user-friendly interface and robust infrastructure, you can seamlessly transition from development to production with minimal effort. Dedicated Endpoints simplify LLM operation, allowing you to focus on your business goals. Our integrated dashboard provides a complete insight into the endpoint performance over time.

Dedicated Endpoints
Dedicated Endpoints

Fine-tune custom models with proprietary datasets

Create highly specialized models tailored to your industry, use case, or company requirements by fine-tuning your AI models with your proprietary datasets. Leverage the Parameter-Efficient Fine-Tuning (PEFT) method to reduce training costs, or try integrating your Weights & Biases account to monitor the training process continuously.

Dedicated Endpoints
Dedicated Endpoints

Auto-scale your endpoints efficiently

Our system dynamically adjusts resources based on your real-time demand, ensuring stable performance during peak times and overall cost efficiency. With the added capability to scale down to zero, you can eliminate unnecessary costs during periods of low activity. This intelligent auto-scaling feature prevents both under-provisioning and over-provisioning of expensive GPU resources.

Dedicated Endpoints
Dedicated Endpoints

Dedicated GPU resource management

Dedicated Endpoints provides exclusive access to high-performance GPU resources, ensuring consistent access to computing resources without contention or performance fluctuations. By eliminating resource sharing, you can rely on predictable performance levels to enhance your AI workloads, improving productivity and reliability.

Dedicated Endpoints
Dedicated Endpoints

02

Friendli Container

Serve generative AI in your secure environment

Friendli Container

Maximum privacy and security

Running models within your infrastructure allows you to maintain complete control of your data, ensuring that sensitive information never leaves your environment.

Integrate with internal systems

Our solution offers seamless Kubernetes integration, facilitating orchestration and observability. You can easily integrate Prometheus and Grafana for monitoring.

Save big on GPU costs

Whether on-premise or through a managed cluster, Friendli Container can process heavy requests efficiently, requiring you to spend fewer GPUs for a larger scale.


03

Friendli Serverless Endpoints

Access fast and affordable generative AI inference

250 tokens/sec at $0.1/1M tokens

Serverless Endpoints delivers output tokens at a staggering 250 tokens per second with per-token billing as low as $0.1 per million tokens for the Llama 3.1 8B model.

Supports 128K context length

Build complex applications that require in-depth understanding and context retention on Serverless Endpoints. Our Llama 3.1 endpoints support complete 128K context length handling.

Easily build AI agents with tool-assist

Are you building an AI agent that can search the web, integrate knowledge bases, and solve complex problems using many tools? Serverless Endpoints has it all.


INTEGRATIONS

Seamlessly build and deploy LLM agents with our integrations




1. Testing conducted by FriendliAI in October 2023 using Llama-2-13B running on Friendli Engine. See the detailed results and methodology here.
2. Performance compared to vLLM on a single NVIDIA A100 80GB GPU running AWQ-ed Mixtral 8x7B from Mistral AI with the following settings: mean input token length = 500, mean output token length = 150. Evaluation conducted by FriendliAI.
3. Performance of Friendli Container compared to vLLM on a single NVIDIA A100 80GB GPU running AWQ-ed Mixtral 8x7B from Mistral AI with the following settings: mean input token length = 500, mean output token length = 150, mean request per second = 0.5. Evaluation conducted by FriendliAI.