Friendli Engine
The fastest LLM serving engine
on the market

Read the docs

GROUNDBREAKING PERFORMANCE

40~80%

Cost savings

6× Fewer

GPUs required

1

6× Higher

Throughput

2

4× Lower

Latency

3
friendliLogo

What
Friendli Engine
offers


relative-performance-graph

Speed up the serving of LLMs, thus slashing costs by 40~80%

Friendli Engine is highly optimized to make LLM serving fast and cost-effective. It’s the fastest on the market, with our performance testing showing that Friendli Engine is significantly faster than vLLM and TensorRT-LLM.

Read the full blog

HIGHLIGHTS

01

Iteration batching (aka continuous batching)

Iteration batching is a new batching technology we invented to handle concurrent generation requests very efficiently. Iteration batching can achieve up to tens of times higher throughput than conventional batching while satisfying the same latency requirement. Our technology is protected by our patents in the US and Korea.

Read the full blog
Iteration-Batching-Graphic

02

Native quantization support

With Friendli Engine, running AWQ-ed models is seamless. For example, one can run AWQ-ed LLMs (e.g., Llama 2 70B 4-bit on a single A100 80 GB GPU) natively on Friendli Engine. Running LLMs with AWQ on Friendli Engine enables you to achieve efficient LLM deployment and remarkable efficiency gains without sacrificing accuracy.

Read the full blog
GPU-Comparison-Graph

03

Friendli TCache

Friendli TCache intelligently identifies and stores frequently used computational results. The Friendli Engine leverages the cached results, significantly reducing the workload on the GPUs.

Read the full blog
TCache-Comparison

04

Multi-LoRA serving on a single GPU

Friendli Engine simultaneously supports multiple LoRA models on fewer GPUs (even on just a single GPU!), a remarkable leap in making LLM customization more accessible and efficient.

Read the full blog
Multi-LoRA-Serving-Comparison

05

Deploy generative AI models, including LLMs and more

Friendli Engine supports a wide range of generative AI models.

Supported models:
GPT, Llama, Llama2 (including Alpaca, Vicuna, and more), Code Llama, Mistral (including Zephyr), Mixtral, MPT, GPT-J, GPT-NeoX (including Pythia, Dolly, and more), Falcon, Bloom, OPT, BlenderBot, T5 (including T5 v1.1, FLAN, and more), Replit, Phi-2, Solar, Qwen, Stable Diffusion, etc.

Supported-models-types
HOW TO USE

Three ways to run generative AI models with Friendli Engine:

01

Friendli Serverless Endpoints

Fast and affordable API for open-source LLMs and LMMs

Learn more

02

Friendli Dedicated Endpoints

Run custom LLMs on autopilot with Friendli Dedicated Endpoints

Learn more

03

Friendli Container

Serve LLMs in your private environment

Learn more

1. Testing conducted by FriendliAI in October 2023 using Llama-2-13B running on Friendli Engine. See the detailed results and methodologyhere.
2. Testing conducted by FriendliAI in October 2023 using Llama-2-13B running on Friendli Engine. See the detailed results and methodologyhere.
3. Testing conducted by FriendliAI in October 2023 using Llama-2-13B running on Friendli Engine. See the detailed results and methodologyhere.

Supercharge generative AI serving with Friendli Engine