Friendli InferenceThe fastest LLM inference engine
on the market
What Friendli Inference offers
Speed up the serving of LLMs,
thus slashing costs by50~90%
Friendli Inference is highly optimized to make LLM serving fast and cost-effective. Process LLM inference with Friendli Inference, the fastest engine on the market. Our performance testing shows that Friendli Inference is significantly faster than vLLM and TensorRT-LLM.
Read moreMulti-LoRA serving on a single GPU
Friendli Inference simultaneously supports multiple LoRA models on fewer GPUs (even on just a single GPU!), a remarkable leap in making LLM customization more accessible and efficient.
Read moreDeploy LLMs and more!
Friendli Inference supports a wide range of generative AI models, including quantized models and MoE.
View the full model listKey Technology
Iteration batching
(aka continuous batching)
Iteration batching is a new batching technology we invented to handle concurrent generation requests very efficiently. Iteration batching can achieve up to tens of times higher LLM inference throughput than conventional batching while satisfying the same latency requirement. Our technology is protected by our patents in the US, Korea and China
Read moreDNN library
Friendli DNN Library is the set of optimized GPU kernels carefully curated and designed specifically for generative AI. Our novel library allows Friendli Inference to support faster LLM inference of various tensor shapes and datatypes, as well as support quantization, Mixture of Experts, LoRA adapters, and so on.
Friendli TCache
Friendli TCache intelligently identifies and stores frequently used computational results. The Friendli Inference leverages the cached results, significantly reducing the workload on the GPUs.
Read moreSpeculative decoding
Friendli Inference natively supports speculative decoding, an optimization technique that rapidly speeds up LLM/LMM inference by making educated guesses on future tokens in parallel while generating the current token. Through validation of the generated potential future tokens, speculative decoding ensures identical model outputs at a fraction of the inference time.
Highlights
Running Quantized Mixtral 8x7B on a Single GPU
We quantized the Mixtral-7x8B-instruct v0.1 model with AWQ and ran it on a single NVIDIA A100 80GB GPU. Both the TTFT and TPOT outnumbers a baseline vLLM system. Friendli Inference achieves at least 4.1x faster response time and 3.8x ~ 23.8x higher token throughput.
Read moreQuantized Llama 2 70B on Single GPU
With Friendli Inference, running AWQ-ed models is seamless. For example, one can run AWQ-ed LLMs (e.g., Llama 2 70B 4-bit on a single A100 80 GB GPU) natively on Friendli Inference. Running LLMs with AWQ on Friendli Inference enables you to achieve efficient LLM deployment and remarkable efficiency gains without sacrificing accuracy.
Read moreEven faster TTFT with Friendli TCache
Friendli TCache reuses recurring computations, optimizing TTFT (Time to First Token) by leveraging cached results. We show that our Engine delivers 11.3x to 23x faster TTFT compared to vLLM.
Read moreThree ways to run generative AI models with Friendli Inference:
02
Friendli Container
Serve LLM and LMM inferences with Friendli Inference in your private environment
Learn more03
Friendli Serverless Endpoints
Call our fast and affordable API for open-source generative AI models
Learn more