- April 12, 2024
- 2 min read
Easily Migrating LLM Inference Serving from vLLM to Friendli Container

vLLM is an open-source inference engine that provides a starting point for serving your large language models (LLMs). However, when it comes to production environments, vLLM faces challenges. In production environments, various optimizations including efficient quantized models [link1, link2], and efficient use of computation (e.g., MoE techniques) become crucial. In production environments, Friendli Container is a much better option, and it is gaining popularity among the companies that need to serve LLMs on a large scale. While vLLM provides an easy entrance to inference serving, this article illustrates how Friendli Container is equally easy-to-use with a simple extra step.
Friendli Container: Built for Production
Friendli Container leverages unique optimizations including the Friendli DNN library optimized for generative AI, iteration batching (or continuous batching), efficient quantization, and TCache techniques, making them ideal for production environments. They offer superior performance and handle heavy workloads efficiently. As shown in these articles 1 and 2, Friendli Container exhibit roughly 10x faster TTFT (time-to-first-token) and 10x faster TPOT (time-to-output-token) under modest loads while serving AWQ-ed Mixtral 7x8B model on an NVIDIA’s A100 80GB GPU.
Moving to Friendli Container: An Easy Transition
Launching inference serving containers using vLLM is pretty straightforward. As instructed in the blog post and the documentations, you can install it in your local environment with pip install vllm
or by downloading a pre-built Docker image with:
bash
With the image, you could launch the server with the following command:
bash
Transitioning from vLLM to Friendli Container is very easy. Here's what you need to do:
- Sign up: Create a Friendli Suite account and generate a Personal Access Token and a Container Secret for user authentication and container activation.
- Download: Pull the trial image for Friendli Containers from Friendli's registry via Docker login with your Personal Access Token.
bash
- Launch Friendli Container: Launching a Friendli Container closely resembles launching a vLLM server. You'll need your Container Secret and specify the model name and port details.
bash
OpenAI Compatible Inference API: Use Your Favorite Tools
Both Friendli Container and vLLM offer an OpenAI-compatible inference API. This allows you to simply send text completion requests through cURL, which works identically for both vLLM and Friendli Container.
bash
Moreover, it also allows you to use popular tools like the OpenAI Python SDK seamlessly on either platform.
python
Ready to Take Your LLMs to the Next Level?
Head over to https://friendli.ai/products/container to start your free trial and experience the power of Friendli Containers for high-performance LLM serving!
Written by
FriendliAI Tech & Research
Share
General FAQ
What is FriendliAI?
FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.
How does FriendliAI help my business?
Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing
Which models and modalities are supported?
Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models
Can I deploy models from Hugging Face directly?
Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership
Still have questions?
If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.