April 12, 2024
2 min read

Easily Migrating LLM Inference Serving from vLLM to Friendli Container

vLLM is an open-source inference engine that provides a starting point for serving your large language models (LLMs). However, when it comes to production environments, vLLM faces challenges. In production environments, various optimizations including efficient quantized models [link1, link2], and efficient use of computation (e.g., MoE techniques) become crucial. In production environments, Friendli Container is a much better option, and it is gaining popularity among the companies that need to serve LLMs on a large scale. While vLLM provides an easy entrance to inference serving, this article illustrates how Friendli Container is equally easy-to-use with a simple extra step.

Friendli Container: Built for Production

Friendli Container leverages unique optimizations including the Friendli DNN library optimized for generative AI, iteration batching (or continuous batching), efficient quantization, and TCache techniques, making them ideal for production environments. They offer superior performance and handle heavy workloads efficiently. As shown in these articles 1 and 2, Friendli Container exhibit roughly 10x faster TTFT (time-to-first-token) and 10x faster TPOT (time-to-output-token) under modest loads while serving AWQ-ed Mixtral 7x8B model on an NVIDIA’s A100 80GB GPU.

Moving to Friendli Container: An Easy Transition

Launching inference serving containers using vLLM is pretty straightforward. As instructed in the blog post and the documentations, you can install it in your local environment with pip install vllm or by downloading a pre-built Docker image with:

bash
docker pull vllm/vllm-openai:latest

With the image, you could launch the server with the following command:

bash
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-v0.1

Transitioning from vLLM to Friendli Container is very easy. Here's what you need to do:

Sign up: Create a Friendli Suite account and generate a Personal Access Token and a Container Secret for user authentication and container activation.
Download: Pull the trial image for Friendli Containers from Friendli's registry via Docker login with your Personal Access Token.

bash
export FRIENDLI_TOKEN="YOUR PERSONAL ACCESS TOKEN"
export YOUR_EMAIL="YOUR EMAIL"

docker login registry.friendli.ai -u $YOUR_EMAIL -p $FRIENDLI_TOKEN
docker pull registry.friendli.ai/trial

Launch Friendli Container: Launching a Friendli Container closely resembles launching a vLLM server. You'll need your Container Secret and specify the model name and port details.

bash
export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET"

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  registry.friendli.ai/trial \
  --hf-model-name mistralai/Mistral-7B-v0.1 \
  --web-server-port 8000

OpenAI Compatible Inference API: Use Your Favorite Tools

Both Friendli Container and vLLM offer an OpenAI-compatible inference API. This allows you to simply send text completion requests through cURL, which works identically for both vLLM and Friendli Container.

bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-v0.1",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
    }'

Moreover, it also allows you to use popular tools like the OpenAI Python SDK seamlessly on either platform.

python
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1" #Fill in your Friendli/vLLM endpoint
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(
    model="mistralai/Mistral-7B-v0.1",
    prompt="San Francisco is a",
)
print("Completion result:", completion)

Ready to Take Your LLMs to the Next Level?

Head over to https://friendli.ai/products/container to start your free trial and experience the power of Friendli Containers for high-performance LLM serving!

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.