FriendliAI Secures $20M to Accelerate AI Inference Innovation — Read the Full Story

  • April 12, 2024
  • 2 min read

Easily Migrating LLM Inference Serving from vLLM to Friendli Container

Easily Migrating LLM Inference Serving from vLLM to Friendli Container thumbnail

vLLM is an open-source inference engine that provides a starting point for serving your large language models (LLMs). However, when it comes to production environments, vLLM faces challenges. In production environments, various optimizations including efficient quantized models [link1, link2], and efficient use of computation (e.g., MoE techniques) become crucial. In production environments, Friendli Container is a much better option, and it is gaining popularity among the companies that need to serve LLMs on a large scale. While vLLM provides an easy entrance to inference serving, this article illustrates how Friendli Container is equally easy-to-use with a simple extra step.

Friendli Container: Built for Production

Friendli Container leverages unique optimizations including the Friendli DNN library optimized for generative AI, iteration batching (or continuous batching), efficient quantization, and TCache techniques, making them ideal for production environments. They offer superior performance and handle heavy workloads efficiently. As shown in these articles 1 and 2, Friendli Container exhibit roughly 10x faster TTFT (time-to-first-token) and 10x faster TPOT (time-to-output-token) under modest loads while serving AWQ-ed Mixtral 7x8B model on an NVIDIA’s A100 80GB GPU.

Moving to Friendli Container: An Easy Transition

Launching inference serving containers using vLLM is pretty straightforward. As instructed in the blog post and the documentations, you can install it in your local environment with pip install vllm or by downloading a pre-built Docker image with:

With the image, you could launch the server with the following command:

Transitioning from vLLM to Friendli Container is very easy. Here's what you need to do:

  1. Sign up: Create a Friendli Suite account and generate a Personal Access Token and a Container Secret for user authentication and container activation.
  2. Download: Pull the trial image for Friendli Containers from Friendli's registry via Docker login with your Personal Access Token.
  1. Launch Friendli Container: Launching a Friendli Container closely resembles launching a vLLM server. You'll need your Container Secret and specify the model name and port details.

OpenAI Compatible Inference API: Use Your Favorite Tools

Both Friendli Container and vLLM offer an OpenAI-compatible inference API. This allows you to simply send text completion requests through cURL, which works identically for both vLLM and Friendli Container.

Moreover, it also allows you to use popular tools like the OpenAI Python SDK seamlessly on either platform.

Ready to Take Your LLMs to the Next Level?

Head over to https://friendli.ai/products/container to start your free trial and experience the power of Friendli Containers for high-performance LLM serving!


Written by

FriendliAI Tech & Research


Share


General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.


Related Posts

Meta Llama 3 now available on Friendli thumbnail
  • April 29, 2024
  • 2 min read

Meta Llama 3 now available on Friendli

Llama
Quantization
Benchmarks
Building Your RAG Application on LlamaIndex with Friendli Inference: A Step-by-Step Guide thumbnail
  • April 8, 2024
  • 2 min read

Building Your RAG Application on LlamaIndex with Friendli Inference: A Step-by-Step Guide

Tutorial
RAG
LlamaIndex
See all from blog

Products

Friendli Dedicated EndpointsFriendli Serverless EndpointsFriendli Container

Solutions

InferenceUse Cases
Models

Developers

DocsBlogResearch

Company

About usNewsCareersPatentsBrand ResourcesContact us
Pricing

Contact us:

contact@friendli.ai

FriendliAI Corp:

Redwood City, CA

Hub:

Seoul, Korea

Privacy PolicyService Level AgreementTerms of ServiceCA Notice

Copyright © 2025 FriendliAI Corp. All rights reserved

bash
docker pull vllm/vllm-openai:latest
bash
docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model mistralai/Mistral-7B-v0.1
bash
export FRIENDLI_TOKEN="YOUR PERSONAL ACCESS TOKEN"
export YOUR_EMAIL="YOUR EMAIL"

docker login registry.friendli.ai -u $YOUR_EMAIL -p $FRIENDLI_TOKEN
docker pull registry.friendli.ai/trial
bash
export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET"

docker run --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  registry.friendli.ai/trial \
  --hf-model-name mistralai/Mistral-7B-v0.1 \
  --web-server-port 8000
bash
curl http://localhost:8000/v1/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "mistralai/Mistral-7B-v0.1",
    "prompt": "San Francisco is a",
    "max_tokens": 7,
    "temperature": 0
    }'
python
from openai import OpenAI

openai_api_key = "EMPTY"
openai_api_base = "http://localhost:8000/v1" #Fill in your Friendli/vLLM endpoint
client = OpenAI(
    api_key=openai_api_key,
    base_url=openai_api_base,
)
completion = client.completions.create(
    model="mistralai/Mistral-7B-v0.1",
    prompt="San Francisco is a",
)
print("Completion result:", completion)