August 23, 2024
5 min read

Friendli Container Part 1: Efficiently Serving LLMs On-Premise

Friendli Container Series: Part 1 of 2

In the ever-evolving landscape of artificial intelligence, efficiently serving large language models (LLMs) has become a pressing challenge. Recognizing that data is both a critical enabler and a potential security concern for AI models, businesses are increasingly seeking strategies that optimize performance while safeguarding data privacy. While Friendli Dedicated Endpoints offer a simple, secure, reliable, and cost-efficient solution, some industry sectors, such as finance, may prefer on-premise options to maintain more stringent privacy safeguards while harnessing the power of LLMs.

In order to make this possible, Friendli Container brings its expertise to the market. As a tool designed for streamlining the deployment of custom generative AI applications, it leverages containers to simplify this process. Docker container is a representative technology for containerization that encapsulates software in a portable format, making it easier to maintain consistency across different environments. Likewise, by utilizing Friendli Container, teams can set up and manage LLMs seamlessly, ensuring that they are accessible and performant.

In this article, we explore how to efficiently serve LLMs in on-premise environments using Friendli Container. From setting up a container environment to sending chat completion inference requests to Meta’s Llama 3.1 8B Instruct model, we’ll cover the essential steps to get started. Join us as we take a closer look at the process of implementing these solutions.

Leveraging Containers for Running LLMs

Containers, such as Docker, have emerged as a powerful tool for deploying AI models, particularly for deep learning applications that require reproducible environments. By packaging models as containers, developers and data scientists can guarantee consistent performance across diverse operational landscapes, sidestepping the notorious "it works on my machine" problem. These containers can be coordinated to oversee the management and scaling of virtual machines or high-performance computing servers.

Container Capabilities	Benefits for ML Deployment
Reproducibility	Consistent performance
Portability	Run anywhere
Scalability	Manage high computational needs

To initiate a model inference engine within a Docker container, for example, a simple docker command can be used, often interacting with model repositories and handling environment variables. In production, containers may handle incoming requests with automatic model engine start-up, continuous batching for generation inference, and more, bringing state-of-the-art AI inference optimizations into practical use with efficiency and ease.

What is a Friendli Container?

Friendli Container is a specialized container designed to facilitate the deployment of generative AI models. By combining the flexibility of containerization with optimized configurations for LLMs, Friendli Container enables users to seamlessly run models without the need to delve deeply into the details of complex inference engines.

Key Features of Friendli Container

Resource Efficiency: Leveraging optimizations including the Friendli DNN library optimized for generative AI, iteration batching (or continuous batching), efficient quantization, and TCache techniques, it offers superior performance and handles heavy workloads efficiently.
Addressing the On-Premise Needs: Utilizing existing GPU resources for deploying the container, it ensures full control over your data and system security. For those who favor cloud solutions, you can still take advantage of your chosen cloud provider and GPU resources.
Custom Model Support: Designed to meet your needs for running unique generative AI models, it supports custom models, including fine-tuned versions of popular models, quantized models, and multi-LoRA models.
Production-level Management: Monitored and maintained using Grafana, it guarantees smooth operations ideal for production environments. Additionally, technical support from FriendliAI experts is available to assist with any issues that may arise during deployment and maintenance.

Setting up the VM Environment

In this blog post, we will use Docker as our containerization tool. Docker offers a streamlined platform for model inference through its Docker containers and images. To start, download and install Docker from the official website, ensuring compatibility with your virtual machine or host system. Verify the installation with docker --version.

Next, refer to the CUDA compatibility guide to ensure that you have the appropriate NVIDIA GPU and NVIDIA driver for our public Docker container image (registry.friendli.ai/trial:latest). Please refer to the guide for installing the NVIDIA Container Toolkit to learn more about the setup process. Please also be sure to install the required NVIDIA drivers and to configure Docker for the NVIDIA driver, as instructed on the guide.

Then, head over to Friendli Suite and get started by signing up for a free Friendli Container plan. For authentication purposes, you need to generate a Friendli Personal Access Token and Friendli Container Secret first. Generate your access token from Personal Settings > Tokens and your Friendli Container Secret from Container > Container Secrets.

For running a language model, such as Meta-Llama-3.1-8B-Instruct from the Hugging Face model repository, set an environment variable HF_TOKEN to authenticate access. Creating a Hugging Face token is a straightforward process that allows accessing models and using APIs from Hugging Face. Also, export the full model repository name, “meta-llama/Meta-Llama-3.1-8B-Instruct”, as HF_MODEL_NAME.

Commands Summary:

Install Docker: Follow the official guide.
Verify Installation: Execute docker --version.
Set Friendli Environment Variables: FRIENDLI_EMAIL, FRIENDLI_TOKEN, FRIENDLI_CONTAINER_SECRET, FRIENDLI_CONTAINER_IMAGE.
Set Hugging Face Environment Variables: HF_TOKEN, HF_MODEL_NAME.
Choose GPU Device: GPU_ENUMERATION.
Login to Docker and Pull Friendli Image: Execute docker login and docker pull.
Launch Containers: Execute docker compose up -d from https://github.com/friendliai/container-resource.

Set up:


export FRIENDLI_EMAIL="{YOUR FULL ACCOUNT EMAIL ADDRESS}"
export FRIENDLI_TOKEN="{YOUR PERSONAL ACCESS TOKEN e.g. flp_XXX}"
docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_TOKEN
docker pull registry.friendli.ai/trial:latest


export FRIENDLI_CONTAINER_SECRET="{YOUR FRIENDLI CONTAINER SECRET e.g. flc_XXX}"
export HF_TOKEN="{YOUR HUGGING FACE TOKEN e.g. hf_XXX}"
export HF_MODEL_NAME="meta-llama/Meta-Llama-3.1-8B-Instruct"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION="{YOUR GPU DEVICE NUMBER e.g. device0}"


git clone https://github.com/friendliai/container-resource
cd container-resource/quickstart/docker-compose
docker compose up -d

Results:

Container results

A Quick Chat Completion Demo

After successfully setting up the Docker environment and launching the Friendli Container, you're ready to send inference requests to the Llama 3.1 8B Instruct model. For instance, you can query the LLM with “What makes a good leader?” while setting the maximum token limit to 30 and enabling streaming as false.

Chat completion inference request:


curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "What makes a good leader?"}], "max_tokens": 30, "stream": false}'

Chat completion inference result:


{
    "choices": [
        {
            "finish_reason": "length",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": "A good leader typically exhibits a combination of skills, traits, and qualities that inspire and motivate others to work towards a common goal. Some key characteristics of",
                "role": "assistant"
            }
        }
    ],
    "created": 1724040225,
    "usage": {
        "completion_tokens": 30,
        "prompt_tokens": 41,
        "total_tokens": 71
    }
}

To repeatedly send inference requests and observe the LLM inference performance, you can run the code below in your terminal:


while :; do curl -X POST http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "What makes a good leader?"}], "max_tokens": 30, "stream": false}'; sleep 0.5; done

Get Started with Friendli Container!

Deploying and scaling Friendli Container is a straightforward process as explained in the Friendli Container quickstart guide. Once the containers are launched, you can easily send inference requests to the designated on-premise endpoints configured within your container, allowing for seamless integration into your existing workflows. Additionally, if any issues arise during deployment or maintenance, you can always rely on expert support from the FriendliAI team, who are ready to assist you in resolving any technical challenges.

Moreover, Friendli Containers can be monitored using Grafana—a powerful tool for visualizing and analyzing metrics—to ensure smooth operation and easy maintenance. In Part 2 of the Friendli Container blog series, we’ll explore monitoring Friendli Container in depth using Grafana.

For those looking for a more hassle-free solution, FriendliAI also offers Friendli Dedicated Endpoints, a cloud-based option that simplifies the deployment process even further. If you are interested in experiencing the performance of the Friendli Inference without a full deployment, you can explore Friendli Serverless Endpoints, which provide a convenient way to test and utilize the engine’s capabilities.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.