August 26, 2024
6 min read

Friendli Container Part 2: Monitoring with Grafana

Friendli Container Series: Part 2 of 2

In the second part of our two-part series on learning how to use the Friendli Container, we will learn how to monitor important metrics such as throughput and latency through Prometheus and our customizable Grafana templates, which can be downloaded from our GitHub repository: Friendli Container GitHub Repository. Friendli Container is designed to make the deployment of custom generative AI models simpler, faster, and cheaper. Monitoring and maintaining containers with Grafana helps ensure smooth operations, making it well-suited for production-scale environments.

The basics of Friendli Container have been covered in our previous post, with explanations of containers in general: Friendli Container Part 1: Efficiently Serving LLMs On-Premise. If you're already familiar with the general container setup and want to jump directly to the section on Grafana, click here to Get Started with Friendli Container x Grafana

Technology used

Friendli Container
- Chat Completions API
Prometheus
Grafana Dashboard (with templates)

Friendli Container dashboard

Friendli Container Dashboard Created from our Grafana Template

To effectively monitor and optimize performance, you can integrate Grafana, an open-source analytics and monitoring platform, with Prometheus to observe the performance of Friendli Containers. Friendli Container exports internal metrics in Prometheus text format, and we provide Grafana Dashboard templates that offer enhanced observability, such as the example shown above.

The dashboard visualizes metrics like ‘Requests Throughput’, ‘Latency’, ‘P90 TTFT (Time to First Token)’, ‘Friendli TCache Hit Ratio’, and more from a Friendli Container instance. Friendli TCache optimizes LLM inferencing by caching frequently used computational results, reducing redundant GPU processing. Higher TCache Hit Ratio leads to lower GPU workloads, ensuring faster P90 TTFT, even under varying load conditions.

A Quick Setup

Execute the terminal commands below after acquiring the necessary values as environment variables (e.g. Friendli Personal Access Token) to efficiently run your generative AI model of choice on your GPUs. In this tutorial, we use the Llama 3.1 8B Instruct model to handle the chat completion inference requests.

Refer to the previous blog “Friendli Container Part 1: Efficiently Serving LLMs On-Premise” for detailed instructions on setting up the VM environment.


export FRIENDLI_EMAIL="{YOUR FULL ACCOUNT EMAIL ADDRESS}"
export FRIENDLI_TOKEN="{YOUR PERSONAL ACCESS TOKEN e.g. flp_XXX}"
docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_TOKEN
docker pull registry.friendli.ai/trial:latest


export FRIENDLI_CONTAINER_SECRET="{YOUR FRIENDLI CONTAINER SECRET e.g. flc_XXX}"
export HF_TOKEN="{YOUR HUGGING FACE TOKEN e.g. hf_XXX}"
export HF_MODEL_NAME="meta-llama/Meta-Llama-3.1-8B-Instruct"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION="{YOUR GPU DEVICE NUMBER e.g. device0}"

After pulling the docker image, you can use the docker images command to list all of the pulled images and the docker image inspect $FRIENDLI_CONTAINER_IMAGE command to view a detailed JSON output for the registry.friendli.ai/trial image. You can use the env command to list all of your exported environment variables.

By default, the container will listen for inference requests at TCP port 8000 and a Grafana service will be available at TCP port 3000. You can optionally change the designated ports using the following environment variables. For example, if you want to use TCP port 8001 and port 3001 for Grafana, execute the command below.


export FRIENDLI_PORT="8001"
export FRIENDLI_GRAFANA_PORT="3001"

Lastly, execute the docker compose up -d command from our GitHub repository to launch a Friendli Container along with two more containers, each from a Grafana image (grafana/grafana) and a Prometheus image (prom/prometheus).


git clone https://github.com/friendliai/container-resource
cd container-resource/quickstart/docker-compose
docker compose up -d

Try the docker ps command to see a list of all of your running containers. You can also execute the docker compose down command in the container-resource/quickstart/docker-compose directory to stop and remove all of the running containers.

Send Chat Completion Inference Requests

Send inference requests to the Llama 3.1 8B Instruct model right away after successfully launching the Friendli Container! For instance, you can query the LLM with the question “If I hang 5 shirts outside and it takes them 5 hours to dry, how long would it take to dry 30 shirts?” by executing the below command.


curl -X POST http://127.0.0.1:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"messages": [{"role": "user", "content": "If I hang 5 shirts outside and it takes them 5 hours to dry, how long would it take to dry 30 shirts?"}]}'

Chat completion inference result:


{
    "choices": [
        {
            "finish_reason": "stop",
            "index": 0,
            "logprobs": null,
            "message": {
                "content": "A classic lateral thinking puzzle!\n\nThe answer is not simply \"30 hours\" because the number of shirts doesn't directly impact the drying time. The drying time remains the same, 5 hours.\n\nThink about it: if you have 5 shirts, it takes 5 hours to dry them. If you have 10 shirts, it will still take approximately 5 hours to dry them. And if you have 30 shirts, it will still take approximately 5 hours to dry them.\n\nSo, the answer is still 5 hours to dry 30 shirts.",
                "role": "assistant"
            }
        }
    ],
    "created": 1724389731,
    "usage": {
        "completion_tokens": 114,
        "prompt_tokens": 38,
        "total_tokens": 152
    }
}

Get Started with Friendli Container x Grafana

Have you ever wanted to monitor the performance of your generative AI models in real-time? Imagine having the power to visualize and analyze your models’ inference metrics, all in one place. With Friendli Container x Grafana, that’s exactly what you can do! The enhanced observability helps you quickly identify bottlenecks, optimize performance, and ensure smooth, efficient operations.

Grafana is an open-source analytics and monitoring platform that visualizes LLM inference metrics by connecting to data sources like Prometheus. Through docker compose, we were previously able to launch a Grafana container for monitoring the Friendli Container and a Prometheus container which is configured to scrape metrics from Friendli Container processes.

Observe your Friendli Container with Grafana by opening http://127.0.0.1:$LOCAL_GRAFANA_PORT/d/friendli-engine on your browser and logging in with username admin and password admin. You can update the password after the initial login and now access the dashboards showing useful engine metrics, such as throughput and latency.

Grafana welcome page

Welcome to Grafana Page

Grafana dashboard

Grafana Dashboards with the Friendli Container Instance

Initial Friendli Container dashboard

Initial Friendli Container Dashboard with our Default Grafana Template

If you cannot open a browser directly in the GPU machine where the Friendli Container is running, you can use SSH to forward requests from the browser running on your PC to the GPU machine. You may also want to use -l login_name or -p port options to connect to the GPU machine using SSH.


# Change these variables to match your environment.
export GPU_MACHINE_ADDRESS="{ADDRESS OF THE GPU MACHINE}"
LOCAL_GRAFANA_PORT=3123
FRIENDLI_GRAFANA_PORT=3000

ssh "$GPU_MACHINE_ADDRESS" -L "$LOCAL_GRAFANA_PORT:127.0.0.1:$FRIENDLI_GRAFANA_PORT"

Afterwards, open http://127.0.0.1:$LOCAL_GRAFANA_PORT/d/friendli-engine (for our example above, the URL would be http://127.0.0.1:3123/d/friendli-engine) on your browser and log in to view the dashboard.

Monitor Different Metrics Using the Grafana Dashboard

While Friendli Container is handling inference requests, the Grafana dashboard provides a comprehensive view of the performance metrics. By default, metrics are served at http://localhost:8281/metrics. You can configure the port number using the command line option --metrics-port. Our supported metrics are categorized into four groups: counters, gauges, histograms, and quantiles.

Counters: Cumulative metrics that are often used with the rate() Prometheus function to calculate throughput.
- friendli_requests_total
- friendli_responses_total
- friendli_items_total
- friendli_failure_by_cancel
- friendli_failure_by_timeout
- friendli_failure_by_nan_error
- friendli_failure_by_reject
Gauges: Dynamic numerical values that go up and down and represent the current value.
- friendli_current_requests
- friendli_current_items
- friendli_current_assigned_items
- friendli_current_waiting_items
Histograms: Histograms are used to track the distribution of the following three variables over time.
- Friendli TCache hit ratio
- The length of input tokens
- The length of output tokens
Quantiles: Quantiles are used to display the current p50(median), p90, and p99 percentiles for the following three variables.
- Request completion latency (in nanoseconds)
- Time to first token (TTFT) (in nanoseconds)
- Request queueing delay (in nanoseconds)

Run the code below in your terminal to repeatedly send inference requests to the Friendli Container and observe the LLM inference performance through the Grafana Dashboard:


while :; do curl -X POST http://127.0.0.1:8000/v1/chat/completions -H "Content-Type: application/json" -d '{"messages": [{"role": "user", "content": "What makes a good leader?"}], "max_tokens": 30, "stream": false}'; sleep 0.5; done

The image below showcases the inference metrics for Llama 3.1 8B Instruct, highlighting the efficiency and responsiveness of the Friendli Container. The overall throughput of 1.13 requests per second (req/s) indicates low traffic, with a steady flow of data being processed, while the P90 latency of 375 milliseconds demonstrates that the majority of requests are handled with minimal delay. The P90 Time to First Token (TTFT) is particularly impressive at 12.6 milliseconds, underscoring the engine's ability to start generating responses almost instantly.

Real time metrics

Refresh Friendli Container Dashboard to View Real-Time Metrics

Explore our blog post "The LLM Serving Engine Showdown: Friendli Inference Outshines" for an in-depth comparison of P90 TTFT performance across various LLM inference engines, including vLLM.

Grafana Templates for Friendli Container

One of the excitements of using Grafana lies in the ability to customize dashboards to suit your specific needs. Whether you're tracking a space mission or managing container instances, Grafana allows you to design dashboards that deliver the insights you require. This flexibility enables you to visualize data in ways that are most meaningful to monitoring Friendli Container instances.

A simple way to create new dashboards is by importing JSON files into Grafana. For instance, the friendli-engine-dashboard-per-instance.json file allows you to set up a dashboard that monitors multiple Friendli Container instances. You can download our Grafana templates as JSON files from the Grafana Templates for Friendli Container section of the Friendli Container GitHub Repository. After downloading the template, go to the 'Import dashboard' page in Grafana and upload the JSON file as shown below.

Import dashboard

Grafana 'Import dashboard' page, highlighting the setup for importing the 'Per-Instance View (Friendli Inference)' dashboard with Prometheus as the data source

Conclusion

In summary, integrating Grafana with Friendli Container facilitates a comprehensive, real-time monitoring system, which is crucial for maintaining the optimal performance of generative AI models. The Grafana dashboards imported through our templates display visualizations of critical performance indicators, such as requests throughput, latency distributions, and cache hit ratios. By leveraging these observability features, you can fine-tune your generative AI deployments for maximum reliability and scalability, making them well-suited for production-level workloads.

Resources

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.