- August 26, 2024
- 9 min read
Friendli Container Part 2: Monitoring with Grafana
Friendli Container Series: Part 2 of 2
In the second part of our two-part series on learning how to use the Friendli Container, we will learn how to monitor important metrics such as throughput and latency through Prometheus and our customizable Grafana templates, which can be downloaded from our GitHub repository: Friendli Container GitHub Repository. Friendli Container is designed to make the deployment of custom generative AI models simpler, faster, and cheaper. Monitoring and maintaining containers with Grafana helps ensure smooth operations, making it well-suited for production-scale environments.
The basics of Friendli Container have been covered in our previous post, with explanations of containers in general: Friendli Container Part 1: Efficiently Serving LLMs On-Premise. If you're already familiar with the general container setup and want to jump directly to the section on Grafana, click here to Get Started with Friendli Container x Grafana
Technology used
- Friendli Container
- Prometheus
- Grafana Dashboard (with templates)
Friendli Container Dashboard Created from our Grafana Template
To effectively monitor and optimize performance, you can integrate Grafana, an open-source analytics and monitoring platform, with Prometheus to observe the performance of Friendli Containers. Friendli Container exports internal metrics in Prometheus text format, and we provide Grafana Dashboard templates that offer enhanced observability, such as the example shown above.
The dashboard visualizes metrics like ‘Requests Throughput’, ‘Latency’, ‘P90 TTFT (Time to First Token)’, ‘Friendli TCache Hit Ratio’, and more from a Friendli Container instance. Friendli TCache optimizes LLM inferencing by caching frequently used computational results, reducing redundant GPU processing. Higher TCache Hit Ratio leads to lower GPU workloads, ensuring faster P90 TTFT, even under varying load conditions.
A Quick Setup
Execute the terminal commands below after acquiring the necessary values as environment variables (e.g. Friendli Personal Access Token) to efficiently run your generative AI model of choice on your GPUs. In this tutorial, we use the Llama 3.1 8B Instruct model to handle the chat completion inference requests.
Refer to the previous blog “Friendli Container Part 1: Efficiently Serving LLMs On-Premise” for detailed instructions on setting up the VM environment.
export FRIENDLI_EMAIL="{YOUR FULL ACCOUNT EMAIL ADDRESS}" export FRIENDLI_TOKEN="{YOUR PERSONAL ACCESS TOKEN e.g. flp_XXX}" docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_TOKEN docker pull registry.friendli.ai/trial:latest
export FRIENDLI_CONTAINER_SECRET="{YOUR FRIENDLI CONTAINER SECRET e.g. flc_XXX}" export HF_TOKEN="{YOUR HUGGING FACE TOKEN e.g. hf_XXX}" export HF_MODEL_NAME="meta-llama/Meta-Llama-3.1-8B-Instruct" export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial" export GPU_ENUMERATION="{YOUR GPU DEVICE NUMBER e.g. device0}"
After pulling the docker image, you can use the docker images
command to list all of the pulled images and the docker image inspect $FRIENDLI_CONTAINER_IMAGE
command to view a detailed JSON output for the registry.friendli.ai/trial
image. You can use the env
command to list all of your exported environment variables.
By default, the container will listen for inference requests at TCP port 8000 and a Grafana service will be available at TCP port 3000. You can optionally change the designated ports using the following environment variables. For example, if you want to use TCP port 8001 and port 3001 for Grafana, execute the command below.
export FRIENDLI_PORT="8001" export FRIENDLI_GRAFANA_PORT="3001"
Lastly, execute the docker compose up -d
command from our GitHub repository to launch a Friendli Container along with two more containers, each from a Grafana image (grafana/grafana
) and a Prometheus image (prom/prometheus
).
git clone https://github.com/friendliai/container-resource cd container-resource/quickstart/docker-compose docker compose up -d
Try the docker ps
command to see a list of all of your running containers. You can also execute the docker compose down
command in the container-resource/quickstart/docker-compose directory to stop and remove all of the running containers.
Send Chat Completion Inference Requests
Send inference requests to the Llama 3.1 8B Instruct model right away after successfully launching the Friendli Container! For instance, you can query the LLM with the question “If I hang 5 shirts outside and it takes them 5 hours to dry, how long would it take to dry 30 shirts?” by executing the below command.
curl -X POST http://127.0.0.1:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{"messages": [{"role": "user", "content": "If I hang 5 shirts outside and it takes them 5 hours to dry, how long would it take to dry 30 shirts?"}]}'
Chat completion inference result:
{ "choices": [ { "finish_reason": "stop", "index": 0, "logprobs": null, "message": { "content": "A classic lateral thinking puzzle!\n\nThe answer is not simply \"30 hours\" because the number of shirts doesn't directly impact the drying time. The drying time remains the same, 5 hours.\n\nThink about it: if you have 5 shirts, it takes 5 hours to dry them. If you have 10 shirts, it will still take approximately 5 hours to dry them. And if you have 30 shirts, it will still take approximately 5 hours to dry them.\n\nSo, the answer is still 5 hours to dry 30 shirts.", "role": "assistant" } } ], "created": 1724389731, "usage": { "completion_tokens": 114, "prompt_tokens": 38, "total_tokens": 152 } }
Get Started with Friendli Container x Grafana
Have you ever wanted to monitor the performance of your generative AI models in real-time? Imagine having the power to visualize and analyze your models’ inference metrics, all in one place. With Friendli Container x Grafana, that’s exactly what you can do! The enhanced observability helps you quickly identify bottlenecks, optimize performance, and ensure smooth, efficient operations.
Grafana is an open-source analytics and monitoring platform that visualizes LLM inference metrics by connecting to data sources like Prometheus. Through docker compose
, we were previously able to launch a Grafana container for monitoring the Friendli Container and a Prometheus container which is configured to scrape metrics from Friendli Container processes.
Observe your Friendli Container with Grafana by opening http://127.0.0.1:$LOCAL_GRAFANA_PORT/d/friendli-engine
on your browser and logging in with username admin
and password admin
. You can update the password after the initial login and now access the dashboards showing useful engine metrics, such as throughput and latency.