- August 26, 2024
- 6 min read
Friendli Container Part 2: Monitoring with Grafana

Friendli Container Series: Part 2 of 2
In the second part of our two-part series on learning how to use the Friendli Container, we will learn how to monitor important metrics such as throughput and latency through Prometheus and our customizable Grafana templates, which can be downloaded from our GitHub repository: Friendli Container GitHub Repository. Friendli Container is designed to make the deployment of custom generative AI models simpler, faster, and cheaper. Monitoring and maintaining containers with Grafana helps ensure smooth operations, making it well-suited for production-scale environments.
The basics of Friendli Container have been covered in our previous post, with explanations of containers in general: Friendli Container Part 1: Efficiently Serving LLMs On-Premise. If you're already familiar with the general container setup and want to jump directly to the section on Grafana, click here to Get Started with Friendli Container x Grafana
Technology used
- Friendli Container
- Prometheus
- Grafana Dashboard (with templates)
To effectively monitor and optimize performance, you can integrate Grafana, an open-source analytics and monitoring platform, with Prometheus to observe the performance of Friendli Containers. Friendli Container exports internal metrics in Prometheus text format, and we provide Grafana Dashboard templates that offer enhanced observability, such as the example shown above.
The dashboard visualizes metrics like ‘Requests Throughput’, ‘Latency’, ‘P90 TTFT (Time to First Token)’, ‘Friendli TCache Hit Ratio’, and more from a Friendli Container instance. Friendli TCache optimizes LLM inferencing by caching frequently used computational results, reducing redundant GPU processing. Higher TCache Hit Ratio leads to lower GPU workloads, ensuring faster P90 TTFT, even under varying load conditions.
A Quick Setup
Execute the terminal commands below after acquiring the necessary values as environment variables (e.g. Friendli Personal Access Token) to efficiently run your generative AI model of choice on your GPUs. In this tutorial, we use the Llama 3.1 8B Instruct model to handle the chat completion inference requests.
Refer to the previous blog “Friendli Container Part 1: Efficiently Serving LLMs On-Premise” for detailed instructions on setting up the VM environment.
After pulling the docker image, you can use the docker images
command to list all of the pulled images and the docker image inspect $FRIENDLI_CONTAINER_IMAGE
command to view a detailed JSON output for the registry.friendli.ai/trial
image. You can use the env
command to list all of your exported environment variables.
By default, the container will listen for inference requests at TCP port 8000 and a Grafana service will be available at TCP port 3000. You can optionally change the designated ports using the following environment variables. For example, if you want to use TCP port 8001 and port 3001 for Grafana, execute the command below.
Lastly, execute the docker compose up -d
command from our GitHub repository to launch a Friendli Container along with two more containers, each from a Grafana image (grafana/grafana
) and a Prometheus image (prom/prometheus
).
Try the docker ps
command to see a list of all of your running containers. You can also execute the docker compose down
command in the container-resource/quickstart/docker-compose directory to stop and remove all of the running containers.
Send Chat Completion Inference Requests
Send inference requests to the Llama 3.1 8B Instruct model right away after successfully launching the Friendli Container! For instance, you can query the LLM with the question “If I hang 5 shirts outside and it takes them 5 hours to dry, how long would it take to dry 30 shirts?” by executing the below command.
Chat completion inference result:
Get Started with Friendli Container x Grafana
Have you ever wanted to monitor the performance of your generative AI models in real-time? Imagine having the power to visualize and analyze your models’ inference metrics, all in one place. With Friendli Container x Grafana, that’s exactly what you can do! The enhanced observability helps you quickly identify bottlenecks, optimize performance, and ensure smooth, efficient operations.
Grafana is an open-source analytics and monitoring platform that visualizes LLM inference metrics by connecting to data sources like Prometheus. Through docker compose
, we were previously able to launch a Grafana container for monitoring the Friendli Container and a Prometheus container which is configured to scrape metrics from Friendli Container processes.
Observe your Friendli Container with Grafana by opening http://127.0.0.1:$LOCAL_GRAFANA_PORT/d/friendli-engine
on your browser and logging in with username admin
and password admin
. You can update the password after the initial login and now access the dashboards showing useful engine metrics, such as throughput and latency.
If you cannot open a browser directly in the GPU machine where the Friendli Container is running, you can use SSH to forward requests from the browser running on your PC to the GPU machine. You may also want to use -l login_name
or -p port
options to connect to the GPU machine using SSH.
Afterwards, open http://127.0.0.1:$LOCAL_GRAFANA_PORT/d/friendli-engine
(for our example above, the URL would be http://127.0.0.1:3123/d/friendli-engine
) on your browser and log in to view the dashboard.
Monitor Different Metrics Using the Grafana Dashboard
While Friendli Container is handling inference requests, the Grafana dashboard provides a comprehensive view of the performance metrics. By default, metrics are served at http://localhost:8281/metrics
. You can configure the port number using the command line option --metrics-port
. Our supported metrics are categorized into four groups: counters, gauges, histograms, and quantiles.
-
Counters: Cumulative metrics that are often used with the rate() Prometheus function to calculate throughput.
friendli_requests_total
friendli_responses_total
friendli_items_total
friendli_failure_by_cancel
friendli_failure_by_timeout
friendli_failure_by_nan_error
friendli_failure_by_reject
-
Gauges: Dynamic numerical values that go up and down and represent the current value.
friendli_current_requests
friendli_current_items
friendli_current_assigned_items
friendli_current_waiting_items
-
Histograms: Histograms are used to track the distribution of the following three variables over time.
Friendli TCache hit ratio
The length of input tokens
The length of output tokens
-
Quantiles: Quantiles are used to display the current p50(median), p90, and p99 percentiles for the following three variables.
Request completion latency (in nanoseconds)
Time to first token (TTFT) (in nanoseconds)
Request queueing delay (in nanoseconds)
Run the code below in your terminal to repeatedly send inference requests to the Friendli Container and observe the LLM inference performance through the Grafana Dashboard:
The image below showcases the inference metrics for Llama 3.1 8B Instruct, highlighting the efficiency and responsiveness of the Friendli Container. The overall throughput of 1.13 requests per second (req/s) indicates low traffic, with a steady flow of data being processed, while the P90 latency of 375 milliseconds demonstrates that the majority of requests are handled with minimal delay. The P90 Time to First Token (TTFT) is particularly impressive at 12.6 milliseconds, underscoring the engine's ability to start generating responses almost instantly.
Explore our blog post "The LLM Serving Engine Showdown: Friendli Inference Outshines" for an in-depth comparison of P90 TTFT performance across various LLM inference engines, including vLLM.
Grafana Templates for Friendli Container
One of the excitements of using Grafana lies in the ability to customize dashboards to suit your specific needs. Whether you're tracking a space mission or managing container instances, Grafana allows you to design dashboards that deliver the insights you require. This flexibility enables you to visualize data in ways that are most meaningful to monitoring Friendli Container instances.
A simple way to create new dashboards is by importing JSON files into Grafana. For instance, the friendli-engine-dashboard-per-instance.json
file allows you to set up a dashboard that monitors multiple Friendli Container instances. You can download our Grafana templates as JSON files from the Grafana Templates for Friendli Container section of the Friendli Container GitHub Repository. After downloading the template, go to the 'Import dashboard' page in Grafana and upload the JSON file as shown below.
Grafana 'Import dashboard' page, highlighting the setup for importing the 'Per-Instance View (Friendli Inference)' dashboard with Prometheus as the data source
Conclusion
In summary, integrating Grafana with Friendli Container facilitates a comprehensive, real-time monitoring system, which is crucial for maintaining the optimal performance of generative AI models. The Grafana dashboards imported through our templates display visualizations of critical performance indicators, such as requests throughput, latency distributions, and cache hit ratios. By leveraging these observability features, you can fine-tune your generative AI deployments for maximum reliability and scalability, making them well-suited for production-level workloads.
Resources
Written by
FriendliAI Tech & Research
Share
General FAQ
What is FriendliAI?
FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.
How does FriendliAI help my business?
Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing
Which models and modalities are supported?
Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models
Can I deploy models from Hugging Face directly?
Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership
Still have questions?
If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Contact Sales — our experts (not a bot) will reply within one business day.