May 22, 2024
8 min read

Measuring LLM Serving Performance with LLMServingPerfEvaluator

Tired of Fixing LLM Serving Benchmarks?

Current tools like Locust and llmperf often struggle to create appropriate workloads with rich performance metrics, and are only limited to testing LLM Inference Endpoints. This makes it difficult to accurately assess the actual capabilities of your LLM serving engine. To overcome these shortcomings, the FriendliAI team has designed LLMServingPerfEvaluator to provide a way to correctly benchmark LLM serving performances.

Introducing LLMServingPerfEvaluator: An All-in-One Benchmarking Tool

LLMServingPerfEvaluator is an open-source tool designed by the FriendliAI team specifically for benchmarking LLM serving engines. It overcomes the limitations of existing solutions by offering:

Realistic Workload Generation: LLMServingPerfEvaluator easily defines workloads that mimic real-world usage patterns. LLMServingPerfEvaluator simulates text generation requests arriving at the serving engine according to a Poisson distribution, allowing you to stress test the engine with varying request rates (λ). Soon, it will also provide workload generation with support for controlling the number of concurrent requests (i.e., concurrency vs. request rate mode).
Customizable Request Inputs: LLMServingPerfEvaluator can craft input-output pairs to match your specific use case. You can either synthesize text data with desired lengths of requests using normal or uniform distributions, or leverage pre-existing datasets from Hugging Face.

Measure What Matters with Key Performance Metrics

LLMServingPerfEvaluator provides various metrics by default to accurately measure the performance of your LLM serving engine. Concretely, it provides in-depth insights into your engine's performance with:

Throughput (requests/sec, input_tokens/sec, output_tokens/sec): Gauges the engine's overall end-to-end capacity by measuring the number of requests, input tokens, and output tokens it can handle per second.
TTFT (time to first token): Measures the time taken to produce the very first response token after receiving a request. This includes any queuing delays.
TPOT (time per output token): Analyzes the efficiency of token generation by calculating the average time taken to generate each subsequent token after the first one (excluding TTFT). Also known as inter-token latency.
Latency (end-to-end processing time per request): Provides the complete picture by measuring the total time taken to process a request, encompassing both TTFT and TPOTs.
Token Latency (the average processing time per out token): Measures the average of the time taken to generate each output token including the first token. This value is obtained by dividing the latency by the total number of output tokens.

Effortless Benchmarking with Docker and Grafana

Setting up LLMServingPerfEvaluator is a breeze. Leverage Docker Compose for a streamlined experience and monitor live performance metrics with Grafana.

Benchmarking vLLM vs Friendli: A Step-by-Step Guide

Let's put theory into practice! This guide will walk you through benchmarking vLLM and Friendli using the meta-llama/Meta-Llama-3-8B-Instruct model on a single NVIDIA A100 GPU. If you want to measure the performance of a single specific LLM serving engine (e.g., Friendli Inference), you can refer to the readme in the LLMServingPerfEvaluator repository.

Prerequisites:

Install Docker Compose: https://docs.docker.com/compose/install/
A Basic Understanding of Docker and Docker Compose: https://docs.docker.com/compose/intro/features-uses/
Friendli Container: https://friendli.ai/suite/default-team/container/quickstart
Hugging Face Access Token: https://huggingface.co/docs/hub/security-tokens

Step 0: Clone the repository

bash
git clone https://github.com/friendliai/LLMServingPerfEvaluator.git
cd LLMServingPerfEvaluator

Step 1: Prepare your working directory

bash
mkdir -p workspace/config/request_config
mkdir -p workspace/config/workload_config

Step 2-1. Set up the configuration files: `workload_config.yaml`

Let’s compare vLLM and Friendli with the synthetic workload. We use the workload configuration in the repository.

bash
cp examples/workload_config/dummy.yaml \
workspace/config/workload_config/

The workload configuration is as follow:

yaml
# workspace/config/workload_config/dummy.yaml
type: dummy
dataset_config:
  vocab_size: 128000
  system_prompt_config:
    - name: '1'
      size: 10
  dataset:
    - input:
        - type: uniform
          min: 100
          max: 150
      output:
        - type: uniform
          min: 100
          max: 300
      system_prompt:
        - name: '1'

According to our configuration, each request generates an input using integers from 0 to 127999 (i.e., vocab_size), with the input length uniformly sampled between 100 and 150 (i.e., input min/max) using the system prompt of length 10 integers. And the output length uniformly sampled between 100 and 300 (i.e., output min/max) If you wish to modify the workload, please refer to the document on generating different workloads.

Step 2-2. Set up the configuration files: `request_config.yaml`

As the request body format differs according to each engine, we will use request configuration files in the repository for vLLM and the Friendli Inference.

bash
cp examples/request_config/friendli.yaml \
workspace/config/request_config/
cp examples/request_config/vllm.yaml workspace/config/request_config/

The request configurations are as follow:

yaml
# workspace/config/request_config/vllm.yaml
stream: true
name: model

yaml
# workspace/config/request_config/friendli.yaml
stream: true

Step 2-3. Set up the configuration files: `grafana` & `prometheus`

We use the grafana and prometheus configuration files in the repository. Please copy the grafana directory and prometheus directory in the repository to the workspace. (grafana/provisioning/dashboards/single_engine_dashboard.json does not have to be copied.)

bash
cp -r grafana/ workspace/
cp -r prometheus/ workspace/
rm workspace/grafana/provisioning/dashboards/single_engine_dashboard.json
mv workspace/grafana/provisioning/compare_engines_dashboard.json workspace/grafana/provisioning/dashboards

Now, the directory structure should look like this:

bash
tree workspace/ -a

workspace/
├── config
│   ├── request_config
│   │   ├── friendli.yaml
│   │   └── vllm.yaml
│   └── workload_config
│       └── dummy.yaml
├── grafana
│   └── provisioning
│       ├── dashboards
│       │   ├── compare_engines_dashboard.json
│       │   └── dashboard.yaml
│       └── datasources
│           └── datasource.yaml
└── prometheus
    └── config
        └── prometheus.yml

Step 3. Set up the `docker-compose.yml` and the `.env` file (or environment variables)

Copy the examples/docker_compose/compare-docker-compose.yml file and examples/docker_compose/.compare_env file in the repository to the workspace directory.

bash
cp examples/docker_compose/compare-docker-compose.yml \
workspace/docker-compose.yml
cp examples/docker_compose/.compare_env workspace/.env

When you open the workspace/.env file, there are environment variables for the experiment. The environment variables within the curly braces should be replaced with the actual values.

First of all, HF_HUB_CACHE and HF_TOKEN should be filled out. For more details on the Hugging Face environment variables, please refer to this document.
Second, FRIENDLI_CONTAINER_REPO, FRIENDLI_CONTAINER_TAG, and FRIENDLI_CONTAINER_SECRET should be filled out. Before starting the tutorial, one should obtain the permission for using the - Friendli Inference trial docker image with FRIENDLI_CONTAINER_SECRET.
Third, UID should be filled out. find your UID in the host machine with the following command:

bash
id -u

After fixing the .env file, it should look like:

.env
# workspace/.env
# Experiment Environment Variables
HF_MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct # The model repo id of the Hugging Face model
REQUEST_RATES=1,3,5,7,9,11 # The average number of requests per second
DURATION=300 # The duration of the experiment in seconds
TIMEOUT=450 # The timeout of the experiment in seconds
CONFIG_DIR=./config # The directory path to the configuration files: workload_config.yaml, request_config.yaml
RESULT_DIR=./result # The directory path to save the end-to-end metrics

# Experiment Environment Variables for the Hugging Face model
# HF_HUB_CACHE={HF_CACHE_DIR}
# HF_TOKEN={HF_API_TOKEN}

# user id by `id -u` in your host machine
UID={UID}

# Friendli Inference Environment Variables
FRIENDLI_CUDA_VISIBLE_DEVICES=0
FRIENDLI_NUM_DEVICES=1
FRIENDLI_CONTAINER_REPO=registry.friendli.ai/trial
FRIENDLI_CONTAINER_TAG=latest
FRIENDLI_CONTAINER_SECRET={FRIENDLI_CONTAINER_SECRET}

# vllm engine
VLLM_CUDA_VISIBLE_DEVICES=1
VLLM_NUM_DEVICES=1
VLLM_REPO=vllm/vllm-openai
VLLM_TAG=v0.4.2

Now, the directory structure should look like this:

bash
tree workspace/ -a

workspace/
├── .env
├── config
│   ├── request_config
│   │   ├── friendli.yaml
│   │   └── vllm.yaml
│   └── workload_config
│       └── dummy.yaml
├── docker-compose.yml
├── grafana
│   └── provisioning
│       ├── dashboards
│       │   ├── compare_engines_dashboard.json
│       │   └── dashboard.yaml
│       └── datasources
│           └── datasource.yaml
└── prometheus
    └── config
        └── prometheus.yml

You can check whether your docker-compose.yml is correctly filled out with the following command in the workspace directory.:

bash
cd workspace
docker compose config

Step 4. Execute the performance evaluation

At the workspace directory, run the performance evaluation with the following command:

bash
cd workspace
docker compose up -d

Step 5. Monitor the live performance

Access grafana to monitor the live performance of your evaluation at http://localhost:3000. The default username and password are both admin and admin.

Dashboard

After the experiments, you can still access the experiment performance metrics with a .csv file at the workspace/result directory.

Step 6. Cleanup

Clean up the docker compose after the experiments at

bash
docker compose down

Using a Different Hugging Face Model or a Quantized Model

To use a different Hugging Face model or a quantized model, you need to modify the .env file and the docker-compose.yml file.

If you wish to use:

FP16, BF16, or FP32 versions of a model: Set the repository name of the model available on Hugging Face hub to HF_MODEL_NAME in the .env file.
FP8 versions of a model: Set the repository names of the model supported by Friendli Inference and vLLM as separate environment variables and modify the docker-compose.yml file.
- For the Friendli Inference, use the repository of the FP8 versions of the model (e.g., FriendliAI/Meta-Llama-3-8B-fp8).
- For vLLM, use the repository of the non-quantized model (e.g., meta-llama/Meta-Llama-3-8B) and add --quantization fp8 to the command of the vllm-engine service in the docker-compose.yml file.
AWQ-ed models: Set the repository name of the model with the AWQ-applied model, available on the Hugging Face hub(e.g., TheBloke/Llama-2-13B-AWQ) to HF_MODEL_NAME in the .env file. For vLLM, add --quantization awq to the command of the vllm-engine service in the docker-compose.yml file.
MoE models: Set the repository name of the model with the MoE model, available on the Hugging Face hub to HF_MODEL_NAME in the .env file.

For Friendli Inference, we recommend using the policy search feature for optimizing the inference performance of AWQ-ed models, MoE models, or FP8 versions of models, and set the directory path of the searched policy file to POLICY_DIR in the .env file. For more details of policy search, you can refer to this document.

Understanding Your Benchmark Results

The results from LLMServingPerfEvaluator provide valuable insights into your LLM serving engine's performance. Normally, the throughput and latency metrics should be analyzed concurrently, to set a performance level objective (i.e., throughput or latency), and make sure that both of these metrics satisfy the criteria. Here's how to interpret the key metrics:

Throughput:

Throughput

Measured in processed requests per second, input tokens per second, and output tokens per second.
Ideal Scenario: Throughput increases proportionally with the request_rate. This indicates the engine is handling requests efficiently.
Watch Out For: If the throughput plateaus or even dips at higher request rates, it suggests that the engine is overloaded.
In our experiment above, the throughput of vLLM does not increase at request rate 13, meaning that vLLM is unable to keep up with the request rate, while Friendli can handle requests efficiently.

Latency:

Latency

Measures the total time to process a request.
Ideal Scenario: Latency remains relatively stable across different request_rates. This indicates the engine is processing requests promptly for the different rates.
Watch Out For: A significant increase in latency signifies that the engine is struggling to keep up with the workload. If it continues to increase over a period of time under a fixed request rate, it means that the requests are piling up in the queue (i.e., increasing queueing delay) which is an evident factor that the engine is overloaded and unable to process the request rates.
In our experiment above, while the throughput of each engine follows the given request rates(i.e., 7, 9, 11, 13), the latency of vLLM increases as we increase the request rate, meaning that vLLM is struggling to keep up with the request rate of 11 and 13.

Time to First Token (TTFT):

Measures the time taken to generate the first response token after receiving a request.
Ideal Scenario: TTFT remains low and stable across request rates.
Watch Out For: An increase in TTFT indicates queuing delays, suggesting that the engine might be overloaded. This indicates that the engine is struggling to properly handle the batch of requests. Similar to the latency, if it continues to increase, it means that the requests are piling up in the queue, indicating an overloaded engine.

Time per Output Token (TPOT):

Measures the average time taken to generate each subsequent token after the first one.
Ideal Scenario: TPOT remains relatively constant across request rates. This suggests efficient token generation.
Watch Out For: An increase in TPOT indicates the engine is taking longer to generate each token as the batch size increases. This metric does not peak as much as latency and TTFT.

Leveraging Visualization Tools:

Use Grafana to monitor live performance metrics and observe trends.
Analyze the final results through the CSV file (located in workspace/result) for a comprehensive picture of engine performance. For example, you could generate a latency-throughput graph to analyze the two performance metrics concurrently.

By combining these metrics and visualizations, you can identify the optimal request rate for your engine and understand its limitations under a certain load pressure. This information is crucial for optimizing your LLM serving infrastructure and ensuring that it meets your application demands.

Written by

FriendliAI Tech & Research

General FAQ

What is FriendliAI?

FriendliAI is a GPU-inference platform that lets you deploy, scale, and monitor large language and multimodal models in production, without owning or managing GPU infrastructure. We offer three things for your AI models: Unmatched speed, cost efficiency, and operational simplicity. Find out which product is the best fit for you in here.

How does FriendliAI help my business?

Our Friendli Inference allows you to squeeze more tokens-per-second out of every GPU. Because you need fewer GPUs to serve the same load, the true metric—tokens per dollar—comes out higher even if the hourly GPU rate looks similar on paper. View pricing

Which models and modalities are supported?

Over 380,000 text, vision, audio, and multi-modal models are deployable out of the box. You can also upload custom models or LoRA adapters. Explore models

Can I deploy models from Hugging Face directly?

Yes. A one-click deploy by selecting “Friendli Endpoints” on the Hugging Face Hub will take you to our model deployment page. The page provides an easy-to-use interface for setting up Friendli Dedicated Endpoints, a managed service for generative AI inference. Learn more about our Hugging Face partnership

Still have questions?

If you want a customized solution for that key issue that is slowing your growth, contact@friendli.ai or click Talk to an expert — our experts (not a bot) will reply within one business day.