- May 22, 2024
- 8 min read
Measuring LLM Serving Performance with LLMServingPerfEvaluator
Tired of Fixing LLM Serving Benchmarks?
Current tools like Locust and llmperf often struggle to create appropriate workloads with rich performance metrics, and are only limited to testing LLM Inference Endpoints. This makes it difficult to accurately assess the actual capabilities of your LLM serving engine. To overcome these shortcomings, the FriendliAI team has designed LLMServingPerfEvaluator to provide a way to correctly benchmark LLM serving performances.
Introducing LLMServingPerfEvaluator: An All-in-One Benchmarking Tool
LLMServingPerfEvaluator is an open-source tool designed by the FriendliAI team specifically for benchmarking LLM serving engines. It overcomes the limitations of existing solutions by offering:
- Realistic Workload Generation: LLMServingPerfEvaluator easily defines workloads that mimic real-world usage patterns. LLMServingPerfEvaluator simulates text generation requests arriving at the serving engine according to a Poisson distribution, allowing you to stress test the engine with varying request rates (λ). Soon, it will also provide workload generation with support for controlling the number of concurrent requests (i.e., concurrency vs. request rate mode).
- Customizable Request Inputs: LLMServingPerfEvaluator can craft input-output pairs to match your specific use case. You can either synthesize text data with desired lengths of requests using normal or uniform distributions, or leverage pre-existing datasets from Hugging Face.
Measure What Matters with Key Performance Metrics
LLMServingPerfEvaluator provides various metrics by default to accurately measure the performance of your LLM serving engine. Concretely, it provides in-depth insights into your engine's performance with:
- Throughput (requests/sec, input_tokens/sec, output_tokens/sec): Gauges the engine's overall end-to-end capacity by measuring the number of requests, input tokens, and output tokens it can handle per second.
- TTFT (time to first token): Measures the time taken to produce the very first response token after receiving a request. This includes any queuing delays.
- TPOT (time per output token): Analyzes the efficiency of token generation by calculating the average time taken to generate each subsequent token after the first one (excluding TTFT). Also known as inter-token latency.
- Latency (end-to-end processing time per request): Provides the complete picture by measuring the total time taken to process a request, encompassing both TTFT and TPOTs.
- Token Latency (the average processing time per out token): Measures the average of the time taken to generate each output token including the first token. This value is obtained by dividing the latency by the total number of output tokens.
Effortless Benchmarking with Docker and Grafana
Setting up LLMServingPerfEvaluator is a breeze. Leverage Docker Compose for a streamlined experience and monitor live performance metrics with Grafana.
Benchmarking vLLM vs Friendli: A Step-by-Step Guide
Let's put theory into practice! This guide will walk you through benchmarking vLLM and Friendli using the meta-llama/Meta-Llama-3-8B-Instruct model
on a single NVIDIA A100 GPU. If you want to measure the performance of a single specific LLM serving engine (e.g., Friendli Inference), you can refer to the readme in the LLMServingPerfEvaluator repository.
Prerequisites:
- Install Docker Compose: https://docs.docker.com/compose/install/
- A Basic Understanding of Docker and Docker Compose: https://docs.docker.com/compose/intro/features-uses/
- Friendli Container: https://suite.friendli.ai/team/ToHNtbdIeAQh/container/quickstart
- Hugging Face Access Token: https://huggingface.co/docs/hub/security-tokens
Step 0: Clone the repository
bashgit clone https://github.com/friendliai/LLMServingPerfEvaluator.git cd LLMServingPerfEvaluator
Step 1: Prepare your working directory
bashmkdir -p workspace/config/request_config mkdir -p workspace/config/workload_config
Step 2-1. Set up the configuration files: workload_config.yaml
Let’s compare vLLM and Friendli with the synthetic workload. We use the workload configuration in the repository.
bashcp examples/workload_config/dummy.yaml \ workspace/config/workload_config/
The workload configuration is as follow:
yaml# workspace/config/workload_config/dummy.yaml type: dummy dataset_config: vocab_size: 128000 system_prompt_config: - name: '1' size: 10 dataset: - input: - type: uniform min: 100 max: 150 output: - type: uniform min: 100 max: 300 system_prompt: - name: '1'
According to our configuration, each request generates an input using integers from 0 to 127999 (i.e., vocab_size), with the input length uniformly sampled between 100 and 150 (i.e., input min/max) using the system prompt of length 10 integers. And the output length uniformly sampled between 100 and 300 (i.e., output min/max) If you wish to modify the workload, please refer to the document on generating different workloads.
Step 2-2. Set up the configuration files: request_config.yaml
As the request body format differs according to each engine, we will use request configuration files in the repository for vLLM and the Friendli Inference.
bashcp examples/request_config/friendli.yaml \ workspace/config/request_config/ cp examples/request_config/vllm.yaml workspace/config/request_config/
The request configurations are as follow:
yaml# workspace/config/request_config/vllm.yaml stream: true name: model
yaml# workspace/config/request_config/friendli.yaml stream: true
Step 2-3. Set up the configuration files: grafana
& prometheus
We use the grafana and prometheus configuration files in the repository. Please copy the grafana directory and prometheus directory in the repository to the workspace
. (grafana/provisioning/dashboards/single_engine_dashboard.json
does not have to be copied.)
bashcp -r grafana/ workspace/ cp -r prometheus/ workspace/ rm workspace/grafana/provisioning/dashboards/single_engine_dashboard.json mv workspace/grafana/provisioning/compare_engines_dashboard.json workspace/grafana/provisioning/dashboards
Now, the directory structure should look like this:
bashtree workspace/ -a workspace/ ├── config │ ├── request_config │ │ ├── friendli.yaml │ │ └── vllm.yaml │ └── workload_config │ └── dummy.yaml ├── grafana │ └── provisioning │ ├── dashboards │ │ ├── compare_engines_dashboard.json │ │ └── dashboard.yaml │ └── datasources │ └── datasource.yaml └── prometheus └── config └── prometheus.yml
Step 3. Set up the docker-compose.yml
and the .env
file (or environment variables)
Copy the examples/docker_compose/compare-docker-compose.yml
file and examples/docker_compose/.compare_env
file in the repository to the workspace
directory.
bashcp examples/docker_compose/compare-docker-compose.yml \ workspace/docker-compose.yml cp examples/docker_compose/.compare_env workspace/.env
When you open the workspace/.env
file, there are environment variables for the experiment. The environment variables within the curly braces should be replaced with the actual values.
- First of all,
HF_HUB_CACHE
andHF_TOKEN
should be filled out. For more details on the Hugging Face environment variables, please refer to this document. - Second,
FRIENDLI_CONTAINER_REPO, FRIENDLI_CONTAINER_TAG
, andFRIENDLI_CONTAINER_SECRET
should be filled out. Before starting the tutorial, one should obtain the permission for using the - Friendli Inference trial docker image withFRIENDLI_CONTAINER_SECRET
. - Third,
UID
should be filled out. find your UID in the host machine with the following command:
bashid -u
After fixing the .env
file, it should look like:
.env# workspace/.env # Experiment Environment Variables HF_MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct # The model repo id of the Hugging Face model REQUEST_RATES=1,3,5,7,9,11 # The average number of requests per second DURATION=300 # The duration of the experiment in seconds TIMEOUT=450 # The timeout of the experiment in seconds CONFIG_DIR=./config # The directory path to the configuration files: workload_config.yaml, request_config.yaml RESULT_DIR=./result # The directory path to save the end-to-end metrics # Experiment Environment Variables for the Hugging Face model # HF_HUB_CACHE={HF_CACHE_DIR} # HF_TOKEN={HF_API_TOKEN} # user id by `id -u` in your host machine UID={UID} # Friendli Inference Environment Variables FRIENDLI_CUDA_VISIBLE_DEVICES=0 FRIENDLI_NUM_DEVICES=1 FRIENDLI_CONTAINER_REPO=registry.friendli.ai/trial
Now, the directory structure should look like this:
bashtree workspace/ -a workspace/ ├── .env ├── config │ ├── request_config │ │ ├── friendli.yaml │ │ └── vllm.yaml │ └── workload_config │ └── dummy.yaml ├── docker-compose.yml ├── grafana │ └── provisioning │ ├── dashboards │ │ ├── compare_engines_dashboard.json │ │ └── dashboard.yaml │ └── datasources │ └── datasource.yaml └── prometheus └── config └── prometheus.yml
You can check whether your docker-compose.yml
is correctly filled out with the following command in the workspace
directory.:
bashcd workspace docker compose config
Step 4. Execute the performance evaluation
At the workspace
directory, run the performance evaluation with the following command:
bashcd workspace docker compose up -d
Step 5. Monitor the live performance
Access grafana to monitor the live performance of your evaluation at http://localhost:3000.
The default username and password are both admin
and admin
.
After the experiments, you can still access the experiment performance metrics with a .csv
file at the workspace/result
directory.
Step 6. Cleanup
Clean up the docker compose after the experiments at
bashdocker compose down
Using a Different Hugging Face Model or a Quantized Model
To use a different Hugging Face model or a quantized model, you need to modify the .env
file and the docker-compose.yml
file.
If you wish to use:
- FP16, BF16, or FP32 versions of a model: Set the repository name of the model available on Hugging Face hub to
HF_MODEL_NAME
in the.env
file. - FP8 versions of a model: Set the repository names of the model supported by Friendli Inference and vLLM as separate environment variables and modify the
docker-compose.yml
file.- For the Friendli Inference, use the repository of the FP8 versions of the model (e.g., FriendliAI/Meta-Llama-3-8B-fp8).
- For vLLM, use the repository of the non-quantized model (e.g., meta-llama/Meta-Llama-3-8B) and add
--quantization fp8
to the command of the vllm-engine service in thedocker-compose.yml
file.
- AWQ-ed models: Set the repository name of the model with the AWQ-applied model, available on the Hugging Face hub(e.g., TheBloke/Llama-2-13B-AWQ) to
HF_MODEL_NAME
in the.env
file. For vLLM, add--quantization awq
to the command of the vllm-engine service in thedocker-compose.yml
file. - MoE models: Set the repository name of the model with the MoE model, available on the Hugging Face hub to
HF_MODEL_NAME
in the.env
file.
For Friendli Inference, we recommend using the policy search feature for optimizing the inference performance of AWQ-ed models, MoE models, or FP8 versions of models, and set the directory path of the searched policy file to POLICY_DIR
in the .env
file. For more details of policy search, you can refer to this document.
Understanding Your Benchmark Results
The results from LLMServingPerfEvaluator provide valuable insights into your LLM serving engine's performance. Normally, the throughput and latency metrics should be analyzed concurrently, to set a performance level objective (i.e., throughput or latency), and make sure that both of these metrics satisfy the criteria. Here's how to interpret the key metrics:
Throughput:
- Measured in processed requests per second, input tokens per second, and output tokens per second.
- Ideal Scenario: Throughput increases proportionally with the
request_rate
. This indicates the engine is handling requests efficiently. - Watch Out For: If the throughput plateaus or even dips at higher request rates, it suggests that the engine is overloaded.
- In our experiment above, the throughput of vLLM does not increase at request rate 13, meaning that vLLM is unable to keep up with the request rate, while Friendli can handle requests efficiently.
Latency:
- Measures the total time to process a request.
- Ideal Scenario: Latency remains relatively stable across different request_rates. This indicates the engine is processing requests promptly for the different rates.
- Watch Out For: A significant increase in latency signifies that the engine is struggling to keep up with the workload. If it continues to increase over a period of time under a fixed request rate, it means that the requests are piling up in the queue (i.e., increasing queueing delay) which is an evident factor that the engine is overloaded and unable to process the request rates.
- In our experiment above, while the throughput of each engine follows the given request rates(i.e., 7, 9, 11, 13), the latency of vLLM increases as we increase the request rate, meaning that vLLM is struggling to keep up with the request rate of 11 and 13.
Time to First Token (TTFT):
- Measures the time taken to generate the first response token after receiving a request.
- Ideal Scenario: TTFT remains low and stable across request rates.
- Watch Out For: An increase in TTFT indicates queuing delays, suggesting that the engine might be overloaded. This indicates that the engine is struggling to properly handle the batch of requests. Similar to the latency, if it continues to increase, it means that the requests are piling up in the queue, indicating an overloaded engine.
Time per Output Token (TPOT):
- Measures the average time taken to generate each subsequent token after the first one.
- Ideal Scenario: TPOT remains relatively constant across request rates. This suggests efficient token generation.
- Watch Out For: An increase in TPOT indicates the engine is taking longer to generate each token as the batch size increases. This metric does not peak as much as latency and TTFT.
Leveraging Visualization Tools:
- Use Grafana to monitor live performance metrics and observe trends.
- Analyze the final results through the CSV file (located in
workspace/result
) for a comprehensive picture of engine performance. For example, you could generate a latency-throughput graph to analyze the two performance metrics concurrently.
By combining these metrics and visualizations, you can identify the optimal request rate for your engine and understand its limitations under a certain load pressure. This information is crucial for optimizing your LLM serving infrastructure and ensuring that it meets your application demands.
Written by
FriendliAI Tech & Research
Share