- May 22, 2024
- 11 min read
Measuring LLM Serving Performance with LLMServingPerfEvaluator
Tired of Fixing LLM Serving Benchmarks?
Current tools like Locust and llmperf often struggle to create appropriate workloads with rich performance metrics, and are only limited to testing LLM Inference Endpoints. This makes it difficult to accurately assess the actual capabilities of your LLM serving engine. To overcome these shortcomings, the FriendliAI team has designed LLMServingPerfEvaluator to provide a way to correctly benchmark LLM serving performances.
Introducing LLMServingPerfEvaluator: An All-in-One Benchmarking Tool
LLMServingPerfEvaluator is an open-source tool designed by the FriendliAI team specifically for benchmarking LLM serving engines. It overcomes the limitations of existing solutions by offering:
- Realistic Workload Generation: LLMServingPerfEvaluator easily defines workloads that mimic real-world usage patterns. LLMServingPerfEvaluator simulates text generation requests arriving at the serving engine according to a Poisson distribution, allowing you to stress test the engine with varying request rates (λ). Soon, it will also provide workload generation with support for controlling the number of concurrent requests (i.e., concurrency vs. request rate mode).
- Customizable Request Inputs: LLMServingPerfEvaluator can craft input-output pairs to match your specific use case. You can either synthesize text data with desired lengths of requests using normal or uniform distributions, or leverage pre-existing datasets from Hugging Face.
Measure What Matters with Key Performance Metrics
LLMServingPerfEvaluator provides various metrics by default to accurately measure the performance of your LLM serving engine. Concretely, it provides in-depth insights into your engine's performance with:
- Throughput (requests/sec, input_tokens/sec, output_tokens/sec): Gauges the engine's overall end-to-end capacity by measuring the number of requests, input tokens, and output tokens it can handle per second.
- TTFT (time to first token): Measures the time taken to produce the very first response token after receiving a request. This includes any queuing delays.
- TPOT (time per output token): Analyzes the efficiency of token generation by calculating the average time taken to generate each subsequent token after the first one (excluding TTFT). Also known as inter-token latency.
- Latency (end-to-end processing time per request): Provides the complete picture by measuring the total time taken to process a request, encompassing both TTFT and TPOTs.
- Token Latency (the average processing time per out token): Measures the average of the time taken to generate each output token including the first token. This value is obtained by dividing the latency by the total number of output tokens.
Effortless Benchmarking with Docker and Grafana
Setting up LLMServingPerfEvaluator is a breeze. Leverage Docker Compose for a streamlined experience and monitor live performance metrics with Grafana.
Benchmarking vLLM vs Friendli: A Step-by-Step Guide
Let's put theory into practice! This guide will walk you through benchmarking vLLM and Friendli using the meta-llama/Meta-Llama-3-8B-Instruct model
on a single NVIDIA A100 GPU. If you want to measure the performance of a single specific LLM serving engine (e.g., Friendli Engine), you can refer to the readme in the LLMServingPerfEvaluator repository.
Prerequisites:
- Install Docker Compose: https://docs.docker.com/compose/install/
- A Basic Understanding of Docker and Docker Compose: https://docs.docker.com/compose/intro/features-uses/
- Friendli Container: https://suite.friendli.ai/team/ToHNtbdIeAQh/container/quickstart
- Hugging Face Access Token: https://huggingface.co/docs/hub/security-tokens
Step 0: Clone the repository
bashgit clone https://github.com/friendliai/LLMServingPerfEvaluator.git cd LLMServingPerfEvaluator
Step 1: Prepare your working directory
bashmkdir -p workspace/config/request_config mkdir -p workspace/config/workload_config
Step 2-1. Set up the configuration files: workload_config.yaml
Let’s compare vLLM and Friendli with the synthetic workload. We use the workload configuration in the repository.
bashcp examples/workload_config/dummy.yaml \ workspace/config/workload_config/
The workload configuration is as follow:
yaml# workspace/config/workload_config/dummy.yaml type: dummy dataset_config: vocab_size: 128000 system_prompt_config: - name: '1' size: 10 dataset: - input: - type: uniform min: 100 max: 150 output: - type: uniform min: 100 max: 300 system_prompt: - name: '1'
According to our configuration, each request generates an input using integers from 0 to 127999 (i.e., vocab_size), with the input length uniformly sampled between 100 and 150 (i.e., input min/max) using the system prompt of length 10 integers. And the output length uniformly sampled between 100 and 300 (i.e., output min/max) If you wish to modify the workload, please refer to the document on generating different workloads.
Step 2-2. Set up the configuration files: request_config.yaml
As the request body format differs according to each engine, we will use request configuration files in the repository for vLLM and the Friendli Engine.
bashcp examples/request_config/friendli.yaml \ workspace/config/request_config/ cp examples/request_config/vllm.yaml workspace/config/request_config/
The request configurations are as follow:
yaml# workspace/config/request_config/vllm.yaml stream: true name: model
yaml# workspace/config/request_config/friendli.yaml stream: true
Step 2-3. Set up the configuration files: grafana
& prometheus
We use the grafana and prometheus configuration files in the repository. Please copy the grafana directory and prometheus directory in the repository to the workspace
. (grafana/provisioning/dashboards/single_engine_dashboard.json
does not have to be copied.)
bashcp -r grafana/ workspace/ cp -r prometheus/ workspace/ rm workspace/grafana/provisioning/dashboards/single_engine_dashboard.json mv workspace/grafana/provisioning/compare_engines_dashboard.json workspace/grafana/provisioning/dashboards
Now, the directory structure should look like this:
bashtree workspace/ -a workspace/ ├── config │ ├── request_config │ │ ├── friendli.yaml │ │ └── vllm.yaml │ └── workload_config │ └── dummy.yaml ├── grafana │ └── provisioning │ ├── dashboards │ │ ├── compare_engines_dashboard.json │ │ └── datshboard.yaml │ └── datasources │ └── datasource.yaml └── prometheus └── config └── prometheus.yml
Step 3. Set up the docker-compose.yml
and the .env
file (or environment variables)
Copy the examples/docker_compose/compare-docker-compose.yml
file and examples/docker_compose/.compare_env
file in the repository to the workspace
directory.
bashcp examples/docker_compose/compare-docker-compose.yml \ workspace/docker-compose.yml cp examples/docker_compose/.compare_env workspace/.env
When you open the workspace/.env
file, there are environment variables for the experiment. The environment variables within the curly braces should be replaced with the actual values.
- First of all,
HF_HUB_CACHE
andHF_TOKEN
should be filled out. For more details on the Hugging Face environment variables, please refer to this document. - Second,
FRIENDLI_CONTAINER_REPO, FRIENDLI_CONTAINER_TAG
, andFRIENDLI_CONTAINER_SECRET
should be filled out. Before starting the tutorial, one should obtain the permission for using the - Friendli Engine trial docker image withFRIENDLI_CONTAINER_SECRET
. - Third,
UID
should be filled out. find your UID in the host machine with the following command:
bashid -u
After fixing the .env
file, it should look like:
.env# workspace/.env # Experiment Environment Variables HF_MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct # The model repo id of the Hugging Face model REQUEST_RATES=1,3,5,7,9,11 # The average number of requests per second DURATION=300 # The duration of the experiment in seconds TIMEOUT=450 # The timeout of the experiment in seconds CONFIG_DIR=./config # The directory path to the configuration files: workload_config.yaml, request_config.yaml RESULT_DIR=./result # The directory path to save the end-to-end metrics # Experiment Environment Variables for the Hugging Face model # HF_HUB_CACHE={HF_CACHE_DIR} # HF_TOKEN={HF_API_TOKEN} # user id by `id -u` in your host machine UID={UID} # Friendli Engine Environment Variables FRIENDLI_CUDA_VISIBLE_DEVICES=0 FRIENDLI_NUM_DEVICES=1 FRIENDLI_CONTAINER_REPO=registry.friendli.ai/trial FRIENDLI_CONTAINER_TAG=latest FRIENDLI_CONTAINER_SECRET={FRIENDLI_CONTAINER_SECERET} # vllm engine VLLM_CUDA_VISIBLE_DEVICES=1 VLLM_NUM_DEVICES=1 VLLM_REPO=vllm/vllm-openai VLLM_TAG=v0.4.2
Now, the directory structure should look like this:
bashtree workspace/ -a workspace/ ├── .env ├── config │ ├── request_config │ │ ├── friendli.yaml │ │ └── vllm.yaml │ └── workload_config │ └── dummy.yaml ├── docker-compose.yml ├── grafana │ └── provisioning │ ├── dashboards │ │ ├── compare_engines_dashboard.json │ │ └── datshboard.yaml │ └── datasources │ └── datasource.yaml └── prometheus └── config └── prometheus.yml
You can check whether your docker-compose.yml
is correctly filled out with the following command in the workspace
directory.:
bashcd workspace docker compose config
Step 4. Execute the performance evaluation
At the workspace
directory, run the performance evaluation with the following command:
bashcd workspace docker compose up -d
Step 5. Monitor the live performance
Access grafana to monitor the live performance of your evaluation at http://localhost:3000.
The default username and password are both admin
and admin
.