Measuring LLM Serving Performance with LLMServingPerfEvaluator

Measuring LLM Serving Performance with LLMServingPerfEvaluator thumbnail

Tired of Fixing LLM Serving Benchmarks?

Current tools like Locust and llmperf often struggle to create appropriate workloads with rich performance metrics, and are only limited to testing LLM Inference Endpoints. This makes it difficult to accurately assess the actual capabilities of your LLM serving engine. To overcome these shortcomings, the FriendliAI team has designed LLMServingPerfEvaluator to provide a way to correctly benchmark LLM serving performances.

Introducing LLMServingPerfEvaluator: An All-in-One Benchmarking Tool

LLMServingPerfEvaluator is an open-source tool designed by the FriendliAI team specifically for benchmarking LLM serving engines. It overcomes the limitations of existing solutions by offering:

  • Realistic Workload Generation: LLMServingPerfEvaluator easily defines workloads that mimic real-world usage patterns. LLMServingPerfEvaluator simulates text generation requests arriving at the serving engine according to a Poisson distribution, allowing you to stress test the engine with varying request rates (λ). Soon, it will also provide workload generation with support for controlling the number of concurrent requests (i.e., concurrency vs. request rate mode).
  • Customizable Request Inputs: LLMServingPerfEvaluator can craft input-output pairs to match your specific use case. You can either synthesize text data with desired lengths of requests using normal or uniform distributions, or leverage pre-existing datasets from Hugging Face.

Measure What Matters with Key Performance Metrics

LLMServingPerfEvaluator provides various metrics by default to accurately measure the performance of your LLM serving engine. Concretely, it provides in-depth insights into your engine's performance with:

  • Throughput (requests/sec, input_tokens/sec, output_tokens/sec): Gauges the engine's overall end-to-end capacity by measuring the number of requests, input tokens, and output tokens it can handle per second.
  • TTFT (time to first token): Measures the time taken to produce the very first response token after receiving a request. This includes any queuing delays.
  • TPOT (time per output token): Analyzes the efficiency of token generation by calculating the average time taken to generate each subsequent token after the first one (excluding TTFT). Also known as inter-token latency.
  • Latency (end-to-end processing time per request): Provides the complete picture by measuring the total time taken to process a request, encompassing both TTFT and TPOTs.
  • Token Latency (the average processing time per out token): Measures the average of the time taken to generate each output token including the first token. This value is obtained by dividing the latency by the total number of output tokens.

Effortless Benchmarking with Docker and Grafana

Setting up LLMServingPerfEvaluator is a breeze. Leverage Docker Compose for a streamlined experience and monitor live performance metrics with Grafana.

Benchmarking vLLM vs Friendli: A Step-by-Step Guide

Let's put theory into practice! This guide will walk you through benchmarking vLLM and Friendli using the meta-llama/Meta-Llama-3-8B-Instruct model on a single NVIDIA A100 GPU. If you want to measure the performance of a single specific LLM serving engine (e.g., Friendli Engine), you can refer to the readme in the LLMServingPerfEvaluator repository.

Prerequisites:

Step 0: Clone the repository

bash
git clone https://github.com/friendliai/LLMServingPerfEvaluator.git
cd LLMServingPerfEvaluator

Step 1: Prepare your working directory

bash
mkdir -p workspace/config/request_config
mkdir -p workspace/config/workload_config

Step 2-1. Set up the configuration files: workload_config.yaml

Let’s compare vLLM and Friendli with the synthetic workload. We use the workload configuration in the repository.

bash
cp examples/workload_config/dummy.yaml \
workspace/config/workload_config/

The workload configuration is as follow:

yaml
# workspace/config/workload_config/dummy.yaml
type: dummy
dataset_config:
  vocab_size: 128000
  system_prompt_config:
    - name: '1'
      size: 10
  dataset:
    - input:
        - type: uniform
          min: 100
          max: 150
      output:
        - type: uniform
          min: 100
          max: 300
      system_prompt:
        - name: '1'

According to our configuration, each request generates an input using integers from 0 to 127999 (i.e., vocab_size), with the input length uniformly sampled between 100 and 150 (i.e., input min/max) using the system prompt of length 10 integers. And the output length uniformly sampled between 100 and 300 (i.e., output min/max) If you wish to modify the workload, please refer to the document on generating different workloads.

Step 2-2. Set up the configuration files: request_config.yaml

As the request body format differs according to each engine, we will use request configuration files in the repository for vLLM and the Friendli Engine.

bash
cp examples/request_config/friendli.yaml \
workspace/config/request_config/
cp examples/request_config/vllm.yaml workspace/config/request_config/

The request configurations are as follow:

yaml
# workspace/config/request_config/vllm.yaml
stream: true
name: model
yaml
# workspace/config/request_config/friendli.yaml
stream: true

Step 2-3. Set up the configuration files: grafana & prometheus

We use the grafana and prometheus configuration files in the repository. Please copy the grafana directory and prometheus directory in the repository to the workspace. (grafana/provisioning/dashboards/single_engine_dashboard.json does not have to be copied.)

bash
cp -r grafana/ workspace/
cp -r prometheus/ workspace/
rm workspace/grafana/provisioning/dashboards/single_engine_dashboard.json
mv workspace/grafana/provisioning/compare_engines_dashboard.json workspace/grafana/provisioning/dashboards

Now, the directory structure should look like this:

bash
tree workspace/ -a

workspace/
├── config
│   ├── request_config
│   │   ├── friendli.yaml
│   │   └── vllm.yaml
│   └── workload_config
│       └── dummy.yaml
├── grafana
│   └── provisioning
│       ├── dashboards
│       │   ├── compare_engines_dashboard.json
│       │   └── datshboard.yaml
│       └── datasources
│           └── datasource.yaml
└── prometheus
    └── config
        └── prometheus.yml

Step 3. Set up the docker-compose.yml and the .env file (or environment variables)

Copy the examples/docker_compose/compare-docker-compose.yml file and examples/docker_compose/.compare_env file in the repository to the workspace directory.

bash
cp examples/docker_compose/compare-docker-compose.yml \
workspace/docker-compose.yml
cp examples/docker_compose/.compare_env workspace/.env

When you open the workspace/.env file, there are environment variables for the experiment. The environment variables within the curly braces should be replaced with the actual values.

  • First of all, HF_HUB_CACHE and HF_TOKEN should be filled out. For more details on the Hugging Face environment variables, please refer to this document.
  • Second, FRIENDLI_CONTAINER_REPO, FRIENDLI_CONTAINER_TAG, and FRIENDLI_CONTAINER_SECRET should be filled out. Before starting the tutorial, one should obtain the permission for using the - Friendli Engine trial docker image with FRIENDLI_CONTAINER_SECRET.
  • Third, UID should be filled out. find your UID in the host machine with the following command:
bash
id -u

After fixing the .env file, it should look like:

.env
# workspace/.env
# Experiment Environment Variables
HF_MODEL_NAME=meta-llama/Meta-Llama-3-8B-Instruct # The model repo id of the Hugging Face model
REQUEST_RATES=1,3,5,7 # The average number of requests per second
DURATION=300 # The duration of the experiment in seconds
TIMEOUT=450 # The timeout of the experiment in seconds
CONFIG_DIR=./config # The directory path to the configuration files: workload_config.yaml, request_config.yaml
RESULT_DIR=./result # The directory path to save the end-to-end metrics

# Experiment Environment Variables for the Hugging Face model
# HF_HUB_CACHE={HF_CACHE_DIR}
# HF_TOKEN={HF_API_TOKEN}

# user id by `id -u` in your host machine
UID={UID}

# Friendli Engine Environment Variables
FRIENDLI_CUDA_VISIBLE_DEVICES=0
FRIENDLI_NUM_DEVICES=1
FRIENDLI_CONTAINER_REPO=registry.friendli.ai/trial
FRIENDLI_CONTAINER_TAG=latest
FRIENDLI_CONTAINER_SECRET={FRIENDLI_CONTAINER_SECERET}

# vllm engine
VLLM_CUDA_VISIBLE_DEVICES=1
VLLM_NUM_DEVICES=1
VLLM_REPO=vllm
VLLM_TAG=v0.4.2

Now, the directory structure should look like this:

bash
tree workspace/ -a

workspace/
├── .env
├── config
│   ├── request_config
│   │   ├── friendli.yaml
│   │   └── vllm.yaml
│   └── workload_config
│       └── dummy.yaml
├── docker-compose.yml
├── grafana
│   └── provisioning
│       ├── dashboards
│       │   ├── compare_engines_dashboard.json
│       │   └── datshboard.yaml
│       └── datasources
│           └── datasource.yaml
└── prometheus
    └── config
        └── prometheus.yml

You can check whether your docker-compose.yml is correctly filled out with the following command in the workspace directory.:

bash
cd workspace
docker compose config

Step 4. Execute the performance evaluation

At the workspace directory, run the performance evaluation with the following command:

bash
cd workspace
docker compose up -d

Step 5. Monitor the live performance

Access grafana to monitor the live performance of your evaluation at http://localhost:3000. The default username and password are both admin and admin.