This guide will walk you through how to run gRPC inference server with Friendli Container and interact with it through friendli-client SDK.

Prerequisites

Install friendli-client to use gRPC client SDK:

pip install friendli-client

Ensure you have the friendli-client SDK version 1.4.1 or higher installed.

Starting the Friendli Container with gRPC

Running the Friendli Container with a gRPC server for completions is available by adding the --grpc true option to the command argument. This supports response-streaming gRPC, and you can send requests using our friendli-client SDK. To start the Friendli Container with gRPC support, use the following command:

export FRIENDLI_CONTAINER_SECRET="YOUR_FRIENDLI_CONTAINER_SECRET_flc_XXX"

# e.g. Running `NousResearch/Hermes-3-Llama-3.1-8B` on GPU 0 with a trial image.
docker run --gpus '"device=0"' -p 8000:8000 \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  registry.friendli.ai/trial:latest  \
  --hf-model-name NousResearch/Hermes-3-Llama-3.1-8B \
  --grpc true

You can change the port of the server with --web-server-port argument.

Sending Requests with the Client SDK

Here is how to use the friendli-client SDK to interact with the gRPC server. This example assumes that the gRPC server is running on 0.0.0.0:8000.

Properly Closing the Client

By default, the library closes underlying HTTP and gRPC connections when the client is garbage-collected. You can manually close the Friendli or AsyncFriendli client using the .close() method or utilize a context manager to ensure proper closure when exiting a with block.