This guide will walk you through how to run gRPC inference server with Friendli Container and interact with it through friendli SDK.

Prerequisites

Install friendli to use gRPC client SDK:

pip install friendli

Ensure you have the friendli SDK version 1.4.1 or higher installed.

Starting the Friendli Container with gRPC

Running the Friendli Container with a gRPC server for completions is available by adding the --grpc true option to the command argument. This supports response-streaming gRPC, and you can send requests using our friendli SDK. To start the Friendli Container with gRPC support, use the following command:

export FRIENDLI_CONTAINER_SECRET="YOUR_FRIENDLI_CONTAINER_SECRET_flc_XXX"

# e.g. Running `NousResearch/Hermes-3-Llama-3.1-8B` on GPU 0 with a trial image.
docker run --gpus '"device=0"' -p 8000:8000 \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  registry.friendli.ai/trial:latest  \
  --hf-model-name NousResearch/Hermes-3-Llama-3.1-8B \
  --grpc true

You can change the port of the server with --web-server-port argument.

Sending Requests with the Client SDK

Here is how to use the friendli SDK to interact with the gRPC server. This example assumes that the gRPC server is running on 0.0.0.0:8000.

from friendli import SyncFriendli

client = SyncFriendli()

stream = client.container.chat.complete(
    messages=[
        {"content": "You are a helpful assistant.", "role": "system"},
        {"content": "Hello!", "role": "user"},
    ],
    stream=True,  # Should be True
    top_k=1,
)

for chunk in stream:
    print(chunk.text, end="", flush=True)

Properly Closing the Client

By default, the library closes underlying HTTP and gRPC connections when the client is garbage-collected. You can manually close the Friendli or AsyncFriendli client using the .close() method or utilize a context manager to ensure proper closure when exiting a with block.

from friendli import SyncFriendli

client = SyncFriendli()

with client:
    stream = client.container.chat.complete(
        messages=[
            {"content": "You are a helpful assistant.", "role": "system"},
            {"content": "Hello!", "role": "user"},
        ],
        stream=True,  # Should be True
        top_k=1,
        min_tokens=10,
    )

    for chunk in stream:
        print(chunk.text, end="", flush=True)