Skip to main content
This page is the configuration reference for Friendli Container—how to pass launch options, serve across multiple GPUs, and tune serving for your model. If you haven’t run a container yet, start with the Quickstart. Friendli Container supports direct loading of safetensors checkpoints—compatible with Hugging Face transformers—for many model types. You can find the complete list of supported models on the Supported Models page. If your model is not on the list, please contact support.

Passing Launch Options

Launch options are passed as arguments after the image name in your docker run command:
# Fill the values of following variables.
export HF_MODEL_NAME=""  # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret

docker run --gpus '"device=0"' -p 8000:8000 \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  registry.friendli.ai/trial \
    --hf-model-name $HF_MODEL_NAME \
    [LAUNCH_OPTIONS]
Replace [LAUNCH_OPTIONS] with the options described in Launch Options. Running the command above starts a Docker container that exposes an HTTP endpoint for handling inference requests.

Multi-GPU Serving

Friendli Container supports tensor parallelism and pipeline parallelism for multi-GPU inference.

Tensor Parallelism

Use tensor parallelism when serving large models that exceed the memory capacity of a single GPU. It distributes parts of the model’s weights across multiple GPUs. To use tensor parallelism with Friendli Container:
  1. Specify multiple GPUs for $GPU_ENUMERATION (e.g., ‘“device=0,1,2,3”’).
  2. Use --num-devices (or -d) option to specify the tensor parallelism degree (e.g., --num-devices 4).

Examples

This is an example running Llama-3.1-8B-Instruct with a single GPU.
export FRIENDLI_CONTAINER_SECRET=""  # Friendli container secret (leave it if it's already set in your environment)
export HF_TOKEN=""  # Access token from Hugging Face (see the caution below)

docker run -p 8000:8000 --gpus '"device=0"' \
  -e HF_TOKEN=$HF_TOKEN \
  -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  registry.friendli.ai/trial \
    --hf-model-name meta-llama/Llama-3.1-8B-Instruct
Since downloading meta-llama/Llama-3.1-8B-Instruct is allowed only for authorized users, you need to provide your Hugging Face User Access Token through HF_TOKEN environment variable. It works the same for all private repositories.

Quantization

Friendli Container supports online quantization, which quantizes a model instantly when you launch it, as well as serving pre-quantized models. If your model is already quantized or needs to be quantized, check Quantization for more details.

Serving MoE Models

Running MoE (Mixture of Experts) models requires an additional step to search the execution policy. See Serving MoE Models to learn how to launch Friendli Container for the MoE model.

Options for Running Friendli Container

General Options

OptionsTypeSummaryDefaultRequired
--version-Print Friendli Container version.-
--help-Print Friendli Container help message.-

Launch Options

OptionsTypeSummaryDefaultRequired
--web-server-portINTWeb server port.8000
--metrics-portINTPrometheus metrics export port.8281
--hf-model-nameTEXTModel name hosted on the Hugging Face Models Hub or a path to a local directory containing a model. When a model name is provided, Friendli Container first checks if the model is already cached at ~/.cache/huggingface/hub and uses it if available. If not, it will download the model from the Hugging Face Models Hub before launching the container. When a local path is provided, it will load the model from the location without downloading. This option is only available for models in a safetensors format.-
--tokenizer-file-pathTEXTAbsolute path of tokenizer file. This option is not needed when tokenizer.json is located under the path specified at --ckpt-path.-
--tokenizer-add-special-tokensBOOLEANWhether or not to add special tokens in tokenization. Equivalent to Hugging Face Tokenizer’s add_special_tokens argument. The default value is false for versions < v1.6.0.true
--tokenizer-skip-special-tokensBOOLEANWhether or not to remove special tokens in detokenization. Equivalent to Hugging Face Tokenizer’s skip_special_tokens argument.true
--dtypeCHOICE: [bf16, fp16, fp32]Data type of weights and activations. Choose one of <fp16|bf16|fp32>. This argument applies to non-quantized weights and activations. If not specified, Friendli Container follows the value of torch_dtype in config.json file or assumes fp16.fp16
--bad-stop-file-pathTEXTJSON file path that contains stop sequences or bad words/tokens.-
--num-request-threadsINTThread pool size for handling HTTP requests.4
--timeout-microsecondsINTServer-side timeout for client requests, in microseconds.0 (no timeout)
--ignore-nan-errorBOOLEANIf set to True, ignore NaN error. Otherwise, respond with a 400 status code if NaN values are detected while processing a request.-
--max-batch-sizeINTMax number of sequences that can be processed in a batch.384
--num-devices, -dINTNumber of devices to use in tensor parallelism degree.1
--search-policyBOOLEANSearches for the best engine policy for the given combination of model, hardware, and parallelism degree. Learn more about policy search at Optimizing Inference with Policy Search.false
--terminate-after-searchBOOLEANTerminates engine container after the policy search.false
--algo-policy-dirTEXTPath to directory containing the policy file. The default value is the current working directory. Learn more about policy search at Optimizing Inference with Policy Search.current working dir
--adapter-modelTEXTAdd an adapter model with adapter name and path; <adapter_name>:<adapter_ckpt_path>. The path can be a name from a Hugging Face model hub.-

Model Specific Options

T5

OptionsTypeSummaryDefaultRequired
--max-input-lengthINTMaximum input length.-
--max-output-lengthINTMaximum output length.-
Last modified on June 22, 2026