safetensors checkpoints—compatible with Hugging Face transformers—for many model types. You can find the complete list of supported models on the Supported Models page. If your model is not on the list, please contact support.
Passing Launch Options
Launch options are passed as arguments after the image name in yourdocker run command:
[LAUNCH_OPTIONS] with the options described in Launch Options. Running the command above starts a Docker container that exposes an HTTP endpoint for handling inference requests.
Multi-GPU Serving
Friendli Container supports tensor parallelism and pipeline parallelism for multi-GPU inference.Tensor Parallelism
Use tensor parallelism when serving large models that exceed the memory capacity of a single GPU. It distributes parts of the model’s weights across multiple GPUs. To use tensor parallelism with Friendli Container:- Specify multiple GPUs for
$GPU_ENUMERATION(e.g., ‘“device=0,1,2,3”’). - Use
--num-devices(or-d) option to specify the tensor parallelism degree (e.g.,--num-devices 4).
Examples
- Deploying Models on a Single GPU
- Deploying Models on Multi-GPU
This is an example running Llama-3.1-8B-Instruct with a single GPU.
Quantization
Friendli Container supports online quantization, which quantizes a model instantly when you launch it, as well as serving pre-quantized models. If your model is already quantized or needs to be quantized, check Quantization for more details.Serving MoE Models
Running MoE (Mixture of Experts) models requires an additional step to search the execution policy. See Serving MoE Models to learn how to launch Friendli Container for the MoE model.Options for Running Friendli Container
General Options
| Options | Type | Summary | Default | Required |
|---|---|---|---|---|
--version | - | Print Friendli Container version. | - | ❌ |
--help | - | Print Friendli Container help message. | - | ❌ |
Launch Options
| Options | Type | Summary | Default | Required |
|---|---|---|---|---|
--web-server-port | INT | Web server port. | 8000 | ❌ |
--metrics-port | INT | Prometheus metrics export port. | 8281 | ❌ |
--hf-model-name | TEXT | Model name hosted on the Hugging Face Models Hub or a path to a local directory containing a model. When a model name is provided, Friendli Container first checks if the model is already cached at ~/.cache/huggingface/hub and uses it if available. If not, it will download the model from the Hugging Face Models Hub before launching the container. When a local path is provided, it will load the model from the location without downloading. This option is only available for models in a safetensors format. | - | ❌ |
--tokenizer-file-path | TEXT | Absolute path of tokenizer file. This option is not needed when tokenizer.json is located under the path specified at --ckpt-path. | - | ❌ |
--tokenizer-add-special-tokens | BOOLEAN | Whether or not to add special tokens in tokenization. Equivalent to Hugging Face Tokenizer’s add_special_tokens argument. The default value is false for versions < v1.6.0. | true | ❌ |
--tokenizer-skip-special-tokens | BOOLEAN | Whether or not to remove special tokens in detokenization. Equivalent to Hugging Face Tokenizer’s skip_special_tokens argument. | true | ❌ |
--dtype | CHOICE: [bf16, fp16, fp32] | Data type of weights and activations. Choose one of <fp16|bf16|fp32>. This argument applies to non-quantized weights and activations. If not specified, Friendli Container follows the value of torch_dtype in config.json file or assumes fp16. | fp16 | ❌ |
--bad-stop-file-path | TEXT | JSON file path that contains stop sequences or bad words/tokens. | - | ❌ |
--num-request-threads | INT | Thread pool size for handling HTTP requests. | 4 | ❌ |
--timeout-microseconds | INT | Server-side timeout for client requests, in microseconds. | 0 (no timeout) | ❌ |
--ignore-nan-error | BOOLEAN | If set to True, ignore NaN error. Otherwise, respond with a 400 status code if NaN values are detected while processing a request. | - | ❌ |
--max-batch-size | INT | Max number of sequences that can be processed in a batch. | 384 | ❌ |
--num-devices, -d | INT | Number of devices to use in tensor parallelism degree. | 1 | ❌ |
--search-policy | BOOLEAN | Searches for the best engine policy for the given combination of model, hardware, and parallelism degree. Learn more about policy search at Optimizing Inference with Policy Search. | false | ❌ |
--terminate-after-search | BOOLEAN | Terminates engine container after the policy search. | false | ❌ |
--algo-policy-dir | TEXT | Path to directory containing the policy file. The default value is the current working directory. Learn more about policy search at Optimizing Inference with Policy Search. | current working dir | ❌ |
--adapter-model | TEXT | Add an adapter model with adapter name and path; <adapter_name>:<adapter_ckpt_path>. The path can be a name from a Hugging Face model hub. | - | ❌ |
Model Specific Options
T5
| Options | Type | Summary | Default | Required |
|---|---|---|---|---|
--max-input-length | INT | Maximum input length. | - | ✅ |
--max-output-length | INT | Maximum output length. | - | ✅ |