# friendli api chat-completions create Source: https://friendli.ai/docs/cli/api/chat-completions/create Create chat completions using the Friendli API. Customize your requests with various options like model selection, message input, token limits, and more to generate tailored results. ## Usage ```bash friendli api chat-completions create [OPTIONS] ``` ## Summary Creates chat completions. ## Options | Option | Type | Summary | Default | Required | | ---------------------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------- | -------- | | **`--message`**, **`-g`** | TEXT | A message in `ROLE CONTENT` format. Repeat this option to add multiple messages. | - | ✅ | | **`--model`**, **`-m`** | TEXT | The model to use for chat completions. See [here](/guides/serverless_endpoints/pricing) for more about available models and pricing. | - | ✅ | | `--n`, `-n` | INTEGER RANGE | The number of results to generate. | None | ❌ | | `--max-tokens`, `-M` | INTEGER RANGE | The maximum number of tokens to generate. | None | ❌ | | `--stop`, `-S` | TEXT | When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Repeat this option to use multiple stop phrases. | None | ❌ | | `--temperature`, `-T` | FLOAT RANGE | Sampling temperature. non-zero positive numbers are allowed. | None | ❌ | | `--top-p`, `-P` | FLOAT RANGE | Tokens comprising the top top\_p probability mass are kept for sampling. | None | ❌ | | `--frequency-penalty`, `-fp` | FLOAT RANGE | Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim. | None | ❌ | | `--presence-penalty`, `-pp` | FLOAT RANGE | Positive values penalizes tokens that have been sampled at least once in the existing text. | None | ❌ | | `--stream`, `-s` | BOOLEAN | Whether to stream generation result. | False | ❌ | | `--token`, `-t` | TEXT | Friendli Token for auth. | None | ❌ | | `--team-id` | TEXT | ID of team to run as. | None | ❌ | # friendli api completions create Source: https://friendli.ai/docs/cli/api/completions/create Create text completions using the Friendli API. Customize your completions with various options like prompts, model selection, token limits, and more to create precise, tailored outputs. ## Usage ```bash friendli api completions create [OPTIONS] ``` ## Summary Creates text completions. ## Options | Option | Type | Summary | Default | Required | | ---------------------------- | ------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------ | ------- | -------- | | **`--prompt`**, **`-p`** | TEXT | The input text to generate completion for. | - | ✅ | | **`--model`**, **`-m`** | TEXT | The model to use for chat completions. See [here](/guides/serverless_endpoints/pricing) for more about available models and pricing. | - | ✅ | | `--n`, `-n` | INTEGER RANGE | The number of results to generate. | None | ❌ | | `--max-tokens`, `-M` | INTEGER RANGE | The maximum number of tokens to generate. | None | ❌ | | `--stop`, `-S` | TEXT | When one of the stop phrases appears in the generation result, the API will stop generation. The stop phrases are excluded from the result. Repeat this option to use multiple stop phrases. | None | ❌ | | `--temperature`, `-T` | FLOAT RANGE | Sampling temperature. non-zero positive numbers are allowed. | None | ❌ | | `--top-p`, `-P` | FLOAT RANGE | Tokens comprising the top top\_p probability mass are kept for sampling. | None | ❌ | | `--frequency-penalty`, `-fp` | FLOAT RANGE | Positive values penalizes tokens that have been sampled, taking into account their frequency in the preceding text. This penalization diminishes the model's tendency to reproduce identical lines verbatim. | None | ❌ | | `--presence-penalty`, `-pp` | FLOAT RANGE | Positive values penalizes tokens that have been sampled at least once in the existing text. | None | ❌ | | `--stream`, `-s` | BOOLEAN | Whether to stream generation result. | False | ❌ | | `--token`, `-t` | TEXT | Friendli Token for auth. | None | ❌ | | `--team` | TEXT | ID of team to run as. | None | ❌ | # friendli endpoint create Source: https://friendli.ai/docs/cli/endpoint/create Create and deploy new endpoints with the Friendli API. Customize with model selection, GPU configuration, and more to efficiently serve your machine learning models. ## Usage ```bash friendli endpoint create [OPTIONS] ``` ## Summary Creates a new endpoint with deploying model. ## Options | Option | Type | Summary | Default | Required | | ---------------------------- | ------- | ----------------------------------------- | ------- | -------- | | **`--name`**, **`-n`** | TEXT | The name of endpoint to create. | - | ✅ | | **`--model`**, **`-m`** | TEXT | The name of Hugging Face model to deploy. | - | ✅ | | **`--gpu-type`**, **`-gt`** | TEXT | GPU type to serve the deployed model. | - | ✅ | | **`--gpu-count`**, **`-gc`** | INTEGER | None | - | ✅ | # friendli endpoint get Source: https://friendli.ai/docs/cli/endpoint/get Get detailed information about a specific endpoint using the Friendli API. ## Usage ```bash friendli endpoint get ENDPOINT_ID ``` ## Summary Get a detailed info of an endpoint. ## Arguments | Argument | Type | Summary | Default | Required | | ----------------- | ---- | ------------------------- | ------- | -------- | | **`endpoint_id`** | TEXT | ID of an endpoint to get. | - | ✅ | # friendli endpoint list Source: https://friendli.ai/docs/cli/endpoint/list View all your deployed endpoints with the Friendli API. Easily list endpoints for efficient model management. ## Usage ```bash friendli endpoint list ``` ## Summary List endpoints. # friendli endpoint terminate Source: https://friendli.ai/docs/cli/endpoint/terminate Terminate a running endpoint with the Friendli API using the endpoint ID. Easily manage and stop your deployed models when needed. ## Usage ```bash friendli endpoint terminate ENDPOINT_ID ``` ## Summary Terminate a running endpoint. ## Arguments | Argument | Type | Summary | Default | Required | | ----------------- | ---- | ------------------------------- | ------- | -------- | | **`endpoint_id`** | TEXT | ID of an endpoint to terminate. | - | ✅ | # Installation Source: https://friendli.ai/docs/cli/installation Install friendli-client package to access advanced features for AI integration. Supports Python 3.8+, with options for machine learning libraries and Hugging Face checkpoint conversion. You can simply install `friendli-client` package using `pip`. ```bash pip install friendli-client ``` `friendli-client` requires **python>=3.8**. We recommend using the most up-to-date package. You can check the release history at [PyPI](https://pypi.org/project/friendli-client/#history) and [GitHub](https://github.com/friendliai/friendli-client/releases). You can update the package with: ```bash pip install friendli-client -U ``` If you have a Hugging Face checkpoint and want to convert it to a Friendli-compatible format or apply quantization, you need to install the package with the necessary machine learing library (`mllib`) dependencies. In this case, install the package with the following command: ```sh pip install "friendli-client[mllib]" ``` # friendli login Source: https://friendli.ai/docs/cli/login Sign in to Friendli using the command line interface. ## Usage ```bash friendli login [OPTIONS] ``` ## Summary Sign in Friendli. ## Options | Option | Type | Summary | Default | Required | | ------- | ------- | -------------- | ------- | -------- | | `--sso` | BOOLEAN | Use SSO login. | False | ❌ | # friendli logout Source: https://friendli.ai/docs/cli/logout Sign out to Friendli using the command line interface. ## Usage ```bash friendli logout ``` ## Summary Sign out. # friendli model convert Source: https://friendli.ai/docs/cli/model/convert Convert Hugging Face model checkpoints to Friendli format for deployment. Includes options for quantization, data type selection, and model optimization using the Friendli API. This command is deprecated and will be removed in future releases. Use the newly created [**friendli-model-optimizer**](https://github.com/friendliai/friendli-model-optimizer) tool instead. ## Usage ```bash friendli model convert [OPTIONS] ``` ## Summary Convert huggingface's model checkpoint to Friendli format. When a checkpoint is in the Hugging Face format, it cannot be directly served. It requires conversion to the Friendli format for serving. The conversion process involves copying the original checkpoint and transforming it into a checkpoint in the Friendli format (\*.h5). The `friendli checkpoint convert` is available only when the package is installed with `pip install "friendli-client[mllib]"`. ### Apply quantization If you want to quantize the model along with the conversion, `--quantize` option should be provided. You can customize the quantization configuration by describing it in a YAML file and providing the path to the file to `--quant-config-file` option. When `--quantize` option is used without providing `--quant-config-file`, the following configuration is used by default. ```yaml # Default quantization configuration mode: awq device: cuda:0 seed: 42 offload: true calibration_dataset: path_or_name: lambada format: json split: validation lookup_column_name: text num_samples: 128 max_length: 512 awq_args: quant_bit: 4 quant_group_size: 64 ``` * **`mode`**: Quantization scheme to apply. Defaults to "awq". * **`device`**: Device to run the quantization process. Defaults to "cuda:0". * **`seed`**: Random seed. Defaults to 42. * **`offload`**: When enabled, this option significantly reduces GPU memory usage by offloading model layers onto CPU RAM. Defaults to true. * **`calibration_dataset`** * **`path_or_name`**: Path or name of the dataset. Datasets from either the Hugging Face Datasets Hub or local file system can be used. Defaults to "lambada". * **`format`**: Format of datasets. Defaults to "json". * **`split`**: Which split of the data to load. Defaults to "validation". * **`lookup_column_name`**: The name of a column in the dataset to be used as calibration inputs. Defaults to "text". * **`num_samples`**: The number of dataset samples to use for calibration. Note that the dataset will be shuffled before sampling. Defaults to 512. * **`max_length`**: The maximum length of a calibration input sequence. Defauts to 512. * **`awq_args`** (Fill in this field only for "awq" mode) * **`quant_bit`** : Bit width of integers to represent weights. Possible values are `4` or `8`. Defaults to 4. * **`quant_group_size`**: Group size of quantized matrices. 64 is the only supported value at this time. Defaults to 64. If you encounter OOM issues when running with AWQ, try enabling the `offload` option. If you set `percentile` in quant-config-file into 100, the quantization range will be determined by the maximum absolute values of the activation tensors. Currently, [AWQ](https://arxiv.org/abs/2306.00978) is the only supported quantization scheme. AWQ is supported only for models with architecture listed as follows: * `GPTNeoXForCausalLM` * `GPTJForCausalLM` * `LlamaForCausalLM` * `MPTForCausalLM` ## Options | Option | Type | Summary | Default | Required | | ------------------------------------ | --------------------------------------- | ---------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | --------- | -------- | | **`--model-name-or-path`**, **`-m`** | TEXT | Hugging Face pretrained model name or path to the saved model checkpoint. | - | ✅ | | **`--output-dir`**, **`-o`** | TEXT | Directory path to save the converted checkpoint and related configuration files. Three files will be created in the directory: `model.h5`, `tokenizer.json`, and `attr.yaml`. The `model.h5` or `model.safetensors` is the converted checkpoint and can be renamed using the `--output-model-filename` option. The `tokenizer.json` is the Friendli-compatible tokenizer file, which should be uploaded along with the checkpoint file to tokenize the model input and output. The `attr.yaml` is the checkpoint attribute file, to be used when uploading the converted model to Friendli. You can designate the file name using the `--output-attr-filename` option. | - | ✅ | | **`--data-type`**, **`-dt`** | CHOICE: \[bf16, fp16, fp32, int8, int4] | The data type of converted checkpoint. | - | ✅ | | `--cache-dir` | TEXT | Directory for downloading checkpoint. | None | ❌ | | `--dry-run` | BOOLEAN | Only check conversion avaliability. | False | ❌ | | `--output-model-filename` | TEXT | Name of the converted checkpoint file.The default file name is `model.h5` when `--output-ckpt-file-type` is `hdf5` or `model.safetensors` when `--output-ckpt-file-type` is `safetensors`. | None | ❌ | | `--output-ckpt-file-type` | CHOICE: \[hdf5, safetensors] | File format of the converted checkpoint file. | hdf5 | ❌ | | `--output-attr-filename` | TEXT | Name of the checkpoint attribute file. | attr.yaml | ❌ | | `--quantize` | BOOLEAN | Quantize the model before conversion | False | ❌ | | `--quant-config-file` | FILENAME | Path to the quantization configuration file. | None | ❌ | # friendli model list Source: https://friendli.ai/docs/cli/model/list View all available models with the Friendli API. Easily list models to streamline your deployment and optimization processes. ## Usage ```bash friendli model list ``` ## Summary List models. # friendli project list Source: https://friendli.ai/docs/cli/project/list List all accessible projects with the Friendli API. Easily manage your available projects for efficient workflow management. ## Usage ```bash friendli project list ``` ## Summary List all accessible projects. # friendli project switch Source: https://friendli.ai/docs/cli/project/switch Switch between project contexts using the Friendli API. Quickly change the active project by providing the project ID for smooth workflow management. ## Usage ```bash friendli project switch PROJECT_ID ``` ## Summary Switch current project context to run as. ## Arguments | Argument | Type | Summary | Default | Required | | ---------------- | ---- | ------------------------ | ------- | -------- | | **`project_id`** | TEXT | ID of project to switch. | - | ✅ | # friendli team list Source: https://friendli.ai/docs/cli/team/list View all available teams with the Friendli API. Easily list teams for project organization. ## Usage ```bash friendli team list ``` ## Summary List teams. # friendli team switch Source: https://friendli.ai/docs/cli/team/switch Switch between team contexts using the Friendli API. Quickly change the active team by providing the team ID for efficient collaboration and management. ## Usage ```bash friendli team switch TEAM_ID ``` ## Summary Switch current team context to run as. ## Arguments | Argument | Type | Summary | Default | Required | | ------------- | ---- | --------------------- | ------- | -------- | | **`team_id`** | TEXT | ID of team to switch. | - | ✅ | # friendli version Source: https://friendli.ai/docs/cli/version Check the installed package version of Friendli using the command line interface. ## Usage ```bash friendli version ``` ## Summary Check the installed package version. # friendli whoami Source: https://friendli.ai/docs/cli/whoami Show my user information of Friendli using the command line interface. ## Usage ```bash friendli whoami ``` ## Summary Show my user info. # CUDA Compatibility Source: https://friendli.ai/docs/guides/container/cuda_compatibility The Friendli Engine supports CUDA-enabled NVIDIA GPUs, which means it relies on a specific version of CUDA and necessitates proper CUDA compute compatibilities. The Friendli Engine supports CUDA-enabled NVIDIA GPUs, which means it relies on a specific version of CUDA and necessitates proper [CUDA compute compatibilities](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability). To utilize the Friendli Container effectively, ensure that you have the appropriate NVIDIA GPUs and an NVIDIA driver in place. Currently, we publicly offer a single Friendli Container image (`registry.friendli.ai/trial:latest`) equipped with CUDA 12.4, targeting CUDA compute compatibility versions `8.0`, `8.6`, `8.9`, and `9.0`. To make the right choices regarding GPUs and driver versions, consult the [required driver versions](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id4) and [GPUs](https://developer.nvidia.com/cuda-gpus) for the CUDA toolkit and compute compatibility. # Inference with gRPC Source: https://friendli.ai/docs/guides/container/inference_with_grpc Run gRPC inference server with Friendli Container and interact with it through friendli-client SDK. This guide will walk you through how to run gRPC inference server with Friendli Container and interact with it through `friendli-client` SDK. ## Prerequisites Install `friendli-client` to use gRPC client SDK: ```sh pip install friendli-client ``` Ensure you have the `friendli-client` SDK version `1.4.1` or higher installed. ## Starting the Friendli Container with gRPC Running the Friendli Container with a gRPC server for completions is available by adding the `--grpc true` option to the command argument. This supports response-streaming gRPC, and you can send requests using our `friendli-client` SDK. To start the Friendli Container with gRPC support, use the following command: ```sh export FRIENDLI_CONTAINER_SECRET="YOUR_FRIENDLI_CONTAINER_SECRET_flc_XXX" # e.g. Running `NousResearch/Hermes-3-Llama-3.1-8B` on GPU 0 with a trial image. docker run --gpus '"device=0"' -p 8000:8000 \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ -v ~/.cache/huggingface:/root/.cache/huggingface \ registry.friendli.ai/trial:latest \ --hf-model-name NousResearch/Hermes-3-Llama-3.1-8B \ --grpc true ``` You can change the port of the server with `--web-server-port` argument. ## Sending Requests with the Client SDK Here is how to use the `friendli-client` SDK to interact with the gRPC server. This example assumes that the gRPC server is running on `0.0.0.0:8000`. ```python Default from friendli import Friendli client = Friendli(base_url="0.0.0.0:8000", use_grpc=True) stream = client.completions.create( prompt="Explain what gRPC is.", stream=True, # Should be True top_k=1, ) for chunk in stream: print(chunk.text, end="", flush=True) ``` ```python Async # For asynchronous operations, use the following code snippet: import asyncio from friendli import AsyncFriendli client = AsyncFriendli(base_url="0.0.0.0:8000", use_grpc=True) async def run(): stream = await client.completions.create( prompt="Explain what gRPC is.", stream=True, # Should be True top_k=1, ) async for chunk in stream: print(chunk.text, end="", flush=True) asyncio.run(run()) ``` ## Properly Closing the Client By default, the library closes underlying HTTP and gRPC connections when the `client` is garbage-collected. You can manually close the `Friendli` or `AsyncFriendli` client using the `.close()` method or utilize a context manager to ensure proper closure when exiting a `with` block. ```python Default from friendli import Friendli client = Friendli(base_url="0.0.0.0:8000", use_grpc=True) with client: stream = client.completions.create( prompt="Explain what gRPC is.", stream=True, # Should be True top_k=1, min_tokens=10, ) for chunk in stream: print(chunk.text, end="", flush=True) ``` ```python Async import asyncio from friendli import AsyncFriendli client = AsyncFriendli(base_url="0.0.0.0:8000", use_grpc=True) async def run(): async with client: stream = await client.completions.create( prompt="Explain what gRPC is.", stream=True, # Should be True top_k=1, ) async for chunk in stream: print(chunk.text, end="", flush=True) asyncio.run(run()) ``` # Introducing Friendli Container Source: https://friendli.ai/docs/guides/container/introduction While Friendli Serverless Endpoints and Dedicated Endpoints offer convenient cloud-based solutions, some users crave even more control and flexibility. For those pioneers, Friendli Container is the answer. While Friendli Serverless Endpoints and Dedicated Endpoints offer convenient cloud-based solutions, some users crave even more control and flexibility. For those pioneers, Friendli Container is the answer. ## What is Friendli Container? Unmatched Control: Friendli Container provides the Friendli Engine, our cutting-edge serving technology, as a Docker container. This means you can: * **Run your own data center or cluster**: Deploy the container on your existing GPU machines, giving you complete control over your infrastructure and data security. * **Choose your own cloud provider**: If you prefer the cloud, you can still leverage your preferred cloud provider and GPUs. * **Customize your environment**: Fine-tune the container configuration to perfectly match your specific needs and workflows. Greater Responsibility, Greater Customization: With Friendli Container, you handle the cluster management, fault tolerance, and scaling. This responsibility comes with these potential benefits: * **Controlled environment**: Keep your data within your own environment, ideal for sensitive applications or meeting compliance requirements. * **Unmatched flexibility**: Tailor your infrastructure and workflows to your specific needs, pushing the boundaries of AI innovation. * **Cost saving opportunities**: Manage your resources on your GPU machines, potentially leading to cost savings compared to cloud-based solutions. Ideal for: * **Data-sensitive users**: Securely run your models within your own infrastructure. * **Control enthusiasts**: Take full control over your AI environment and workflows. * **Existing cluster owners**: Utilize your existing GPU resources for cost-effective generative AI serving. ## Getting Started with Friendli Container: 1. **Generate Your User Token**: Visit the Friendli Container page through the [Friendli Suite](https://suite.friendli.ai) website and generate your unique token. 2. **Login with Docker Client**: Use your token to authenticate with the Docker client on your machine. 3. **Pull the Friendli Container Image**: Run the docker pull command with the provided image name. 4. [**Launch the Friendli Container**](/guides/container/running_friendli_container): Run the docker run command with the desired configuration and credentials. 5. **Expose Your Model**: The container will expose the model for inference. 6. [**Send Inference Requests**](/guides/container/running_friendli_container#sending-inference-requests): Use tools like curl or Python's requests library to send input prompts or data to the container. Take generative AI to the next level with unmatched control, security, and flexibility through Friendli Container. Start your journey today and elevate your AI endeavors on your own terms! # Observability for Friendli Container Source: https://friendli.ai/docs/guides/container/monitoring Observability is an integral part of DevOps. To support this, Friendli Container exports internal metrics in a Prometheus text format. Observability is an integral part of DevOps. To support this, Friendli Container exports internal metrics in a [Prometheus](https://prometheus.io) text format. By default, metrics are served at `http://localhost:8281/metrics`. You can configure the port number using the command line option `--metrics-port`. ## Supported Metrics ### Counters Counters are cumulative metrics whose values monotonically increase. They are often used in combination with Prometheus function [rate()](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) for calculating the throughput. | Metric Name | Description | | --------------------------------- | -------------------------------------------------------- | | friendli\_requests\_total | Cumulative number of requests received | | friendli\_responses\_total | Cumulative number of responses sent | | friendli\_items\_total | Cumulative number of items requested | | friendli\_failure\_by\_cancel | Cumulative number of failed requests due to cancellation | | friendli\_failure\_by\_timeout | Cumulative number of failed requests due to timeout | | friendli\_failure\_by\_nan\_error | Cumulative number of failed requests due to NaN error | | friendli\_failure\_by\_reject | Cumulative number of failed requests due to rejection | One inference request may generate multiple results with the `n` field in the request body. Upon receiving such request, `friendli_requests_total` is increased by 1 and `friendli_items_total` is increased by `n`. ### Gauges Gauges are numerical values that can go up and down to represent the current value. | Metric Name | Description | | ---------------------------------- | --------------------------------------------------------------------- | | friendli\_current\_requests | Current number of requests in the engine (either assigned or waiting) | | friendli\_current\_items | Current number of items in the engine (either assigned or waiting) | | friendli\_current\_assigned\_items | Current number of items actively processed by the engine | | friendli\_current\_waiting\_items | Current number number of items waiting in the internal queue | ### Histograms [Histograms](https://prometheus.io/docs/practices/histograms) are used to track the distribution of variables over time.
Histogram Metric Name Description
Friendli TCache hit ratio (0≤value≤1) friendli\_tcache\_hit\_ratio\_bucket Bucketized number of histogram samples for TCache hit ratio, with le label
friendli\_tcache\_hit\_ratio\_count Total number of histogram samples for TCache hit ratio
friendli\_tcache\_hit\_ratio\_sum Sum of histogram sample values for TCache hit ratio
The length of input tokens (Experimental metric) friendli\_input\_lengths\_bucket Bucketized number of histogram samples for length of input tokens, with le label
friendli\_input\_lengths\_count Total number of histogram samples for length of input tokens
friendli\_input\_lengths\_sum Sum of histogram sample values for length of input tokens
The length of output tokens (Experimental metric) friendli\_output\_lengths\_bucket Bucketized number of histogram samples for length of output tokens, with le label
friendli\_output\_lengths\_count Total number of histogram samples for length of output tokens
friendli\_output\_lengths\_sum Sum of histogram sample values for length of output tokens
For visualizing histograms using Grafana, [How to visualize Prometheus histograms in Grafana](https://grafana.com/blog/2020/06/23/how-to-visualize-prometheus-histograms-in-grafana) provides useful tips. ### Quantiles Quantiles are used to show the current p50(median), p90, and p99 percentiles of variables.
Quantiles Metric Name Description
Request completion latency (in nanoseconds) friendli\_requests\_latencies Percentile value for request completion latency (quantile label is either 0.5, 0.9, or 0.99)
friendli\_requests\_latencies\_count Total number of samples for request completion latency
friendli\_requests\_latencies\_sum Sum of sample values for request completion latency
Time to first token (TTFT) (in nanoseconds) friendli\_requests\_ttft Percentile value for time to first token (TTFT) (quantile label is either 0.5, 0.9, or 0.99)
friendli\_requests\_ttft\_count Total number of samples for time to first token (TTFT)
friendli\_requests\_ttft\_sum Sum of sample values for time to first token (TTFT)
Request queueing delay (in nanoseconds) friendli\_requests\_queueing\_delays Percentile value for queueing delay (quantile label is either 0.5, 0.9, or 0.99)
friendli\_requests\_queueing\_delays\_count Total number of samples for queueing delay
friendli\_requests\_queueing\_delays\_sum Sum of sample values for queueing delay
### Info The following information metric always has a value of 1. The metric labels contain useful information in text. | Metric Name | Label | Description | | ------------------------- | --------- | -------------- | | friendli\_engine\_version | `version` | Engine version | ## Grafana Dashboard Template ![Grafana Dashboard](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/container/grafana_template_dashboard_example.png) You can import [the dashboard templates](https://github.com/friendliai/container-resource/tree/main/grafana) to your Grafana instance. The Grafana instance must be connected to a Prometheus instance (or a Prometheus-compatible data source) which is configured to scrape metrics from Friendli Container processes. The dashboard template works with Grafana v8.0.0 or later versions. We recommend using Grafana v10.0.0 or later for the best experience. # Optimizing Inference with Policy Search Source: https://friendli.ai/docs/guides/container/optimizing_inference_with_policy_search For specialized cases like MoE or quantized models, optimizing the execution policy in Friendli Engine can boost inference performance by 1.5x to 2x, improving throughput and reducing latency. ## Introduction For specialized cases, like **serving MoE models (e.g., Mixtral)** or **quantized models**, performance of inference can be further optimized through a execution policy search. This process can be skipped, but it is necessary to get the optimized speed of Friendli Engine. When Friendli Engine runs with the optimal policy, the performance can increase by from 1.5x to 2x (i.e., throughput and latency). Therefore, we recommend skipping policy search for simple model testing, and performing policy search for cost analysis or latency analysis in production service. Policy search is effective only when serving (1) MoE models (2) AWQ, FP8 or INT8 quantized models. Otherwise, it is useless. ## Running Policy Search You can run policy search by adding the following options to the launch command of Friendli Container. | Options | Type | Summary | Default | | -------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | | `--algo-policy-dir` | TEXT | Path to the directory to save the searched optimal policy file. The default value is the current working directory. | current working dir | | `--search-policy` | BOOLEAN | Runs policy search to find the best Friendli execution policy for the given configuration such as model type, GPU, NVIDIA driver version, quantization scheme, etc. | false | | `--terminate-after-search` | BOOLEAN | Terminates engine container after policy search. | false | ### Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8` For example, you can start the policy search for [FriendliAI/Llama-3.1-8B-Instruct-fp8](https://huggingface.co/FriendliAI/Llama-3.1-8B-Instruct-fp8) model as follows: ```sh export HF_MODEL_NAME="FriendliAI/Llama-3.1-8B-Instruct-fp8" export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial" export GPU_ENUMERATION='"device=0"' export POLICY_DIR=$PWD/policy mkdir -p $POLICY_DIR docker run \ --gpus $GPU_ENUMERATION \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name $HF_MODEL_NAME \ --algo-policy-dir /policy \ --search-policy true ``` ### Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4) ```sh export HF_MODEL_NAME="mistralai/Mixtral-8x7B-Instruct-v0.1" export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial" export GPU_ENUMERATION='"device=0,1,2,3"' export POLICY_DIR=$PWD/policy mkdir -p $POLICY_DIR docker run -p 8000:8000 \ --ipc=host --gpus $GPU_ENUMERATION \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name $HF_MODEL_NAME \ --num-devices 4 \ --algo-policy-dir /policy \ --search-policy true ``` Once the policy search is complete, a policy file will be created in `$POLICY_DIR`. If the policy file already exists, the engine will search only the necessary spaces and update the policy file accordingly. After the policy search, engine starts to serve endpoint with using the policy file. It takes up to several minutes to find the optimal policy for Llama 2 13B model with NVIDIA A100 80GB GPU. The estimated time and remaining time will be displayed in the stderr when you run the policy search. ## Running Policy Search Without Starting Serving Endpoint To search for the best policy without starting the serving endpoint, launch the engine with the Friendli Container command and include the `--terminate-after-search true` option. ### Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8` ```sh docker run \ --gpus $GPU_ENUMERATION \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name FriendliAI/Llama-3.1-8B-Instruct-fp8 \ --algo-policy-dir /policy --search-policy true --terminate-after-search true ``` ### Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4) ```sh docker run -p 8000:8000 \ --ipc=host --gpus $GPU_ENUMERATION \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name mistralai/Mixtral-8x7B-Instruct-v0.1 \ --num-devices 4 \ --algo-policy-dir /policy \ --search-policy true --terminate-after-search true ``` ## FAQ: When to Run Policy Search Again? The execution policy depends on the following factors: * Model * GPU * GPU count and parallelism degree (The value for `--num-devices` and `--num-workers` options) * NVIDIA Driver major version * Friendli Container version You should run policy search again when any of these are changed from your serving setup. # QuickStart: Friendli Container Trial Source: https://friendli.ai/docs/guides/container/quickstart Learn how to get started with Friendli Container in this step-by-step guide. Activate your free trial, access to the Container registry, perpare you container secret, run your Friendli Container, and monitor using Granfana. ## Introduction [Friendli Container](https://friendli.ai/products/container) enables you to efficiently deploy LLMs of your choice on your infrastructure. With Friendli Container, you can perform high-speed LLM inferencing in a secure and private environment. This tutorial will guide you through the process of running a Friendli Container for your LLM. ## Prerequisites * **Hardware Requirements**: Friendli Container currently only targets x86\_64 architecture and supports NVIDIA GPUs, so please prepare proper GPUs and a compatible driver by referring to [our required CUDA compatibility guide](/guides/container/cuda_compatibility). * **Software Requirements**: Your machine should be able to run containers with the [NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html). In this tutorial, we will use Docker as container runtime and make use of [Docker Compose](https://docs.docker.com/compose). * **Model Compatibility**: If your model is in a [safetensors](https://huggingface.co/docs/safetensors/index) format, which is compatible with [Hugging Face transformers](https://huggingface.co/docs/transformers), you can serve the model directly with the Friendli Container. Please check our [Model library](https://friendli.ai/models) for the non-exhaustive list of supported models. This tutorial assumes that your model of choice is uploaded to [Hugging Face](https://huggingface.co) and you have access to it. If the model is gated or private, you need to prepare a [Hugging Face Access Token](https://huggingface.co/settings/tokens). ## Getting Access to Friendli Container ### Activate your Free Trial 1. Sign up for [Friendli Suite](https://suite.friendli.ai). 2. In the 'Friendli Container' section, click the 'Start free trial' button. Now you can use Friendli Container free of charge during the trial period. ### Get Access to the Container Registry Friendli Token is a user credential that is required for logging into our container registry. 1. Go to [Personal settings > Tokens](https://suite.friendli.ai/default-team/settings/tokens) and click 'Create token'. 2. Save the token you just created. ### Prepare your Container Secret Container secret is a secret code that is used to activate Friendli Container. You should pass the container secret as an environment variable to run the container image. 1. Go to [Container > Container Secrets](https://suite.friendli.ai/default-team/container/secrets) and click 'Create secret'. 2. Save the secret you just created. **🔑 Secret Rotation** You can rotate the container secret for security reasons. If you rotate the container secret, a new secret will be created and the previous secret will be automatically revoked in **30** minutes. ## Running Friendli Container ### Pull the Friendli Container Image 1. Log in to the container registry using the email address for your Friendli Suite account and the Friendli Token. ```sh export FRIENDLI_EMAIL="YOUR ACCOUNT EMAIL ADDRESS" export FRIENDLI_TOKEN="YOUR FRIENDLI TOKEN" docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_TOKEN ``` 2. Pull the image. ```sh docker pull registry.friendli.ai/trial ``` ### Run Friendli Container with a HuggingFace Model 1. Clone our [container resource](https://github.com/friendliai/container-resource) git repository. ```sh git clone https://github.com/friendliai/container-resource cd container-resource/quickstart/docker-compose ``` 2. Set up environment variables. ```sh export HF_MODEL_NAME="<...>" # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct") export FRIENDLI_CONTAINER_SECRET="<...>" # Friendli container secret ``` If your model is a private or gated one, you also need to provide [HuggingFace Access Token](https://huggingface.co/settings/tokens). ```sh export HF_TOKEN="<...>" # HuggingFace Access Token ``` 3. Launch the Friendli Container. ```sh docker compose up -d ``` By default, the container will listen for inference requests at TCP port 8000 and a Grafana service will be available at TCP port 3000. You can change the designated ports using the following environment variables. For example, if you want to use TCP port 8001 and port 3001 for Grafana, execute the command below. ```sh export FRIENDLI_PORT="8001" export FRIENDLI_GRAFANA_PORT="3001" ``` Even though the machine has multiple GPUs, the container will make use of only one GPU, specifically the first GPU (`device_ids: ['0']`). You can edit `docker-compose.yaml` to change what GPU device the container will use. The downloaded HuggingFace model will be cached in the `$HOME/.cache/huggingface` directory. You may want to clean up this directory after completing this tutorial. ### Send Inference Requests You can now send inference requests to the running container. For information on all parameters that can be used in an inference request, please refer to [this document](/openapi). ```sh Chat Completion curl -X POST http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "What makes a good leader?"} ], "max_tokens": 30 }' ``` ```sh Completion curl -X POST http://0.0.0.0:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "What makes a good leader?", "max_tokens": 30 }' ``` ```sh Tokenization curl -X POST http://0.0.0.0:8000/v1/tokenize \ -H "Content-Type: application/json" \ -d '{ "prompt": "What is generative AI?" }' ``` ```sh Detokenization curl -X POST http://0.0.0.0:8000/v1/detokenize \ -H "Content-Type: application/json" \ -d '{ "tokens": [ 128000, 3923, 374, 1803, 1413, 15592, 30 ] }' ``` Chat completion requests work only if the model's tokenizer config contains a `chat_template`. ### Monitor using Grafana Using your browser, open [http://0.0.0.0:3000/d/friendli-engine](http://0.0.0.0:3000/d/friendli-engine), and login with username `admin` and password `admin`. You can now access the dashboards showing useful engine metrics. ![Grafana Dashboard](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/container/grafana_template_dashboard_example.png) If you cannot open a browser directly in the GPU machine where the Friendli Container is running, you can use SSH to forward requests from the browser running on your PC to the GPU machine. ```sh # Change these variables to match your environment. LOCAL_GRAFANA_PORT=3000 # The number of the port in your PC. FRIENDLI_GRAFANA_PORT=3000 # The number of the port in the GPU machine. ssh "$GPU_MACHINE_ADDRESS" -L "$LOCAL_GRAFANA_PORT:0.0.0.0:$FRIENDLI_GRAFANA_PORT" ``` where `$GPU_MACHINE_ADDRESS` shall be replaced with the address of the GPU machine. You may also want to use `-l login_name` or `-p port` options to connect to the GPU machine using SSH. Then using your browser on the PC, open `http://0.0.0.0:$LOCAL_GRAFANA_PORT/d/friendli-engine`. ## Going Further Congratulations! You can now serve your LLM of choice using your hardware, with the power of the most efficient LLM serving engine on the planet. The following topics will help you go further through your AI endeavors. * **Multi-GPU Serving**: Although this tutorial is limited to using only one GPU, Friendli Container supports tensor parallelism and pipeline parallelism for multi-GPU inference. Check [Multi-GPU Serving](/guides/container/running_friendli_container#multi-gpu-serving) for more information. * **Serving Multi-LoRA Models**: You can deploy multiple customized LLMs without additional GPU resources. See [Serving Multi-LoRA Models](/guides/container/serving_multi_lora_models) to learn how to launch the container with your adapters. * **Serving Quantized Models**: Running quantized models requires an additional step of [execution policy search](/guides/container/optimizing_inference_with_policy_search). See [Serving Quantized Models](/guides/container/serving_quantized_models) to learn how to create an inference endpoint for quantized models. * **Serving MoE Models**: Running MoE (Mixture of Experts) models requires an additional step of [execution policy search](/guides/container/optimizing_inference_with_policy_search). See [Serving MoE Models](/guides/container/serving_moe_models) to learn how to create an inference endpoint for MoE models. If you are stuck or need help going through this tutorial, please ask for support by sending an email to [Support](mailto:support@friendli.ai). # Running Friendli Container Source: https://friendli.ai/docs/guides/container/running_friendli_container Friendli Container enables you to effortlessly deploy your generative AI model on your own machine. This tutorial will guide you through the process of running a Friendli Container. ## Introduction Friendli Container enables you to effortlessly deploy your generative AI model on your own machine. This tutorial will guide you through the process of running a Friendli Container. The current version of Friendli Container supports most of major generative language models. ## Prerequisites * Before you begin, make sure you have signed up for [Friendli Suite](https://suite.friendli.ai). **You can use Friendli Container free of charge for 60 days.** * Friendli Container currently only supports NVIDIA GPUs, so please prepare proper GPUs and a compatible driver by referring to [our required CUDA compatibility guide](/guides/container/cuda_compatibility). * Prepare a Friendli Token following [this guide](#preparing-friendli-token). * Prepare a Friendli Container Secret following [this guide](#preparing-container-secret). ### Preparing Friendli Token Friendli Token is the user credentials for logging into our container registry. 1. Sign in [Friendli Suite](https://suite.friendli.ai). 2. Go to **[Personal settings > Tokens](https://suite.friendli.ai/default-team/settings/tokens)** and click **'Create new token'**. 3. Save your created token value and export it as `FRIENDLI_TOKEN`. ### Preparing Container Secret Container secret is a secret code that is used to activate Friendli Container. You should pass the container secret as an environment variable to run the container image. 1. Sign in [Friendli Suite](https://suite.friendli.ai). 2. Go to **[Container > Container Secrets](https://suite.friendli.ai/default-team/container/secrets)** and click **'Create secret'**. 3. Save your created secret value and export it as `FRIENDLI_CONTAINER_SECRET`. **🔑 Secret Rotation** You can rotate the container secret for security reasons. If you rotate the container secret, a new secret will be created and the previous secret will be revoked automatically in 30 minutes. ## Pulling Friendli Container Image Log in to the Docker client using the Friendli Token created as outlined in [Preparing Friendli Token](#preparing-friendli-token). ```sh export FRIENDLI_EMAIL="YOUR ACCOUNT EMAIL ADDRESS" export FRIENDLI_TOKEN="YOUR FRIENDLI TOKEN" docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_TOKEN ``` ```sh docker pull registry.friendli.ai/trial:latest ``` **💰 60-Days Free Trial** During the 60-days free trial period, you can use `registry.friendli.ai/trial` image only. ## Running Friendli Container with Hugging Face Models If your model is in a [`safetensors`](https://huggingface.co/docs/safetensors/index) format, which is compatible with [Hugging Face transformers](https://huggingface.co/docs/transformers), you can serve the model directly with Friendli Container. The current version of Friendli Container supports direct loading of `safetensors` checkpoints for the following models (and corresponding Hugging Face transformers classes): * FLUX * Arctic (`ArcticForCausalLM`) * Baichuan (`BaichuanForCausalLM`) * Blenderbot (`BlenderbotForConditionalGeneration`) * BLOOM (`BloomForCausalLM`) * Cohere (`CohereForCausalLM`) * DBRX (`DbrxForCausalLM`) * DeepSeek (`DeepseekForCausalLM`) * DeepSeek (`DeepseekV2ForCausalLM`) * DeepSeek (`DeepseekV3ForCausalLM`) * EXAONE (`ExaoneForCausalLM`) * Falcon (`FalconForCausalLM`) * Gemma2 (`Gemma2ForCausalLM`) * Gemma (`GemmaForCausalLM`) * GPT2 (`GPT2LMHeadModel`) * GPT-J (`GPTJForCausalLM`) * GPT-NeoX (`GPTNeoXForCausalLM`) * Grok-1 (`Grok1ForCausalLM`) * Llama (`LlamaForCausalLM`) * Mistral (`MistralForCausalLM`) * Mixtral (`MixtralForCausalLM`) * Llama (`MllamaForConditionalGeneration`) * MPT (`MPTForCausalLM`) * MT5 (`MT5ForConditionalGeneration`) * OPT (`OPTForcausalLM`) * Phi3 (`Phi3ForCausalLM`) * Phi (`PhiForCausalLM`) * Phi MoE (`PhiMoEForCausalLM`, `PhimoeForCausalLM`) * Qwen2 (`Qwen2ForCausalLM`) * Qwen2-VL 72B Instruct (`Qwen2VLForConditionalGeneration`) * Solar (`SolarForCausalLM`) * StarCoder2 (`Starcoder2ForCausalLM`) * T5 (`T5ForConditionalGeneration`) If your model does not belong to one of the above model types, please [contact us](https://friendliai.canny.io/supported-model) for support. Here are the instructions to run Friendli Container to serve a Hugging Face model: ```sh # Fill the values of following variables. export HF_MODEL_NAME="" # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct") export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret docker run --gpus '"device=0"' -p 8000:8000 \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ -v ~/.cache/huggingface:/root/.cache/huggingface \ registry.friendli.ai/trial \ --hf-model-name $HF_MODEL_NAME ``` The `[LAUNCH_OPTIONS]` should be replaced with [Launch Options for Friendli Container](#launch-options). By running the above command, you will have a running Docker container that exports an HTTP endpoint for handling inference requests. ### Multi-GPU Serving Friendli Container supports ***tensor parallelism*** and ***pipeline parallelism*** for multi-GPU inference. #### Tensor Parallelism Tensor parallelism is employed when serving large models that exceed the memory capacity of a single GPU, by distributing parts of the model's weights across multiple GPUs. To leverage tensor parallelism with the Friendli Container: 1. Specify multiple GPUs for `$GPU_ENUMERATION` (e.g., '"device=0,1,2,3"'). 2. Use `--num-devices` (or `-d`) option to specify the tensor parallelism degree (e.g., `--num-devices 4`). #### Pipeline Parallelism Pipeline parallelism splits a model into multiple segments to be processed across different GPU, enabling the deployment of larger models that would not otherwise fit on a single GPU. To exploit pipeline parallelism with the Friendli Container: 1. Specify multiple GPUs for `$GPU_ENUMERATION` (e.g., '"device=0,1,2,3"'). 2. Use `--num-workers` (or `-n`) option to specify the pipeline parallelism degree (e.g., `--num-workers 4`). **🆚 Choosing between Tensor Parallelism and Pipeline Parallelism** When deploying models with the Friendli Container, you have the flexibility to combine tensor parallelism and pipeline parallelism. We recommend exploring a balance between the two, based on their distinct characteristics. While tensor parallelism involves "expensive" ***all-reduce*** operations to aggregate partial results across all devices, pipeline parallelism relies on "cheaper" ***peer-to-peer*** communication. Thus, in limited network setup, such as PCIe networks, leveraging pipeline parallelism is preferable. Conversely, in rich network setup like NVLink, tensor parallelism is recommended due to its superior parallel computation efficiency. ### Advanced: Serving Quantized Models Running quantized models requires an additional step to search execution policy. See [Serving Quantized Models](/guides/container/serving_quantized_models) to learn how to create an inference endpoint for the quantized model. ### Advanced: Serving MoE Models Running MoE (Mixture of Experts) models requires an additional step to search execution policy. See [Serving MoE Models](/guides/container/serving_moe_models) to learn how to create an inference endpoint for the MoE model. ### Examples This is an example running [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) with a single GPU. ```sh export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret (leave it if it's already set in your environment) export HF_TOKEN="" # Access token from HuggingFace (see the caution below) docker run -p 8000:8000 --gpus '"device=0"' \ -e HF_TOKEN=$HF_TOKEN \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ -v ~/.cache/huggingface:/root/.cache/huggingface \ registry.friendli.ai/trial \ --hf-model-name meta-llama/Llama-3.1-8B-Instruct ``` Since downloading `meta-llama/Llama-3.1-8B-Instruct` is allowed only for authorized users, you need to provide your [Hugging Face User Access Token](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hftoken) through `HF_TOKEN` environment variable. It works the same for all private repositories. This is an example running [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) with a multi-GPU setup. ```sh {5, 11} export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret (leave it if it's already set in your environment) export HF_TOKEN="" # Access token from HuggingFace (see the caution below) docker run -p 8000:8000 \ --ipc=host --gpus '"device=0,1"' \ -e HF_TOKEN=$HF_TOKEN \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ -v ~/.cache/huggingface:/root/.cache/huggingface \ registry.friendli.ai/trial \ --hf-model-name meta-llama/Llama-3.1-70B-Instruct \ --num-devices 2 ``` Since downloading `meta-llama/Llama-3.1-70B-Instruct` is allowed only for authorized users, you need to provide your [Hugging Face User Access Token](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hftoken) through `HF_TOKEN` environment variable. It works the same for all private repositories. ## Sending Inference Requests We can now send inference requests to the running Friendli Container. For information on all parameters that can be used in an inference request, please refer to [this document](/openapi/serverless/chat-completions). ### Examples ```sh cURL curl -X POST http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "What makes a good leader?"} ], "max_tokens": 30, "stream": true }' ``` ```python Python SDK # pip install friendli-client from friendli import Friendli client = Friendli(base_url="http://0.0.0.0:8000") stream = client.chat.completions.create( messages=[{"role": "user", "content": "Python is a popular"}], max_tokens=30, stream=True, ) for chunk in stream: print(chunk.text, end="", flush=True) ``` ## Options for Running Friendli Container ### General Options | Options | Type | Summary | Default | Required | | ----------- | ---- | -------------------------------------- | ------- | -------- | | `--version` | - | Print Friendli Container version. | - | ❌ | | `--help` | - | Print Friendli Container help message. | - | ❌ | ### Launch Options | Options | Type | Summary | Default | Required | | --------------------------------- | --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | -------- | | `--web-server-port` | INT | Web server port. | 8000 | ❌ | | `--metrics-port` | INT | Prometheus metrics export port. | 8281 | ❌ | | `--hf-model-name` | TEXT | Model name hosted on the Hugging Face Models Hub or a path to a local directory containing a model. When a model name is provided, Friendli Container first checks if the model is already cached at \~/.cache/huggingface/hub and uses it if available. If not, it will download the model from the Hugging Face Models Hub before creating the inference endpoint. When a local path is provided, it will load the model from the location without downloading. This option is only available for models in a safetensors format. | - | ❌ | | `--tokenizer-file-path` | TEXT | Absolute path of tokenizer file. This option is not needed when `tokenizer.json` is located under the path specified at `--ckpt-path`. | - | ❌ | | `--tokenizer-add-special-tokens` | BOOLEAN | Whether or not to add special tokens in tokenization. Equivalent to Hugging Face Tokenizer's `add_special_tokens` argument. The default value is **false** for versions \< v1.6.0. | `true` | ❌ | | `--tokenizer-skip-special-tokens` | BOOLEAN | Whether or not to remove special tokens in detokenization. Equivalent to Hugging Face Tokenizer's `skip_special_tokens` argument. | `true` | ❌ | | `--dtype` | CHOICE: \[bf16, fp16, fp32] | Data type of weights and activations. Choose one of \. This argument applies to non-quantized weights and activations. If not specified, Friendli Container follows the value of `torch_dtype` in `config.json` file or assumes fp16. | fp16 | ❌ | | `--bad-stop-file-path` | TEXT | JSON file path that contains stop sequences or bad words/tokens. | - | ❌ | | `--num-request-threads` | INT | Thread pool size for handling HTTP requests. | 4 | ❌ | | `--timeout-microseconds` | INT | Server-side timeout for client requests, in microseconds. | 0 (no timeout) | ❌ | | `--ignore-nan-error` | BOOLEAN | If set to True, ignore NaN error. Otherwise, respond with a 400 status code if NaN values are detected while processing a request. | - | ❌ | | `--max-batch-size` | INT | Max number of sequences that can be processed in a batch. | 384 | ❌ | | `--num-devices`, `-d` | INT | Number of devices to use in tensor parallelism degree. | 1 | ❌ | | `--num-workers`, `-n` | INT | Number of workers to use in a pipeline (i.e., pipeline parallelism degree). | 1 | ❌ | | `--search-policy` | BOOLEAN | Searches for the best engine policy for the given combination of model, hardware, and parallelism degree. Learn more about policy search at [Optimizing Inference with Policy Search](/guides/container/optimizing_inference_with_policy_search). | false | ❌ | | `--terminate-after-search` | BOOLEAN | Terminates engine container after the policy search. | false | ❌ | | `--algo-policy-dir` | TEXT | Path to directory containing the policy file. The default value is the current working directory. Learn more about policy search at [Optimizing Inference with Policy Search](/guides/container/optimizing_inference_with_policy_search). | current working dir | ❌ | | `--adapter-model` | TEXT | Add an adapter model with adapter name and path; \:\. The path can be a name from a Hugging Face model hub. | - | ❌ | ### Model Specific Options #### T5 | Options | Type | Summary | Default | Required | | --------------------- | ---- | ---------------------- | ------- | -------- | | `--max-input-length` | INT | Maximum input length. | - | ✅ | | `--max-output-length` | INT | Maximum output length. | - | ✅ | # Running Friendli Container on SageMaker Source: https://friendli.ai/docs/guides/container/running_friendli_container_on_sagemaker Create a real-time inference endpoint in Amazon SageMaker with Friendli Container backend. By utilizing Friendli Container in your SageMaker pipeline, you'll benefit from the Friendli Engine's speed and resource efficiency. ## Introduction This guide will walk you through creating a [real-time inference endpoint in Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) with Friendli Container backend. By utilizing Friendli Container in your SageMaker pipeline, you'll benefit from the Friendli Engine's speed and resource efficiency. We'll explore how to create inference endpoints using both the AWS Console and the boto3 Python SDK. ## General Workflow ![Lora Serving](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/container/sagemaker_workflow.png) 1. **Create a Model**: Within SageMaker Inference, define a new model by specifying the model artifacts in your S3 bucket and the Friendli container image from ECR. 2. **Configure the Endpoint**: Create a SageMaker Inference endpoint configuration by selecting the instance type and the number of instances required. 3. **Create the Endpoint**: Utilize the configured settings to launch a SageMaker Inference endpoint. 4. **Invoke the Endpoint**: Once deployed, send requests to your endpoint to receive inference responses. ## Prerequisite Before beginning, you need to push the Friendli Container image to an ECR repository on AWS. First, prepare the Friendli Container image by following the instructions in [**Pulling Friendli Container Image**](/guides/container/running_friendli_container/#pulling-friendli-container-image). Then, tag and push the image to the Amazon ECR repository as guided in [**Pushing a Docker image to an Amazon ECR private repository**](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html). ## Using the AWS Console Let's delve into the step-by-step instructions for creating an inference endpoint using the AWS Console. ### Step 1: Creating a Model You can start creating a model by clicking on the **Create model** button under **SageMaker > Inference > Models**. Then, configure the model with the following fields: * **Model settings**: * **Model name**: A model name. * **IAM role**: An IAM role that includes the `AmazonSageMakerFullAccess` policy. * **Container definition 1**: * **Container input option**: Select the "Provide model artifacts and inference image location". * **Model Compression Type**: * To use a model in the S3 bucket: * When the model is compressed, select "CompressedModel". * Otherwise, select "UncompressedModel". * When using a model from the Hugging Face hub, any option would work fine. * **Location of inference code image**: Specify the ARN of the ECR repo for the Friendli Container. * **Location of model artifacts** (optional): * To use a model in the S3 bucket: Specify the S3 URI where your model is stored. Ensure the file structure matches the directory format compatible with the `--hf-model-name` option of the Friendli Container. * When using a model from the Hugging Face hub, you can leave this field empty. * **Environment variables**: * Always required: * `FRIENDLI_CONTAINER_SECRET`: Your Friendli Container Secret. Refer to [**Preparing Container Secret**](/guides/container/running_friendli_container/#preparing-container-secret) to learn how to get the container secret. * `SAGEMAKER_MODE`: This should be set to `True`. * `SAGEMAKER_NUM_DEVICES`: Number of devices to use for tensor parallelism degree. * Required when using a model in the S3 bucket: * `SAGEMAKER_USE_S3`: This should be set to `True`. * Required when using a model from the Hugging Face hub: * `SAGEMAKER_HF_MODEL_NAME`: The Hugging Face model name (e.g., `mistralai/Mistral-7B-Instruct-v0.2`) * For private or gated model repos: * `HF_TOKEN`: The Hugging Face secret access token. ### Step 2: Creating an Endpoint Configuration You can start by clicking on the **Create endpoint configuration** button under **SageMaker > Inference > Endpoint configurations**. * **Endpoint configuration**: * **Endpoint configuration name**: The name of this endpoint configuration. * **Type of endpoint**: For real-time inference, select "Provisioned". * **Variants**: * To create a "Production" variant, click "Create production variant". * Select the model that you have created in [**Step 1**](#step-1-creating-a-model). * Configure the instance type and count by clicking on "Edit" in the Actions column. * Create the endpoint configuration by clicking "Create endpoint configuration". ### Step 3: Creating SageMaker Inference Endpoint You can start by clicking the **Create endpoint** button under **SageMaker > Inference > Endpoints**. * Select "Use an existing endpoint configuration". * Select the endpoint configuration created in [**Step 2**](#step-2-creating-an-endpoint-configuration). * Finish by clicking on the "Create endpoint" button. ### Step 4: Invoking Endpoint When the endpoint status becomes "In Service", you can invoke the endpoint with the following script, after filling in the endpoint name and the region name: ```python import boto3 import json endpoint_name = "FILL OUT ENDPOINT NAME" region_name = "FILL OUT AWS REGION" sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=region_name) prompt = "Story title: 3 llamas go for a walk\nSummary: The 3 llamas crossed a bridge and something unexpected happened\n\nOnce upon a time" payload = { "prompt": prompt, "max_tokens": 512, "temperature": 0.8, } response = sagemaker_runtime.invoke_endpoint( EndpointName=endpoint_name, Body=json.dumps(payload), ContentType="application/json", ) print(response['Body'].read().decode('utf-8')) ``` ## Using the boto3 SDK Next, let's discover the process for creating a SageMaker endpoint using the boto3 Python SDK. You can achieve this by using the code snippet below. Be sure to fill in the custom fields, customized for your specific use case: ```python import boto3 from sagemaker import get_execution_role sm_client = boto3.client(service_name='sagemaker') runtime_sm_client = boto3.client(service_name='sagemaker-runtime') account_id = boto3.client('sts').get_caller_identity()['Account'] region = boto3.Session().region_name role = get_execution_role() endpoint_name="FILL OUT ENDPOINT NAME" model_name="FILL OUT MODEL NAME" container = "FILL OUT ECR IMAGE NAME" # .dkr.ecr..amazonaws.com/IMAGE instance_type = "ml.g5.12xlarge" # instance type container = { 'Image': container, 'Environment': { "HF_TOKEN": "", "FRIENDLI_CONTAINER_SECRET": "", "SAGEMAKER_HF_MODEL_NAME": "", # e.g) meta-llama/Meta-Llama-3-8B "SAGEMAKER_MODE": "True", # Should be true "SAGEMAKER_NUM_DEVICES": "4", # Number of GPUs in `instance_type` } } endpoint_config_name = 'FILL OUT ENDPOINT CONFIG NAME' # Create a model create_model_response = sm_client.create_model( ModelName=model_name, ExecutionRoleArn=role, Containers=[container], ) # Create an endpoint configuration create_endpoint_config_response = sm_client.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[ { 'InstanceType': instance_type, 'InitialInstanceCount': 1, 'InitialVariantWeight': 1, 'ModelName': model_name, 'VariantName': 'AllTraffic', }, ], ) endpoint_name = "FILL OUT ENDPOINT NAME" # Create an endpoint sm_client.create_endpoint( EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name, ) sm_client.describe_endpoint(EndpointName=endpoint_name) ``` You can invoke this endpoint by following [**Step 4**](#step-4-invoking-endpoint). By following these guides, you'll be able to seamlessly deploy your models using Friendli Container on SageMaker endpoints and leverage their capabilities for real-time inference. # Serving MoE Models Source: https://friendli.ai/docs/guides/container/serving_moe_models Explore the steps to serve Mixture of Experts (MoE) models such as Mixtral 8x7B using Friendli Container. ## Introduction This guide explores the steps to serve Mixture of Experts (MoE) models such as Mixtral 8x7B using Friendli Container. ## Search Optimal Policy and Running Friendli Container To serve MoE models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at [Running Policy Search](/guides/container/optimizing_inference_with_policy_search#running-policy-search). When the optimal policy is successfully searched, the policy is compiled into a policy file, which can be used for creating serving endpoints. And the engine starts to serve the endpoint using the optimal policy. # Serving Multi-LoRA Models Source: https://friendli.ai/docs/guides/container/serving_multi_lora_models The Friendli Engine introduces an innovative approach to this challenge through Multi-LoRA (Low-Rank Adaptation) serving, a method that allows for the simultaneous serving of multiple LLMs, optimized for specific tasks without the need for extensive retraining. ## Introduction In a world where the demand for highly specialized AI capabilities is surging, the ability to deploy multiple customized large language models (LLMs) without additional GPU resources represents a significant leap forward. The Friendli Engine introduces an innovative approach to this challenge through Multi-LoRA (Low-Rank Adaptation) serving, a method that allows for the simultaneous serving of multiple LLMs, optimized for specific tasks without the need for extensive retraining. This advancement opens new avenues for AI efficiency and adaptability, promising to revolutionize the deployment of AI solutions on constrained hardware. This article provides an overview of efficient serving Multi-LoRA models with the Friendli Engine. ![Lora Serving](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/container/lora.png) ## Prerequisite huggingface-cli should be installed in your local environment. ```sh pip install "huggingface_hub[cli]" ``` ## Downloading Adapter Checkpoints For each adapter model that you want to server, you have to download in your local storage. ```sh # Hugging Face model name of the adapters export ADAPTER_MODEL1="" export ADAPTER_MODEL2="" export ADAPTER_MODEL3="" export ADAPTER_DIR=/tmp/adapter huggingface-cli download $ADAPTER_MODEL1 \ --include "adapter_model.safetensors" "adapter_config.json" \ --local-dir $ADAPTER_DIR/model1 huggingface-cli download $ADAPTER_MODEL2 \ --include "adapter_model.safetensors" "adapter_config.json" \ --local-dir $ADAPTER_DIR/model2 huggingface-cli download $ADAPTER_MODEL3 \ --include "adapter_model.safetensors" "adapter_config.json" \ --local-dir $ADAPTER_DIR/model3 ... ``` This will result in directory structure like: ``` /tmp/adapter/model1 - adapter_model.safetensors - adapter_config.json /tmp/adapter/model2 - adapter_model.safetensors - adapter_config.json /tmp/adapter/model3 - adapter_model.safetensors - adapter_config.json ``` If an adapter's Hugging Face repo does not contain `adapter_model.safetensors` checkpoint file, you have to manually convert `adapter_model.bin` into `adapter_model.safetensors`. You can use the [official app](https://huggingface.co/spaces/safetensors/convert) or the [python script](https://github.com/huggingface/safetensors/tree/main/bindings/python) for conversion. ## Launch Friendli Engine in Container When you have prepared adapter model checkpoints, now you can serve the Multi-LoRA model with Friendli Container. In addition to the command for running the base model, you have to add the `--adapter-model` argument. * `--adapter-model`: Add an adapter model with adapter name and path. The path can be Hugging Face hub's name. ```sh # Fill the values of following variables. export HF_BASE_MODEL_NAME="" # Hugging Face base model name (e.g., "meta-llama/Llama-2-7b-chat-hf") export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial") export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"') export ADAPTER_NAME="" # Specify the adapter's name(a user defined alias). export ADAPTER_DIR=/tmp/adapter docker run \ --gpus $GPU_ENUMERATION \ -p 8000:8000 \ -v $ADAPTER_DIR:/adapter \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name $HF_BASE_MODEL_NAME \ --adapter-model $ADAPTER_NAME:/adapter/model1 \ [LAUNCH_OPTIONS] ``` You can find available options for `[LAUNCH_OPTIONS]` at [Running Friendli Container: Launch Options](/guides/container/running_friendli_container#launch-options). If you want to launch with multiple adapters, you can use `--adapter-model` with comma-separated string. (e.g. `--adapter-model "adapter_name_0:/adapter/model1,adapter_name_1:/adapter/model2"`) If `tokenizer_config.json` file is in an adapter checkpoint path, the engine uses a different chat template in `tokenizer_config.json`. ### Example: Llama 2 7B Chat + LoRA Adapter This is an example that runs [`meta-llama/Llama-2-7b-chat-hf`](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) with [`FinGPT/fingpt-forecaster_dow30_llama2-7b_lora`](https://huggingface.co/FinGPT/fingpt-forecaster_dow30_llama2-7b_lora) adapter model. ```sh export ADAPTER_DIR=/tmp/adapter huggingface-cli download FinGPT/fingpt-forecaster_dow30_llama2-7b_lora \ --include "adapter_model.safetensors" "adapter_config.json" \ --local-dir $ADAPTER_DIR/model1 docker run \ --gpus '"device=0"' \ -p 8000:8000 \ -v $ADAPTER_DIR:/adapter \ -e FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" \ registry.friendli.ai/trial \ --hf-model-name meta-llama/Llama-2-7b-chat-hf \ --adapter-model adapter-model-name:/adapter/model1 ``` ## Sending Request to Specific Adapter You can generate an inference result from a specific adapter model by specifying `model` in the body of an inference request. For example, assuming you set the launch option of `--adpater-model` to "\:\", you can send a request to the adapter model as follows. ```sh curl -X POST http://0.0.0.0:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "adapter-model-name", "prompt": "Python is a language", "max_tokens": 30 }' ``` ## Sending Request to the Base Model If you omit the `model` field in your request, the base model will be used for generating an inference request. You can send a request to the base model as shown below. ```sh curl -X POST http://0.0.0.0:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "Python is a language", "max_tokens": 30 }' ``` ## Limitations We only support models compatible with [`peft`](https://github.com/huggingface/peft). Base model checkpoint and adapter model checkpoint should have the same datatype. When serving multiple adapters simultaneously, each adapter model should have the same target modules. In Hugging Face, the target modules are listed at `adapter_config.json`. # Serving Quantized Models Source: https://friendli.ai/docs/guides/container/serving_quantized_models Tutorial for serving quantized model with Friendli Engine. Friendli Engine supports FP8, IN8, and AWQ-ed model checkpoints. ## Introduction Quantization is a technique that reduces the precision of a generative AI model's parameters, optimizing memory usage and inference speed while maintaining acceptable accuracy. This tutorial will walk you through the process of serving quantized models with Friendli Container. ## Off-the-Shelf Model Checkpoints from Hugging Face Hub To use model checkpoints that are already quantized and available on Hugging Face Hub, check the following options: * Checkpoints quantized with [friendli-model-optimizer](https://github.com/friendliai/friendli-model-optimizer) * [Quantized model checkpoints by FriendliAI](https://huggingface.co/FriendliAI) * a subset of models quantized with: * [`AutoAWQ`](https://github.com/casper-hansen/AutoAWQ) * [`AutoFP8`](https://github.com/neuralmagic/AutoFP8) * [`llm-compressor`](https://github.com/vllm-project/llm-compressor) For details on how to use these models, go directly to [Serving Quantized Models](#serving-quantized-models). ## Quantizing Your Own Models (FP8/INT8) To quantize your own models with FP8 or INT8, follow these steps: 1. **Install the `friendli-model-optimizer` package** This tool provides model quantization for efficient generative AI serving with Friendli Engine. Install it using the following command: ```sh pip install "friendli-model-optimizer" ``` 2. **Prepare the Original Model** Ensure you have the original model checkpoint that can be loaded using Hugging Face's [`transformers`](https://github.com/huggingface/transformers) library. 3. **Quantize Model with Friendli-Model-Optimizer(FMO)** You can simply run quantization with the command below: ```sh export MODEL_NAME_OR_PATH="" # Hugging Face pretrained model name or directory path of the original model checkpoint. export OUTPUT_DIR="" # Directory path to save the quantized checkpoint and related configurations. export QUANTIZATION_SCHEME="" # Quantization techniques to apply. You can use fp8, int8. export DEVICE="" # Device to run the quantization process. Defaults to "cuda:0". fmo quantize \ --model-name-or-path $MODEL_NAME_OR_PATH \ --output-dir $OUTPUT_DIR \ --mode $QUANTIZATION_SCHEME \ --device $DEVICE \ ``` When the model checkpoint is successfully quantized, the following files will be created at `$OUTPUT_DIR`. * `config.json` * `model.safetensors` * `special_tokens_map.json` * `tokenizer_config.json` * `tokenizer.json` If the size of the model exceeds **10GB**, multiple sharded checkpoints are generated as follows instead of a single `model.safetensors`. * `model-00001-of-00005.safetensors` * `model-00002-of-00005.safetensors` * `model-00003-of-00005.safetensors` * `model-00004-of-00005.safetensors` * `model-00005-of-00005.safetensors` For more information about FMO, check out [this documentation](https://github.com/friendliai/friendli-model-optimizer) for details. ## Serving Quantized Models ### Search Optimal Policy To serve quantized models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at [Running Policy Search](/guides/container/optimizing_inference_with_policy_search#running-policy-search). ### Serving FP8 Models Once you have prepared the quantized model checkpoint, you are ready to create a serving endpoint. ```sh # Fill the values of following variables. export HF_MODEL_NAME="" # Quantized model name in Hugging Face Hub or directory path of the quantized model checkpoint. export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial") export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"') export POLICY_DIR=$PWD/policy mkdir -p $POLICY_DIR docker run \ --gpus $GPU_ENUMERATION \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name $HF_MODEL_NAME \ --algo-policy-dir /policy \ --search-policy true ``` ### Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8` FP8 model serving is only supported by NVIDIA **Ada**, **Hopper**, and **Blackwell** GPU architectures. ```sh # Fill the values of following variables. export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial") export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"') docker run \ --gpus $GPU_ENUMERATION \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ # Make sure running policy search -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name FriendliAI/Llama-3.1-8B-Instruct-fp8 --algo-policy-dir /policy --search-policy true ``` # Endpoints Source: https://friendli.ai/docs/guides/dedicated_endpoints/endpoints Endpoints are the actual deployments of your models on your specified GPU resource. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ## What are Endpoints? Endpoints are the actual deployments of your models on a dedicated GPU resource. They provide a stable and efficient interface to serve your models in real-world applications, ensuring high availability and optimized performance. With endpoints, you can manage model versions, scale resources, and seamlessly integrate your model into production environments. ### Key Capabilities of Endpoints: * **Efficient Model Serving**: Deploy models on powerful GPU instances optimized for your use case. * **Flexibility with Multi-LoRA Models**: Serve multiple fine-tuned adapters alongside base models. * **Autoscaling**: Automatically adjust resources to handle varying workloads, ensuring optimal performance and cost efficiency. * **Monitoring and Management**: Check endpoint health, adjust configurations, and view logs directly from the platform. * **Interactive Testing**: Use the integrated playground to test your models before integrating them into applications. * **API Integration**: Access your models via robust OpenAI-compatible APIs, enabling easy integration into any system. ## Creating Endpoints You can create your endpoint by specifying the name, the model, and the instance configuration, consisting of your desired GPU specification. Endpoint Create ## Intelligent Autoscaling Autoscaling Config Our autoscaling system automatically adjusts computational resources based on your traffic patterns, helping you optimize both performance and costs. ### How Autoscaling Works * **Minimum Replicas**: * When set to 0, the endpoint enters sleeping status during periods of inactivity, helping to minimize costs * When set to a value greater than 0, the endpoint maintains at least that number of active replicas at all times * **Maximum Replicas**: Defines the upper limit of replicas that can be created to handle increased traffic load * **Cooldown Period**: The time delay before scaling down an active replica. This ensures the system doesn't prematurely reduce capacity during temporary drops in traffic. ### Benefits of Autoscaling * **Cost Optimization**: Pay only for the resources you need by automatically scaling to zero during idle periods * **Performance Management**: Handle traffic spikes efficiently by automatically adding replicas * **Resource Efficiency**: Maintain optimal resource utilization across varying workload patterns ## Serving Multi-LoRA Models You can serve Multi-LoRA models using Friendli Dedicated Endpoints. For an overview of Multi-LoRA models, refer to our [document on serving Multi-LoRA models with Friendli Container](/guides/container/serving_multi_lora_models). In Friendli Dedicated Endpoints, Multi-LoRA model is supported only in Enterprise plan. For pricing and availability, [Contact sales](https://friendli.ai/contact). ## Checking Endpoint Status After creating the Endpoint, you can view its health status and Endpoint URL on the Endpoint's details page. Endpoint Detail The cost of using dedicated endpoints accumulates from the `INITIALIZING` status. Specifically, charges begin after the `Initializing GPU` phase, where the endpoint waits to acquire the GPU. The endpoint then downloads and loads the model onto the GPU, which usually takes less than a minute. ## Using Playgrounds To test the deployed model via the web, we provide a playground interface where you can interact with the model using a user-friendly chat interface. Simply enter your query, adjust your settings, and generate your responses! Endpoint Playground Send inference queries to your model through our [API](/openapi) at the given endpoint address, accessible on the endpoint information tab. {/* TODO: add image for sending APIs */} # Frequently Asked Questions and Troubleshooting Source: https://friendli.ai/docs/guides/dedicated_endpoints/faq While following through our tutorials, you might have had questions regarding the details of the requirements and specifications. We have listed out the frequently asked questions and as a separate document. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; While following through our tutorials, you might have had questions regarding the details of the requirements and specifications. We have listed out the frequently asked questions and as a separate document. Please refer to the relevant information below: ## Format Requirements ### General requirements for a model * A model should be in safetensors format. * The model should NOT be nested inside another directory. * Including other arbitrary files (that are not in the list) is totally fine. However, those files will not be downloaded nor used. | Required | Filename | Description | | -------- | ------------------------- | -------------------------------------------------------------------------------------------------------------------- | | Yes | *safetensors* | Model weight, e.g. model.safetensors. Use model.safetensors.index.json for split safetensors files | | Yes | config.json | Model config that includes the architecture. ([Supported Models on Friendli](https://friendli.ai/models)) | | Yes | tokenizer.json | Tokenizer for the model | | No | tokenizer\_config.json | Tokenizer config. This should be present & have a `chat_template` field for the Friendli Engine to provide chat APIs | | No | special\_tokens\_map.json | | ### General requirements for a dataset * Read our documentation on the [fine-tuning dataset format](/guides/dedicated_endpoints/fine-tuning#dataset-format) for information on the dataset requirements. ## 3rd-party account integration Personal settings ### How to integrate a Hugging Face account * [Log in to Hugging Face, then navigate to user settings → access tokens → User Access Tokens. Acquire a token.](https://huggingface.co/settings/tokens) * You may use a fine-grained token. In this case, please make sure the token has view permission for the repository you'd like to use. * [Integrate the key in Friendli Suite → Personal settings → Account → Integrations](https://suite.friendli.ai/default-team/settings/account) If you revoke / invalidate the key, you will have to update the key in order to not disrupt ongoing deployments, or to launch a new inference deployment / fine-tuning job ### How to integrate a W\&B account * [Log in to W\&B, then navigate to user settings → danger zone → API keys. Acquire a token.](https://wandb.ai/settings#api) * [Integrate the key in Friendli Suite → Personal settings → Account → Integrations](https://suite.friendli.ai/default-team/settings/account) If you revoke / invalidate the key, you will have to update the key in order to not disrupt ongoing deployments, or to launch a new inference deployment / fine-tuning job #### Extra: How to upload a safetensors format model to W\&B using W\&B CLI * Install the cli and log in using the API key → [Command Line Interface | Weights & Biases Documentation](https://docs.wandb.ai/ref/cli) * Upload the model as an W\&B artifact using the command below ``` wandb artifact put -n project/artifact_id --type model /path/to/dir ``` * With all this, the W\&B artifact will look like this: ![W\&B artifact](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_artifact.png) ## Using 3rd-party model ### How to use a W\&B artifact as a model ![W\&B artifact as a model](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_model.png) * Use the full name of the artifact * The *artifact name* must be in the format of: `org/project/artifact_id:version` ### How to use a Hugging Face repository as a model ![HF artifact as a model](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/hf_model.png) * Use the repository id of the model. You may select the entry from the list of autocompleted model repositories. * You may choose specific branch, or manually enter a commit hash. ## Using W\&B with Dedicated Fine-tuning * When launching a fine-tuning job, you can designate a W\&B project that the metrics will be exported to. If you provide a W\&B project name that already exists, your job will be added to that project. Otherwise, a new W\&B project will be automatically created in your integrated W\&B account. If the project name is not provided, it defaults to "friendliai". ![W\&B project](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_project.png) * As the training starts, you will be able to see a new "Run" in the project you chose. ![W\&B Run](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_run.png) * By clicking the project, you can easily track & monitor the status of the training job. ![W\&B Log](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_log.png) If new runs are not displayed in your project, please check that the default team is set correctly on [W\&B user settings](https://wandb.ai/settings). ![W\&B Default team](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_default_team.png) ## Troubleshooting ### Can't access the artifact ![Troubleshooting - can't access](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/troubleshooting_cant_access.png) * The artifact might be nonexistent, or hidden so that you cannot access it. ### You don't have access to this gated model ![Troubleshooting - no access](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/troubleshooting_no_access.png) * The repository is gated. Please follow the steps and gain approval from the owner using Hugging Face Hub. ### The repository / artifact is invalid ![Troubleshooting - invalid repo](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/troubleshooting_invalid_repo.png) ![Troubleshooting - invalid artifact](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/troubleshooting_invalid_artifact.png) * The model does not meet the requirements. Please check if the model follows a correct safetensors format. ### The architecture is not supported ![Troubleshooting - unsupported](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/troubleshooting_unsupported.png) * The model architecture is not supported. Please refer to [Supported Models on Friendli](https://friendli.ai/models). # Fine-tuning Source: https://friendli.ai/docs/guides/dedicated_endpoints/fine-tuning Effortlessly fine-tune your model with Friendli Dedicated Endpoints, which leverages the Parameter-Efficient Fine-Tuning (PEFT) method to reduce training costs while preserving model quality, similar to full-parameter fine-tuning. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ### In order to fine-tune large generic models for your specific purpose, you may fine-tune models on Friendli Dedicated Endpoints. Effortlessly fine-tune your model with [Friendli Dedicated Endpoints](https://friendli.ai/products/dedicated-endpoints), which leverages the Parameter-Efficient Fine-Tuning (PEFT) method to reduce training costs while preserving model quality, similar to full-parameter fine-tuning. This can make your model become an expert on a specific topic, and prevent hallucinations from your model. ## Table of Contents 1. **[How to Select Your Base Model](#how-to-select-your-base-model)** 2. **[How to Upload Your Dataset](#how-to-upload-your-dataset)** 3. **[How to Create Your Fine-tuning Job](#how-to-create-your-fine-tuning-job)** 4. **[How to Monitor Progress](#how-to-monitor-progress)** 5. **[How to Deploy the Fine-tuned Model](#how-to-deploy-the-fine-tuned-model)** 6. **[Resources](#resources)** By the end of this guide, you will understand how you can effectively fine-tune your generative AI models by using Friendli Dedicated Endpoints. ## How to Select Your Base Model Through our (1) Hugging Face Integration and (2) Weights & Biases (W\&B) Integration, you can select the base model to fine-tune. Explore and find open-source models that are supported on Friendli Dedicated Endpoints [here](https://friendli.ai/models). For guidance on the necessary format and file requirements, especially when using your own models, review the FAQ section on [general requirements for a model](/guides/dedicated_endpoints/faq#general-requirements-for-a-model). * **Hugging Face Model** ![Hugging Face Model](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/finetuning/hf_model.png) * **Weights & Biases Model** ![Weights & Biases Model](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/finetuning/wandb_model.png) ### Hugging Face Integration Integrate your [Hugging Face account](https://huggingface.co) to access your private repo or a gated repo. Go to [**Personal settings > Account > Hugging Face integration**](https://suite.friendli.ai/default-team/settings/account) and save your [Hugging Face access token](https://huggingface.co/docs/hub/security-tokens). This access token will be used upon creating your fine-tuning jobs. Check our FAQ section on [using a Hugging Face repository as a model](/guides/dedicated_endpoints/faq#how-to-use-a-hugging-face-repository-as-a-model) and [integrating a Hugging Face account](/guides/dedicated_endpoints/faq#how-to-integrate-a-hugging-face-account) for more detailed integration information. ### Weights & Biases (W\&B) Integration Integrate your [Weights & Biases account](https://wandb.ai/site) to access your model artifact. Go to [**Personal settings > Account > Weights & Biases integration**](https://suite.friendli.ai/default-team/settings/account) and save your Weights & Biases API key, which you can obtain [here](https://wandb.ai/settings#api). This API key will be used upon creating your fine-tuning jobs. Check our FAQ section on [using a W\&B artifact as a model](/guides/dedicated_endpoints/faq#how-to-use-a-w-and-b-artifact-as-a-model) and [integrating a W\&B account](/guides/dedicated_endpoints/faq#how-to-integrate-a-w-and-b-account) for more detailed integration information. ## How to Upload Your Dataset Navigate to the 'Datasets' section within your dedicated endpoints project page to upload your fine-tuning dataset. Enter the dataset name, then either drag and drop your .jsonl training and validation files or browse for them on your computer. If your files meet the required criteria, the blue 'Upload' button will be activated, allowing you to complete the process. ![Upload Dataset](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/finetuning/upload_dataset.png) ![Uploaded Dataset](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/finetuning/uploaded_dataset.png) ### Dataset Format The dataset used for fine-tuning should satisfy the following conditions: 1. The dataset must contain a column named **"messages"**, which will be used for fine-tuning. 2. Each row in the "messages" column should be compatible with the chat template of the base model. For example, [`tokenizer_config.json`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/blob/41b61a33a2483885c981aa79e0df6b32407ed873/tokenizer_config.json#L42) of `mistralai/Mistral-7B-Instruct-v0.2` is a template that repeats the messages of a user and an assistant. Concretely, each row in the "messages" field should follow a format like: `[{"role": "user", "content": "The 1st user's message"}, {"role": "assistant", "content": "The 1st assistant's message"}]`. In this case, `HuggingFaceH4/ultrachat_200k` is a dataset that is compatible with the chat template. You can access our example dataset ['FriendliAI/gsm8k' on Hugging Face](https://huggingface.co/datasets/FriendliAI/gsm8k) and explore some of our quantized generative AI models on [our Hugging Face page](https://huggingface.co/FriendliAI). ## How to Create Your Fine-tuning Job Navigate to the 'Fine-tuning' section within your dedicated endpoints project page to launch and view your fine-tuning jobs. You can view the training progress in a job's detail page by clicking on the fine-tuning job. To create a new fine-tuning job, follow these steps: 1. Go to your project and click on the **Fine-tuning** tab. 2. Click **New job**. 3. Fill out the job configuration based on the following field descriptions: * **Job name**: Name of fine-tuning job to create. * **Model**: Hugging Face Models repository or Weights & Biases model artifact name. * **Dataset**: Your uploaded fine-tuning dataset. * **Weights & Biases (W\&B)**: Optional for W\&B integration. * **W\&B project**: Your W\&B project name. * **Hyperparameters**: Fine-tuning Hyperparameters. * **`Learning rate`**: Initial learning rate for AdamW optimizer. * **`Batch size`**: Total training batch size. * **Total number of training**: Configure the number of training cycles with either `Number of training epochs` or `Training steps`. * **`Number of training epochs`**: Total number of training epochs. * **`Training steps`**: Total number of training steps. * **`Evaluation steps`**: Number of steps between model evaluation using the validation dataset. * **`LoRA rank`**: The rank of the LoRA parameters (optional). * **`LoRA alpha`**: Scaling factor that determines the influence of the low-rank matrices during fine-tuning (optional). * **`LoRA dropout`**: Dropout rate applied during fine-tuning (optional). 4. Click the **Create** button to create a job with the input configuration. ## How to Monitor Progress After launching the fine-tuning job, you can monitor the job overview, including progress information and fine-tuning configuration. If you have integrated your Weights & Biases (W\&B) account, you can also monitor the training status in your W\&B project. Read our FAQ section on [using W\&B with dedicated fine-tuning](/guides/dedicated_endpoints/faq#using-w-and-b-with-dedicated-fine-tuning) to learn more about monitoring you fine-tuning jobs on their platform. ## How to Deploy the Fine-tuned Model Once the fine-tuning process is complete, you can immediately deploy the model by clicking the 'Deploy' button in the top right corner. The name of the fine-tuned LoRA adapter will be the same as your fine-tuning job name. Fine-tuning Done The steps to deploy the fine-tuned model are equivalent to how you would deploy a custom model on Friendli Dedicated Endpoints. For further information, refer to the [Endpoints documentation](/guides/dedicated_endpoints/endpoints) for more detailed information on launching a model. ## Resources * [Supported open-source models](https://friendli.ai/models) * ['FriendliAI/gsm8k' on Hugging Face](https://huggingface.co/datasets/FriendliAI/gsm8k) * [FAQ on general requirements for a model](/guides/dedicated_endpoints/faq#general-requirements-for-a-model) * [FAQ on using a Hugging Face repository as a model](/guides/dedicated_endpoints/faq#how-to-use-a-hugging-face-repository-as-a-model) * [FAQ on integrating a Hugging Face account](/guides/dedicated_endpoints/faq#how-to-integrate-a-hugging-face-account) * [FAQ on using a W\&B artifact as a model](/guides/dedicated_endpoints/faq#how-to-use-a-w-and-b-artifact-as-a-model) * [FAQ on integrating a W\&B account](/guides/dedicated_endpoints/faq#how-to-integrate-a-w-and-b-account) * [FAQ on using W\&B with dedicated fine-tuning](/guides/dedicated_endpoints/faq#using-w-and-b-with-dedicated-fine-tuning) * [Endpoints documentation on model deployment](/guides/dedicated_endpoints/endpoints) # Deploy with Hugging Face Models Source: https://friendli.ai/docs/guides/dedicated_endpoints/huggingface_tutorial Hands-on tutorial for launching and deploying LLMs using Friendli Dedicated Endpoints with Hugging Face models. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; #### Hands-on Tutorial Deploying `meta-llama-3-8b-instruct` LLM from Hugging Face using Friendli Dedicated Endpoints ## Introduction With Friendli Dedicated Endpoints, you can easily spin up scalable, secure, and highly available inference deployments, without the need for extensive infrastructure expertise or significant capital expenditures. This tutorial is designed to guide you through the process of launching and deploying LLMs using Friendli Dedicated Endpoints. Through a series of step-by-step instructions and hands-on examples, you'll learn how to: * Select and deploy pre-trained LLMs from Hugging Face repositories * Deploy and manage your models using the Friendli Engine * Monitor and optimize your inference deployments By the end of this tutorial, you'll be equipped with the knowledge and skills necessary to unlock the full potential of LLMs in your applications, products, and services. So, let's get started and explore the possibilities of Friendli Dedicated Endpoints! ## Prerequisites: * A Friendli Suite account with access to [Friendli Dedicated Endpoints](https://suite.friendli.ai) * A Hugging Face account with access to the [meta-llama-3-8b-instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model ## Step 1: Create a new endpoint 1. Log in to your Friendli Suite account and navigate to the Friendli Dedicated Endpoints dashboard. 2. If not done already, start the free trial for Dedicated Endpoints. 3. Create a new project, then click on the "New Endpoint" button. 4. Fill in the basic information: * Endpoint name: Choose a unique name for your endpoint (e.g., "My New Endpoint"). 5. Select the model: Hugging Face Model Search * Model Repository: Select "Hugging Face" as the model provider. * Model ID: Enter "meta-llama/Meta-Llama-3-8B-Instruct" as the model id. As the search bar loads the list, click on the top result that exactly matches the repository id. By default, the model pulls the latest commit on the default branch of the model. You may manually select a specific branch / tag / commit instead. If you're using your own model, check [Format Requirements](/guides/dedicated_endpoints/faq#format-requirements) for requirements. 6. Select the instance: Select instance * Instance configuration: Choose a suitable instance type based on your performance requirements. We suggest 1x A100 80G for most models. In some cases where the model's size is big, some options may be restricted as they are guaranteed to not run due to insufficient VRAM. Low Memory Warning 7. Edit the configurations: Autoscaling Config
Engine Config * Autoscaling: By default, the autoscaling ranges from 0 to 2 replicas. This means that the deployment will sleep when it's not being used, which reduces cost. * Advanced configuration: Some LLM options including the batch size and token configurations are mutable. For this tutorial, we'll leave it as-is. 8. Click "Create" to create a new endpoint. ## Step 2: Test the endpoint 1. Wait for the deployment to be created and initialized. This may take a few minutes. You may check the status by the indicator under the endpoint's name. Initializing Endpoint 2. In the "Playground" section, you may enter a sample input prompt (e.g., "What is the capital of France?"). 3. Click on the right arrow button to send the inference request. Playground 4. If you are an enterprise user, you can use the "Metrics" and "Logs" section to monitor the endpoint. Metrics
Logs ## Step 3: Send requests by using cURL or Python 1. As instructed in our [API docs](/openapi/serverless/chat-completions), you can send instructions with the following code: ```sh cURL curl -X POST https://api.friendli.ai/dedicated/v1/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ -d '{ "model": "$ENDPOINT_ID", "prompt": "What is the capital of France?", "max_tokens": 200, "top_k": 1 }' ``` ```python Python import requests import json import os url = 'https://api.friendli.ai/dedicated/v1/completions' payload = json.dumps({ "model": f"{os.environ['ENDPOINT_ID']}", "max_tokens": 200, "top_k": 1, "prompt": "What is the capital of France?" }) headers = { "Content-Type": "application/json", "Accept": "application/json", "Authorization": f"Bearer {os.environ['FRIENDLI_TOKEN']}" } response = requests.request("POST", url, headers=headers, data=payload) print(response.text) ``` ## Step 4: Update the endpoint 1. You can update the model and change almost everything by clicking the update button. # Introducing Friendli Dedicated Endpoints Source: https://friendli.ai/docs/guides/dedicated_endpoints/introduction Friendli Dedicated Endpoints gives you the reins to explore the full potential of your custom generative AI models on the hardware of your choice, whether you're crafting innovative eloquent texts, generating stunning images, or even more. Friendli Dedicated Endpoints (previously known as **PeriFlow Cloud**) gives you the reins to explore the full potential of your custom generative AI models on the hardware of your choice, whether you're crafting innovative eloquent texts, generating stunning images, or even more. ## What are Friendli Dedicated Endpoints? Don't be limited to pre-trained models. Friendli Dedicated Endpoints lets you take center stage: * **Seamless Serving, Powered by the Friendli Engine**: Experience the magic of the Friendli Engine, our patented GPU-optimized serving technology. Sit back and watch as your models come to life with automatically optimized performances, orchestrated seamlessly by Friendli Dedicated Endpoints. * **Choose or Upload Your Model**: Use your own custom models that are tailored to your specific needs and purposes. Otherwise, simply choose from the open-source models available on [HuggingFace](https://huggingface.co). Text generation, image creation, code synthesis – the possibilities are limitless. * **Control Your Instance**: Select the perfect GPU for your model. The GPU resources are dedicated entirely to your generative AI models. No sharing is required. * **Per-second Billing, Worry-free Optimization**: Focus on your creative pursuits, not cost management. Pay only for the seconds your model runs, eliminating the burden of manual optimization. Let Friendli Dedicated Endpoints handle the heavy lifting. * **Proven Reliability for Real-World Success**: Trusted by leading companies, Friendli Dedicated Endpoints delivers robust performance for even the most demanding workloads. ## Getting Started with Friendli Dedicated Endpoints: Ready to step up your generative AI game? Getting started is as simple as: 1. **Sign Up for a Free Account**: Experience the power of Friendli Dedicated Endpoints risk-free. 2. **Choose or Upload Your Model**: Harness your own custom-trained creation or simply select an open-source model. 3. **Launch Your GPU Instance**: Select the perfect GPU for your model. 4. **Get Your Endpoint Address**: Your gateway to unleashing your model's magic. 5. **Fine-tune Your Model**: Optionally, you can fine-tune your generic model for your specific needs. 6. **Send Your Input**: Prompt your model, send your queries, and let your creativity flow. 7. **Witness the Magic**: Sit back and marvel as your custom model delivers stunningly fast outputs, tailored to your specific needs. Friendli Dedicated Endpoints is more than just an AI serving platform - it's a launchpad for your creative ambitions. Dive into the website ([https://friendli.ai](https://friendli.ai)) and blog ([https://friendli.ai/blog](https://friendli.ai/blog)) to discover deeper insights, use cases, and customer testimonials. In our documentations, you can find how you can (1) manage your [projects](/guides/dedicated_endpoints/projects) and (2) [models](/guides/dedicated_endpoints/models), and (3) make them come to life on your [endpoints](/guides/dedicated_endpoints/endpoints), as well as to (4) [fine-tune](/guides/dedicated_endpoints/fine-tuning) them for your specific purposes. To quickly have a look at our service, take a look at our [quickstart](/guides/dedicated_endpoints/quickstart) document. Reserve your own GPUs for your model! It's time to run your own models cost-efficiently with Friendli Dedicated Endpoints! ## Additional Resources: * FriendliAI website: [https://friendli.ai](https://friendli.ai) * FriendliAI blog: [https://friendli.ai/blog](https://friendli.ai/blog) # Models Source: https://friendli.ai/docs/guides/dedicated_endpoints/models Within your Friendli Dedicated Endpoints projects you can prepare and manage the models that you wish to deploy. You may upload your models within your project to deploy them directly on your endpoints. Alternatively, you may manage them on the HuggingFace repository or Weights & Biases artifacts, as our endpoints can load models from your project, HuggingFace repositories, and Weights & Biases artifacts. ### Within your project, you can prepare and manage the models that you wish to deploy. You may upload your models within your project to deploy them directly on your endpoints. Alternatively, you may manage them on the HuggingFace repository or Weights & Biases artifacts, as our endpoints can load models from your project, HuggingFace repositories, and Weights & Biases artifacts. * At the moment, we support loading models from your uploaded model, HuggingFace repositories, and Weights & Biases artifacts. ![HuggingFace](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/hugging_face.png) # Pricing Source: https://friendli.ai/docs/guides/dedicated_endpoints/pricing Friendli Dedicated Endpoints pricing detail page. Friendli Dedicated Endpoints offer pricing with flexible monthly billing based on actual usage. | Endpoint | GPU Type | Basic | Enterprise | | -------- | --------- | ------------ | ------------- | | | H100 80GB | \$5.6 / hour | Contact sales | | | A100 80GB | \$2.9 / hour | Contact sales | | Fine-tuning | Model | Basic | Enterprise | | ----------- | ------------------ | ------------------ | ------------- | | | Models ≤ 16B | \$0.50 / 1M tokens | Contact sales | | | Models 16.1B - 72B | \$3.00 / 1M tokens | Contact sales | Contact sales for a discounted custom pricing plan for your enterprise. For more information on pricing and feature comparisons between basic and enterprise plans, please visit our [pricing page](https://friendli.ai/pricing/dedicated-endpoints). # Projects Source: https://friendli.ai/docs/guides/dedicated_endpoints/projects Friendli Dedicated Endpoints projects are a basic working unit for your team. ### Projects are a basic working unit for your team. You can freely add and remove members to control access to your project. * You can view your list of projects ![Project List](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/projects_list.png) * For project settings, you can view your project ID, and manage the members who have access to your project. ![Project Settings](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/project_settings.png) * To add a member to your project, simply enter their names or emails and hit the add button. ![Project AddMember](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/project_settings_addmember.png) In order for a user to have access to the project, they should have been granted accessed to Friendli Dedicated Endpoints by the team administrators from the team settings. ![Team Members](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/team_members.png) # QuickStart: Friendli Dedicated Endpoints Source: https://friendli.ai/docs/guides/dedicated_endpoints/quickstart Learn how to get started with Friendli Dedicated Endpoints in this step-by-step guide. Create an account, select your project, choose a model you wish to serve, deploy your endpoint, and seamlessly generate text, code, and more with ease. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ## 1. Log In or Sign Up * If you have an account, log in using your preferred SSO or email/password combination. * If you're new to FriendliAI, create an account for free. ![Login](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/serverless_endpoints/login.png) ## 2. Access Friendli Dedicated Endpoints * On your dashboard, find the "Friendli Dedicated Endpoints" section. * If unauthorized, ask your team admin to provide access to the Friendli Dedicated Endpoints at the team settings. ![Dashboard Unauthorized](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/dashboard_unauthorized.png) ![Team Members](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/team_members.png) ![Dashboard Authorized](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/dashboard_authorized.png) ## 3. Select Your Project * Either create a new project, or choose from your existing projects for your workload. ![Project List](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/projects_list.png) ## 4. Prepare Your Model * Choose a model that you wish to serve from HuggingFace, Weights & Biases, or upload your custom model on our cloud. ![HuggingFace](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/hugging_face.png) ## 5. Deploy Your Endpoint * Deploy your endpoint, using the model of your choice prepared from step 3, and the instance equipped with your desired GPU specification. * You can also configure your replicas and the max-batch-size for your endpoint. Endpoint Create
Endpoint Detail ## 6. Generate Responses * You can generate your responses in two ways: playground and endpoint address. * Try out and test generating responses on your custom model using a chatGPT-like interface at the playground tab. Endpoint Playground * For general usages, send queries to your model through our [API](/openapi) at the given endpoint address, accessible on the endpoint information tab. ### Generating Responses Through the Endpoint URL Refer to [this guide](/guides/personal_access_tokens) for general instructions on Friendli Token. ```sh cURL # Send inference request to a running Friendli Dedicated Endpoints using a `curl` command. curl -X POST https://api.friendli.ai/dedicated/v1/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ -d '{ "model": "$ENDPOINT_ID", "prompt": "Python is a popular", "min_tokens": 20, "max_tokens": 30, "top_k": 32, "top_p": 0.8, "n": 3, "no_repeat_ngram": 3, "ngram_repetition_penalty": 1.75 }' ``` ```python Python SDK # pip install friendli-client # Send inference request to a Friendli Dedicated Endpoints using Python SDK. import os from friendli import Friendli client = Friendli( base_url="https://api.friendli.ai/dedicated", token=os.getenv("FRIENDLI_TOKEN"), endpoint_id="ENDPOINT_ID", ) chat_completion = client.chat.completions.create( messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" } ], stream=False, ) print(chat_completion.choices[0].message.content) ``` {/* TODO: add image for sending APIs */} For a more detailed tutorial for your usage, please refer to our tutorial for using [HuggingFace models](/guides/dedicated_endpoints/huggingface_tutorial) and [W\&B models](/guides/dedicated_endpoints/wandb_tutorial). # Deploy with W&B Models Source: https://friendli.ai/docs/guides/dedicated_endpoints/wandb_tutorial Hands-on tutorial for launching and deploying LLMs using Friendli Dedicated Endpoints with Weights & Biases artifacts. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; #### Hands-on Tutorial Deploying `meta-llama-3-8b-instruct` LLM from W\&B using Friendli Dedicated Endpoints ## Introduction With Friendli Dedicated Endpoints, you can easily spin up scalable, secure, and highly available inference deployments, without the need for infrastructure expertise or significant capital expenditures. This tutorial is designed to guide you through the process of launching and deploying LLMs using Friendli Dedicated Endpoints. Through a series of step-by-step instructions and hands-on examples, you'll learn how to: * Select and deploy pre-trained LLMs from W\&B artifacts * Deploy and manage your models using the Friendli Engine * Monitor and optimize your inference deployments By the end of this tutorial, you'll be equipped with the knowledge and skills necessary to unlock the full potential of LLMs in your applications, products, and services. So, let's get started and explore the possibilities of Friendli Dedicated Endpoints! ## Prerequisites: * A Friendli Suite account with access to [Friendli Dedicated Endpoints](https://suite.friendli.ai) * A W\&B account with an api key (as an access token) ## Step 1: Create a new endpoint 1. Log in to your Friendli Suite account and navigate to the Friendli Dedicated Endpoints dashboard. 2. If not done already, start the free trial for Dedicated Endpoints. 3. Create a new project, then click on the "New Endpoint" button. 4. [Integrate your W\&B account with an api key.](https://wandb.ai/settings#api) 5. Fill in the basic information: * Endpoint name: Choose a unique name for your endpoint (e.g., "My New Endpoint"). 6. Select the model: W&B Model Select * Model Repository: Select "Weights & Biases" as the model provider. * Model ID: Enter `friendliai/model-registry/Meta-Llama-3-8B-Instruct:v0` as the model id. If you're using your own model, check [Format Requirements](/guides/dedicated_endpoints/faq#format-requirements) for requirements. 7. Select the instance: Select instance * Instance configuration: Choose a suitable instance type based on your performance requirements. We suggest 1x A100 80G for most models. In some cases where the model's size is big, some options may be restricted as they are guaranteed to not run due to insufficient VRAM. Low Memory Warning 8. Edit the configurations: Autoscaling Config
Engine Config * Autoscaling: By default, the autoscaling ranges from 0 to 2 replicas. This means that the deployment will sleep when it's not being used, which reduces cost. * Advanced configuration: Some LLM options including the maximum processing batch size and token configurations can be updated. For this tutorial, we'll leave it as-is. 9. Click "Create" to create a new endpoint. ## Step 2: Test the endpoint 1. Wait for the deployment to be created and initialized. This may take a few minutes. You may check the status by the indicator under the endpoint's name. Initializing Endpoint 2. In the "Playground" section, you may enter a sample input prompt (e.g., "What is the capital of France?"). 3. Click on the right arrow button to send the inference request. Playground 4. If you are an enterprise user, you can use the "Metrics" and "Logs" section to monitor the endpoint. Metrics
Logs ## Step 3: Send requests by using cURL or Python 1. As instructed in our [API docs](/openapi/serverless/chat-completions), you can send instructions with the following code: ```sh cURL curl -X POST https://api.friendli.ai/dedicated/v1/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ --data-raw '{ "model": "$ENDPOINT_ID", "prompt": "What is the capital of France?", "max_tokens": 200, "top_k": 1 }' ``` ```python Python import requests import json import os url = 'https://api.friendli.ai/dedicated/v1/completions' payload = json.dumps({ "model": f"{os.environ['ENDPOINT_ID']}", "prompt": "What is the capital of France?", "max_tokens": 200, "top_k": 1 }) headers = { "Content-Type": "application/json", "Accept": "application/json", "Authorization": f"Bearer {os.environ['FRIENDLI_TOKEN']}" } response = requests.request("POST", url, headers=headers, data=payload) print(response.text) ``` ## Step 4: Update the endpoint 1. You can update the model and change almost everything by clicking the update button. # Image Generation Models Source: https://friendli.ai/docs/guides/image-generation Dive into the characteristics of popular Image Generation Models available on Friendli Dedicated Endpoints. ## Visualizing Ideas with Friendli: A Guide to Image Generation Friendli provides powerful Image Generation capabilities, allowing users to transform text prompts into high-quality visuals with ease. This guide explores how to generate images using Friendli Dedicated Endpoints, including code examples to help you make the most of these powerful tools. ## Model Supports We supports **Flux Dev** and **Flux Schnell** models. Also their fine-tuned and quantized models are supported, and adapters are available as well. For a detailed list of models, refer to the models page in our website. * [Flux Dev](https://friendli.ai/models?baseModel=black-forest-labs/FLUX.1-dev) * [Flux Schnell](https://friendli.ai/models?baseModel=black-forest-labs/FLUX.1-schnell) ## API Usage For full API specifications, refer to: * [Dedicated API Reference](/openapi/dedicated/image-generations) * [Container API Reference](/openapi/container/image-generations) ## Examples ```python Python import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/dedicated/v1", api_key=os.environ.get("FRIENDLI_TOKEN"), ) images = client.images.generate( # Replace YOUR_ENDPOINT_ID with the ID of your endpoint, e.g. "zbimjgovmlcb" model="YOUR_ENDPOINT_ID", prompt="An orange Lamborghini driving down a hill road at night with a beautiful ocean view in the background.", extra_body={ "num_inference_steps": 10, "guidance_scale": 3.5 } ) print(images.data[0].url) ``` ```sh cURL curl -L -X POST "https://api.friendli.ai/dedicated/v1/images/generations" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ --data-raw '{ "model": "YOUR_ENDPOINT_ID", "prompt": "An orange Lamborghini driving down a hill road at night with a beautiful ocean view in the background.", "num_inference_steps": 10, "guidance_scale": 3.5 }' ``` `guidance_scale` is required when using Friendli Container. For more detail, please refer to the [Container API Reference](/openapi/container/image-generations). # Unleash the Power of Generative AI with Friendli Suite: Your End-to-End Solution Source: https://friendli.ai/docs/guides/introduction Friendli Suite empowers you to explore generative AI with three solutions: Serverless Endpoints for quick access to open-source models, Dedicated Endpoints for deploying custom models on dedicated GPUs, and Containers for secure, on-premise control. Powered by the optimized Friendli Engine, each option ensures fast, cost-efficient AI serving for text, code, and image generation. export const ServerlessIcon = () => { return ; }; export const ContainerIcon = () => { return ; }; export const DedicatedIcon = () => { return ; }; Welcome to the exciting world of generative AI, where words dance into text, code sparks creation, and images bloom from the imagination. Friendli Suite empowers you to tap into this potential with three distinct offerings, catering to your specific needs and technical expertise. Whether you're a seasoned developer or a curious newcomer, Friendli Suite provides the perfect platform to bring your AI-powered visions to life. ## What is Generative AI Serving? Before diving into Friendli Suite, let's get familiar with the magic behind the curtain. Generative AI models, including large language models (LLMs), learn from massive datasets of text and code, mimicking human creativity and knowledge. However, utilizing these models in real-world applications requires generative AI serving. Inference serving acts as the bridge between the model and your desired outputs, efficiently processing your prompts and queries to generate text, code, images, and more. An efficient inference serving is not a process that can be achieved easily. During the process, one needs to actively optimize the various aspects of the system to optimize how the machine can handle user requests efficiently on the limited amount of resources. Inference serving without optimizations can result in extremely high latencies or unnecessarily over-excessive usage of many expensive GPUs. In order to offload such optimization hassles from your concerns, the Friendli Engine steps in to enable fast and cost-efficient inference serving for your generative-AI models. ## Friendli Suite: Your Flexible Gateway to Generative AI Mastery Now, let's meet the three members of Friendli Suite, each unlocking different doors to AI innovation: ### 1. [Friendli Dedicated Endpoints](/guides/dedicated_endpoints/introduction): Power and Customization at Your Fingertips Ready to take the reins and unleash the full potential of your own models? Friendli Dedicated Endpoints is for you. This service provides dedicated GPU resources, letting you upload and run your custom generative AI models. Reserve the exact GPU you need and enjoy fine-grained control over your model settings. Pay-per-second billing makes it perfect for regular or resource-intensive workloads. ### 2. [Friendli Container](/guides/container/introduction): On-Premise Control for the AI Purist Do you prefer the comfort and security of your own data center? Friendli Container is the solution. We provide the Friendli Engine within Docker containers that can be installed on your on-premise GPUs so your data stays within your own secure cluster. This option offers maximum control and security, ideal for advanced users or those with specific data privacy requirements. ### 3. [Friendli Serverless Endpoints](/guides/serverless_endpoints/introduction): Your Quickest Path to Creativity Imagine a playground for your AI dreams. Friendli Serverless Endpoints is just that - a simple, click-and-play interface that lets you access popular general-purpose open-source models like Llama 3.1 without any heavy lifting. Choose your model, enter your prompt, and marvel at the generated text, or code outputs. With pay-per-token billing, this is ideal for exploration and experimentation. You can think of it as an AI sampler to try out the abilities of general-purpose AI models. ## [The Friendli Engine](https://friendli.ai/solutions/engine): The Powerhouse Behind the Suite At the heart of each Friendli Suite offering lies the Friendli Engine, a patented GPU-optimized serving engine. This technological marvel is what enables Friendli Suite's superior performance and cost-effectiveness, featuring innovations like continuous batching (iteration batching) that significantly improve resource utilization compared to traditional LLM serving solutions. ## Which Friendli solution is Right for You? Friendli Suite provides flexibility to match your needs: * Level up with your own models: Opt for [Friendli Dedicated Endpoints](/guides/dedicated_endpoints/introduction) for customized models on autopilot. * Embrace on-premise control: Utilize [Friendli Container](/guides/container/introduction) for maximum control and efficiency on your GPUs. * Start quick and simple: Choose [Friendli Serverless Endpoints](/guides/serverless_endpoints/introduction) for exploration and quick projects. No matter your skill level or preferences, Friendli Suite has the perfect option to empower your generative AI journey. Dive in, explore, and unleash the endless possibilities of AI creativity! Remember to explore the resources at [https://friendli.ai/blog](https://friendli.ai/blog) for deeper insights into generative AI and Friendli Suite capabilities. ## Popular Guides Check out popular how to guides and dive into the Friendli Suite. } href="/guides/dedicated_endpoints/quickstart"> Deploy your models with Friendli Dedicated Endpoints, and enjoy the flexibility of customizing your own models. Use the Friendli Engine to generate images, text, and more with extraordinary speed and efficiency. } href="/guides/container/quickstart"> Opt for maximum control with Friendli Container, offering the Friendli Engine in Docker containers installable on your on-premise GPUs, ensuring your data remains within your cluster. } href="/guides/serverless_endpoints/quickstart"> Only a few clicks are required for you to access general-purpose open-source models like Llama 3.1. Enjoy the power of generative AI without any hassle at a blazing speed. # Friendli Documentation Source: https://friendli.ai/docs/guides/overview Get started with FriendliAI products and explore APIs. export const ToolIcon = () => { return ; }; export const ChatIcon = () => { return ; }; export const ServerlessIcon = () => { return ; }; export const ContainerIcon = () => { return ; }; export const DedicatedIcon = () => { return ; }; ## QuickStarts } href="/guides/dedicated_endpoints/quickstart"> Deploy your models with Friendli Dedicated Endpoints, and enjoy the flexibility of customizing your own models. Use the Friendli Engine to generate images, text, and more with extraordinary speed and efficiency. } href="/guides/container/quickstart"> Opt for maximum control with Friendli Container, offering the Friendli Engine in Docker containers installable on your on-premise GPUs, ensuring your data remains within your cluster. } href="/guides/serverless_endpoints/quickstart"> Only a few clicks are required for you to access general-purpose open-source models like Llama 3.1. Enjoy the power of generative AI without any hassle at a blazing speed. ## SDK Friendli offers tools for developers to easily integrate AI into various applications. Our solutions support popular frameworks, enabling AI integration from simple chatbots to complex systems. ## Explore APIs } href="/openapi/serverless/chat-completions"> Discover how to generate text through interactive conversations. } href="/openapi/serverless/tool-assisted-chat-completions"> Learn how to enhance responses with tool assisted chat completions using built-in tools. # Personal Access Tokens Source: https://friendli.ai/docs/guides/personal_access_tokens Learn how to manage credentials in Friendli Suite, including using personal access tokens for authentication and authorization. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; Effective management of credentials is crucial when using Friendli Suite and its endpoints for authentication and authorization purposes. This guide outlines when the credentials are required and provides instructions on how to manage them. A Friendli Token serves as an alternative method of authorization to signing in with an email and a password. You can generate a new Friendli Token through the [Friendli Suite](https://suite.friendli.ai), at your **"Personal settings"** page. 1. Go to the [Friendli Suite](https://suite.friendli.ai) and sign in with your account. 2. Click the profile icon at the top-right corner of the page. 3. Click **"Personal settings"** menu. Personal settings 4. Go to the **"Tokens"** tab on the navigation bar. 5. Create a new Friendli Token by clicking the **"Create token"** button. 6. Copy the token and save it in a safe place. You will not be able to see this token again once the page is refreshed. Tokens # Advanced Applications on Friendli Serverless Endpoints (Coming Soon!) Source: https://friendli.ai/docs/guides/serverless_endpoints/applications Stay tuned for detailed guides on how to perform tasks like Retrieval-Augmented Generation (RAG), Conditional Image Generation, Fine-tuning Custom Models. Friendli Serverless Endpoints empowers you to unleash the full potential of generative AI models with ease. While we've already covered some exciting applications through text and image generation, we're eager to offer even more possibilities for users like you! This document serves as a preview for upcoming content showcasing advanced applications of Friendli Serverless Endpoints. Stay tuned for detailed guides on how to perform tasks like: * **Retrieval-Augmented Generation (RAG)**: Combine the power of search and generation to create highly relevant and informative text outputs based on real-world data. * **Conditional Image Generation**: Fine-tune your image creations by using specific conditions or attributes as additional prompts, pushing the boundaries of creative control. * **Fine-tuning Custom Models**: Tailor existing models to your specific needs and data for a truly personalized generative AI experience. This is just a glimpse of the advanced applications on the horizon! We're actively working on bringing you comprehensive guides that explain the process, settings, and potential benefits of each approach. In the meantime, feel free to explore the current capabilities of Friendli Serverless Endpoints with text generation. Experiment with different models, settings, and prompts to discover the vast creative and informative potential at your fingertips. We're committed to evolving Friendli Serverless Endpoints into a one-stop platform for all your generative AI needs. Stay tuned for updates and get ready to dive into the world of advanced applications soon! #### For any questions or feedback regarding these upcoming features, please don't hesitate to [reach out to us](https://friendli.ai/contact)! We appreciate your understanding and continuous support as we push the boundaries of generative AI accessibility. # Function Calling Source: https://friendli.ai/docs/guides/serverless_endpoints/function-calling Learn how to do OpenAI compatible function calling on Friendli Serverless Endpoints. Function calling is a powerful feature that connects large language models (LLMs) with external systems to maximize the model’s utility. It goes beyond simply relying on model’s learned knowledge and provides the possibility of utilizing real-time data and performing complex tasks. Function calling ## Simple Example In the example below, which consists of 1 to 5 steps, we define a `get_weather` function that retrieves weather information, ask a question that prompts the model to use the tool, and execute the tool to execute the final response. Open In Colab Define a function that the model can call (`get_weather`) with a JSON Schema.\ The function requires the following parameters: * `location`: The location to look up weather information for. * `date`: The date to look up weather information for. This definition is included in the `tools` array and passed to the model. ```python tools = [ { "type": "function", "function": { "name": "get_weather", "parameters": { "type": "object", "properties": { "location": {"type": "string"}, "date": {"type": "string", "format": "date"} }, }, }, } ] ``` When a user asks a question, this request is passed to the model as a `messages` array.\ For example, the request "What's the weather like in Paris today?" would be passed as: ```python from datetime import datetime today = datetime.now() messages = [ {"role": "system", "content": f"You are a helpful assistant. today is {today}."}, {"role": "user", "content": "What's the weather like in Paris today?"} ] ``` Call the model using the `tools` and `messages` defined above. ```python {13-14} from openai import OpenAI import os token = os.getenv("FRIENDLI_TOKEN") or "" client = OpenAI( base_url = "https://api.friendli.ai/serverless/v1", api_key = token ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=messages, tools=tools, ) print(completion.choices[0].message.tool_calls) ``` The API caller runs the tool based on the function call information of the model.\ For example, the `get_weather` function is executed as follows: ```python import json import random def get_weather(location: str, date: str): temperature = random.randint(60, 80) return {"temperature": temperature, "forecast": "sunny"} tool_call = completion.choices[0].message.tool_calls[0] tool_response = locals()[tool_call.function.name](**json.loads(tool_call.function.arguments)) print(tool_response) ``` ```python Result: {'temperature': 65, 'forecast': 'sunny'} ``` Add the tool's response to the `messages` array and pass it back to the model. 1. Append tool call information 2. Append the tool's execution result This ensures the model has all the necessary information to generate a response. ```python model_response = completion.choices[0].message # Append the response from the model messages.append( { "role": model_response.role, "tool_calls": [ tool_call.model_dump() for tool_call in model_response.tool_calls ] } ) # Append the response from the tool messages.append( { "role": "tool", "content": json.dumps(tool_response), "tool_call_id": tool_call.id } ) print(json.dumps(messages, indent=2)) ``` The model generates the final response based on the tool's output: ```python next_completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=messages, tools=tools ) print(next_completion.choices[0].message.content) ``` ```text Final output: According to the forecast, it's going to be a sunny day in Paris with a temperature of 65 degrees. ``` ## Parameters To use function calling, modify the `tool_choice`, `tools`, and `parallel_tool_calls` parameters. | Parameter | Description | default | | --------------------- | ---------------------------------------------------------------------------------------------------------------- | ------- | | `tool_choice` | Specifies how the model should choose tools. Has four options: "none", "auto", "required", or named tool choice. | `auto` | | `tools` | The list of tool objects that define the functions the model can call. | - | | `parallel_tool_calls` | Boolean value (`True` or `False`) specifying whether to make tool calls in parallel. | `True` | ### `tool_choice` options The model will automatically choose whether to call a function and which function to call by default.\ However, you can use the `tool_choice` parameter to tell the model to use a function. * `none`: Disables the use of tools. * `auto`: Enables the model to decide whether to use tools and which ones to use. * `required`: Forces the model to use a tool, but the model chooses which one. * Named tool choice: Forces the model to use a specific tool. It must be in the following format: ```json { "type": "function", "function": { "name": "get_current_weather" // The function name you want to specify } } ``` ## Supported models * `deepseek-r1` * `meta-llama-3.3-70b-instruct` * `meta-llama-3.1-8b-instruct` ## References Building an AI Agent for Google Calendar ([Part 1](https://friendli.ai/blog/ai-agent-google-calendar) / [Part 2](https://friendli.ai/blog/calendar-agent-vercel))\ Friendli Tools Blog Series ([Part 1](https://friendli.ai/blog/llm-function-calling) / [Part 2](https://friendli.ai/blog/ai-agents-function-calling) / [Part 3](https://friendli.ai/blog/friendli-tools-llama3-outperforms-gpt4o)) # Integrations Source: https://friendli.ai/docs/guides/serverless_endpoints/integrations Friendli integrates with LangChain, LiteLLM, LlamaIndex, and MongoDB to streamline GenAI application deployment. LangChain and LlamaIndex enable tool calling AI agents and Retrieval-Augmented Generation (RAG), while MongoDB provides memory via vector databases, and LiteLLM boosts performance through load balancing. [Friendli](/guides/introduction) integrates with LangChain, LiteLLM, LlamaIndex, and MongoDB to streamline the deployment of compound GenAI applications. The integration of LangChain and LlamaIndex facilitates tool calling AI agents or Retrieval-Augmented Generation (RAG). MongoDB supports these agentic systems by providing memory with vector databases, while LiteLLM enhances performance through load balancing and evaluation. Get a quick overview of [Friendli Serverless Endpoints'](/guides/serverless_endpoints/introduction) integrations and learn more through the linked resources. ## LangChain [LangChain](https://python.langchain.com/v0.2/docs/introduction) is a framework for developing applications powered by large language models (LLMs). Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in LangChain by preparing a [Friendli Token](/guides/personal_access_tokens). To install the required packages, run: ``` pip install langchain langchain-community friendli-client ``` Here's a streaming chat sample code to get started with LangChain and FriendliAI: ```python from langchain_community.chat_models.friendli import ChatFriendli llm = ChatFriendli(model="meta-llama-3.3-70b-instruct") for chunk in llm.stream("Tell me a funny joke."): print(chunk.content, end="", flush=True) ``` Output: ``` Here's one: Why couldn't the bicycle stand up by itself? (Wait for it...) Because it was two-tired! Hope that brought a smile to your face! ``` #### Resources * [FriendliAI Blog Post on Building RAG Chatbots with Friendli, MongoDB Atlas, and LangChain](https://friendli.ai/blog/rag-chatbot-friendli-mongodb-atlas-langchain) * [FriendliAI Blog Post on Example RAG Application with Friendli and LangChain](https://friendli.ai/blog/chatdocs-rag-friendli-langchain) * [FriendliAI Blog Post on LangChain Integration with Friendli Dedicated Endpoints](https://friendli.ai/blog/langchain-integration-friendli-engine) * [LangChain's Documentation on Friendli](https://python.langchain.com/v0.1/docs/integrations/llms/friendli) ## MongoDB [MongoDB Atlas](https://www.mongodb.com/docs/atlas/getting-started) is a developer data platform offering vector stores and searches for compound GenAI applications, compatible through both LangChain and LlamaIndex. Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in MongoDB by preparing a [Friendli Token](/guides/personal_access_tokens). To install the required packages, run: ``` pip install pymongo friendli-client langchain langchain-mongodb langchain-community pypdf langchain-openai tiktoken ``` Here's a RAG sample code to get started with MongoDB and FriendliAI using LangChain: ```python # Note: You can find detailed explanation on this code in the blog post below. from pymongo import MongoClient from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch from langchain_community.chat_models.friendli import ChatFriendli from langchain_community.document_loaders import PyPDFLoader from langchain_openai import OpenAIEmbeddings from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_core.output_parsers import StrOutputParser from langchain_core.prompts import PromptTemplate from langchain_core.runnables import RunnablePassthrough # Fill in your Cluster URI here. MONGODB_ATLAS_CLUSTER_URI = "{YOUR CLUSTER URI}" client = MongoClient(MONGODB_ATLAS_CLUSTER_URI) # Fill in your DB information here. DB_NAME = "{YOUR DB NAME}" COLLECTION_NAME = "{YOUR COLLECTION NAME}" ATLAS_VECTOR_SEARCH_INDEX_NAME = "{YOUR INDEX NAME}" MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME] # Fill in your PDF link here. loader = PyPDFLoader("{YOUR PDF DOCUMENT LINK}") data = loader.load() text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150) docs = text_splitter.split_documents(data) vector_store = MongoDBAtlasVectorSearch.from_documents( documents=docs, embedding=OpenAIEmbeddings(disallowed_special=()), collection=MONGODB_COLLECTION, index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME, ) retriever = vector_store.as_retriever() llm = ChatFriendli(model="meta-llama-3.3-70b-instruct") prompt = PromptTemplate.from_template( """ Use the following pieces of context to answer the question. {context} Question: {question} Helpful Answer: """ ) def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) # Input your user query here. rag_chain.invoke("{Sample Query Texts}") ``` #### Resources * [FriendliAI Blog Post on Building RAG Chatbots with Friendli, MongoDB Atlas, and LangChain](https://friendli.ai/blog/rag-chatbot-friendli-mongodb-atlas-langchain) * [FriendliAI Blog Post on RAG with FriendliAI and MongoDB](https://friendli.ai/blog/rag-mongodb-friendli) * [MongoDB's Partner Ecosystem Page on FriendliAI](https://cloud.mongodb.com/ecosystem/friendliai) ## LlamaIndex [LlamaIndex](https://docs.llamaindex.ai/en/stable) is a data framework designed to connect LLMs to custom data sources. Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in LlamaIndex by preparing a [Friendli Token](/guides/personal_access_tokens). Additionally, an [OpenAI API key](https://platform.openai.com/docs/api-reference/authentication) is required to access the [OpenAI embedding API](https://platform.openai.com/docs/api-reference/embeddings). To install the required packages, run: ``` pip install llama-index-llms-friendli llama-index ``` Here's a RAG streaming chat sample code to get started with LlamaIndex and FriendliAI: ```python from llama_index.llms.friendli import Friendli from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex Settings.llm = Friendli() # Assuming a directory named 'data_folder' stores your pdf file. documents = SimpleDirectoryReader('data_folder').load_data() index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine(streaming=True) # Input your user query here. response = query_engine.query("{Sample Query Texts}") response.print_response_stream() ``` #### Resources * [FriendliAI Blog Post on Building RAG Applications with Friendli and LlamaIndex](https://friendli.ai/blog/llamaindex-rag-app-friendli-engine) * [Google Colab Notebook on Two-Stage Retrieval with LlamaIndex Friendli Integration](https://colab.research.google.com/drive/1_-1aITFQh0UUbRzaRM8FRid_wZHrfIjX?usp=sharing) * [LlamaIndex's Documentation on Friendli](https://docs.llamaindex.ai/en/stable/examples/llm/friendli) ## LiteLLM [LiteLLM](https://docs.litellm.ai/docs) is a versatile platform offering access to 100+ LLMs in the [OpenAI API format](https://platform.openai.com/docs/api-reference/chat/create). Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in LiteLLM by preparing a [Friendli Token](/guides/personal_access_tokens). To install the required package, run: ``` pip install litellm ``` Here's a streaming chat sample code to get started with LiteLLM and FriendliAI: ```python from litellm import completion response = completion( # Simply change the model ID to use different LLM inference models & engines. model="friendliai/meta-llama-3-70b-instruct", messages=[ {"role": "user", "content": "Hello from LiteLLM"} ], stream=True, ) for chunk in response: print(chunk.choices[0].delta.content, end="", flush=True) ``` Output: ``` Hello from an AI! It's great to meet you, LiteLLM! How's your day going so far? ``` #### Resources * [FriendliAI Blog Post on LiteLLM Friendli Integration using LiteLLM's Budget Manager](https://friendli.ai/blog/litellm-friendli-integration) * [LiteLLM's Supported Models & Providers Documentation Page on FriendliAI](https://docs.litellm.ai/docs/providers/friendliai) # Introducing Friendli Serverless Endpoints Source: https://friendli.ai/docs/guides/serverless_endpoints/introduction Guide for Friendli Serverless Endpoints, allowing you to seamlessly integrate state-of-the-art AI models into your workflows, regardless of your technical expertise. {/* Welcome to the exciting world of generative AI, where words dance into text, code sparks creation, and images bloom from the imagination. FriendliAI makes this world readily accessible with Friendli Serverless Endpoints, a revolutionary service that puts the power of cutting-edge generative models right at your fingertips. */} This tutorial will guide you through Friendli Serverless Endpoints, allowing you to seamlessly integrate state-of-the-art AI models into your workflows, regardless of your technical expertise. Whether you're a seasoned developer or a curious newcomer, get ready to unlock the limitless potential of generative AI! ## What are Friendli Serverless Endpoints? Imagine there is a powerful racecar (a generative AI model) that needs much maintenance and tuning (infrastructure and technical know-how). Friendli Serverless Endpoints is like a rental service, taking care of the hassle so you can just drive! It provides a simple, serverless interface that connects you to Friendli Engine, a high-performance, cost-effective inference serving engine optimized for generative AI models. With Friendli Serverless Endpoints, you can: * **Access popular open-source models**: Get started with pre-loaded models like Llama 3.1. No need to worry about downloading or optimizing them. * **Build your own workflows**: Integrate these models into your applications with just a few lines of code. Generate creative text formats, code, musical pieces, email, letters, etc. and create stunning images with ease. * **Pay per token, not per GPU**: Unlike traditional solutions that require whole GPU instances, Friendli Serverless Endpoints bills you only for the resources your models actually use. This translates to significant cost savings and efficient resource utilization. * **Focus on what matters**: Forget about infrastructure setup and GPU optimization. Friendli Serverless Endpoints handles the heavy lifting, freeing you to focus on your creative vision and application development. ## Getting Started with Friendli Serverless Endpoints: 1. **Sign up for a free account**: Visit [Friendli Suite](https://suite.friendli.ai) and create your Friendli Suite account. 2. **Choose your model**: Select the pre-loaded model you want to experiment with, such as Llama 3.1 for text generation. 3. **Connect to the endpoint**: Friendli Serverless Endpoints provides simple API documentation for a variety of programming languages. Follow the instructions to integrate the endpoint into your code. 4. **Send your input**: Supply the model with your input text, code, or image prompt. 5. **Witness the magic**: Friendli Serverless Endpoints will utilize Friendli Engine to process your input and generate the desired output, be it text, code, or an image. You can then integrate the generated results into your application or simply marvel at the AI's creativity! ## Beyond the Basics: As you gain confidence, Friendli Serverless Endpoints offers even more: * **Granular control**: Optimize resource usage at the per-token or per-step level for each model, ensuring efficient resource allocation for your specific needs. {/* - **Customization**: Build your own custom generative models and seamlessly integrate them into your workflows using Friendli Serverless Endpoints. */} * **Scalability**: As your needs grow, easily scale your resources without worrying about complex infrastructure management. Friendli Serverless Endpoints is the perfect springboard for your generative AI journey. Whether you're a experienced developer seeking to integrate AI into your projects or a curious explorer yearning to unleash your creative potential, FriendliAI provides the tools and resources you need to succeed. So, start your engines, take the wheel, and explore the vast possibilities of generative AI with Friendli Serverless Endpoints! ## Additional Resources: * FriendliAI website: [https://friendli.ai](https://friendli.ai) * FriendliAI blog: [https://friendli.ai/blog](https://friendli.ai/blog) # OpenAI Compatibility Source: https://friendli.ai/docs/guides/serverless_endpoints/openai-compatibility Friendli Serverless Endpoints is compatible with the OpenAI API standard through the Python API Libraries and the Node API Libraries. Friendli Dedicated Endpoints and Friendli Container are also OpenAI API compatible. Friendli Serverless Endpoints is compatible with the [OpenAI API standard](https://platform.openai.com/docs/api-reference/chat) through the [Python API Libraries](https://pypi.org/project/openai) and the [Node API Libraries](https://www.npmjs.com/package/openai). [Friendli Dedicated Endpoints](https://friendli.ai/products/dedicated-endpoints) and [Friendli Container](https://friendli.ai/products/container) are also OpenAI API compatible. Through this guide, you will learn how to: * Send inference requests to Friendli Serverless Endpoints in Python and Node.js. * Use chat models supported by Friendli Endpoint. * Generate streaming chat responses. ## Model Supports * `deepseek-r1` * `meta-llama-3.3-70b-instruct` * `meta-llama-3.1-8b-instruct`
* [and more!](https://friendli.ai/models) You can find more information about each text generation model [here](https://friendli.ai/models). Log in to the [Friendli Suite](https://suite.friendli.ai/login) to create your Friendli Token for this quick tutorial. We will use the *Llama 3.3 70B Instruct* model as an example in this tutorial. ## Quick Guide If you want to integrate Friendli Serverless Endpoints to your application that had been using OpenAI, you can simply switch the following components: **API key**, **model**, and the **base url**. The **API key** is equivalent to your Friendli Token, which you can create [here](https://suite.friendli.ai/default-team/settings/tokens). After choosing your generative text model, you can find the **model id** by pressing the 'More info' icon, or by using the ids listed in the Model Supports section above. Last but not least, change the **base url** to [https://api.friendli.ai/serverless/v1](https://api.friendli.ai/serverless/v1) and you are all set! ## Python This example demonstrates how you can use the OpenAI Python SDK to generate a response. #### Default Example Code ```python import openai import os client = openai.OpenAI( api_key=os.getenv("FRIENDLI_TOKEN"), base_url="https://api.friendli.ai/serverless/v1", ) chat_completion = client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a funny joke."}, ], stream=False, ) print(chat_completion.choices[0].message.content) ``` #### Streaming Example Code ```python import openai import os client = openai.OpenAI( api_key=os.getenv("FRIENDLI_TOKEN"), base_url="https://api.friendli.ai/serverless/v1", ) stream = client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a funny joke."}, ], stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content or "", end="", flush=True) ``` ## Node.js This example demonstrates how you can use the OpenAI Node.js SDK to generate a response. #### Default Example Code ```javascript const OpenAI = require("openai"); const openai = new OpenAI({ apiKey: process.env.FRIENDLI_TOKEN, baseURL: "https://api.friendli.ai/serverless/v1", }); async function getChatCompletion() { try { const chatCompletion = await openai.chat.completions.create({ messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Tell me a funny joke." }, ], model: "meta-llama-3.3-70b-instruct", stream: false, }); process.stdout.write(chatCompletion.choices[0].message.content); } catch (error) { console.error("Error:", error); } } getChatCompletion(); ``` #### Streaming Example Code ```javascript const OpenAI = require("openai"); const openai = new OpenAI({ apiKey: process.env.FRIENDLI_TOKEN, baseURL: "https://api.friendli.ai/serverless/v1", }); async function getChatCompletionStream() { try { const stream = await openai.chat.completions.create({ messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Tell me a funny joke." }, ], model: "meta-llama-3.3-70b-instruct", stream: true, }); for await (const chunk of stream) { process.stdout.write(chunk.choices[0].delta?.content || ""); } } catch (error) { console.error("Error:", error); } } getChatCompletionStream(); ``` ## Results ``` Here's one: Why couldn't the bicycle stand up by itself? (wait for it...) Because it was two-tired! Hope that brought a smile to your face! ``` # Pricing Source: https://friendli.ai/docs/guides/serverless_endpoints/pricing Friendli Serverless Endpoints offer a range of models tailored to various tasks. Friendli Serverless Endpoints offer a range of models tailored to various tasks. ## Text Generation Models Text generation models provide users with completions and chat completions APIs, with pricing determined on a per-token basis. The following table outlines the pricing details for different text generation models: | Model Code | Price per Token | | --------------------------- | ---------------------------------- | | deepseek-r1 | Input \$3 · Output \$7 / 1M tokens | | meta-llama-3.3-70b-instruct | \$0.6 / 1M tokens | | meta-llama-3.1-8b-instruct | \$0.1 / 1M tokens | The term "token" refers to an individual unit processed by the model. # QuickStart: Friendli Serverless Endpoints Source: https://friendli.ai/docs/guides/serverless_endpoints/quickstart Learn how to get started with Friendli Serverless Endpoints in this step-by-step guide. Create an account, choose from powerful AI models like Llama 3.1, and seamlessly generate text, code, and more with ease. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ## 1. Log In or Sign Up * If you have an account, log in using your preferred SSO or email/password combination. * If you're new to FriendliAI, create an account for free. ![Login](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/serverless_endpoints/login.png) ## 2. Access Friendli Serverless Endpoints * On your dashboard, find the "Friendli Serverless Endpoints" section. * Click the "Go to playground" button to start generating text. ![Suite Dashboard](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/serverless_endpoints/dashboard.png) ## 3. Select a Model * Browse the available generative models. * Choose the model that best aligns with your desired use case. * First-time users receives a free trial to explore Friendli Serverless Endpoints without any financial commitment. Model List ## 4. Generate Responses 1. Enter Your Query: * Type in your prompt or question. * Alternatively, select from the provided example queries to try out different scenarios. Chat Prompt 2. Adjust Settings: * Refer to the [Text Generation](/guides/serverless_endpoints/text-generation) docs for more details on the settings applicable for the text generation models. 3. Generate Your Response: * Click the "Generate" button to start the generation process. * The model will process your query and produce the corresponding text output. That's it! Chat Settings ### Generating Responses Through the Endpoint URL If you wish to send your requests through the endpoint URL, you can find the endpoint URL by hitting the 'More Info' button on the top-right corner of the page. Refer to [this guide](/guides/personal_access_tokens) for general instructions on the Friendli Token. Endpoint URL
```sh cURL # Send inference request to a running Friendli Serverless Endpoints using a `curl` command. curl -X POST https://api.friendli.ai/serverless/v1/completions \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ -d '{ "model": "meta-llama-3.1-8b-instruct", "prompt": "Python is a popular", "min_tokens": 20, "max_tokens": 30, "top_k": 32, "top_p": 0.8, "n": 3, "no_repeat_ngram": 3, "ngram_repetition_penalty": 1.75 }' ``` ```python Python SDK # pip install friendli-client # Send inference request to a Friendli Serverless Endpoints using Python SDK. import os from friendli import Friendli client = Friendli(token=os.getenv("FRIENDLI_TOKEN")) chat_completion = client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" } ], stream=False, ) print(chat_completion.choices[0].message.content) ```
## Additional Tips Check out the [Text Generation](/guides/serverless_endpoints/text-generation) docs for more details. **Ready to unlock the creativity of generative AI? Get started with Friendli Serverless Endpoints today!** # Rate Limits Source: https://friendli.ai/docs/guides/serverless_endpoints/rate_limit Understand the rate limits for Friendli Serverless Endpoints, including Requests per Minute (RPM) and Tokens per Minute (TPM), to ensure efficient usage of resources and balanced performance when interacting with AI models. When interacting with Friendli Serverless Endpoints, it's important to be aware of the rate limits imposed on requests. These limits are in place to regulate the number of requests made within a specified timeframe, ensuring a balanced and efficient use of resources. The rate limits are quantified using two metrics: * **RPM (Requests per Minute):** This measures the maximum number of requests allowed per minute. * **TPM (Tokens per Minute):** TPM represents the maximum estimated tokens processed per minute, providing insight into the computational load. {/* **SPM (Steps per Minute):** SPM signifies the maximum number of inference steps permitted within a minute. */} **RPM** is used for all types of generation models, while **TPM** is used only for text generation models. The information related to the rate limits is included in the response headers as follows: * In all responses * `X-RateLimit-Limit-Requests` * `X-RateLimit-Remaining-Requests` * `X-RateLimit-Reset-Requests` * In text generation responses * `X-RateLimit-Limit-Tokens` * `X-RateLimit-Remaining-Tokens` * `X-RateLimit-Reset-Tokens` {/* In image generation responses - `X-RateLimit-Limit-Steps` - `X-RateLimit-Remaining-Steps` - `X-RateLimit-Reset-Steps` */} The specific rate limits applied depend on the user's subscription plan, with higher-tier plans enjoying fewer restrictions. The following table illustrates the rate limits corresponding to each plan: | Plan | RPM | TPM | | ---------- | -------- | -------- | | Trial | 10 | 50K | | Basic | 10K | 100K | | Enterprise | No limit | No limit | The metrics are measured **per team across all models**. # Structured Outputs Source: https://friendli.ai/docs/guides/serverless_endpoints/structured-outputs Generate structured outputs using FriendliAI's Structured Outputs feature. Large language models (LLMs) excel at creative text generation, but we often face a case where we need LLM outputs to be more structured. This is where our exciting new "structured output" feature comes in. Structured Outputs is also available in [Friendli Dedicated Endpoints](https://friendli.ai/products/dedicated-endpoints) and [Friendli Container](https://friendli.ai/products/container). For more advanced use cases of our Structured Outputs feature, check out our detailed blog post on [Structured Output for LLM Agents](https://friendli.ai/blog/structured-output-llm-agents). ## Structured response modes | Type | Description | Name at OpenAI | | ------------- | ------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | | `json_schema` | The model returns a JSON object that conforms to the given schema. | [Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#introduction) | | `json_object` | The model can return any JSON object. | [JSON mode](https://platform.openai.com/docs/guides/structured-outputs#json-mode) | | `regex` | The model returns a string that conforms to the given regex schema. | N/A | ## How to use This guide provides a step-by-step example of how to create a structured output response in JSON form.\ In this example, we will use Python and the `pydantic` library to define a schema for the output. Define a schema that contains information about a dish. ```python from pydantic import BaseModel class Result(BaseModel): dish: str cuisine: str calories: int ``` Call structured output and use schema to structure the response. ```python {17-22} import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.getenv("FRIENDLI_TOKEN"), ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ { "role": "user", "content": "Suggest a popular Italian dish in JSON format.", }, ], response_format={ "type": "json_schema", "json_schema": { "schema": Result.model_json_schema(), } } ) ``` You can use the output in the following way. ```python response = completion.choices[0].message.content print(response) ``` The code output result is as follows. ```json Result: { "dish": "Spaghetti Bolognese", "cuisine": "Italian", "calories": 540 } ``` This example demonstrates how to generate an arbitrary JSON object response without a predefined schema. ```python {15} import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.getenv("FRIENDLI_TOKEN"), ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ {"role": "system", "content": "You MUST answer with JSON."}, {"role": "user", "content": "Generate a lasagna recipe. (very short)"}, ], response_format={"type": "json_object"}, ) print(completion.choices[0].message.content) ``` This example shows how to generate output that matches a specific regular expression pattern. ```python {17-18} import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.getenv("FRIENDLI_TOKEN"), ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ { "role": "user", "content": "조선 왕조의 첫번째 왕은 누구입니까 (Who is the first king of the Joseon Dynasty)?", }, ], # Korean characters and numbers are allowed in the response. response_format={"type": "regex", "schema": "[\n ,.?!0-9\uac00-\ud7af]*"}, ) print(completion.choices[0].message.content) ``` ## Supported JSON schemas We ensure super-fast schema-guided generation by disabling JSON schema features that cause computation inefficiencies. We support **all seven standard JSON schema types** (`null`, `boolean`, `number`, `integer`, `string`, `object`, `array`), and **the supported JSON schema keywords are listed below**. Using unsupported or unexpected JSON schema keywords may result in them being ignored, triggering an error, or causing undefined behavior. ### Type-specific keywords * `integer` * `exclusiveMinimum`, `exclusiveMaximum`, `minimum`, `maximum` (Note: these are not supported in `number`) * `string` * `pattern` * `format` * Supported values: `uuid`, `date-time`, `date`, `time` * `object` * `properties` * `additionalProperties` is ignored, and is always set to `False`. * `required`: We support both required and optional properties, but have these limitations: * The sequence of the properties is fixed. * The first property should be `required`. If not, the first required property is moved to the first. * `array` * `items` * `minItems`: We support only `0` or `1` for `minItems`. ### Constant values and enumerated values `const` and `enum` only support constant values of null, boolean, number, and string. ### Schema composition We support only `anyOf` for [schema composition](https://json-schema.org/understanding-json-schema/reference/combining). ### Referencing subschemas We only support referencing (`$ref`) to "internal" subschemas. These subschemas must be defined within `$defs`, and the value of `$ref` must be a valid URI pointing to a subschema. Please refer [here](https://json-schema.org/understanding-json-schema/structuring#defs) for more details. ### Annotation JSON schema annotations such as `title`, `$comments` or `description` are accepted but ignored. # Text Generation Models Source: https://friendli.ai/docs/guides/serverless_endpoints/text-generation Dive into the characteristics of six popular Text Generation Models (TGMs) available on Friendli Serverless Endpoints. ## Unleashing the Power of Language with Friendli Serverless Endpoints Welcome to the captivating world of Text Generation Models (TGMs)! These AI models learn from massive datasets of text and code, mimicking human language patterns to generate creative and informative outputs. Friendli Serverless Endpoints empowers you to harness the potential of several cutting-edge TGMs through its convenient interface, letting you unlock the magic of words with ease. This guide dives into the characteristics of six popular TGMs available on Friendli Serverless Endpoints: ## Model Supports * `deepseek-r1` * `meta-llama-3.3-70b-instruct` * `meta-llama-3.1-8b-instruct` Please note that the pricing for each model can be found in the [pricing section](/guides/serverless_endpoints/pricing). ## Llama 3.3 70B Instruct * **Focus**: Engaging dialogues and interactive experiences. * **Strengths**: * Natural language understanding and human-like response generation in conversational settings. * Maintains coherence and context throughout dialogues, fostering seamless interactions. * Can adapt to different conversation styles and tones. * **Example Use Cases**: * Building customer service chatbots that understand natural language and offer personalized support. * Creating interactive storytelling experiences and AI companions. * Developing game AI characters with engaging back-and-forth conversations. ### Examples When you install `friendli-client`, you can generate chat response with Python SDK. You must set `FRIENDLI_TOKEN` environment variable before initializing the client instance with `client = Friendli()`. Alternatively, you can provide the value of your Friendli Token as the `token` argument when creating the client, like so: ```python from friendli import Friendli client = Friendli(token="YOUR FRIENDLI TOKEN") ``` ```python Default from friendli import Friendli client = Friendli() chat_completion = client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" } ], stream=False, ) print(chat_completion.choices[0].message.content) ``` ```python Streaming from friendli import Friendli client = Friendli() stream = client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" } ], stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content or "", end="") ``` ```python Async import asyncio from friendli import AsyncFriendli client = AsyncFriendli() async def main() -> None: chat_completion = await client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" } ], stream=False, ) print(chat_completion.choices[0].message.content) asyncio.run(main()) ``` ```python Streaming (Async) import asyncio from friendli import AsyncFriendli client = AsyncFriendli() async def main() -> None: stream = await client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" } ], stream=True, ) async for chunk in stream: print(chunk.choices[0].delta.content or "", end="") asyncio.run(main()) ``` ## Beyond the Models: Generation Settings: Friendli Serverless Endpoints unlocks further customization through various generation settings, allowing you to fine-tune your Text Generation Model (TGM) outputs: * **max\_tokens**: This defines the maximum number of words your TGM generates. Lower values produce concise outputs, while higher values allow for longer narratives. * **temperature**: Think of temperature as a creativity knob. Higher values promote more imaginative and surprising outputs, while lower values favor safe and predictable responses. * **top\_p**: This parameter governs the diversity of your output. Lower values focus on the most likely continuation, while higher values encourage exploration of less probable but potentially interesting options. ## Unleashing the Full Potential: Friendli Serverless Endpoints removes the technical hurdles, letting you focus on exploring the magic of TGMs. Start experimenting with different models and settings, tailoring the outputs to your unique vision. Remember, practice makes perfect – the more you interact with these models, the more you'll understand their strengths and discover the incredible possibilities they hold. #### Ready to embark on your text generation journey? Friendli Serverless Endpoints is your gateway to a world of boundless creativity and innovative applications. Sign up today and let the words flow! # Tool Assisted API (Beta) Source: https://friendli.ai/docs/guides/serverless_endpoints/tool-assisted-api Tool Assisted API enhances a model's capabilities by integrating tools that extend its functionality beyond simple conversational interactions. By using this API, the model becomes more dynamic, providing more comprehensive and actionable responses. Currently, Friendli Serverless Endpoints supports a variety of built-in tools specifically designed for Chat Completion tasks. export const ToolIcon = () => { return ; }; export const ChatIcon = () => { return ; }; export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ## What is Tool Assisted API? **Tool Assisted API** enhances a model's capabilities by integrating **tools** that extend its functionality beyond simple conversational interactions. By using this API, the model becomes more dynamic, providing more comprehensive and actionable responses. Currently, **[Friendli Serverless Endpoints](/guides/serverless_endpoints/introduction)** supports a variety of built-in tools specifically designed for **Chat Completion** tasks. *** ### What is Chat Completion? **[Chat completion](/openapi/serverless/chat-completions)** refers to a model's ability to generate responses in a conversation. Given a sequence of messages or conversation turns, the model processes the input and generates a response based on its internal knowledge and training data. * **Example**: * **User**: "What is the capital of France?" * **Model**: "The capital of France is Paris." However, chat completion has its limitations—it is restricted to the knowledge the model has learned during its training and cannot access real-time or external data. *** ### Is Chat Completion Different from Tool Assisted Chat Completion? Yes, **[Tool Assisted Chat Completion](/openapi/serverless/tool-assisted-chat-completions)** goes beyond basic chat completion by integrating external tools to enhance the conversation. This allows the model to access real-time data, perform specific tasks, and interact with external systems in ways that chat completion alone cannot achieve. * **Example**: * **User**: "What is the weather today?" * **Model without Tool Access**: Relies on pre-learned information, potentially giving outdated or generalized answers. * **Model with Tool Access**: Calls a weather API to retrieve live data and responds: "The weather today in New York is 72°F with clear skies." With tool access, the model provides a more accurate and up-to-date response. Additionally, some tasks—such as file processing or complex calculations—cannot be performed by the model alone but can be handled with the help of tools. * **Example**: * **User**: "Can you extract the text from this document?" (provides a file) * **Model without Tool Access**: "I cannot extract data from files directly." * **Model with Tool Access**: Extracts the text from the provided file and responds: "Using the `file:text` tool, I've extracted the following text: \[Text from the file]." When no tools are specified, the model will respond using only its internal knowledge. *** ### Benefits of Tool Assisted Chat Completion Tool Assisted Chat Completion offers several advantages over basic chat completion: * **Real-Time Data Access**: The model can pull live information. * **Extended Capabilities**: The model can perform complex tasks like running calculations, executing code, extracting text from files, and interacting with databases and APIs. *** ### Comparison: Chat Completion vs. Tool Assisted Chat Completion
| Feature | **Chat Completion** | **Tool Assisted Chat Completion** | | ----------------- | ------------------------------------------------ | -------------------------------------------------------------------- | | **Response Type** | Based on internal knowledge | Uses external tools for enhanced, real-time responses | | **Capabilities** | Limited to pre-learned knowledge | Can interact with tools for data retrieval and task execution | | **Example** | "What is the weather today?" (general knowledge) | "What is the weather today?" (live API result) | | **Use Cases** | General conversation and Q\&A | Complex tasks like real-time updates, data analysis, file processing | *** ## Built-In Tools Tool Assisted API automatically selects the best tool to perform an action based on user input when a specific tool is enabled. These tools can handle various operations, such as calculations, statistical analysis, web search, file content extraction, and code execution. Below is a more detailed description of the available tools in Tool Assisted API and when they are typically used: ### `math:calculator` **Description:** Performs basic arithmetic operations like addition, subtraction, multiplication, division, and more complex calculations like and square roots or exponents. It is useful for any tasks requiring mathematical computation. **When Used:** Automatically called when mathematical expressions or calculations are required. Whether you're solving equations, calculating percentages, or handling financial calculations, this tool performs the task for you. ### `math:statistics` **Description:** Performs statistical analysis, including calculating mean, median, mode, standard deviation, and correlations. It is tailored for situations where you need to analyze or interpret numeric datasets to understand trends or patterns. **When Used:** Automatically called when analyzing numeric data or generating insights from datasets, like summarizing survey results, or calculating probabilities. ### `math:calendar` **Description:** Handles date-related data, such as calculating date differences or finding specific days in the past or future. It is effective in managing and manipulating calendar-based information. **When Used:** Automatically called when operations involving dates or time spans are required, like finding how many days remain until an event. For example, figuring out how many days are left until an event, determining the day of the week for a specific date, or calculating the duration between two dates. ### `web:search` **Description:** Retrieves information from the web based on search queries. It fetches information based on keywords and helps gather knowledge or insights from online sources. **When Used:** Automatically called when you ask questions or seek information that requires external research or the latest data from the web. Whether it is looking up definitions, recent news, or general web searches, this tool handles such tasks effectively. ### `web:url` **Description:** Extracts specific data from a given website. You can provide a URL, and the tool will fetch the relevant content, including text, metadata, or other embedded information, from that web page. **When Used:** Automatically called when extracting content from a provided URL, such as fetching text from articles or blog posts. ### `code:python-interpreter` **Description:** Executes Python code directly within the platform for custom scripts, data processing, or automation. You can run Python scripts, test snippets of code, or automate tasks through coding logic. **When Used:** Automatically called when tasks involve writing or running Python scripts, such as custom data manipulations or logic-based automation. ### `file:text` **Description:** Reads and extracts text from files, supporting only `.txt` and `.pdf` formats. To use this tool, you must provide the file IDs. (For now, only one file is supported.) After uploading a file in the playground, you can copy the file ID by clicking on the files icon in the left sidebar and selecting the option from the dropdown menu next to the uploaded file. Copy file ID **When Used:** Automatically called when text extraction from a file is requested, such as pulling content from documents or reports. ## Conclusion * **Chat Completion**: Best for general conversations that rely on the model's pre-existing knowledge. * **Tool Assisted Chat Completion**: Ideal for real-time, dynamic tasks and more advanced interactions, leveraging external tools to enhance functionality. *** ## Explore APIs To get started with Tool Assisted Chat Completion, follow this tutorial: **[Tool calling with Serverless Endpoints](/guides/tutorials/tool-calling-with-serverless-endpoints)**. For more details, check out the API Reference documentations below: } href="/openapi/serverless/chat-completions"> Discover how to generate text through interactive conversations. } href="/openapi/serverless/tool-assisted-chat-completions"> Learn how to enhance responses with tool assisted chat completions using built-in tools. # Build an agent with Gradio Source: https://friendli.ai/docs/guides/tutorials/build-an-agent-with-gradio Build and deploy smart AI agents with Friendli Serverless Endpoints and Gradio in under 50 lines. ## Goals * Build your own AI agent using [**Friendli Serverless Endpoints**](https://friendli.ai/products/serverless-endpoints) and [**Gradio**](https://www.gradio.app) less than 50 LoC 🤖 * Use tool calling to make your agent even smarter 🤩 * Share your AI agent with the world and gather feedback 🌎 > [**Gradio**](https://www.gradio.app) is the fastest way to demo your model with a friendly web interface. ## Getting Started 1. Head to [**https://suite.friendli.ai**](https://suite.friendli.ai/get-started/serverless-endpoints), and create an account. 2. Grab a [FRIENDLI\_TOKEN](https://suite.friendli.ai/default-team/settings/tokens) to use Friendli Serverless Endpoints within an agent. ## 🚀 Step 1. Prerequisite Install dependencies. ``` pip install openai gradio ``` ## 🚀 Step 2. Launch your agent Build your own AI agent using **Friendli Serverless Endpoints** and **Gradio**. * Gradio provides a `ChatInterface` that implements a chatbot UI running the `chat_function`. * More information about the *chat\_function(message, history)* > *The input function should accept two parameters: a string input message and list of two-element lists of the form \[\[user\_message, bot\_message], ...] representing the chat history, and return a string response.* * Implement the `chat_function` using Friendli Serverless Endpoints. * Here, we used the `meta-llama-3.3-70b-instruct` model. * Feel free to explore other available models [here](/guides/serverless_endpoints/text-generation#model-supports). ```python from openai import OpenAI import gradio as gr friendli_client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key="YOUR FRIENDLI TOKEN" ) def chat_function(message, history): messages = [] for user, chatbot in history: messages.append({"role" : "user", "content": user}) messages.append({"role" : "assistant", "content": chatbot}) messages.append({"role": "user", "content": message}) stream = friendli_client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=messages, stream=True ) res = "" for chunk in stream: res += chunk.choices[0].delta.content or "" yield res css = """ .gradio-container { max-width: 800px !important; margin-top: 100px !important; } .pending { display: none !important; } .sm { box-shadow: None !important; } #component-2 { height: 400px !important; } """ with gr.Blocks(theme=gr.themes.Soft(), css=css) as friendli_agent: gr.ChatInterface(chat_function) friendli_agent.launch() ``` ## 🚀 Step 3. Tool Calling (Advanced) Use tool calling to make your agent even smarter! We will show you how to make your agent search the web before answer as an example. * Change the `base_url` to `https://api.friendli.ai/serverless/tools/v1` * Add `tools` parameter when calling chat completion API ```python from openai import OpenAI import gradio as gr friendli_client = OpenAI( base_url="https://api.friendli.ai/serverless/tools/v1", api_key="YOUR FRIENDLI TOKEN" ) def chat_function(message, history): messages = [] for user, chatbot in history: messages.append({"role" : "user", "content": user}) messages.append({"role" : "assistant", "content": chatbot}) messages.append({"role": "user", "content": message}) stream = friendli_client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=messages, stream=True, tools=[{"type": "web:search"}], ) res = "" for chunk in stream: if chunk.choices is None: yield "Waiting for tool response..." else: res += chunk.choices[0].delta.content or "" yield res css = """ .gradio-container { max-width: 800px !important; margin-top: 100px !important; } .pending { display: none !important; } .sm { box-shadow: None !important; } #component-2 { height: 400px !important; } """ with gr.Blocks(theme=gr.themes.Soft(), css=css) as agent: gr.ChatInterface(chat_function) agent.launch() ``` Here is the available built-in tools (beta) list. Feel free to build your agent using the below tools. * `math:calculator` (tool for calculating arithmetic operations) * `math:statistics` (tool for analyzing statistic data) * `math:calendar` (tool for handling date-related data) * `web:search` (tool for retrieving data through the web search) * `web:url` (tool for extracting data from a given website) * `code:python-interpreter` (tool for writing and executing python code) * `file:text` (tool for extracting text data from a given file) ## 🚀 Step 4. Deploy your agent For the temporal deployment, change the last line of the code. ```python agent.launch(share=True) ``` For the permanent deployment, you can use [HuggingFace Space](https://huggingface.co/spaces) ! # Build an agent with LangChain Source: https://friendli.ai/docs/guides/tutorials/build-an-agent-with-langchain Build an AI agent with LangChain and Friendli Serverless Endpoints, integrating tool calling for dynamic and efficient responses. ## Introduction This article walks you through creating an Agent using LangChain and Serverless Endpoints. ## Setup ```bash pip install -qU langchain-openai langchain-community langchain wikipedia ``` Get your Friendli Token from [https://suite.friendli.ai/](https://suite.friendli.ai/) to use. ```python import getpass import os if not os.environ.get("FRIENDLI_TOKEN"): os.environ["FRIENDLI_TOKEN"] = getpass.getpass("Enter your Friendli Token: ") ``` ## Instantiation ```python from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="meta-llama-3.1-8b-instruct", base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) ``` ## Create Agent with LangChain ### Step 1. Create Tool ```python from langchain_community.tools import WikipediaQueryRun from langchain_community.utilities import WikipediaAPIWrapper api_wrapper = WikipediaAPIWrapper(top_k_results=1, doc_content_chars_max=100) wiki = WikipediaQueryRun(api_wrapper=api_wrapper) tools = [wiki] ``` ### Step 2. Create Prompt ```python from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder prompt = ChatPromptTemplate.from_messages( [ ("system", "You are a helpful assistant"), MessagesPlaceholder("chat_history"), ("user", "{input}"), ("placeholder", "{agent_scratchpad}"), ] ) prompt.messages ``` ### Step 3. Create Agent ```python from langchain.agents import AgentExecutor from langchain.agents import create_tool_calling_agent agent = create_tool_calling_agent(llm, tools, prompt) agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True) ``` ### Step 4. Run the Agent ```python chat_history = [] while True: user_input = input("Enter your message: ") result = agent_executor.invoke( {"input": user_input, "chat_history": chat_history}, ) chat_history.append({"role": "user", "content": user_input}) chat_history.append({"role": "assistant", "content": result["output"]}) ``` When you run the code, it will wait for the user's input. After inputting, it will wait and output the result. When you ask a question about a specific wikipedia, it will automatically call the wikipedia tool and output the result. ```text final result Enter your Friendli Token: ·········· Enter your message: hello > Entering new AgentExecutor chain... Hello, it's nice to meet you. I'm here to help with any questions or topics you'd like to discuss. Is there something in particular you'd like to talk about, or do you need assistance with something? > Finished chain. Enter your message: What does the Linux kernel do? > Entering new AgentExecutor chain... Invoking: `wikipedia` with `{'query': 'Linux kernel'}` responded: The Linux kernel is the core component of the Linux operating system. It acts as a bridge between the computer hardware and the user space applications. The kernel manages the system's hardware resources, such as memory, CPU, and I/O devices. It provides a set of interfaces and APIs that allow user space applications to interact with the hardware. Page: Linux kernel Summary: The Linux kernel is a free and open source,: 4  UNIX-like kernel that isThe Linux kernel is a free and open source, UNIX-like kernel that is responsible for managing the system's hardware resources, such as memory, CPU, and I/O devices. It provides a set of interfaces and APIs that allow user space applications to interact with the hardware. The kernel is the core component of the Linux operating system, and it plays a crucial role in ensuring the stability and security of the system. > Finished chain. Enter your message: ``` ## Full Example Code ```python import getpass import os from langchain_openai import ChatOpenAI from langchain_community.tools import WikipediaQueryRun from langchain_community.utilities import WikipediaAPIWrapper from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder from langchain.agents import AgentExecutor from langchain.agents import create_tool_calling_agent if not os.environ.get("FRIENDLI_TOKEN"): os.environ["FRIENDLI_TOKEN"] = getpass.getpass("Enter your Friendli Token: ") llm = ChatOpenAI( model="meta-llama-3.1-8b-instruct", base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) api_wrapper = WikipediaAPIWrapper(top_k_results=1, doc_content_chars_max=100) wiki = WikipediaQueryRun(api_wrapper=api_wrapper) tools = [wiki] # Get the prompt to use - you can modify this! prompt = ChatPromptTemplate.from_messages( [ ("system", "You are a helpful assistant"), MessagesPlaceholder("chat_history"), ("user", "{input}"), ("placeholder", "{agent_scratchpad}"), ] ) agent = create_tool_calling_agent(llm, tools, prompt) agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True) chat_history = [] while True: user_input = input("Enter your message: ") result = agent_executor.invoke( {"input": user_input, "chat_history": chat_history}, ) chat_history.append({"role": "user", "content": user_input}) chat_history.append({"role": "assistant", "content": result["output"]}) ``` # Chat docs with LangChain Source: https://friendli.ai/docs/guides/tutorials/chat-docs-with-langchain You can view the content [here](https://friendli.ai/blog/chatdocs-rag-friendli-langchain). # Chat docs with MongoDB Source: https://friendli.ai/docs/guides/tutorials/chat-docs-with-mongodb You can view the content [here](https://friendli.ai/blog/rag-chatbot-friendli-mongodb-atlas-langchain). # Go Playground with Next.js Source: https://friendli.ai/docs/guides/tutorials/go-playground-with-nextjs You can view the content [here](https://friendli.ai/blog/vercel-ai-sdk-playground-tutorial). # RAG app with LlamaIndex Source: https://friendli.ai/docs/guides/tutorials/rag-app-with-llamaindex You can view the content [here](https://friendli.ai/blog/llamaindex-rag-app-friendli-engine). # Tool calling with Serverless Endpoints Source: https://friendli.ai/docs/guides/tutorials/tool-calling-with-serverless-endpoints Build AI agents with Friendli Serverless Endpoints using tool calling for dynamic, real-time interactions with LLMs. ## Goals * Use tool calling to build your own AI agent with [**Friendli Serverless Endpoints**](https://friendli.ai/products/serverless-endpoints) * Check out the examples below to see how you can interact with state-of-the-art language models while letting them search the web, run Python code, etc. * Feel free to make your own custom tools! ## Getting Started 1. Head to [**https://suite.friendli.ai**](https://suite.friendli.ai/get-started/serverless-endpoints), and create an account. 2. Grab a [FRIENDLI\_TOKEN](https://suite.friendli.ai/default-team/settings/tokens) to use Friendli Serverless Endpoints within an agent. ## 🚀 Step 1. Playground UI Experience tool calling on the Playground 1. On your dashboard, click the "Go to Playground" button of **Friendli Serverless Endpoints** 2. Choose a model that best aligns with your desired use case. 3. Click a `web:search` tool calling example and see the response. 😀 ## 🚀 Step 2. Tool Calling Search interesting information using the `web:search` tool. This time, let's try it by writing python code. 1. Turn on the `web:search` tool on the playground. 2. Ask something interesting! ``` Find information on the popular movies currently showing in theaters and provide their ratings. ``` 3. Click the "View code" button to use the tool calling in Python/Javascript. 4. Copy/Paste the code on your IDE. 5. Click [**here**](https://suite.friendli.ai/default-team/settings/tokens) to generate a Friendli Token. 6. Fill in the token value of the copied/pasted code and run it. ## 🚀 Step 3. Multiple tool calling Use multiple tools at once to calculate "How long it will take you to buy a house in the San Francisco Bay Area based on your annual salary". Here is the available built-in tools (beta) list. * `math:calculator` (tool for calculating arithmetic operations) * `math:statistics` (tool for analyzing statistic data) * `math:calendar` (tool for handling date-related data) * `web:search` (tool for retrieving data through the web search) * `web:url` (tool for extracting data from a given website) * `code:python-interpreter` (tool for writing and executing python code) * `file:text` (tool for extracting text data from a given file) ### Example Answer sheet ``` Prompt: My annual salary is $ 100k. How long it will take to buy a house in San Francisco Bay Area? (`web:search` & `math:calculator` used) Answer: Based on the web search results, the median price of an existing single-family home in the Bay Area is around $1.25 million. Using a calculator to calculate how long it would take to buy a house in the San Francisco Bay Area with an annual salary of $100,000, we get: $1,200,000 (house price) / $100,000 (annual salary) = 12 years So, it would take approximately 12 years to buy a house in the San Francisco Bay Area with an annual salary of $100,000, assuming you save your entire salary each year and don't consider other factors like interest rates, taxes, and living expenses. ``` ## 🚀 Step 4. Build a custom tool Build your own creative tool. We will show you how to make a custom tool that retrieves temperature information. (Completed code snippet is provided at the bottom) 1. **Define a function for using as a custom tool** ```python def get_temperature(location: str) -> int: """Mock function that returns the city temperature""" if "new york" in location.lower(): return 45 if "san francisco" in location.lower(): return 72 return 30 ``` 2. **Send a function calling inference request** 1. Add the user's input as an `user` role message. 2. The information about the custom function (e.g., `get_temperature`) goes into the tools option. The function's parameters are described in JSON schema. 3. The response includes the `arguments` field, which are values extracted from the user's input that can be used as parameters of the custom function. ```python from friendli import Friendli token = os.environ.get("FRIENDLI_TOKEN") or "YOUR_FRIENDLI_TOKEN" client= Friendli(token=token) user_prompt = "I live in New York. What should I wear for today's weather?" messages = [ { "role": "user", "content": user_prompt, }, ] tools=[ { "type": "function", "function": { "name": "get_temperature", "description": "Get the temperature information in a given location.", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The name of current location e.g., New York", }, }, }, }, }, ] chat = client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=messages, tools=tools, temperature=0, frequency_penalty=1, ) print(chat) ``` 3. **Generate the final response using the tool calling results** 1. Add the `tool_calls` response as an `assistant` role message. 2. Add the result obtained by calling the `get_weather` function as a `tool` message to the Chat API again. ```python import json func_kwargs = json.loads(chat.choices[0].message.tool_calls[0].function.arguments) temperature_info = get_temperature(**func_kwargs) messages.append( { "role": "assistant", "tool_calls": [ tool_call.model_dump() for tool_call in chat.choices[0].message.tool_calls ] } ) messages.append( { "role": "tool", "content": str(temperature_info), "tool_call_id": chat.choices[0].message.tool_calls[0].id } ) chat_w_info = client.chat.completions.create( model="meta-llama-3.3-70b-instruct", tools=tools, messages=messages, ) for choice in chat_w_info.choices: print(choice.message.content) ``` * **Complete Code Snippet** ```python from friendli import Friendli import json import os token = os.environ.get("FRIENDLI_TOKEN") or "YOUR_FRIENDLI_TOKEN" client= Friendli(token=token) user_prompt = "I live in New York. What should I wear for today's weather?" messages = [ { "role": "user", "content": user_prompt, }, ] tools=[ { "type": "function", "function": { "name": "get_temperature", "description": "Get the temperature information in a given location.", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The name of current location e.g., New York", }, }, }, }, }, ] chat = client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=messages, tools=tools, temperature=0, frequency_penalty=1, ) def get_temperature(location: str) -> int: """Mock function that returns the city temperature""" if "new york" in location.lower(): return 45 if "san francisco" in location.lower(): return 72 return 30 func_kwargs = json.loads(chat.choices[0].message.tool_calls[0].function.arguments) temperature_info = get_temperature(**func_kwargs) messages.append( { "role": "assistant", "tool_calls": [ tool_call.model_dump() for tool_call in chat.choices[0].message.tool_calls ] } ) messages.append( { "role": "tool", "content": str(temperature_info), "tool_call_id": chat.choices[0].message.tool_calls[0].id } ) chat_w_info = client.chat.completions.create( model="meta-llama-3.3-70b-instruct", tools=tools, messages=messages, ) for choice in chat_w_info.choices: print(choice.message.content) ``` ## 🎉 Congratulations! Following the above instructions, we've experienced the whole process of defining and using a custom tool to generate an accurate and rich answer from LLM models! Brainstorm creative ideas for your agent by reading our blog articles! * [**Building an AI Agent for Google Calendar**](https://friendli.ai/blog/ai-agent-google-calendar) * [**Hassle-free LLM Fine-tuning with FriendliAI and Weights & Biases**](https://friendli.ai/blog/llm-fine-tuning-friendliai-wandb) * [**Building AI Agents Using Function Calling with LLMs**](https://friendli.ai/blog/ai-agents-function-calling) * [**Function Calling: Connecting LLMs with Functions and APIs**](https://friendli.ai/blog/llm-function-calling) # Vision Source: https://friendli.ai/docs/guides/vision Guide to using Friendli's Vision feature for image analysis. Covers usage via Playground and API (URL & Base64 examples). The Vision feature is available when the model supports vision capabilities. Friendli is equipped with a new Vision feature that can understand and analyze images, opening up exciting possibilities for multimodal interactions. This guide explains how to work with images in Friendli, including best practices and code examples. ### How to Use Vision Utilize Friendli's Vision features through the following: * Select and test a vision model at [friendli.ai/playground](https://friendli.ai/playground). * Use the API to process images and receive the model's responses, referring to the methods described in this document. ### Using the API ```python URL-based image {22} import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/dedicated/v1", api_key=os.environ.get("FRIENDLI_TOKEN"), ) image_url = "https://upload.wikimedia.org/wikipedia/commons/9/9e/Ours_brun_parcanimalierpyrenees_1.jpg" completion = client.chat.completions.create( # Replace YOUR_ENDPOINT_ID with the ID of your endpoint, e.g. "zbimjgovmlcb" model="YOUR_ENDPOINT_ID", messages=[ { "role": "user", "content": [ { "type": "text", "text": "What kind of animal is shown in the image?", }, {"type": "image_url", "image_url": {"url": image_url}}, ], }, ], ) print(completion.choices[0].message.content) ``` ```python Base64-encoded image {28-30} import base64, requests, os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/dedicated/v1", api_key=os.environ.get("FRIENDLI_TOKEN"), ) image_url = "https://upload.wikimedia.org/wikipedia/commons/9/9e/Ours_brun_parcanimalierpyrenees_1.jpg" image_media_type = "image/jpg" image_base64 = base64.standard_b64encode(requests.get(image_url).content).decode( "utf-8" ) completion = client.chat.completions.create( # Replace YOUR_ENDPOINT_ID with the ID of your endpoint, e.g. "zbimjgovmlcb" model="YOUR_ENDPOINT_ID", messages=[ { "role": "user", "content": [ { "type": "text", "text": "What kind of animal is shown in the image?", }, { "type": "image_url", "image_url": { "url": f"data:{image_media_type};base64,{image_base64}" }, }, ], }, ], ) print(completion.choices[0].message.content) ``` # Container chat completions Source: https://friendli.ai/docs/openapi/container/chat-completions post /v1/chat/completions Given a list of messages forming a conversation, the model generates a response. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/container/chat-completions-chunk-object). # Container chat completions chunk object Source: https://friendli.ai/docs/openapi/container/chat-completions-chunk-object Represents a streamed chunk of a chat completions response returned by model, based on the provided input. ```json Response data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "content": " is" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop", "logprobs": null } ], "usage": null, "created": 1726294383 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 8, "completion_tokens": 4, "total_tokens": 12 }, "created": 1726294402 } data: [DONE] ``` ```json With tools data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "id": "call_TARbemDG9CFdwuoaQBTRXiYK", "type": "function", "function": { "name": "func", "arguments": "{\"" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "arg" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "}" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "tool_calls", "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 468, "completion_tokens": 59, "total_tokens": 527 }, "created": 1726294443 } data: [DONE] ``` A unique ID of the chat completion. The object type, which is always set to `chat.completion.chunk`. The model to generate the completion. The index of the choice in the list of generated choices. Role of the generated message author, in this case `assistant`. The contents of the assistant message. The index of tool call being generated. The ID of the tool call. The type of the tool, which is always set to `function`. The name of the function to call. The arguments for calling the function, generated by the model in JSON format. Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON. Termination condition of the generation. `stop` means the API returned the full chat completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. `tool_calls` means the API has generated tool calls. Available options: `stop`, `length`, `tool_calls` Log probability information for the choice. A list of message content tokens with log probability information. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. List of the most likely tokens and their log probability, at this token position. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. Number of tokens in the prompt. Number of tokens in the generated chat completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Container completions Source: https://friendli.ai/docs/openapi/container/completions post /v1/completions Generate text based on the given text prompt. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/container/completions-chunk-object). # Container completions chunk object Source: https://friendli.ai/docs/openapi/container/completions-chunk-object Represents a streamed chunk of a completions response returned by model, based on the provided input. ```json Response data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "object": "text_completion", "choices": [ { "index": 0, "text": " such", "token": 1778, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "object": "text_completion", "choices": [ { "index": 0, "text": " as", "token": 439, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } ... data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "object": "text_completion", "choices": [ { "index": 0, "text": "", "finish_reason": "length", "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "object": "text_completion", "choices": [], "usage": { "prompt_tokens": 5, "completion_tokens": 10, "total_tokens": 15 }, "created": 1733382157 } data: [DONE] ``` A unique ID of the completion. The object type, which is always set to `text_completion`. The model to generate the completion. The index of the choice in the list of generated choices. The text. The token. Termination condition of the generation. `stop` means the API returned the full completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. Available options: `stop`, `length` Log probability information for the choice. The starting character position of each token in the generated text, useful for mapping tokens back to their exact location for detailed analysis. The log probabilities of each generated token, indicating the model's confidence in selecting each token. A list of individual tokens generated in the completion, representing segments of text such as words or pieces of words. A list of dictionaries, where each dictionary represents the top alternative tokens considered by the model at a specific position in the generated text, along with their log probabilities. The number of items in each dictionary matches the value of `logprobs`. Number of tokens in the prompt. Number of tokens in the generated completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Container detokenization Source: https://friendli.ai/docs/openapi/container/detokenization post /v1/detokenize By giving a list of tokens, generate a detokenized output text string. # Container image generations (Beta) Source: https://friendli.ai/docs/openapi/container/image-generations post /v1/images/generations Given a description, the model generates image. # Container overview Source: https://friendli.ai/docs/openapi/container/overview OpenAPI reference of Friendli Container API. ### Inference Discover how to generate text through interactive conversations. Learn how to generate text. Explore the process of breaking down text into smaller tokens for machine processing. Learn how to reconstruct tokenized text back into its original, human-readable form. # Container tokenization Source: https://friendli.ai/docs/openapi/container/tokenization post /v1/tokenize By giving a text input, generate a tokenized output of token IDs. # Dedicated chat completions Source: https://friendli.ai/docs/openapi/dedicated/chat-completions post /dedicated/v1/chat/completions Given a list of messages forming a conversation, the model generates a response. To successfully run an inference request, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://suite.friendli.ai/default-team/settings/tokens) to generate your token. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/dedicated/chat-completions-chunk-object). # Dedicated chat completions chunk object Source: https://friendli.ai/docs/openapi/dedicated/chat-completions-chunk-object Represents a streamed chunk of a chat completions response returned by model, based on the provided input. ```json Response data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "content": " is" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop", "logprobs": null } ], "usage": null, "created": 1726294383 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 8, "completion_tokens": 4, "total_tokens": 12 }, "created": 1726294402 } data: [DONE] ``` ```json With tools data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "id": "call_TARbemDG9CFdwuoaQBTRXiYK", "type": "function", "function": { "name": "func", "arguments": "{\"" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "arg" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "}" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "tool_calls", "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 468, "completion_tokens": 59, "total_tokens": 527 }, "created": 1726294443 } data: [DONE] ``` A unique ID of the chat completion. The object type, which is always set to `chat.completion.chunk`. The model to generate the completion. For dedicated endpoints, it returns the endpoint id. The index of the choice in the list of generated choices. Role of the generated message author, in this case `assistant`. The contents of the assistant message. The index of tool call being generated. The ID of the tool call. The type of the tool, which is always set to `function`. The name of the function to call. The arguments for calling the function, generated by the model in JSON format. Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON. Termination condition of the generation. `stop` means the API returned the full chat completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. `tool_calls` means the API has generated tool calls. Available options: `stop`, `length`, `tool_calls` Log probability information for the choice. A list of message content tokens with log probability information. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. List of the most likely tokens and their log probability, at this token position. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. Number of tokens in the prompt. Number of tokens in the generated chat completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Dedicated completions Source: https://friendli.ai/docs/openapi/dedicated/completions post /dedicated/v1/completions Generate text based on the given text prompt. To successfully run an inference request, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://suite.friendli.ai/default-team/settings/tokens) to generate your token. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/dedicated/completions-chunk-object). # Dedicated completions chunk object Source: https://friendli.ai/docs/openapi/dedicated/completions-chunk-object Represents a streamed chunk of a completions response returned by model, based on the provided input. ```json Response data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "(endpoint-id)", "object": "text_completion", "choices": [ { "index": 0, "text": " such", "token": 1778, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "(endpoint-id)", "object": "text_completion", "choices": [ { "index": 0, "text": " as", "token": 439, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } ... data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "(endpoint-id)", "object": "text_completion", "choices": [ { "index": 0, "text": "", "finish_reason": "length", "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "(endpoint-id)", "object": "text_completion", "choices": [], "usage": { "prompt_tokens": 5, "completion_tokens": 10, "total_tokens": 15 }, "created": 1733382157 } data: [DONE] ``` A unique ID of the completion. The object type, which is always set to `text_completion`. The model to generate the completion. For dedicated endpoints, it returns the endpoint id. The index of the choice in the list of generated choices. The text. The token. Termination condition of the generation. `stop` means the API returned the full completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. Available options: `stop`, `length` Log probability information for the choice. The starting character position of each token in the generated text, useful for mapping tokens back to their exact location for detailed analysis. The log probabilities of each generated token, indicating the model's confidence in selecting each token. A list of individual tokens generated in the completion, representing segments of text such as words or pieces of words. A list of dictionaries, where each dictionary represents the top alternative tokens considered by the model at a specific position in the generated text, along with their log probabilities. The number of items in each dictionary matches the value of `logprobs`. Number of tokens in the prompt. Number of tokens in the generated completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Dedicated detokenization Source: https://friendli.ai/docs/openapi/dedicated/detokenization post /dedicated/v1/detokenize By giving a list of tokens, generate a detokenized output text string. To successfully run an inference request, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://suite.friendli.ai/default-team/settings/tokens) to generate your token. # Dedicated create endpoint from W&B artifact Source: https://friendli.ai/docs/openapi/dedicated/endpoint-wandb-artifact-create post /dedicated/v1/endpoint/wandb-artifact-create Create an endpoint from Weights & Biases artifact. To successfully run an inference request, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://suite.friendli.ai/default-team/settings/tokens) to generate your token. # Dedicated image generations (Beta) Source: https://friendli.ai/docs/openapi/dedicated/image-generations post /dedicated/v1/images/generations Given a description, the model generates image(s). To successfully run an inference request, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://suite.friendli.ai/default-team/settings/tokens) to generate your token. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/dedicated/completions-chunk-object). This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * [Feature request & feedback](https://friendliai.canny.io) * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated overview Source: https://friendli.ai/docs/openapi/dedicated/overview OpenAPI reference of Friendli Dedicated Endpoints API. ### Inference Discover how to generate text through interactive conversations. Learn how to generate text. Explore the process of breaking down text into smaller tokens for machine processing. Learn how to reconstruct tokenized text back into its original, human-readable form. Learn how to generate images. ### Endpoint Create an endpoint from Weights & Biases artifact. # Dedicated tokenization Source: https://friendli.ai/docs/openapi/dedicated/tokenization post /dedicated/v1/tokenize By giving a text input, generate a tokenized output of token IDs. To successfully run an inference request, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://suite.friendli.ai/default-team/settings/tokens) to generate your token. # Friendli Suite API Reference Source: https://friendli.ai/docs/openapi/introduction OpenAPI reference of Friendli Suite API. You can interact with the API through HTTP requests from any language. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; To send inference requests, send to the URI with the prefix: `https://api.friendli.ai`.\ For more information, visit [FriendliAI](https://friendli.ai). ## Authentication When using Friendli Suite API for inference requests, you need to provide a **Friendli Token** for authentication and authorization purposes. A Friendli Token serves as an alternative method of authorization to signing in with an email and a password. You can generate a new Friendli Token through the [Friendli Suite](https://suite.friendli.ai), at your **"Personal settings"** page by following the steps below. 1. Go to the [Friendli Suite](https://suite.friendli.ai) and sign in with your account. 2. Click the profile icon at the top-right corner of the page. 3. Click **"Personal settings"** menu. Personal settings 4. Go to the **"Tokens"** tab on the navigation bar. 5. Create a new Friendli Token by clicking the **"Create token"** button. 6. Copy the token and save it in a safe place. You will not be able to see this token again once the page is refreshed. Tokens # Serverless chat completions Source: https://friendli.ai/docs/openapi/serverless/chat-completions post /serverless/v1/chat/completions Given a list of messages forming a conversation, the model generates a response. See available models at [this pricing table](/guides/serverless_endpoints/pricing#text-generation-models). To successfully run an inference request, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://suite.friendli.ai/default-team/settings/tokens) to generate your token. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/serverless/chat-completions-chunk-object). You can explore examples on the [Friendli Serverless Endpoints](https://suite.friendli.ai/get-started/serverless-endpoints) playground and adjust settings with just a few clicks. # Serverless chat completions chunk object Source: https://friendli.ai/docs/openapi/serverless/chat-completions-chunk-object Represents a streamed chunk of a chat completions response returned by model, based on the provided input. ```json Response data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "content": " is" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop", "logprobs": null } ], "usage": null, "created": 1726294383 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 8, "completion_tokens": 4, "total_tokens": 12 }, "created": 1726294402 } data: [DONE] ``` ```json With tools data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "id": "call_TARbemDG9CFdwuoaQBTRXiYK", "type": "function", "function": { "name": "func", "arguments": "{\"" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "arg" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "}" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "tool_calls", "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 468, "completion_tokens": 59, "total_tokens": 527 }, "created": 1726294443 } data: [DONE] ``` A unique ID of the chat completion. The object type, which is always set to `chat.completion.chunk`. The model to generate the completion. The index of the choice in the list of generated choices. Role of the generated message author, in this case `assistant`. The contents of the assistant message. The index of tool call being generated. The ID of the tool call. The type of the tool, which is always set to `function`. The name of the function to call. The arguments for calling the function, generated by the model in JSON format. Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON. Termination condition of the generation. `stop` means the API returned the full chat completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. `tool_calls` means the API has generated tool calls. Available options: `stop`, `length`, `tool_calls` Log probability information for the choice. A list of message content tokens with log probability information. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. List of the most likely tokens and their log probability, at this token position. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. Number of tokens in the prompt. Number of tokens in the generated chat completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Serverless completions Source: https://friendli.ai/docs/openapi/serverless/completions post /serverless/v1/completions Generate text based on the given text prompt. See available models at [this pricing table](/guides/serverless_endpoints/pricing#text-generation-models). To successfully run an inference request, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://suite.friendli.ai/default-team/settings/tokens) to generate your token. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/serverless/completions-chunk-object). # Serverless completions chunk object Source: https://friendli.ai/docs/openapi/serverless/completions-chunk-object Represents a streamed chunk of a completions response returned by model, based on the provided input. ```json Response data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "meta-llama-3.1-8b-instruct", "object": "text_completion", "choices": [ { "index": 0, "text": " such", "token": 1778, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "meta-llama-3.1-8b-instruct", "object": "text_completion", "choices": [ { "index": 0, "text": " as", "token": 439, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } ... data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "meta-llama-3.1-8b-instruct", "object": "text_completion", "choices": [ { "index": 0, "text": "", "finish_reason": "length", "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "meta-llama-3.1-8b-instruct", "object": "text_completion", "choices": [], "usage": { "prompt_tokens": 5, "completion_tokens": 10, "total_tokens": 15 }, "created": 1733382157 } data: [DONE] ``` A unique ID of the completion. The object type, which is always set to `text_completion`. The model to generate the completion. The index of the choice in the list of generated choices. The text. The token. Termination condition of the generation. `stop` means the API returned the full completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. Available options: `stop`, `length` Log probability information for the choice. The starting character position of each token in the generated text, useful for mapping tokens back to their exact location for detailed analysis. The log probabilities of each generated token, indicating the model's confidence in selecting each token. A list of individual tokens generated in the completion, representing segments of text such as words or pieces of words. A list of dictionaries, where each dictionary represents the top alternative tokens considered by the model at a specific position in the generated text, along with their log probabilities. The number of items in each dictionary matches the value of `logprobs`. Number of tokens in the prompt. Number of tokens in the generated completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Serverless detokenization Source: https://friendli.ai/docs/openapi/serverless/detokenization post /serverless/v1/detokenize By giving a list of tokens, generate a detokenized output text string. To successfully run an inference request, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://suite.friendli.ai/default-team/settings/tokens) to generate your token. # Serverless overview Source: https://friendli.ai/docs/openapi/serverless/overview OpenAPI reference of Friendli Serverless Endpoints API. ### Inference Discover how to generate text through interactive conversations. Learn how to enhance responses with tool assisted chat completions using built-in tools. Learn how to generate text. Explore the process of breaking down text into smaller tokens for machine processing. Learn how to reconstruct tokenized text back into its original, human-readable form. # Serverless tokenization Source: https://friendli.ai/docs/openapi/serverless/tokenization post /serverless/v1/tokenize By giving a text input, generate a tokenized output of token IDs. To successfully run an inference request, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://suite.friendli.ai/default-team/settings/tokens) to generate your token. # Serverless tool assisted chat completions (Beta) Source: https://friendli.ai/docs/openapi/serverless/tool-assisted-chat-completions post /serverless/tools/v1/chat/completions Given a list of messages forming a conversation, the model generates a response. Additionally, the model can utilize built-in tools for tool calls, enhancing its capability to provide more comprehensive and actionable responses. See available models at [this pricing table](/guides/serverless_endpoints/pricing#text-generation-models). To successfully run an inference request, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://suite.friendli.ai/default-team/settings/tokens) to generate your token. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/serverless/tool-assisted-chat-completions-chunk-object). You can explore examples on the [Friendli Serverless Endpoints](https://suite.friendli.ai/get-started/serverless-endpoints) playground and adjust settings with just a few clicks. Tool assisted chat completions does not fully support parallel tool calls now. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * [Feature request & feedback](https://friendliai.canny.io) * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Serverless tool assisted chat completions chunk object (Beta) Source: https://friendli.ai/docs/openapi/serverless/tool-assisted-chat-completions-chunk-object Represents a streamed chunk of a tool assisted chat completions response returned by model, based on the provided input. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * [Feature request & feedback](https://friendliai.canny.io) * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support ```json Response event: tool_status data: { "tool_call_id": "call_3QrfStXSU6fGdOGPcETocIAq", "name": "math:calculator", "status": "STARTED", "parameters": [{ "name": "expression", "value": "150 * 1.60934" }], "result": null, "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726277121 } event: tool_status data: { "tool_call_id": "call_3QrfStXSU6fGdOGPcETocIAq", "name": "math:calculator", "status": "ENDED", "parameters": [{ "name": "expression", "value": "150 * 1.60934" }], "result": "\"{\\\"result\\\": \\\"150 * 1.60934=241.401000000000\\\"}\"", "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726277121 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "To" }, "finish_reason": null, "logprobs": null } ], "created": 1726277121 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "." }, "finish_reason": null, "logprobs": null } ], "created": 1726277121 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop", "logprobs": null } ], "created": 1726277121 } data: [DONE] ``` ```json Multiple tools event: tool_status data: { "tool_call_id": "call_5X9KQ52bV3CUigqHWleTzD9A", "name": "code:python-interpreter", "status": "STARTED", "parameters": [{ "name": "code", "value": "def is_prime(n): ... \n" }], "result": null, "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726277008 } event: tool_status data: { "tool_call_id": "call_5X9KQ52bV3CUigqHWleTzD9A", "name": "code:python-interpreter", "status": "ENDED", "parameters": [{ "name": "code", "value": "def is_prime(n): ... \n" }], "result": "\"[2, 3, 5, 7, 11, 13, 17]\\n\"", "files": [], "message": null, "error": null, "usage": null, "timestamp": 1726277011 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "Now" }, "finish_reason": null, "logprobs": null } ], "created": 1726277011 } ... event: tool_status data: { "tool_call_id": "call_FgfZYpRoDdPtz3QwLrLZIhdP", "name": "math:calculator", "status": "STARTED", "parameters": [{ "name": "expression", "value": "2 * 3 * 5 * 7 * 11 * 13 * 17" }], "result": null, "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726277012 } event: tool_status data: { "tool_call_id": "call_FgfZYpRoDdPtz3QwLrLZIhdP", "name": "math:calculator", "status": "ENDED", "parameters": [{ "name": "expression", "value": "2 * 3 * 5 * 7 * 11 * 13 * 17" }], "result": "\"{\\\"result\\\": \\\"2 * 3 * 5 * 7 * 11 * 13 * 17=510510\\\"}\"", "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726277016 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "The" }, "finish_reason": null, "logprobs": null } ], "created": 1726277016 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "." }, "finish_reason": null, "logprobs": null } ], "created": 1726277016 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop", "logprobs": null } ], "created": 1726277016 } data: [DONE] ``` ```json With custom tool event: tool_status data: { "tool_call_id": "call_iryDFgBCcNoc2ICXuuyZqQUe", "name": "web:search", "status": "STARTED", "parameters": [{ "name": "query", "value": "tallest buildings in the world" }], "result": null, "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726294660 } event: tool_status data: { "tool_call_id": "call_iryDFgBCcNoc2ICXuuyZqQUe", "name": "web:search", "status": "UPDATING", "parameters": [{ "name": "query", "value": "tallest buildings in the world" }], "result": "https://en.wikipedia.org/wiki/List_of_tallest_buildings", "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726294666 } ... event: tool_status data: { "tool_call_id": "call_iryDFgBCcNoc2ICXuuyZqQUe", "name": "web:search", "status": "ENDED", "parameters": [{ "name": "query", "value": "tallest buildings in the world" }], "result": "['https://en.wikipedia.org/wiki/List_of_tallest_buildings', ...]", "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726294671 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "The" }, "finish_reason": null, "logprobs": null } ], "created": 1726294672 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "id": "call_yuvrTUk4O2Uh7Hns5ieUcu1S", "type": "function", "function": { "name": "func", "arguments": "{\"" }, } ] }, "finish_reason": null, "logprobs": null } ], "created": 1726294673 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "arg" } } ] }, "finish_reason": null, "logprobs": null } ], "created": 1726294673 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "}" } } ] }, "finish_reason": null, "logprobs": null } ], "created": 1726294673 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "tool_calls", "logprobs": null } ], "created": 1726294673 } data: [DONE] ``` A unique ID of the chat completion. The object type, which is always set to `chat.completion.chunk`. The model to generate the completion. The index of the choice in the list of generated choices. Role of the generated message author, in this case `assistant`. The contents of the assistant message. The index of tool call being generated. The ID of the tool call. The type of the tool, which is always set to `function`. The name of the function to call. The arguments for calling the function, generated by the model in JSON format. Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON. Termination condition of the generation. `stop` means the API returned the full chat completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. `tool_calls` means the API has generated tool calls. Available options: `stop`, `length`, `tool_calls` Log probability information for the choice. A list of message content tokens with log probability information. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. List of the most likely tokens and their log probability, at this token position. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. Number of tokens in the prompt. Number of tokens in the generated chat completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. ### `event: tool_status` chunk object `event: tool_status` tracks the execution progress of built-in tools, such as calculator or web search functions. It provides real-time updates on their status and results. The ID of the tool call. The name of the built-in tool. Available options: `math:calculator`, `math:statistics`, `math:calendar`, `web:search`, `web:url`, `code:python-interpreter`, `file:text` Indicates the current execution status of the tool. Available options: `STARTED`, `UPDATING`, `ENDED`, `ERRORED` The name of the tool's function parameter. The value of the tool's function parameter. The output from the tool's execution. The name of the file generated by the tool's execution. URL of the file generated by the tool's execution. Message generated by the tool's execution. The type of error encountered during the tool's execution. The message of error. {/* */} The Unix timestamp (in seconds) for when the event occurred. # Langchain Node.js SDK Source: https://friendli.ai/docs/sdk/integrations/langchain/nodejs Utilize the LangChain Node.js SDK with FriendliAI for seamless integration and enhanced tool calling capabilities in your applications. You can use [**LangChain Node.js SDK**](https://github.com/langchain-ai/langchainjs) to interact with FriendliAI. This makes migration of existing applications already using LangChain particularly easy. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://suite.friendli.ai/default-team/settings/tokens). Our products are entirely compatible with OpenAI, so we use the `@langchain/openai` package by referring to the FriendliAI `baseURL`. ```bash npm npm i @langchain/core @langchain/openai ``` ```bash yarn yarn add @langchain/core @langchain/openai ``` ```bash pnpm pnpm add @langchain/core @langchain/openai ``` ### Instantiation Now we can instantiate our model object and generate chat completions. We provide usage examples for each type of endpoint. Choose the one that best suits your needs: ```js Serverless Endpoints import { ChatOpenAI } from "@langchain/openai"; const model = new ChatOpenAI({ model: "meta-llama-3.1-8b-instruct", apiKey: process.env.FRIENDLI_TOKEN, configuration: { baseURL: "https://api.friendli.ai/serverless/v1", }, }); ``` ```js Dedicated Endpoints import { ChatOpenAI } from "@langchain/openai"; const model = new ChatOpenAI({ model: "YOUR_ENDPOINT_ID", apiKey: process.env.FRIENDLI_TOKEN, configuration: { baseURL: "https://api.friendli.ai/dedicated/v1", }, }); ``` ```js Fine-tuned Dedicated Endpoints import { ChatOpenAI } from "@langchain/openai"; const model = new ChatOpenAI({ model: "YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE", apiKey: process.env.FRIENDLI_TOKEN, configuration: { baseURL: "https://api.friendli.ai/dedicated/v1", }, }); ``` ### Runnable interface We support both synchronous and asynchronous runnable methods to generate a response. {/* #### Synchronous methods: #### Asynchronous methods: TODO: Add more examples */} ```js import { HumanMessage, SystemMessage } from "@langchain/core/messages"; const messages = [ new SystemMessage("Translate the following from English into Italian"), new HumanMessage("hi!"), ]; const result = await model.invoke(messages); console.log(result); ``` ### Chaining We can chain our model with a prompt template. Prompt templates convert raw user input to better input to the LLM. ```javascript import { ChatPromptTemplate } from "@langchain/core/prompts"; const prompt = ChatPromptTemplate.fromMessages([ ["system", "You are a world class technical documentation writer."], ["user", "{input}"], ]); const chain = prompt.pipe(model); console.log( await chain.invoke({ input: "how can langsmith help with testing?" }) ); ``` To get the string value instead of the message, we can add an output parser to the chain. ```javascript import { StringOutputParser } from "@langchain/core/output_parsers"; const outputParser = new StringOutputParser(); const chain = prompt.pipe(model).pipe(outputParser); console.log( await chain.invoke({ input: "how can langsmith help with testing?" }) ); ``` ### Tool calling Describe tools and their parameters, and let the model return a tool to invoke with the input arguments. Tool calling is extremely useful for enhancing the model's capability to provide more comprehensive and actionable responses. #### Define tools to use We can define tools with Zod schemas and use them to generate tool calls. ```bash npm npm i zod ``` ```bash yarn yarn add zod ``` ```bash pnpm pnpm add zod ``` ```js import { tool } from "@langchain/core/tools"; import { z } from "zod"; /** * Note that the descriptions here are crucial, as they will be passed along * to the model along with the class name. */ const calculatorSchema = z.object({ operation: z .enum(["add", "subtract", "multiply", "divide"]) .describe("The type of operation to execute."), number1: z.number().describe("The first number to operate on."), number2: z.number().describe("The second number to operate on."), }); const calculatorTool = tool( async ({ operation, number1, number2 }) => { // Functions must return strings if (operation === "add") { return `${number1 + number2}`; } else if (operation === "subtract") { return `${number1 - number2}`; } else if (operation === "multiply") { return `${number1 * number2}`; } else if (operation === "divide") { return `${number1 / number2}`; } else { throw new Error("Invalid operation."); } }, { name: "calculator", description: "Can perform mathematical operations.", schema: calculatorSchema, } ); console.log( await calculatorTool.invoke({ operation: "add", number1: 3, number2: 4 }) ); ``` #### Bind tools to the model Now models can generate a tool calling response. ```js const modelWithTools = model.bindTools([calculatorTool]); const messages = [new HumanMessage("What is 3 * 12? Also, what is 11 + 49?")]; const aiMessage = await modelWithTools.invoke(messages); console.log(aiMessage); ``` #### Generate a tool assisted message Use the tool call results to generate a message. ```js messages.push(aiMessage); const toolsByName = { calculator: calculatorTool, }; for (const toolCall of aiMessage.tool_calls) { const selectedTool = toolsByName[toolCall.name]; const toolMessage = await selectedTool.invoke(toolCall); messages.push(toolMessage); } console.log(await modelWithTools.invoke(messages)); ``` For more information on how to use tools, check out the [LangChain documentation](https://js.langchain.com/v0.2/docs/how_to/#tools). # LangChain Python SDK Source: https://friendli.ai/docs/sdk/integrations/langchain/python Utilize the LangChain Python SDK with FriendliAI for easy integration and advanced tool calling in your applications. You can use [**LangChain Python SDK**](https://github.com/langchain-ai/langchain) to interact with FriendliAI. This makes migration of existing applications already using LangChain particularly easy. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://suite.friendli.ai/default-team/settings/tokens). Our products are entirely compatible with OpenAI, so we use the `langchain-openai` package by referring to the FriendliAI `baseURL`. ```bash pip install -qU langchain-openai langchain ``` ### Instantiation Now we can instantiate our model object and generate chat completions. We provide usage examples for each type of endpoint. Choose the one that best suits your needs: ```python Serverless Endpoints from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="meta-llama-3.1-8b-instruct", base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) ``` ```python Dedicated Endpoints from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="YOUR_ENDPOINT_ID", base_url="https://api.friendli.ai/dedicated/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) ``` ```python Fine-tuned Dedicated Endpoints from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE", base_url="https://api.friendli.ai/dedicated/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) ``` ### Runnable interface We support both synchronous and asynchronous runnable methods to generate a response. #### Synchronous methods: ```python invoke result = llm.invoke("Tell me a joke.") print(result.content) ``` ```python stream for chunk in llm.stream("Tell me a joke."): print(chunk.content, end="", flush=True) ``` ```python batch for r in llm.batch(["Tell me a joke.", "Tell me a useless fact."]): print(r.content, "\n\n") ``` #### Asynchronous methods: ```python ainvoke result = await llm.ainvoke("Tell me a joke.") print(result.content) ``` ```python astream async for chunk in llm.astream("Tell me a joke."): print(chunk.content, end="", flush=True) ``` ```python abatch for r in await llm.abatch(["Tell me a joke.", "Tell me a useless fact."]): print(r.content, "\n\n") ``` ### Chaining We can [chain](https://python.langchain.com/v0.2/docs/how_to/sequence) our model with a prompt template. Prompt templates convert raw user input to better input to the LLM. ```python from langchain_core.prompts import ChatPromptTemplate prompt = ChatPromptTemplate.from_messages([ ("system", "You are a world class technical documentation writer."), ("user", "{input}") ]) chain = prompt | llm print(chain.invoke({"input": "how can langsmith help with testing?"})) ``` To get the string value instead of the message, we can add an output parser to the chain. ```python from langchain_core.output_parsers import StrOutputParser output_parser = StrOutputParser() chain = prompt | llm | output_parser print(chain.invoke({"input": "how can langsmith help with testing?"})) ``` ### Tool calling Describe tools and their parameters, and let the model return a tool to invoke with the input arguments. Tool calling is extremely useful for enhancing the model's capability to provide more comprehensive and actionable responses. #### Define tools to use The `@tool` decorator is used to define a tool. If you set `parse_docstring=True`, the tool will parse the docstring to extract the information of arguments. ```python Default from langchain_core.tools import tool @tool def add(a: int, b: int) -> int: """Adds a and b.""" return a + b @tool def multiply(a: int, b: int) -> int: """Multiplies a and b.""" return a * b tools = [add, multiply] ``` ```python Parse Docstring from langchain_core.tools import tool @tool(parse_docstring=True) def add(a: int, b: int) -> int: """Adds a and b. Args: a: The first integer. b: The second integer. """ return a + b @tool(parse_docstring=True) def multiply(a: int, b: int) -> int: """Multiplies a and b. Args: a: The first integer. b: The second integer. """ return a * b tools = [add, multiply] ``` #### Bind tools to the model Now models can generate a tool calling response. ```python from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="meta-llama-3.1-8b-instruct", base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) llm_with_tools = llm.bind_tools(tools) query = "What is 3 * 12? Also, what is 11 + 49?" print(llm_with_tools.invoke(query).tool_calls) ``` #### Generate a tool assisted message Use the tool call results to generate a message. ```python from langchain_core.messages import HumanMessage, ToolMessage messages = [HumanMessage(query)] ai_msg = llm_with_tools.invoke(messages) messages.append(ai_msg) for tool_call in ai_msg.tool_calls: selected_tool = {"add": add, "multiply": multiply}[tool_call["name"].lower()] tool_output = selected_tool.invoke(tool_call["args"]) messages.append(ToolMessage(tool_output, tool_call_id=tool_call["id"])) print(llm_with_tools.invoke(messages)) ``` For more information on how to use tools, check out the [LangChain documentation](https://python.langchain.com/v0.2/docs/how_to/#tools). # LiteLLM Source: https://friendli.ai/docs/sdk/integrations/litellm LiteLLM SDK supports all FriendliAI models, offering easy integration with serverless, dedicated, and fine-tuned endpoints. You can use [**LiteLLM**](https://github.com/BerriAI/litellm) to interact with FriendliAI. This makes migration of existing applications already using LiteLLM particularly easy. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://suite.friendli.ai/default-team/settings/tokens). Add `friendliai` prefix to your endpoint name for the `model` parameter. ### Chat completion We provide usage examples for each type of endpoint. Choose the one that best suits your needs. You can specify one of the [available models](/guides/serverless_endpoints/text-generation#model-supports) for the serverless endpoints. ```python Serverless Endpoints from litellm import completion import os os.environ['FRIENDLI_TOKEN'] = "YOUR_FREIENDLI_TOKEN" response = completion( model="friendliai/mixtral-8x7b-instruct-v0-1", messages=[ {"role": "user", "content": "hello from litellm"} ], ) print(response) ``` ```python Dedicated Endpoints from litellm import completion import os os.environ['FRIENDLI_TOKEN'] = "YOUR_FREIENDLI_TOKEN" os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1" response = completion( model="friendliai/YOUR_ENDPOINT_ID", messages=[ {"role": "user", "content": "hello from litellm"} ], ) print(response) ``` ```python Fine-tuned Dedicated Endpoints from litellm import completion import os os.environ['FRIENDLI_TOKEN'] = "YOUR_FREIENDLI_TOKEN" os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1" response = completion( model="friendliai/YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE", messages=[ {"role": "user", "content": "hello from litellm"} ], ) print(response) ``` ### Chat completion - Streaming ```python Serverless Endpoints from litellm import completion import os os.environ['FRIENDLI_TOKEN'] = "YOUR_FREIENDLI_TOKEN" response = completion( model="friendliai/mixtral-8x7b-instruct-v0-1", messages=[ {"role": "user", "content": "hello from litellm"} ], stream=True ) for chunk in response: print(chunk) ``` ```python Dedicated Endpoints from litellm import completion import os os.environ['FRIENDLI_TOKEN'] = "YOUR_FREIENDLI_TOKEN" os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1" response = completion( model="friendliai/YOUR_ENDPOINT_ID", messages=[ {"role": "user", "content": "hello from litellm"} ], stream=True ) for chunk in response: print(chunk) ``` ```python Fine-tuned Dedicated Endpoints from litellm import completion import os os.environ['FRIENDLI_TOKEN'] = "YOUR_FREIENDLI_TOKEN" os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1" response = completion( model="friendliai/YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE", messages=[ {"role": "user", "content": "hello from litellm"} ], stream=True ) for chunk in response: print(chunk) ``` # LlamaIndex Source: https://friendli.ai/docs/sdk/integrations/llamaindex Easily integrate large language models with the LlamaIndex SDK, featuring FriendliAI for seamless interaction. {/* Open In Colab */} You can use [**LlamaIndex**](https://github.com/run-llama/llama_index) to interact with FriendliAI. This makes migration of existing applications already using LlamaIndex particularly easy. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://suite.friendli.ai/default-team/settings/tokens). ```python pip install llama-index llama-index-llms-friendli ``` ### Instantiation Now we can instantiate our model object and generate chat completions. The default model (i.e. `mixtral-8x7b-instruct-v0-1`) will be used if no model is specified. ```python from llama_index.llms.friendli import Friendli import os os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN" llm = Friendli(model="meta-llama-3.3-70b-instruct") ``` ### Chat completion Generate a response from a given conversation. ```python Default from llama_index.core.llms import ChatMessage, MessageRole message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.") resp = llm.chat([message]) print(resp) ``` ```python Streaming from llama_index.core.llms import ChatMessage, MessageRole message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.") resp = llm.stream_chat([message]) for r in resp: print(r.delta, end="") ``` ```python Async from llama_index.core.llms import ChatMessage, MessageRole message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.") resp = await llm.achat([message]) print(resp) ``` ```python Async Streaming from llama_index.core.llms import ChatMessage, MessageRole message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.") resp = await llm.astream_chat([message]) async for r in resp: print(r.delta, end="") ``` ### Completion Generate a response from a given prompt. ```python Default prompt = "Draft a cover letter for a role in software engineering." resp = llm.complete(prompt) print(resp) ``` ```python Streaming prompt = "Draft a cover letter for a role in software engineering." resp = llm.stream_complete(prompt) for r in resp: print(r.delta, end="") ``` ```python Async prompt = "Draft a cover letter for a role in software engineering." resp = await llm.acomplete(prompt) print(resp) ``` ```python Async Streaming prompt = "Draft a cover letter for a role in software engineering." resp = await llm.astream_complete(prompt) async for r in resp: print(r.delta, end="") ``` # OpenAI Node.js SDK Source: https://friendli.ai/docs/sdk/integrations/openai/nodejs Easily integrate FriendliAI with the OpenAI Node.js SDK. You can use [**OpenAI Node.js SDK**](https://github.com/openai/openai-node) to interact with FriendliAI. This makes migration of existing applications already using OpenAI particularly easy. ## How to use Before you start, ensure the `baseURL` and `apiKey` refer to FriendliAI. Since our products are entirely compatible with OpenAI SDK, now you are good to follow the examples below. Choose one of the [available models](/guides/serverless_endpoints/text-generation#model-supports) for `model` parameter. ```bash npm npm i openai ``` ```bash yarn yarn add openai ``` ```bash pnpm pnpm add openai ``` ### Chat Completion Chat completion API that generates a response from a given conversation. We provide multiple usage examples. Try to find the best one that aligns with your needs: ```ts Default import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/v1", apiKey: process.env.FRIENDLI_TOKEN, }); async function main() { const completion = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Hello!" }, ], }); console.log(completion.choices[0]); } main(); ``` ```ts Streaming import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/v1", apiKey: process.env.FRIENDLI_TOKEN, }); async function main() { const completion = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Hello!" }, ], stream: true, }); for await (const chunk of completion) { console.log(chunk.choices[0].delta.content); } } main(); ``` ```ts Functions import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/v1", apiKey: process.env.FRIENDLI_TOKEN, }); async function main() { const messages = [ { role: "user", content: "What's the weather like in Boston today?" }, ]; const tools = [ { type: "function", function: { name: "get_current_weather", description: "Get the current weather in a given location", parameters: { type: "object", properties: { location: { type: "string", description: "The city and state, e.g. San Francisco, CA", }, unit: { type: "string", enum: ["celsius", "fahrenheit"] }, }, required: ["location"], }, }, }, ]; const completion = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: messages, tools: tools, tool_choice: "auto", }); console.log(completion); } main(); ``` ```ts Logprobs import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/v1", apiKey: process.env.FRIENDLI_TOKEN, }); async function main() { const completion = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: [{ role: "user", content: "Hello!" }], logprobs: true, top_logprobs: 2, }); console.log(completion.choices[0].message); console.log(completion.choices[0].logprobs); } main(); ``` ### Tool assisted chat completion This feature is in Beta and available only on the **Serverless Endpoints**. Using tool assisted chat completion API, models can utilize built-in tools prepared for tool calls, enhancing its capability to provide more comprehensive and actionable responses. Available tools are listed [here](/guides/serverless_endpoints/tool-assisted-api#built-in-tools). ```ts Basic import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/v1", apiKey: process.env.FRIENDLI_TOKEN, }); async function main() { const messages = [ { role: "user", content: "What is the current average home price in New York City, and if I put 15% down, how much will my mortgage be?", }, ]; const tools = [{ type: "code:python-interpreter" }, { type: "web:search" }]; const completion = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: messages, tools: tools, tool_choice: "auto", stream: true, }); for await (const chunk of completion) { if (chunk.choices === undefined) { console.log(`event: ${chunk.event}, data: ${JSON.stringify(chunk.data)}`); } else { console.log(chunk.choices[0].delta.content); } } } main(); ``` ```ts Advanced (REPL) import OpenAI from "openai"; import * as readline from "node:readline/promises"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/v1", apiKey: process.env.FRIENDLI_TOKEN, }); const terminal = readline.createInterface({ input: process.stdin, output: process.stdout, }); async function chatbot(input) { const stream = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: [{ role: "user", content: input }], tools: [ { type: "web:url" }, { type: "code:python-interpreter" }, { type: "math:calculator" }, { type: "web:search" }, ], tool_choice: "auto", stream: true, }); for await (const chunk of stream) { if (chunk.choices === undefined) { if (chunk.event === "tool_status") { if (chunk.data.result !== "") { switch (chunk.data.status) { case "STARTED": terminal.write( `⚒️ TOOL CALL: ${chunk.data.name}(${JSON.stringify( chunk.data.parameters )})` ); break; case "ENDED": terminal.write(`🔧 TOOL RESULT: ${chunk.data.result}`); break; case "ERRORED": terminal.write(`🔧 TOOL ERROR: ${chunk.data.error}`); break; case "UPDATING": terminal.write(`🔧 TOOL UPDATE: ${chunk.data.result}`); break; default: terminal.write(`Unknown tool status: ${chunk.data}`); } } terminal.write("\n"); } else { terminal.write("Unknown event", chunk); } } else { terminal.write(chunk.choices[0]?.delta?.content || ""); } } terminal.write("\n"); } while (true) { const input = await terminal.question("You: "); terminal.write(" "); await chatbot(input); } ``` # OpenAI Python SDK Source: https://friendli.ai/docs/sdk/integrations/openai/python Integrate FriendliAI with OpenAI Python SDK for chat, streaming, and more. You can use [**OpenAI Python SDK**](https://github.com/openai/openai-python) to interact with FriendliAI. This makes migration of existing applications already using OpenAI particularly easy. ## How to use Before you start, ensure the `base_url` and `api_key` refer to FriendliAI. Since our products are entirely compatible with OpenAI SDK, now you are good to follow the examples below. Choose one of the [available models](/guides/serverless_endpoints/text-generation#model-supports) for `model` parameter. ```bash pip install -qU openai ``` ### Chat Completion Chat completion API that generates a response from a given conversation. We provide multiple usage examples. Try to find the best one that aligns with your needs. ```python Default from openai import OpenAI import os client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ] ) print(completion.choices[0].message) ``` ```python Streaming from openai import OpenAI import os client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ], stream=True ) for chunk in completion: print(chunk.choices[0].delta) ``` ```python Functions from openai import OpenAI import os client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) tools = [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, }, "required": ["location"], }, } } ] completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ {"role": "user", "content": "What's the weather like in Boston today?"} ], tools=tools, tool_choice="auto" ) print(completion) ``` ```python Logprobs from openai import OpenAI import os client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ {"role": "user", "content": "Hello!"} ], logprobs=True, top_logprobs=2 ) print(completion.choices[0].message) print(completion.choices[0].logprobs) ``` ### Tool assisted chat completion This feature is in Beta and available only on the **Serverless Endpoints**. Using tool assisted chat completion API, models can utilize built-in tools prepared for tool calls, enhancing its capability to provide more comprehensive and actionable responses. Available tools are listed [here](/guides/serverless_endpoints/tool-assisted-api#built-in-tools). ```python Basic from openai import OpenAI import os client = OpenAI( base_url="https://api.friendli.ai/serverless/tools/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) stream = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[{"role": "user", "content": "What is the current average home price in New York City, and if I put 15% down, how much will my mortgage be?"}], tools=[ {"type": "web:search"}, {"type": "math:calculator"}, ], stream=True, ) for chunk in stream: if chunk.choices is None: print(f"{chunk.event=}, {chunk.data=}") elif chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="") ``` ```python Advanced (REPL) from openai import OpenAI import os client = OpenAI( base_url="https://api.friendli.ai/serverless/tools/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) class bcolors: OKBLUE = '\033[94m' OKCYAN = '\033[96m' FAIL = '\033[91m' WHITE = '\033[97m' def print_response(response): print(f"{bcolors.OKCYAN}{response}", end='') def print_tool_call(data): print(f"\n{bcolors.OKBLUE}⚒️ TOOL CALL: { data['name']}({data['parameters']})") def print_tool_result(data): print(f"{bcolors.OKBLUE}🔧 TOOL RESULT: {data['result']}") def print_tool_error(data): print(f"{bcolors.FAIL}🔧 TOOL ERROR: {data['error']}", end='') def print_tool_update(data): print(f"{bcolors.OKBLUE}🔧 TOOL UPDATE: {data['result']}") def chatbot(prompt): stream = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[{"role": "user", "content": prompt}], stream=True, tools=[ {"type": "web:url"}, {"type": "code:python-interpreter"}, {"type": "math:calculator"}, {"type": "web:search"} ] ) for chunk in stream: if chunk.choices is None: if chunk.event == "tool_status": match chunk.data: case {"status": "STARTED"}: print_tool_call(chunk.data) case {"status": "ENDED"}: print_tool_result(chunk.data) case {"status": "ERRORED"}: print_tool_error(chunk.data) case {"status": "UPDATING"}: print_tool_update(chunk.data) elif chunk.choices[0].delta.content is not None: print_response(chunk.choices[0].delta.content) print("\n") print("Welcome to the Tool Inference!") print("To exit, enter 'q'.") while True: user_input = input(f"{bcolors.WHITE}You: ") if user_input.lower() == 'q': break chatbot(user_input) ``` # Friendli Integrations Source: https://friendli.ai/docs/sdk/integrations/overview Effortlessly integrate FriendliAI models into your projects with support for popular SDKs and frameworks. ## Effortless AI integration with popular SDKs Friendli is committed to providing developers with flexible and powerful tools to integrate our AI models into their projects. We support a variety of popular SDKs and frameworks, making it easy to incorporate Friendli's capabilities into existing workflows and applications. Our integration options include LiteLLM for unified LLM interactions, Vercel AI SDK for seamless web application development, LangChain for building complex AI-driven applications, and an OpenAI-compatible API for those familiar with OpenAI's interface. These integrations enable developers to leverage Friendli's AI models across a wide range of use cases, from simple chat applications to sophisticated AI systems, all while maintaining ease of use and compatibility with existing tools and practices. openai openai openai openai langchain langchain weaviate weaviate vercel vercel llamaindex litellm litellm # Vercel AI SDK Source: https://friendli.ai/docs/sdk/integrations/vercel-ai-sdk Easily integrate FriendliAI models with the Vercel AI SDK, supporting serverless, dedicated, and fine-tuned endpoints. You can use [**Vercel AI SDK**](https://sdk.vercel.ai) to interact with FriendliAI. This makes migration of existing applications already using Vercel AI SDK particularly easy. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://suite.friendli.ai/default-team/settings/tokens). ```bash npm npm i ai @friendliai/ai-provider ``` ```bash yarn yarn add ai @friendliai/ai-provider ``` ```bash pnpm pnpm add ai @friendliai/ai-provider ``` ### Instantiation Instantiate your models using a Friendli provider instance. We provide usage examples for each type of endpoint. Choose the one that best suits your needs: ```ts Serverless Endpoints {4,7-9} import { friendli } from '@friendliai/ai-provider'; // Automatically select serverless endpoints const model = friendli("meta-llama-3.3-70b-instruct"); // Or specify a specific serverless endpoint const model = friendli("meta-llama-3.3-70b-instruct", { endpoint: "serverless", }); ``` ```ts Dedicated Endpoints {4,7-9} import { friendli } from '@friendliai/ai-provider'; // Replace YOUR_ENDPOINT_ID with the ID of your endpoint, e.g. "zbimjgovmlcb" const model = friendli("YOUR_ENDPOINT_ID"); // Specify a dedicated endpoint instead of auto-selecting const model = friendli("YOUR_ENDPOINT_ID", { endpoint: "dedicated", }); ``` ```ts Friendli Container {9} import { createFriendli } from "@friendliai/ai-provider"; const friendli = createFriendli({ // Update with the URL where your container is running. baseURL: "http://localhost:8000/v1", }); // Containers do not require a model id. const model = friendli(""); ``` ### Example: Generating text Generate a response with the `generateText` function: ```ts import { friendli } from "@friendliai/ai-provider"; import { generateText } from "ai"; const { text } = await generateText({ model: friendli("meta-llama-3.3-70b-instruct"), prompt: "Write a vegetarian lasagna recipe for 4 people.", }); console.log(text); ``` ### Example: Using Enforcing Patterns (Regex) Specify a specific pattern (e.g., CSV), character sets, or specific language characters (e.g., Korean Hangul characters) for your LLM's output. ```ts {6} import { friendli } from "@friendliai/ai-provider"; import { generateText } from "ai"; const { text } = await generateText({ model: friendli("meta-llama-3.3-70b-instruct", { regex: new RegExp("[\n ,.?!0-9\uac00-\ud7af]*"), }), prompt: "Who is the first king of the Joseon Dynasty?", }); console.log(text); ``` ### Example: Using built-in tools This feature is in Beta and available only on the **Serverless Endpoints**. Using tool assisted chat completion API, models can utilize built-in tools prepared for tool calls, enhancing its capability to provide more comprehensive and actionable responses. Available tools are listed [here](/guides/serverless_endpoints/tool-assisted-api#built-in-tools). ```ts {6-9} import { friendli } from "@friendliai/ai-provider"; import { streamText } from "ai"; const result = await streamText({ model: friendli("meta-llama-3.3-70b-instruct", { tools: [ {"type": "web:search"}, {"type": "math:calculator"}, ], }), prompt: "Find the current USD to CAD exchange rate and calculate how much $5,000 USD would be in Canadian dollars.", }); for await (const textPart of result.textStream) { console.log(textPart); } ``` ## OpenAI Compatibility You can also use `@ai-sdk/openai` as the APIs are OpenAI-compatible. ```ts import { createOpenAI } from '@ai-sdk/openai'; const friendli = createOpenAI({ baseURL: 'https://api.friendli.ai/serverless/v1', apiKey: process.env.FRIENDLI_TOKEN, }); ``` If you are using dedicated endpoints ```ts import { createOpenAI } from '@ai-sdk/openai'; const friendli = createOpenAI({ baseURL: 'https://api.friendli.ai/dedicated/v1', apiKey: process.env.FRIENDLI_TOKEN, }); ``` ## Further resources * [Implementing a simple streaming chat with Next.js](https://sdk.vercel.ai/examples/next-app/basics/streaming-text-generation) * [Build a Next.js app with the Vercel AI SDK](https://sdk.vercel.ai/docs/getting-started/nextjs-app-router) * [Explore the Vercel AI SDK Core Reference](https://sdk.vercel.ai/docs/ai-sdk-core/overview) # FriendliAI + Weaviate (Node.js) Source: https://friendli.ai/docs/sdk/integrations/weaviate/nodejs Utilize the Weaviate to build applications with less hallucination open-source vector database. Integration with [**Weaviate**](https://github.com/weaviate/weaviate) enables performing Retrieval Augmented Generation (RAG) directly within the Weaviate database. This combines the power of [**Friendli Engine**](https://friendli.ai/solutions/engine) and Weaviate's efficient storage and fast retrieval capabilities to generate personalized and context-aware responses. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://suite.friendli.ai/default-team/settings/tokens). Also, set up your Weaviate instance following this [guide](https://weaviate.io/developers/weaviate/starter-guides/which-weaviate). Your Weaviate instance must be configured with the FriendliAI generative AI integration (`generative-friendliai`) module. ```bash npm npm i weaviate-client ``` ```bash yarn yarn add weaviate-client ``` ```bash pnpm pnpm add weaviate-client ``` ### Instantiation Now we can instantiate a [Weaviate collection](https://weaviate.io/developers/weaviate/manage-data/collections) using our model. We provide usage examples for each type of endpoint. Choose the one that best suits your needs. You can specify one of the [available models](/guides/serverless_endpoints/text-generation#model-supports) for the serverless endpoints. The default model (i.e. `meta-llama-3.3-70b-instruct`) will be used if no model is specified. ```ts Serverless Endpoints import weaviate from 'weaviate-client' const client = await weaviate.connectToWeaviateCloud( 'WEAVIATE_INSTANCE_URL', // your Weaviate instance URL { authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_APIKEY'), headers: { 'X-Friendli-Api-Key': process.env.FRIENDLI_TOKEN, } } ) await client.collections.create({ name: 'DemoCollection', generative: weaviate.configure.generative.friendliai({ model: 'meta-llama-3.3-70b-instruct' }), // Additional parameters ... }); client.close() ``` ```ts Dedicated Endpoints import weaviate from 'weaviate-client' const client = await weaviate.connectToWeaviateCloud( 'WEAVIATE_INSTANCE_URL', // your Weaviate instance URL { authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_APIKEY'), headers: { 'X-Friendli-Api-Key': process.env.FRIENDLI_TOKEN, "X-Friendli-Baseurl": "https://api.friendli.ai/dedicated", } } ) await client.collections.create({ name: 'DemoCollection', generative: weaviate.configure.generative.friendliai({ model: 'YOUR_ENDPOINT_ID' }), // Additional parameters ... }); client.close() ``` ```ts Fine-tuned Dedicated Endpoints import weaviate from 'weaviate-client' const client = await weaviate.connectToWeaviateCloud( 'WEAVIATE_INSTANCE_URL', // your Weaviate instance URL { authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_APIKEY'), headers: { 'X-Friendli-Api-Key': process.env.FRIENDLI_TOKEN, "X-Friendli-Baseurl": "https://api.friendli.ai/dedicated", } } ) await client.collections.create({ name: 'DemoCollection', generative: weaviate.configure.generative.friendliai({ model: 'YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE' }), // Additional parameters ... }); client.close() ``` #### Configurable parameters Configure the following generative parameters to customize the model behavior. ```ts await client.collections.create({ name: 'DemoCollection', generative: weaviate.configure.generative.friendliai({ model: 'meta-llama-3.3-70b-instruct', maxTokens: 500, temperature: 0.7, }), // Additional parameters ... }); ``` ### Retrieval Augmented Generation After configuring Weaviate, perform RAG operations, either with the single prompt or grouped task method. #### Single prompt To generate text for each object in the search results, use the single prompt method. The example below generates outputs for each of the n search results, where n is specified by the limit parameter. When creating a single prompt query, use braces `{}` to interpolate the object properties you want Weaviate to pass on to the language model. For example, to pass on the object's title property, include `{title}` in the query. ```ts let myCollection = client.collections.get('DemoCollection'); const singlePromptResults = await myCollection.generate.nearText( ['A holiday film'], { singlePrompt: `Translate this into French: {title}`, }, { limit: 2, } ); for (const obj of singlePromptResults.objects) { console.log(obj.properties['title']); console.log(`Generated output: ${obj.generated}`); // Note that the generated output is per object } ``` #### Grouped task To generate one text for the entire set of search results, use the grouped task method. In other words, when you have n search results, the generative model generates one output for the entire group. ```ts let myCollection = client.collections.get('DemoCollection'); const groupedTaskResults = await myCollection.generate.nearText( ['A holiday film'], { groupedTask: `Write a fun tweet to promote readers to check out these films.`, }, { limit: 2, } ); console.log(`Generated output: ${groupedTaskResults.generated}`); // Note that the generated output is per query for (const obj of groupedTaskResults.objects) { console.log(obj.properties['title']); } ``` ### Further resources Once the integrations are configured at the collection, the data management and search operations in Weaviate work identically to any other collection. See the following model-agnostic examples: * [How-to manage data guides show how to perform data operations](https://weaviate.io/developers/weaviate/manage-data/create). * [How-to search guides show how to perform search operations](https://weaviate.io/developers/weaviate/search/basics). # FriendliAI + Weaviate (Python) Source: https://friendli.ai/docs/sdk/integrations/weaviate/python Utilize the Weaviate to build applications with less hallucination open-source vector database. Integration with [**Weaviate**](https://github.com/weaviate/weaviate) enables performing Retrieval Augmented Generation (RAG) directly within the Weaviate database. This combines the power of [**Friendli Engine**](https://friendli.ai/solutions/engine) and Weaviate's efficient storage and fast retrieval capabilities to generate personalized and context-aware responses. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://suite.friendli.ai/default-team/settings/tokens). Also, set up your Weaviate instance following this [guide](https://weaviate.io/developers/weaviate/starter-guides/which-weaviate). Your Weaviate instance must be configured with the FriendliAI generative AI integration (`generative-friendliai`) module. ```bash pip install -qU weaviate-client ``` ### Instantiation Now we can instantiate a [Weaviate collection](https://weaviate.io/developers/weaviate/manage-data/collections) using our model. We provide usage examples for each type of endpoint. Choose the one that best suits your needs. You can specify one of the [available models](/guides/serverless_endpoints/text-generation#model-supports) for the serverless endpoints. The default model (i.e. `meta-llama-3.3-70b-instruct`) will be used if no model is specified. ```python Serverless Endpoints import weaviate from weaviate.classes.init import Auth from weaviate.classes.config import Configure import os headers = { "X-Friendli-Api-Key": os.getenv("FRIENDLI_TOKEN"), } client = weaviate.connect_to_weaviate_cloud( cluster_url=weaviate_url, # `weaviate_url`: your Weaviate URL auth_credentials=Auth.api_key(weaviate_key), # `weaviate_key`: your Weaviate API key headers=headers ) client.collections.create( "DemoCollection", generative_config=Configure.Generative.friendliai( model = "meta-llama-3.3-70b-instruct", ) # Additional parameters not shown ) client.close() ``` ```python Dedicated Endpoints import weaviate from weaviate.classes.init import Auth from weaviate.classes.config import Configure import os headers = { "X-Friendli-Api-Key": os.getenv("FRIENDLI_TOKEN"), "X-Friendli-Baseurl": "https://api.friendli.ai/dedicated", } client = weaviate.connect_to_weaviate_cloud( cluster_url=weaviate_url, # `weaviate_url`: your Weaviate URL auth_credentials=Auth.api_key(weaviate_key), # `weaviate_key`: your Weaviate API key headers=headers ) client.collections.create( "DemoCollection", generative_config=Configure.Generative.friendliai( model = "YOUR_ENDPOINT_ID", ) # Additional parameters not shown ) client.close() ``` ```python Fine-tuned Dedicated Endpoints import weaviate from weaviate.classes.init import Auth from weaviate.classes.config import Configure import os headers = { "X-Friendli-Api-Key": os.getenv("FRIENDLI_TOKEN"), "X-Friendli-Baseurl": "https://api.friendli.ai/dedicated", } client = weaviate.connect_to_weaviate_cloud( cluster_url=weaviate_url, # `weaviate_url`: your Weaviate URL auth_credentials=Auth.api_key(weaviate_key), # `weaviate_key`: your Weaviate API key headers=headers ) client.collections.create( "DemoCollection", generative_config=Configure.Generative.friendliai( model = "YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE", ) # Additional parameters not shown ) client.close() ``` #### Configurable parameters Configure the following generative parameters to customize the model behavior. ```python from weaviate.classes.config import Configure client.collections.create( "DemoCollection", generative_config=Configure.Generative.friendliai( # These parameters are optional model = "meta-llama-3.3-70b-instruct", max_tokens = 500, temperature = 0.7, ) ) ``` ### Retrieval Augmented Generation After configuring Weaviate, perform RAG operations, either with the single prompt or grouped task method. #### Single prompt To generate text for each object in the search results, use the single prompt method. The example below generates outputs for each of the n search results, where n is specified by the limit parameter. When creating a single prompt query, use braces `{}` to interpolate the object properties you want Weaviate to pass on to the language model. For example, to pass on the object's title property, include `{title}` in the query. ```python collection = client.collections.get("DemoCollection") response = collection.generate.near_text( query="A holiday film", # The model provider integration will automatically vectorize the query single_prompt="Translate this into French: {title}", limit=2 ) for obj in response.objects: print(obj.properties["title"]) print(f"Generated output: {obj.generated}") # Note that the generated output is per object ``` #### Grouped task To generate one text for the entire set of search results, use the grouped task method. In other words, when you have n search results, the generative model generates one output for the entire group. ```python collection = client.collections.get("DemoCollection") response = collection.generate.near_text( query="A holiday film", # The model provider integration will automatically vectorize the query grouped_task="Write a fun tweet to promote readers to check out these films.", limit=2 ) print(f"Generated output: {response.generated}") # Note that the generated output is per query for obj in response.objects: print(obj.properties["title"]) ``` ### Further resources Once the integrations are configured at the collection, the data management and search operations in Weaviate work identically to any other collection. See the following model-agnostic examples: * [How-to manage data guides show how to perform data operations](https://weaviate.io/developers/weaviate/manage-data/create). * [How-to search guides show how to perform search operations](https://weaviate.io/developers/weaviate/search/basics).