# Audio and Speech: Converting Audio to Text Source: https://friendli.ai/docs/guides/audio Guide to using Friendli's Audio and Speech feature for audio analysis and transcription. Covers usage via Playground and API (URL & Base64 examples). Friendli provides audio and speech features through Friendli Dedicated Endpoints, allowing you to convert audio files to text and perform various AI tasks. This guide explains how to use these features with examples for both the Playground and API interfaces. You can find the full list of available models [here](https://friendli.ai/models/search?input=AUDIO). ## ASR - `/v1/audio/transcriptions` Our ASR (Automatic Speech Recognition) service is designed for efficient audio transcription.\ By default, audio input is limited to 30 seconds. If you require support for longer audio inputs, please { e.preventDefault(); window.Intercom('showNewMessage'); }}>contact us. ### API Usage Example ```python Python from openai import OpenAI import os client = OpenAI( base_url="https://api.friendli.ai/dedicated/v1", api_key=os.getenv("FRIENDLI_TOKEN"), ) audio_file= open("/path/to/file/audio.mp3", "rb") transcription = client.audio.transcriptions.create( model="YOUR_ENDPOINT_ID", file=audio_file ) print(transcription.text) ``` ```sh cURL curl -X POST https://api.friendli.ai/dedicated/v1/audio/transcriptions \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ -H 'Content-Type: multipart/form-data' \ -F file=@/path/to/audio/file.mp3 \ -F model="YOUR_ENDPOINT_ID" ``` For more detailed information, please refer to the [API reference](/openapi/dedicated/inference/audio-transcriptions). ### Supported Models We support a variety of powerful ASR models, including: * [openai/whisper-large-v3-turbo](https://huggingface.co/openai/whisper-large-v3-turbo) * [openai/whisper-large-v3](https://huggingface.co/openai/whisper-large-v3) * [openai/whisper-small](https://huggingface.co/openai/whisper-small) * ...and many more. *** ## Audio Modality - `/v1/chat/completions` The audio modality endpoint allows you to combine audio and text inputs, enabling advanced AI tasks. This endpoint is ideal for: * **Complex audio and text analysis** * **Conversational AI** * **Tasks requiring diverse inference**, such as summarization, sentiment analysis, and question answering. By default, audio input is limited to 10 seconds. If you require support for longer audio inputs, please { e.preventDefault(); window.Intercom('showNewMessage'); }}>contact us. ```sh Passing a URL curl -X POST https://api.friendli.ai/dedicated/v1/chat/completions \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ -H "Content-Type: application/json" \ --data @- < Ensure you have the `friendli` SDK version `1.4.1` or higher installed. ## Starting the Friendli Container with gRPC Running the Friendli Container with a gRPC server for completions is available by adding the `--grpc true` option to the command argument. This supports response-streaming gRPC, and you can send requests using our `friendli` SDK. To start the Friendli Container with gRPC support, use the following command: ```sh export FRIENDLI_CONTAINER_SECRET="YOUR_FRIENDLI_CONTAINER_SECRET_flc_XXX" # e.g. Running `NousResearch/Hermes-3-Llama-3.1-8B` on GPU 0 with a trial image. docker run --gpus '"device=0"' -p 8000:8000 \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ -v ~/.cache/huggingface:/root/.cache/huggingface \ registry.friendli.ai/trial:latest \ --hf-model-name NousResearch/Hermes-3-Llama-3.1-8B \ --grpc true ``` You can change the port of the server with `--web-server-port` argument. ## Sending Requests with the Client SDK Here is how to use the `friendli` SDK to interact with the gRPC server. This example assumes that the gRPC server is running on `0.0.0.0:8000`. ```python Default from friendli import SyncFriendli client = SyncFriendli() stream = client.container.chat.complete( messages=[ {"content": "You are a helpful assistant.", "role": "system"}, {"content": "Hello!", "role": "user"}, ], stream=True, # Should be True top_k=1, ) for chunk in stream: print(chunk.text, end="", flush=True) ``` ```python Async # For asynchronous operations, use the following code snippet: import asyncio from friendli import AsyncFriendli client = AsyncFriendli() async def run(): stream = await client.container.chat.complete( messages=[ {"content": "You are a helpful assistant.", "role": "system"}, {"content": "Hello!", "role": "user"}, ], stream=True, # Should be True top_k=1, ) async for chunk in stream: print(chunk.text, end="", flush=True) asyncio.run(run()) ``` ## Properly Closing the Client By default, the library closes underlying HTTP and gRPC connections when the `client` is garbage-collected. You can manually close the `Friendli` or `AsyncFriendli` client using the `.close()` method or utilize a context manager to ensure proper closure when exiting a `with` block. ```python Default from friendli import SyncFriendli client = SyncFriendli() with client: stream = client.container.chat.complete( messages=[ {"content": "You are a helpful assistant.", "role": "system"}, {"content": "Hello!", "role": "user"}, ], stream=True, # Should be True top_k=1, min_tokens=10, ) for chunk in stream: print(chunk.text, end="", flush=True) ``` ```python Async import asyncio from friendli import AsyncFriendli client = AsyncFriendli() async def run(): async with client: stream = await client.container.chat.complete( messages=[ {"content": "You are a helpful assistant.", "role": "system"}, {"content": "Hello!", "role": "user"}, ], stream=True, # Should be True top_k=1, ) async for chunk in stream: print(chunk.text, end="", flush=True) asyncio.run(run()) ``` # Introducing Friendli Container Source: https://friendli.ai/docs/guides/container/introduction While Friendli Serverless Endpoints and Dedicated Endpoints offer convenient cloud-based solutions, some users crave even more control and flexibility. For those pioneers, Friendli Container is the answer. While Friendli Serverless Endpoints and Dedicated Endpoints offer convenient cloud-based solutions, some users crave even more control and flexibility. For those pioneers, Friendli Container is the answer. ## What is Friendli Container? Unmatched Control: Friendli Container provides the Friendli Engine, our cutting-edge serving technology, as a Docker container. This means you can: * **Run your own data center or cluster**: Deploy the container on your existing GPU machines, giving you complete control over your infrastructure and data security. * **Choose your own cloud provider**: If you prefer the cloud, you can still leverage your preferred cloud provider and GPUs. * **Customize your environment**: Fine-tune the container configuration to perfectly match your specific needs and workflows. Greater Responsibility, Greater Customization: With Friendli Container, you handle the cluster management, fault tolerance, and scaling. This responsibility comes with these potential benefits: * **Controlled environment**: Keep your data within your own environment, ideal for sensitive applications or meeting compliance requirements. * **Unmatched flexibility**: Tailor your infrastructure and workflows to your specific needs, pushing the boundaries of AI innovation. * **Cost saving opportunities**: Manage your resources on your GPU machines, potentially leading to cost savings compared to cloud-based solutions. Ideal for: * **Data-sensitive users**: Securely run your models within your own infrastructure. * **Control enthusiasts**: Take full control over your AI environment and workflows. * **Existing cluster owners**: Utilize your existing GPU resources for cost-effective generative AI serving. ## Getting Started with Friendli Container: 1. **Generate Your User Token**: Visit the Friendli Container page through the [Friendli Suite](https://friendli.ai/suite) website and generate your unique token. 2. **Login with Docker Client**: Use your token to authenticate with the Docker client on your machine. 3. **Pull the Friendli Container Image**: Run the docker pull command with the provided image name. 4. [**Launch the Friendli Container**](/guides/container/running_friendli_container): Run the docker run command with the desired configuration and credentials. 5. **Expose Your Model**: The container will expose the model for inference. 6. [**Send Inference Requests**](/guides/container/running_friendli_container#sending-inference-requests): Use tools like curl or Python's requests library to send input prompts or data to the container. Take generative AI to the next level with unmatched control, security, and flexibility through Friendli Container. Start your journey today and elevate your AI endeavors on your own terms! # Observability for Friendli Container Source: https://friendli.ai/docs/guides/container/monitoring Observability is an integral part of DevOps. To support this, Friendli Container exports internal metrics in a Prometheus text format. Observability is an integral part of DevOps. To support this, Friendli Container exports internal metrics in a [Prometheus](https://prometheus.io) text format. By default, metrics are served at `http://localhost:8281/metrics`. You can configure the port number using the command line option `--metrics-port`. ## Supported Metrics ### Counters Counters are cumulative metrics whose values monotonically increase. They are often used in combination with Prometheus function [rate()](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) for calculating the throughput. | Metric Name | Description | | --------------------------------- | -------------------------------------------------------- | | friendli\_requests\_total | Cumulative number of requests received | | friendli\_responses\_total | Cumulative number of responses sent | | friendli\_items\_total | Cumulative number of items requested | | friendli\_failure\_by\_cancel | Cumulative number of failed requests due to cancellation | | friendli\_failure\_by\_timeout | Cumulative number of failed requests due to timeout | | friendli\_failure\_by\_nan\_error | Cumulative number of failed requests due to NaN error | | friendli\_failure\_by\_reject | Cumulative number of failed requests due to rejection | One inference request may generate multiple results with the `n` field in the request body. Upon receiving such request, `friendli_requests_total` is increased by 1 and `friendli_items_total` is increased by `n`. ### Gauges Gauges are numerical values that can go up and down to represent the current value. | Metric Name | Description | | ---------------------------------- | --------------------------------------------------------------------- | | friendli\_current\_requests | Current number of requests in the engine (either assigned or waiting) | | friendli\_current\_items | Current number of items in the engine (either assigned or waiting) | | friendli\_current\_assigned\_items | Current number of items actively processed by the engine | | friendli\_current\_waiting\_items | Current number number of items waiting in the internal queue | ### Histograms [Histograms](https://prometheus.io/docs/practices/histograms) are used to track the distribution of variables over time.
Histogram Metric Name Description
Friendli TCache hit ratio (0≀value≀1) friendli\_tcache\_hit\_ratio\_bucket Bucketized number of histogram samples for TCache hit ratio, with le label
friendli\_tcache\_hit\_ratio\_count Total number of histogram samples for TCache hit ratio
friendli\_tcache\_hit\_ratio\_sum Sum of histogram sample values for TCache hit ratio
The length of input tokens (Experimental metric) friendli\_input\_lengths\_bucket Bucketized number of histogram samples for length of input tokens, with le label
friendli\_input\_lengths\_count Total number of histogram samples for length of input tokens
friendli\_input\_lengths\_sum Sum of histogram sample values for length of input tokens
The length of output tokens (Experimental metric) friendli\_output\_lengths\_bucket Bucketized number of histogram samples for length of output tokens, with le label
friendli\_output\_lengths\_count Total number of histogram samples for length of output tokens
friendli\_output\_lengths\_sum Sum of histogram sample values for length of output tokens
For visualizing histograms using Grafana, [How to visualize Prometheus histograms in Grafana](https://grafana.com/blog/2020/06/23/how-to-visualize-prometheus-histograms-in-grafana) provides useful tips. ### Quantiles Quantiles are used to show the current p50(median), p90, and p99 percentiles of variables.
Quantiles Metric Name Description
Request completion latency (in nanoseconds) friendli\_requests\_latencies Percentile value for request completion latency (quantile label is either 0.5, 0.9, or 0.99)
friendli\_requests\_latencies\_count Total number of samples for request completion latency
friendli\_requests\_latencies\_sum Sum of sample values for request completion latency
Time to first token (TTFT) (in nanoseconds) friendli\_requests\_ttft Percentile value for time to first token (TTFT) (quantile label is either 0.5, 0.9, or 0.99)
friendli\_requests\_ttft\_count Total number of samples for time to first token (TTFT)
friendli\_requests\_ttft\_sum Sum of sample values for time to first token (TTFT)
Request queueing delay (in nanoseconds) friendli\_requests\_queueing\_delays Percentile value for queueing delay (quantile label is either 0.5, 0.9, or 0.99)
friendli\_requests\_queueing\_delays\_count Total number of samples for queueing delay
friendli\_requests\_queueing\_delays\_sum Sum of sample values for queueing delay
### Info The following information metric always has a value of 1. The metric labels contain useful information in text. | Metric Name | Label | Description | | ------------------------- | --------- | -------------- | | friendli\_engine\_version | `version` | Engine version | ## Grafana Dashboard Template ![Grafana Dashboard](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/container/grafana_template_dashboard_example.png) You can import [the dashboard templates](https://github.com/friendliai/container-resource/tree/main/grafana) to your Grafana instance. The Grafana instance must be connected to a Prometheus instance (or a Prometheus-compatible data source) which is configured to scrape metrics from Friendli Container processes. The dashboard template works with Grafana v8.0.0 or later versions. We recommend using Grafana v10.0.0 or later for the best experience. # Optimizing Inference with Policy Search Source: https://friendli.ai/docs/guides/container/optimizing_inference_with_policy_search For specialized cases like MoE or quantized models, optimizing the execution policy in Friendli Engine can boost inference performance by 1.5x to 2x, improving throughput and reducing latency. ## Introduction For specialized cases, like **serving MoE models (e.g., Mixtral)** or **quantized models**, performance of inference can be further optimized through a execution policy search. This process can be skipped, but it is necessary to get the optimized speed of Friendli Engine. When Friendli Engine runs with the optimal policy, the performance can increase by from 1.5x to 2x (i.e., throughput and latency). Therefore, we recommend skipping policy search for simple model testing, and performing policy search for cost analysis or latency analysis in production service. Policy search is effective only when serving (1) MoE models (2) AWQ, FP8 or INT8 quantized models. Otherwise, it is useless. ## Running Policy Search You can run policy search by adding the following options to the launch command of Friendli Container. | Options | Type | Summary | Default | | -------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | | `--algo-policy-dir` | TEXT | Path to the directory to save the searched optimal policy file. The default value is the current working directory. | current working dir | | `--search-policy` | BOOLEAN | Runs policy search to find the best Friendli execution policy for the given configuration such as model type, GPU, NVIDIA driver version, quantization scheme, etc. | false | | `--terminate-after-search` | BOOLEAN | Terminates engine container after policy search. | false | ### Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8` For example, you can start the policy search for [FriendliAI/Llama-3.1-8B-Instruct-fp8](https://huggingface.co/FriendliAI/Llama-3.1-8B-Instruct-fp8) model as follows: ```sh export HF_MODEL_NAME="FriendliAI/Llama-3.1-8B-Instruct-fp8" export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial" export GPU_ENUMERATION='"device=0"' export POLICY_DIR=$PWD/policy mkdir -p $POLICY_DIR docker run \ --gpus $GPU_ENUMERATION \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name $HF_MODEL_NAME \ --algo-policy-dir /policy \ --search-policy true ``` ### Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4) ```sh export HF_MODEL_NAME="mistralai/Mixtral-8x7B-Instruct-v0.1" export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial" export GPU_ENUMERATION='"device=0,1,2,3"' export POLICY_DIR=$PWD/policy mkdir -p $POLICY_DIR docker run -p 8000:8000 \ --ipc=host --gpus $GPU_ENUMERATION \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name $HF_MODEL_NAME \ --num-devices 4 \ --algo-policy-dir /policy \ --search-policy true ``` Once the policy search is complete, a policy file will be created in `$POLICY_DIR`. If the policy file already exists, the engine will search only the necessary spaces and update the policy file accordingly. After the policy search, engine starts to serve endpoint with using the policy file. It takes up to several minutes to find the optimal policy for Llama 2 13B model with NVIDIA A100 80GB GPU. The estimated time and remaining time will be displayed in the stderr when you run the policy search. ## Running Policy Search Without Starting Serving Endpoint To search for the best policy without starting the serving endpoint, launch the engine with the Friendli Container command and include the `--terminate-after-search true` option. ### Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8` ```sh docker run \ --gpus $GPU_ENUMERATION \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name FriendliAI/Llama-3.1-8B-Instruct-fp8 \ --algo-policy-dir /policy --search-policy true --terminate-after-search true ``` ### Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4) ```sh docker run -p 8000:8000 \ --ipc=host --gpus $GPU_ENUMERATION \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name mistralai/Mixtral-8x7B-Instruct-v0.1 \ --num-devices 4 \ --algo-policy-dir /policy \ --search-policy true --terminate-after-search true ``` ## FAQ: When to Run Policy Search Again? The execution policy depends on the following factors: * Model * GPU * GPU count and parallelism degree (The value for `--num-devices` and `--num-workers` options) * NVIDIA Driver major version * Friendli Container version You should run policy search again when any of these are changed from your serving setup. # QuickStart: Friendli Container Trial Source: https://friendli.ai/docs/guides/container/quickstart Learn how to get started with Friendli Container in this step-by-step guide. Access to the Container registry, prepare you container secret, run your Friendli Container, and monitor using Grafana. ## Introduction [Friendli Container](https://friendli.ai/products/container) enables you to efficiently deploy LLMs of your choice on your infrastructure. With Friendli Container, you can perform high-speed LLM inferencing in a secure and private environment. This tutorial will guide you through the process of running a Friendli Container for your LLM. ## Prerequisites * **Hardware Requirements**: Friendli Container currently only targets x86\_64 architecture and supports NVIDIA GPUs, so please prepare proper GPUs and a compatible driver by referring to [our required CUDA compatibility guide](/guides/container/cuda_compatibility). * **Software Requirements**: Your machine should be able to run containers with the [NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html). In this tutorial, we will use Docker as container runtime and make use of [Docker Compose](https://docs.docker.com/compose). * **Model Compatibility**: If your model is in a [safetensors](https://huggingface.co/docs/safetensors/index) format, which is compatible with [Hugging Face transformers](https://huggingface.co/docs/transformers), you can serve the model directly with the Friendli Container. Please check our [Model library](https://friendli.ai/models) for the non-exhaustive list of supported models. This tutorial assumes that your model of choice is uploaded to [Hugging Face](https://huggingface.co) and you have access to it. If the model is gated or private, you need to prepare a [Hugging Face Access Token](https://huggingface.co/settings/tokens). ## Getting Access to Friendli Container ### Activate your Free Trial [Contact sales](https://friendli.ai/contact) to activate your free trial. ### Get Access to the Container Registry Friendli Token is a user credential that is required for logging into our container registry. 1. Go to [Personal settings > Tokens](https://friendli.ai/suite/setting/tokens) and click 'Create token'. 2. Save the token you just created. ### Prepare your Container Secret Container secret is a secret code that is used to activate Friendli Container. You should pass the container secret as an environment variable to run the container image. 1. Go to [Container > Container Secrets](https://suite.friendli.ai/default-team/container/secrets) and click 'Create secret'. 2. Save the secret you just created. **πŸ”‘ Secret Rotation** You can rotate the container secret for security reasons. If you rotate the container secret, a new secret will be created and the previous secret will be automatically revoked in **30** minutes. ## Running Friendli Container ### Pull the Friendli Container Image 1. Log in to the container registry using the email address for your Friendli Suite account and the Friendli Token. ```sh export FRIENDLI_EMAIL="YOUR ACCOUNT EMAIL ADDRESS" export FRIENDLI_TOKEN="YOUR FRIENDLI TOKEN" docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_TOKEN ``` 2. Pull the image. ```sh docker pull registry.friendli.ai/trial ``` ### Run Friendli Container with a HuggingFace Model 1. Clone our [container resource](https://github.com/friendliai/container-resource) git repository. ```sh git clone https://github.com/friendliai/container-resource cd container-resource/quickstart/docker-compose ``` 2. Set up environment variables. ```sh export HF_MODEL_NAME="<...>" # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct") export FRIENDLI_CONTAINER_SECRET="<...>" # Friendli container secret ``` If your model is a private or gated one, you also need to provide [HuggingFace Access Token](https://huggingface.co/settings/tokens). ```sh export HF_TOKEN="<...>" # HuggingFace Access Token ``` 3. Launch the Friendli Container. ```sh docker compose up -d ``` By default, the container will listen for inference requests at TCP port 8000 and a Grafana service will be available at TCP port 3000. You can change the designated ports using the following environment variables. For example, if you want to use TCP port 8001 and port 3001 for Grafana, execute the command below. ```sh export FRIENDLI_PORT="8001" export FRIENDLI_GRAFANA_PORT="3001" ``` Even though the machine has multiple GPUs, the container will make use of only one GPU, specifically the first GPU (`device_ids: ['0']`). You can edit `docker-compose.yaml` to change what GPU device the container will use. The downloaded HuggingFace model will be cached in the `$HOME/.cache/huggingface` directory. You may want to clean up this directory after completing this tutorial. ### Send Inference Requests You can now send inference requests to the running container. For information on all parameters that can be used in an inference request, please refer to [this document](/openapi). ```sh Chat Completion curl -X POST http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "What makes a good leader?"} ], "max_tokens": 30 }' ``` ```sh Completion curl -X POST http://0.0.0.0:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "What makes a good leader?", "max_tokens": 30 }' ``` ```sh Tokenization curl -X POST http://0.0.0.0:8000/v1/tokenize \ -H "Content-Type: application/json" \ -d '{ "prompt": "What is generative AI?" }' ``` ```sh Detokenization curl -X POST http://0.0.0.0:8000/v1/detokenize \ -H "Content-Type: application/json" \ -d '{ "tokens": [ 128000, 3923, 374, 1803, 1413, 15592, 30 ] }' ``` Chat completion requests work only if the model's tokenizer config contains a `chat_template`. ### Monitor using Grafana Using your browser, open [http://0.0.0.0:3000/d/friendli-engine](http://0.0.0.0:3000/d/friendli-engine), and login with username `admin` and password `admin`. You can now access the dashboards showing useful engine metrics. ![Grafana Dashboard](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/container/grafana_template_dashboard_example.png) If you cannot open a browser directly in the GPU machine where the Friendli Container is running, you can use SSH to forward requests from the browser running on your PC to the GPU machine. ```sh # Change these variables to match your environment. LOCAL_GRAFANA_PORT=3000 # The number of the port in your PC. FRIENDLI_GRAFANA_PORT=3000 # The number of the port in the GPU machine. ssh "$GPU_MACHINE_ADDRESS" -L "$LOCAL_GRAFANA_PORT:0.0.0.0:$FRIENDLI_GRAFANA_PORT" ``` where `$GPU_MACHINE_ADDRESS` shall be replaced with the address of the GPU machine. You may also want to use `-l login_name` or `-p port` options to connect to the GPU machine using SSH. Then using your browser on the PC, open `http://0.0.0.0:$LOCAL_GRAFANA_PORT/d/friendli-engine`. ## Going Further Congratulations! You can now serve your LLM of choice using your hardware, with the power of the most efficient LLM serving engine on the planet. The following topics will help you go further through your AI endeavors. * **Multi-GPU Serving**: Although this tutorial is limited to using only one GPU, Friendli Container supports tensor parallelism and pipeline parallelism for multi-GPU inference. Check [Multi-GPU Serving](/guides/container/running_friendli_container#multi-gpu-serving) for more information. * **Serving Multi-LoRA Models**: You can deploy multiple customized LLMs without additional GPU resources. See [Serving Multi-LoRA Models](/guides/container/serving_multi_lora_models) to learn how to launch the container with your adapters. * **Serving Quantized Models**: Running quantized models requires an additional step of [execution policy search](/guides/container/optimizing_inference_with_policy_search). See [Serving Quantized Models](/guides/container/serving_quantized_models) to learn how to create an inference endpoint for quantized models. * **Serving MoE Models**: Running MoE (Mixture of Experts) models requires an additional step of [execution policy search](/guides/container/optimizing_inference_with_policy_search). See [Serving MoE Models](/guides/container/serving_moe_models) to learn how to create an inference endpoint for MoE models. If you are stuck or need help going through this tutorial, please ask for support by sending an email to [Support](mailto:support@friendli.ai). # Running Friendli Container Source: https://friendli.ai/docs/guides/container/running_friendli_container Friendli Container enables you to effortlessly deploy your generative AI model on your own machine. This tutorial will guide you through the process of running a Friendli Container. ## Introduction Friendli Container enables you to effortlessly deploy your generative AI model on your own machine. This tutorial will guide you through the process of running a Friendli Container. The current version of Friendli Container supports most of major generative language models. ## Prerequisites * Before you begin, make sure you have signed up for [Friendli Suite](https://friendli.ai/suite). * [Contact sales](https://friendli.ai/contact) to activate your free trial. * Friendli Container currently only supports NVIDIA GPUs, so please prepare proper GPUs and a compatible driver by referring to [our required CUDA compatibility guide](/guides/container/cuda_compatibility). * Prepare a Friendli Token following [this guide](#preparing-friendli-token). * Prepare a Friendli Container Secret following [this guide](#preparing-container-secret). ### Preparing Friendli Token Friendli Token is the user credentials for logging into our container registry. 1. Sign in [Friendli Suite](https://friendli.ai/suite). 2. Go to **[Personal settings > Tokens](https://friendli.ai/suite/setting/tokens)** and click **'Create new token'**. 3. Save your created token value and export it as `FRIENDLI_TOKEN`. ### Preparing Container Secret Container secret is a secret code that is used to activate Friendli Container. You should pass the container secret as an environment variable to run the container image. 1. Sign in [Friendli Suite](https://friendli.ai/suite). 2. Go to **[Container > Container Secrets](https://suite.friendli.ai/default-team/container/secrets)** and click **'Create secret'**. 3. Save your created secret value and export it as `FRIENDLI_CONTAINER_SECRET`. **πŸ”‘ Secret Rotation** You can rotate the container secret for security reasons. If you rotate the container secret, a new secret will be created and the previous secret will be revoked automatically in 30 minutes. ## Pulling Friendli Container Image Log in to the Docker client using the Friendli Token created as outlined in [Preparing Friendli Token](#preparing-friendli-token). ```sh export FRIENDLI_EMAIL="YOUR ACCOUNT EMAIL ADDRESS" export FRIENDLI_TOKEN="YOUR FRIENDLI TOKEN" docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_TOKEN ``` ```sh docker pull registry.friendli.ai/trial:latest ``` ## Running Friendli Container with Hugging Face Models If your model is in a [`safetensors`](https://huggingface.co/docs/safetensors/index) format, which is compatible with [Hugging Face transformers](https://huggingface.co/docs/transformers), you can serve the model directly with Friendli Container. Friendli Container supports direct loading of `safetensors` checkpoints for many model types. You can find the complete list of supported models on the [Supported Models page](https://friendli.ai/models/search?products=CONTAINER). If your model does not exist in supported model list, please { e.preventDefault(); window.Intercom('showNewMessage'); }}>contact us. Here are the instructions to run Friendli Container to serve a Hugging Face model: ```sh # Fill the values of following variables. export HF_MODEL_NAME="" # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct") export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret docker run --gpus '"device=0"' -p 8000:8000 \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ -v ~/.cache/huggingface:/root/.cache/huggingface \ registry.friendli.ai/trial \ --hf-model-name $HF_MODEL_NAME ``` The `[LAUNCH_OPTIONS]` should be replaced with [Launch Options for Friendli Container](#launch-options). By running the above command, you will have a running Docker container that exports an HTTP endpoint for handling inference requests. ### Multi-GPU Serving Friendli Container supports ***tensor parallelism*** and ***pipeline parallelism*** for multi-GPU inference. #### Tensor Parallelism Tensor parallelism is employed when serving large models that exceed the memory capacity of a single GPU, by distributing parts of the model's weights across multiple GPUs. To leverage tensor parallelism with the Friendli Container: 1. Specify multiple GPUs for `$GPU_ENUMERATION` (e.g., '"device=0,1,2,3"'). 2. Use `--num-devices` (or `-d`) option to specify the tensor parallelism degree (e.g., `--num-devices 4`). #### Pipeline Parallelism Pipeline parallelism splits a model into multiple segments to be processed across different GPU, enabling the deployment of larger models that would not otherwise fit on a single GPU. To exploit pipeline parallelism with the Friendli Container: 1. Specify multiple GPUs for `$GPU_ENUMERATION` (e.g., '"device=0,1,2,3"'). 2. Use `--num-workers` (or `-n`) option to specify the pipeline parallelism degree (e.g., `--num-workers 4`). **πŸ†š Choosing between Tensor Parallelism and Pipeline Parallelism** When deploying models with the Friendli Container, you have the flexibility to combine tensor parallelism and pipeline parallelism. We recommend exploring a balance between the two, based on their distinct characteristics. While tensor parallelism involves "expensive" ***all-reduce*** operations to aggregate partial results across all devices, pipeline parallelism relies on "cheaper" ***peer-to-peer*** communication. Thus, in limited network setup, such as PCIe networks, leveraging pipeline parallelism is preferable. Conversely, in rich network setup like NVLink, tensor parallelism is recommended due to its superior parallel computation efficiency. ### Advanced: Serving Quantized Models Running quantized models requires an additional step to search execution policy. See [Serving Quantized Models](/guides/container/serving_quantized_models) to learn how to create an inference endpoint for the quantized model. ### Advanced: Serving MoE Models Running MoE (Mixture of Experts) models requires an additional step to search execution policy. See [Serving MoE Models](/guides/container/serving_moe_models) to learn how to create an inference endpoint for the MoE model. ### Examples This is an example running [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) with a single GPU. ```sh export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret (leave it if it's already set in your environment) export HF_TOKEN="" # Access token from HuggingFace (see the caution below) docker run -p 8000:8000 --gpus '"device=0"' \ -e HF_TOKEN=$HF_TOKEN \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ -v ~/.cache/huggingface:/root/.cache/huggingface \ registry.friendli.ai/trial \ --hf-model-name meta-llama/Llama-3.1-8B-Instruct ``` Since downloading `meta-llama/Llama-3.1-8B-Instruct` is allowed only for authorized users, you need to provide your [Hugging Face User Access Token](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hftoken) through `HF_TOKEN` environment variable. It works the same for all private repositories. This is an example running [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) with a multi-GPU setup. ```sh {5, 11} export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret (leave it if it's already set in your environment) export HF_TOKEN="" # Access token from HuggingFace (see the caution below) docker run -p 8000:8000 \ --ipc=host --gpus '"device=0,1"' \ -e HF_TOKEN=$HF_TOKEN \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ -v ~/.cache/huggingface:/root/.cache/huggingface \ registry.friendli.ai/trial \ --hf-model-name meta-llama/Llama-3.1-70B-Instruct \ --num-devices 2 ``` Since downloading `meta-llama/Llama-3.1-70B-Instruct` is allowed only for authorized users, you need to provide your [Hugging Face User Access Token](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hftoken) through `HF_TOKEN` environment variable. It works the same for all private repositories. ## Sending Inference Requests We can now send inference requests to the running Friendli Container. For information on all parameters that can be used in an inference request, please refer to [this document](/openapi/serverless/chat-completions). ### Examples ```sh cURL curl -X POST http://0.0.0.0:8000/v1/chat/completions \ -H "Content-Type: application/json" \ -d '{ "messages": [ {"role": "user", "content": "What makes a good leader?"} ], "max_tokens": 30, "stream": true }' ``` ```python Python SDK # pip install friendli from friendli import SyncFriendli client = SyncFriendli() stream = client.container.chat.complete( messages=[{"role": "user", "content": "Python is a popular"}], max_tokens=30, stream=True, ) for chunk in stream: print(chunk.text, end="", flush=True) ``` ## Options for Running Friendli Container ### General Options | Options | Type | Summary | Default | Required | | ----------- | ---- | -------------------------------------- | ------- | -------- | | `--version` | - | Print Friendli Container version. | - | ❌ | | `--help` | - | Print Friendli Container help message. | - | ❌ | ### Launch Options | Options | Type | Summary | Default | Required | | --------------------------------- | --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | -------- | | `--web-server-port` | INT | Web server port. | 8000 | ❌ | | `--metrics-port` | INT | Prometheus metrics export port. | 8281 | ❌ | | `--hf-model-name` | TEXT | Model name hosted on the Hugging Face Models Hub or a path to a local directory containing a model. When a model name is provided, Friendli Container first checks if the model is already cached at \~/.cache/huggingface/hub and uses it if available. If not, it will download the model from the Hugging Face Models Hub before creating the inference endpoint. When a local path is provided, it will load the model from the location without downloading. This option is only available for models in a safetensors format. | - | ❌ | | `--tokenizer-file-path` | TEXT | Absolute path of tokenizer file. This option is not needed when `tokenizer.json` is located under the path specified at `--ckpt-path`. | - | ❌ | | `--tokenizer-add-special-tokens` | BOOLEAN | Whether or not to add special tokens in tokenization. Equivalent to Hugging Face Tokenizer's `add_special_tokens` argument. The default value is **false** for versions \< v1.6.0. | `true` | ❌ | | `--tokenizer-skip-special-tokens` | BOOLEAN | Whether or not to remove special tokens in detokenization. Equivalent to Hugging Face Tokenizer's `skip_special_tokens` argument. | `true` | ❌ | | `--dtype` | CHOICE: \[bf16, fp16, fp32] | Data type of weights and activations. Choose one of \. This argument applies to non-quantized weights and activations. If not specified, Friendli Container follows the value of `torch_dtype` in `config.json` file or assumes fp16. | fp16 | ❌ | | `--bad-stop-file-path` | TEXT | JSON file path that contains stop sequences or bad words/tokens. | - | ❌ | | `--num-request-threads` | INT | Thread pool size for handling HTTP requests. | 4 | ❌ | | `--timeout-microseconds` | INT | Server-side timeout for client requests, in microseconds. | 0 (no timeout) | ❌ | | `--ignore-nan-error` | BOOLEAN | If set to True, ignore NaN error. Otherwise, respond with a 400 status code if NaN values are detected while processing a request. | - | ❌ | | `--max-batch-size` | INT | Max number of sequences that can be processed in a batch. | 384 | ❌ | | `--num-devices`, `-d` | INT | Number of devices to use in tensor parallelism degree. | 1 | ❌ | | `--num-workers`, `-n` | INT | Number of workers to use in a pipeline (i.e., pipeline parallelism degree). | 1 | ❌ | | `--search-policy` | BOOLEAN | Searches for the best engine policy for the given combination of model, hardware, and parallelism degree. Learn more about policy search at [Optimizing Inference with Policy Search](/guides/container/optimizing_inference_with_policy_search). | false | ❌ | | `--terminate-after-search` | BOOLEAN | Terminates engine container after the policy search. | false | ❌ | | `--algo-policy-dir` | TEXT | Path to directory containing the policy file. The default value is the current working directory. Learn more about policy search at [Optimizing Inference with Policy Search](/guides/container/optimizing_inference_with_policy_search). | current working dir | ❌ | | `--adapter-model` | TEXT | Add an adapter model with adapter name and path; \:\. The path can be a name from a Hugging Face model hub. | - | ❌ | ### Model Specific Options #### T5 | Options | Type | Summary | Default | Required | | --------------------- | ---- | ---------------------- | ------- | -------- | | `--max-input-length` | INT | Maximum input length. | - | βœ… | | `--max-output-length` | INT | Maximum output length. | - | βœ… | # Running Friendli Container on SageMaker Source: https://friendli.ai/docs/guides/container/running_friendli_container_on_sagemaker Create a real-time inference endpoint in Amazon SageMaker with Friendli Container backend. By utilizing Friendli Container in your SageMaker pipeline, you'll benefit from the Friendli Engine's speed and resource efficiency. ## Introduction This guide will walk you through creating a [real-time inference endpoint in Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) with Friendli Container backend. By utilizing Friendli Container in your SageMaker pipeline, you'll benefit from the Friendli Engine's speed and resource efficiency. We'll explore how to create inference endpoints using both the AWS Console and the boto3 Python SDK. ## General Workflow ![Lora Serving](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/container/sagemaker_workflow.png) 1. **Create a Model**: Within SageMaker Inference, define a new model by specifying the model artifacts in your S3 bucket and the Friendli container image from ECR. 2. **Configure the Endpoint**: Create a SageMaker Inference endpoint configuration by selecting the instance type and the number of instances required. 3. **Create the Endpoint**: Utilize the configured settings to launch a SageMaker Inference endpoint. 4. **Invoke the Endpoint**: Once deployed, send requests to your endpoint to receive inference responses. ## Prerequisite Before beginning, you need to push the Friendli Container image to an ECR repository on AWS. First, prepare the Friendli Container image by following the instructions in [**Pulling Friendli Container Image**](/guides/container/running_friendli_container/#pulling-friendli-container-image). Then, tag and push the image to the Amazon ECR repository as guided in [**Pushing a Docker image to an Amazon ECR private repository**](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html). ## Using the AWS Console Let's delve into the step-by-step instructions for creating an inference endpoint using the AWS Console. ### Step 1: Creating a Model You can start creating a model by clicking on the **'Create model'** button under **SageMaker > Inference > Models**. Then, configure the model with the following fields: * **Model settings**: * **Model name**: A model name. * **IAM role**: An IAM role that includes the `AmazonSageMakerFullAccess` policy. * **Container definition 1**: * **Container input option**: Select the "Provide model artifacts and inference image location". * **Model Compression Type**: * To use a model in the S3 bucket: * When the model is compressed, select "CompressedModel". * Otherwise, select "UncompressedModel". * When using a model from the Hugging Face hub, any option would work fine. * **Location of inference code image**: Specify the ARN of the ECR repo for the Friendli Container. * **Location of model artifacts** (optional): * To use a model in the S3 bucket: Specify the S3 URI where your model is stored. Ensure the file structure matches the directory format compatible with the `--hf-model-name` option of the Friendli Container. * When using a model from the Hugging Face hub, you can leave this field empty. * **Environment variables**: * Always required: * `FRIENDLI_CONTAINER_SECRET`: Your Friendli Container Secret. Refer to [**Preparing Container Secret**](/guides/container/running_friendli_container/#preparing-container-secret) to learn how to get the container secret. * `SAGEMAKER_MODE`: This should be set to `True`. * `SAGEMAKER_NUM_DEVICES`: Number of devices to use for tensor parallelism degree. * Required when using a model in the S3 bucket: * `SAGEMAKER_USE_S3`: This should be set to `True`. * Required when using a model from the Hugging Face hub: * `SAGEMAKER_HF_MODEL_NAME`: The Hugging Face model name (e.g., `mistralai/Mistral-7B-Instruct-v0.2`) * For private or gated model repos: * `HF_TOKEN`: The Hugging Face secret access token. ### Step 2: Creating an Endpoint Configuration You can start by clicking on the **'Create endpoint configuration'** button under **SageMaker > Inference > Endpoint configurations**. * **Endpoint configuration**: * **Endpoint configuration name**: The name of this endpoint configuration. * **Type of endpoint**: For real-time inference, select "Provisioned". * **Variants**: * To create a "Production" variant, click 'Create production variant'. * Select the model that you have created in [**Step 1**](#step-1-creating-a-model). * Configure the instance type and count by clicking on 'Edit' in the Actions column. * Create the endpoint configuration by clicking 'Create endpoint configuration'. ### Step 3: Creating SageMaker Inference Endpoint You can start by clicking the **'Create endpoint'** button under **SageMaker > Inference > Endpoints**. * Select "Use an existing endpoint configuration". * Select the endpoint configuration created in [**Step 2**](#step-2-creating-an-endpoint-configuration). * Finish by clicking on the 'Create endpoint' button. ### Step 4: Invoking Endpoint When the endpoint status becomes "In Service", you can invoke the endpoint with the following script, after filling in the endpoint name and the region name: ```python import boto3 import json endpoint_name = "FILL OUT ENDPOINT NAME" region_name = "FILL OUT AWS REGION" sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=region_name) prompt = "Story title: 3 llamas go for a walk\nSummary: The 3 llamas crossed a bridge and something unexpected happened\n\nOnce upon a time" payload = { "prompt": prompt, "max_tokens": 512, "temperature": 0.8, } response = sagemaker_runtime.invoke_endpoint( EndpointName=endpoint_name, Body=json.dumps(payload), ContentType="application/json", ) print(response['Body'].read().decode('utf-8')) ``` ## Using the boto3 SDK Next, let's discover the process for creating a SageMaker endpoint using the boto3 Python SDK. You can achieve this by using the code snippet below. Be sure to fill in the custom fields, customized for your specific use case: ```python import boto3 from sagemaker import get_execution_role sm_client = boto3.client(service_name='sagemaker') runtime_sm_client = boto3.client(service_name='sagemaker-runtime') account_id = boto3.client('sts').get_caller_identity()['Account'] region = boto3.Session().region_name role = get_execution_role() endpoint_name="FILL OUT ENDPOINT NAME" model_name="FILL OUT MODEL NAME" container = "FILL OUT ECR IMAGE NAME" # .dkr.ecr..amazonaws.com/IMAGE instance_type = "ml.g5.12xlarge" # instance type container = { 'Image': container, 'Environment': { "HF_TOKEN": "", "FRIENDLI_CONTAINER_SECRET": "", "SAGEMAKER_HF_MODEL_NAME": "", # e.g) meta-llama/Meta-Llama-3-8B "SAGEMAKER_MODE": "True", # Should be true "SAGEMAKER_NUM_DEVICES": "4", # Number of GPUs in `instance_type` } } endpoint_config_name = 'FILL OUT ENDPOINT CONFIG NAME' # Create a model create_model_response = sm_client.create_model( ModelName=model_name, ExecutionRoleArn=role, Containers=[container], ) # Create an endpoint configuration create_endpoint_config_response = sm_client.create_endpoint_config( EndpointConfigName=endpoint_config_name, ProductionVariants=[ { 'InstanceType': instance_type, 'InitialInstanceCount': 1, 'InitialVariantWeight': 1, 'ModelName': model_name, 'VariantName': 'AllTraffic', }, ], ) endpoint_name = "FILL OUT ENDPOINT NAME" # Create an endpoint sm_client.create_endpoint( EndpointName=endpoint_name, EndpointConfigName=endpoint_config_name, ) sm_client.describe_endpoint(EndpointName=endpoint_name) ``` You can invoke this endpoint by following [**Step 4**](#step-4-invoking-endpoint). By following these guides, you'll be able to seamlessly deploy your models using Friendli Container on SageMaker endpoints and leverage their capabilities for real-time inference. # Serving MoE Models Source: https://friendli.ai/docs/guides/container/serving_moe_models Explore the steps to serve Mixture of Experts (MoE) models such as Mixtral 8x7B using Friendli Container. ## Introduction This guide explores the steps to serve Mixture of Experts (MoE) models such as Mixtral 8x7B using Friendli Container. ## Search Optimal Policy and Running Friendli Container To serve MoE models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at [Running Policy Search](/guides/container/optimizing_inference_with_policy_search#running-policy-search). When the optimal policy is successfully searched, the policy is compiled into a policy file, which can be used for creating serving endpoints. And the engine starts to serve the endpoint using the optimal policy. # Serving Multi-LoRA Models Source: https://friendli.ai/docs/guides/container/serving_multi_lora_models The Friendli Engine introduces an innovative approach to this challenge through Multi-LoRA (Low-Rank Adaptation) serving, a method that allows for the simultaneous serving of multiple LLMs, optimized for specific tasks without the need for extensive retraining. ## Introduction In a world where the demand for highly specialized AI capabilities is surging, the ability to deploy multiple customized large language models (LLMs) without additional GPU resources represents a significant leap forward. The Friendli Engine introduces an innovative approach to this challenge through Multi-LoRA (Low-Rank Adaptation) serving, a method that allows for the simultaneous serving of multiple LLMs, optimized for specific tasks without the need for extensive retraining. This advancement opens new avenues for AI efficiency and adaptability, promising to revolutionize the deployment of AI solutions on constrained hardware. This article provides an overview of efficient serving Multi-LoRA models with the Friendli Engine. ![Lora Serving](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/container/lora.png) ## Prerequisite huggingface-cli should be installed in your local environment. ```sh pip install "huggingface_hub[cli]" ``` ## Downloading Adapter Checkpoints For each adapter model that you want to server, you have to download in your local storage. ```sh # Hugging Face model name of the adapters export ADAPTER_MODEL1="" export ADAPTER_MODEL2="" export ADAPTER_MODEL3="" export ADAPTER_DIR=/tmp/adapter huggingface-cli download $ADAPTER_MODEL1 \ --include "adapter_model.safetensors" "adapter_config.json" \ --local-dir $ADAPTER_DIR/model1 huggingface-cli download $ADAPTER_MODEL2 \ --include "adapter_model.safetensors" "adapter_config.json" \ --local-dir $ADAPTER_DIR/model2 huggingface-cli download $ADAPTER_MODEL3 \ --include "adapter_model.safetensors" "adapter_config.json" \ --local-dir $ADAPTER_DIR/model3 ... ``` This will result in directory structure like: ``` /tmp/adapter/model1 - adapter_model.safetensors - adapter_config.json /tmp/adapter/model2 - adapter_model.safetensors - adapter_config.json /tmp/adapter/model3 - adapter_model.safetensors - adapter_config.json ``` If an adapter's Hugging Face repo does not contain `adapter_model.safetensors` checkpoint file, you have to manually convert `adapter_model.bin` into `adapter_model.safetensors`. You can use the [official app](https://huggingface.co/spaces/safetensors/convert) or the [python script](https://github.com/huggingface/safetensors/tree/main/bindings/python) for conversion. ## Launch Friendli Engine in Container When you have prepared adapter model checkpoints, now you can serve the Multi-LoRA model with Friendli Container. In addition to the command for running the base model, you have to add the `--adapter-model` argument. * `--adapter-model`: Add an adapter model with adapter name and path. The path can be Hugging Face hub's name. ```sh # Fill the values of following variables. export HF_BASE_MODEL_NAME="" # Hugging Face base model name (e.g., "meta-llama/Llama-2-7b-chat-hf") export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial") export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"') export ADAPTER_NAME="" # Specify the adapter's name(a user defined alias). export ADAPTER_DIR=/tmp/adapter docker run \ --gpus $GPU_ENUMERATION \ -p 8000:8000 \ -v $ADAPTER_DIR:/adapter \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name $HF_BASE_MODEL_NAME \ --adapter-model $ADAPTER_NAME:/adapter/model1 \ [LAUNCH_OPTIONS] ``` You can find available options for `[LAUNCH_OPTIONS]` at [Running Friendli Container: Launch Options](/guides/container/running_friendli_container#launch-options). If you want to launch with multiple adapters, you can use `--adapter-model` with comma-separated string. (e.g. `--adapter-model "adapter_name_0:/adapter/model1,adapter_name_1:/adapter/model2"`) If `tokenizer_config.json` file is in an adapter checkpoint path, the engine uses a different chat template in `tokenizer_config.json`. ### Example: Llama 2 7B Chat + LoRA Adapter This is an example that runs [`meta-llama/Llama-2-7b-chat-hf`](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) with [`FinGPT/fingpt-forecaster_dow30_llama2-7b_lora`](https://huggingface.co/FinGPT/fingpt-forecaster_dow30_llama2-7b_lora) adapter model. ```sh export ADAPTER_DIR=/tmp/adapter huggingface-cli download FinGPT/fingpt-forecaster_dow30_llama2-7b_lora \ --include "adapter_model.safetensors" "adapter_config.json" \ --local-dir $ADAPTER_DIR/model1 docker run \ --gpus '"device=0"' \ -p 8000:8000 \ -v $ADAPTER_DIR:/adapter \ -e FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" \ registry.friendli.ai/trial \ --hf-model-name meta-llama/Llama-2-7b-chat-hf \ --adapter-model adapter-model-name:/adapter/model1 ``` ## Sending Request to Specific Adapter You can generate an inference result from a specific adapter model by specifying `model` in the body of an inference request. For example, assuming you set the launch option of `--adpater-model` to "\:\", you can send a request to the adapter model as follows. ```sh curl -X POST http://0.0.0.0:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "model": "adapter-model-name", "prompt": "Python is a language", "max_tokens": 30 }' ``` ## Sending Request to the Base Model If you omit the `model` field in your request, the base model will be used for generating an inference request. You can send a request to the base model as shown below. ```sh curl -X POST http://0.0.0.0:8000/v1/completions \ -H "Content-Type: application/json" \ -d '{ "prompt": "Python is a language", "max_tokens": 30 }' ``` ## Limitations We only support models compatible with [`peft`](https://github.com/huggingface/peft). Base model checkpoint and adapter model checkpoint should have the same datatype. When serving multiple adapters simultaneously, each adapter model should have the same target modules. In Hugging Face, the target modules are listed at `adapter_config.json`. # Serving Quantized Models Source: https://friendli.ai/docs/guides/container/serving_quantized_models Tutorial for serving quantized model with Friendli Engine. Friendli Engine supports FP8, IN8, and AWQ-ed model checkpoints. ## Introduction Quantization is a technique that reduces the precision of a generative AI model's parameters, optimizing memory usage and inference speed while maintaining acceptable accuracy. This tutorial will walk you through the process of serving quantized models with Friendli Container. ## Off-the-Shelf Model Checkpoints from Hugging Face Hub To use model checkpoints that are already quantized and available on Hugging Face Hub, check the following options: * Checkpoints quantized with [friendli-model-optimizer](https://github.com/friendliai/friendli-model-optimizer) * [Quantized model checkpoints by FriendliAI](https://huggingface.co/FriendliAI) * a subset of models quantized with: * [`AutoAWQ`](https://github.com/casper-hansen/AutoAWQ) * [`AutoFP8`](https://github.com/neuralmagic/AutoFP8) * [`llm-compressor`](https://github.com/vllm-project/llm-compressor) For details on how to use these models, go directly to [Serving Quantized Models](#serving-quantized-models). ## Quantizing Your Own Models (FP8/INT8) To quantize your own models with FP8 or INT8, follow these steps: 1. **Install the `friendli-model-optimizer` package** This tool provides model quantization for efficient generative AI serving with Friendli Engine. Install it using the following command: ```sh pip install "friendli-model-optimizer" ``` 2. **Prepare the Original Model** Ensure you have the original model checkpoint that can be loaded using Hugging Face's [`transformers`](https://github.com/huggingface/transformers) library. 3. **Quantize Model with Friendli-Model-Optimizer(FMO)** You can simply run quantization with the command below: ```sh export MODEL_NAME_OR_PATH="" # Hugging Face pretrained model name or directory path of the original model checkpoint. export OUTPUT_DIR="" # Directory path to save the quantized checkpoint and related configurations. export QUANTIZATION_SCHEME="" # Quantization techniques to apply. You can use fp8, int8. export DEVICE="" # Device to run the quantization process. Defaults to "cuda:0". fmo quantize \ --model-name-or-path $MODEL_NAME_OR_PATH \ --output-dir $OUTPUT_DIR \ --mode $QUANTIZATION_SCHEME \ --device $DEVICE \ ``` When the model checkpoint is successfully quantized, the following files will be created at `$OUTPUT_DIR`. * `config.json` * `model.safetensors` * `special_tokens_map.json` * `tokenizer_config.json` * `tokenizer.json` If the size of the model exceeds **10GB**, multiple sharded checkpoints are generated as follows instead of a single `model.safetensors`. * `model-00001-of-00005.safetensors` * `model-00002-of-00005.safetensors` * `model-00003-of-00005.safetensors` * `model-00004-of-00005.safetensors` * `model-00005-of-00005.safetensors` For more information about FMO, check out [this documentation](https://github.com/friendliai/friendli-model-optimizer) for details. ## Serving Quantized Models ### Search Optimal Policy To serve quantized models efficiently, it is required to run a policy search to explore the optimal execution policy. Learn how to run the policy search at [Running Policy Search](/guides/container/optimizing_inference_with_policy_search#running-policy-search). ### Serving FP8 Models Once you have prepared the quantized model checkpoint, you are ready to create a serving endpoint. ```sh # Fill the values of following variables. export HF_MODEL_NAME="" # Quantized model name in Hugging Face Hub or directory path of the quantized model checkpoint. export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial") export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"') export POLICY_DIR=$PWD/policy mkdir -p $POLICY_DIR docker run \ --gpus $GPU_ENUMERATION \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name $HF_MODEL_NAME \ --algo-policy-dir /policy \ --search-policy true ``` ### Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8` FP8 model serving is only supported by NVIDIA **Ada**, **Hopper**, and **Blackwell** GPU architectures. ```sh # Fill the values of following variables. export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial") export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"') docker run \ --gpus $GPU_ENUMERATION \ -p 8000:8000 \ -v ~/.cache/huggingface:/root/.cache/huggingface \ -v $POLICY_DIR:/policy \ # Make sure running policy search -e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \ $FRIENDLI_CONTAINER_IMAGE \ --hf-model-name FriendliAI/Llama-3.1-8B-Instruct-fp8 --algo-policy-dir /policy --search-policy true ``` # Dataset Specifications and Upload Guide Source: https://friendli.ai/docs/guides/dataset Learn how to upload datasets for fine-tuning models on Friendli. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ### Uploading Datasets This document explains how to upload datasets for fine-tuning. On Friendli, you can upload datasets via the web interface or the SDK. You can easily upload datasets through the web interface. Files in `.jsonl` and `.parquet` formats are supported, and each dataset should be structured as follows: #### Conversation This is the most basic dataset format. The `role` field can be `system`, `user`, or `assistant`. ``` {"messages": [{"role": "...", "content": "..."}]} ``` #### Alpaca (Beta) Two types of Alpaca datasets are supported as shown below.\ For compatibility with the Conversation format, they are automatically converted according to a template during upload. If you do not want automatic conversion, please convert to the Conversation format before uploading, or use the SDK to upload. ``` {"instruction": "...", "output": "..."} {"instruction": "...", "input": "...", "output": "..."} ``` #### Multi-Modal (Image) For multi-modal inputs, the following three formats are supported for compatibility.\ Currently, the web interface does not support `local path`, `base64`, or `PIL.Image` objects. For these cases, please use the SDK to upload. ``` {"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image", "image": "https://example.com/image.jpg"}]}]} {"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image", "image_url": "https://example.com/image.jpg"}]}]} {"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}]}]} ``` ### How to Upload a Dataset First, go to the **'Datasets'** section in the [Friendli Suite](https://friendli.ai/suite). Click the **'New Dataset'** button to start the upload process.\ From the dropdown, select **'Upload a file directly'** option. Uploading a Dataset Step 1 Click the File Upload Area in the Dataset file section, or drag and drop the file you want to upload. Then click the **'Upload'** button to start uploading. Uploading a Dataset Step 2 The dataset will be uploaded progressively in the background. Once the upload is complete, you can rename it, add splits, and preview each split. Uploading a Dataset Step 3 {/* This content is completely duplicated from "/guides/tutorials/how-to-fine-tune-vlm". */} ## Prerequisites 1. Head to [Friendli Suite](https://friendli.ai/get-started/dedicated-endpoints) and create an account. 2. Issue a **Friendli Token** by going to [Personal settings > Tokens](https://friendli.ai/suite/setting/tokens). Make sure to copy and store it securely in a safe place as you won't be able to see it again after refreshing the page.\ For detailed instructions, see [Personal Access Tokens](/guides/personal_access_tokens). ## Step 1. Prepare Your Dataset Your dataset should be a conversational dataset in `.jsonl` or `.parquet` format, where each line represents a sequence of messages. Each message in the conversation should include a `"role"` (e.g., `system`, `user`, or `assistant`) and `"content"`. For VLM fine-tuning, user content can contain both text and image data (Note that for image data, we support URL and Base64). Here's an example of what it should look like. Note that it's one line but beautified for readability: ```json { "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": [ { "type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg" }, { "type": "image", "image": "data:image/png;base64," }, { "type": "text", "text": "Describe this image in detail." } ] }, { "role": "assistant", "content": "The image is a bee." } ] } ``` You can access our example dataset ['FriendliAI/gsm8k'](https://huggingface.co/datasets/FriendliAI/gsm8k) (for Chat), ['FriendliAI/sample-vision'](https://huggingface.co/datasets/FriendliAI/sample-vision) (for Chat with image) and explore some of our quantized generative AI models on [our Hugging Face page](https://huggingface.co/FriendliAI). ## Step 2. Upload Your Dataset Once you have prepared your dataset, you can upload it to Friendli using the [Python SDK](/sdk/python-sdk). ### Install the Python SDK First, install the Friendli Python SDK: ```bash # Using pip pip install friendli # Using poetry poetry add friendli ``` ### Upload Your Dataset Use the following code to create a dataset and upload your samples: ```python import os from friendli.friendli import SyncFriendli from friendli.models import Sample TEAM_ID = os.environ["FRIENDLI_TEAM_ID"] PROJECT_ID = os.environ["FRIENDLI_PROJECT_ID"] TOKEN = os.environ["FRIENDLI_TOKEN"] # Read dataset file and parse each line as a Sample with open("dataset.jsonl", "rb") as f: data = [Sample.model_validate_json(line) for line in f] with SyncFriendli( token=TOKEN, x_friendli_team=TEAM_ID, ) as friendli: # Create a new dataset with TEXT and IMAGE modalities with friendli.dataset.create( modality=["TEXT", "IMAGE"], name="my-vlm-dataset", # name of the dataset project_id=PROJECT_ID, ) as dataset: # Upload samples to the dataset # Each line from your dataset file becomes a separate sample dataset.upload_samples( samples=data, split="train", # name of the split to upload to ) ``` ### How It Works Friendli Python SDK doesn't upload your entire dataset file at once. Instead, it processes your dataset more efficiently: 1. **Reads your dataset file line by line**: Each line is parsed as a `Sample` object containing a conversation with messages. 2. **Creates a dataset**: A new dataset is created in your Friendli project with the specified modalities (`TEXT` and `IMAGE`). 3. **Uploads each conversation as a separate sample**: Rather than uploading the entire file, each conversation (line in the dataset file) becomes an individual sample in the dataset. 4. **Organizes by splits**: Samples are organized into splits like "train", "validation", or "test" for different purposes during fine-tuning. ### Environment Variables Make sure to set the required environment variables: ```bash export FRIENDLI_TOKEN="your-friendli-token" export FRIENDLI_TEAM_ID="your-team-id" export FRIENDLI_PROJECT_ID="your-project-id" ``` You can find your Team ID and Project ID in the URL of Friendli Suite, formatted as `https://friendli.ai///...`. ### View Your Dataset To view and edit the datasets you've uploaded, visit [Friendli Suite > Dataset](https://friendli.ai/suite/~/dataset). View Datasets in Friendli Suite View Dataset in Friendli Suite ### Next Steps Now that you have uploaded your dataset, you can proceed to fine-tune your model. Learn more about how to fine-tune and deploy a model using your uploaded dataset. Learn how to fine-tune vision models using multi-modal datasets. # Endpoints Source: https://friendli.ai/docs/guides/dedicated_endpoints/endpoints Endpoints are the actual deployments of your models on your specified GPU resource. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ## What are Endpoints? Endpoints are the actual deployments of your models on a dedicated GPU resource. They provide a stable and efficient interface to serve your models in real-world applications, ensuring high availability and optimized performance. With endpoints, you can manage model versions, scale resources, and seamlessly integrate your model into production environments. ### Key Capabilities of Endpoints: * **Efficient Model Serving**: Deploy models on powerful GPU instances optimized for your use case. * **Flexibility with Multi-LoRA Models**: Serve multiple fine-tuned adapters alongside base models. * **Autoscaling**: Automatically adjust resources to handle varying workloads, ensuring optimal performance and cost efficiency. * **Monitoring and Management**: Check endpoint health, adjust configurations, and view logs directly from the platform. * **Interactive Testing**: Use the integrated playground to test your models before integrating them into applications. * **API Integration**: Access your models via robust OpenAI-compatible APIs, enabling easy integration into any system. ## Creating Endpoints You can create your endpoint by specifying the name, the model, and the instance configuration, consisting of your desired GPU specification. Endpoint Create ## Intelligent Autoscaling Autoscaling Config Our autoscaling system automatically adjusts computational resources based on your traffic patterns, helping you optimize both performance and costs. ### How Autoscaling Works * **Minimum Replicas**: * When set to 0, the endpoint enters sleeping status during periods of inactivity, helping to minimize costs * When set to a value greater than 0, the endpoint maintains at least that number of active replicas at all times * **Maximum Replicas**: Defines the upper limit of replicas that can be created to handle increased traffic load * **Cooldown Period**: The time delay before scaling down an active replica. This ensures the system doesn't prematurely reduce capacity during temporary drops in traffic. ### Benefits of Autoscaling * **Cost Optimization**: Pay only for the resources you need by automatically scaling to zero during idle periods * **Performance Management**: Handle traffic spikes efficiently by automatically adding replicas * **Resource Efficiency**: Maintain optimal resource utilization across varying workload patterns ## Serving Multi-LoRA Models You can serve Multi-LoRA models using Friendli Dedicated Endpoints. For an overview of Multi-LoRA models, refer to our [document on serving Multi-LoRA models with Friendli Container](/guides/container/serving_multi_lora_models). In Friendli Dedicated Endpoints, Multi-LoRA model is supported only in Enterprise plan. For pricing and availability, [Contact sales](https://friendli.ai/contact). ## Checking Endpoint Status After creating the Endpoint, you can view its health status and Endpoint URL on the Endpoint's details page. Endpoint Detail The cost of using dedicated endpoints accumulates from the `INITIALIZING` status. Specifically, charges begin after the `Initializing GPU` phase, where the endpoint waits to acquire the GPU. The endpoint then downloads and loads the model onto the GPU, which usually takes less than a minute. ## Using Playgrounds To test the deployed model via the web, we provide a playground interface where you can interact with the model using a user-friendly chat interface. Simply enter your query, adjust your settings, and generate your responses! Endpoint Playground Send inference queries to your model through our [API](/openapi) at the given endpoint address, accessible on the endpoint information tab. {/* TODO: add image for sending APIs */} # Frequently Asked Questions and Troubleshooting Source: https://friendli.ai/docs/guides/dedicated_endpoints/faq While following through our tutorials, you might have had questions regarding the details of the requirements and specifications. We have listed out the frequently asked questions and as a separate document. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; While following through our tutorials, you might have had questions regarding the details of the requirements and specifications. We have listed out the frequently asked questions and as a separate document. Please refer to the relevant information below: ## Format Requirements ### General requirements for a model * A model should be in safetensors format. * The model should NOT be nested inside another directory. * Including other arbitrary files (that are not in the list) is totally fine. However, those files will not be downloaded nor used. | Required | Filename | Description | | -------- | ------------------------- | -------------------------------------------------------------------------------------------------------------------- | | Yes | *safetensors* | Model weight, e.g. model.safetensors. Use model.safetensors.index.json for split safetensors files | | Yes | config.json | Model config that includes the architecture. ([Supported Models on Friendli](https://friendli.ai/models)) | | No | tokenizer.json | Tokenizer for the model | | No | tokenizer\_config.json | Tokenizer config. This should be present & have a `chat_template` field for the Friendli Engine to provide chat APIs | | No | special\_tokens\_map.json | | ### General requirements for a dataset * Read our documentation on the [fine-tuning dataset format](/guides/dedicated_endpoints/fine-tuning#dataset-format) for information on the dataset requirements. ## 3rd-party account integration Personal settings ### How to integrate a Hugging Face account * [Log in to Hugging Face, then navigate to user settings β†’ access tokens β†’ User Access Tokens. Acquire a token.](https://huggingface.co/settings/tokens) * You may use a fine-grained token. In this case, please make sure the token has view permission for the repository you'd like to use. * [Integrate the key in Friendli Suite β†’ Personal settings β†’ Account β†’ Integrations](https://friendli.ai/suite/setting/account) If you revoke / invalidate the key, you will have to update the key in order to not disrupt ongoing deployments, or to launch a new inference deployment / fine-tuning job. ### How to integrate a W\&B account * [Log in to W\&B, then navigate to user settings β†’ danger zone β†’ API keys. Acquire a token.](https://wandb.ai/authorize) * [Integrate the key in Friendli Suite β†’ Personal settings β†’ Account β†’ Integrations](https://friendli.ai/suite/setting/account) If you revoke / invalidate the key, you will have to update the key in order to not disrupt ongoing deployments, or to launch a new inference deployment / fine-tuning job. #### Extra: How to upload a safetensors format model to W\&B using W\&B CLI * Install the CLI and log in using the API key β†’ [Command Line Interface | Weights & Biases Documentation](https://docs.wandb.ai/ref/cli) * Upload the model as an W\&B artifact using the command below: ``` wandb artifact put -n project/artifact_id --type model /path/to/dir ``` * With all this, the W\&B artifact will look like this: ![W\&B artifact](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_artifact.png) ## Using 3rd-party model ### How to use a W\&B artifact as a model ![W\&B artifact as a model](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_model.png) * Use the full name of the artifact * The *artifact name* must be in the format of: `org/project/artifact_id:version` ### How to use a Hugging Face repository as a model ![HF artifact as a model](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/hf_model.png) * Use the repository id of the model. You may select the entry from the list of autocompleted model repositories. * You may choose specific branch, or manually enter a commit hash. ## Using W\&B with Dedicated Fine-tuning * When launching a fine-tuning job, you can designate a W\&B project that the metrics will be exported to. If you provide a W\&B project name that already exists, your job will be added to that project. Otherwise, a new W\&B project will be automatically created in your integrated W\&B account. If the project name is not provided, it defaults to 'friendliai'. ![W\&B project](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_project.png) * As the training starts, you will be able to see a new 'Run' in the project you chose. ![W\&B Run](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_run.png) * By clicking the project, you can easily track & monitor the status of the training job. ![W\&B Log](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_log.png) If new runs are not displayed in your project, please check that the default team is set correctly on [W\&B user settings](https://wandb.ai/settings). ![W\&B Default team](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/wandb_default_team.png) ## Troubleshooting ### Can't access the artifact ![Troubleshooting - can't access](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/troubleshooting_cant_access.png) * The artifact might be nonexistent, or hidden so that you cannot access it. ### You don't have access to this gated model ![Troubleshooting - no access](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/troubleshooting_no_access.png) * The repository is gated. Please follow the steps and gain approval from the owner using Hugging Face Hub. ### The repository / artifact is invalid ![Troubleshooting - invalid repo](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/troubleshooting_invalid_repo.png) ![Troubleshooting - invalid artifact](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/troubleshooting_invalid_artifact.png) * The model does not meet the requirements. Please check if the model follows a correct safetensors format. ### The architecture is not supported ![Troubleshooting - unsupported](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/faq/troubleshooting_unsupported.png) * The model architecture is not supported. Please refer to [Supported Models on Friendli](https://friendli.ai/models). # Fine-tuning Source: https://friendli.ai/docs/guides/dedicated_endpoints/fine-tuning Effortlessly fine-tune your model with Friendli Dedicated Endpoints, which leverages the Parameter-Efficient Fine-Tuning (PEFT) method to reduce training costs while preserving model quality, similar to full-parameter fine-tuning. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ### In order to fine-tune large generic models for your specific purpose, you may fine-tune models on Friendli Dedicated Endpoints. Effortlessly fine-tune your model with [Friendli Dedicated Endpoints](https://friendli.ai/products/dedicated-endpoints), which leverages the Parameter-Efficient Fine-Tuning (PEFT) method to reduce training costs while preserving model quality, similar to full-parameter fine-tuning. This can make your model become an expert on a specific topic, and prevent hallucinations from your model. ## Table of Contents 1. **[How to Select Your Base Model](#how-to-select-your-base-model)** 2. **[How to Upload Your Dataset](#how-to-upload-your-dataset)** 3. **[How to Create Your Fine-tuning Job](#how-to-create-your-fine-tuning-job)** 4. **[How to Monitor Progress](#how-to-monitor-progress)** 5. **[How to Deploy the Fine-tuned Model](#how-to-deploy-the-fine-tuned-model)** 6. **[Resources](#resources)** By the end of this guide, you will understand how you can effectively fine-tune your generative AI models by using Friendli Dedicated Endpoints. ## How to Select Your Base Model Through our (1) Hugging Face Integration and (2) Weights & Biases (W\&B) Integration, you can select the base model to fine-tune. Explore and find open-source models that are supported on Friendli Dedicated Endpoints [here](https://friendli.ai/models/search?products=DEDICATED). For guidance on the necessary format and file requirements, especially when using your own models, review the FAQ section on [general requirements for a model](/guides/dedicated_endpoints/faq#general-requirements-for-a-model). * **Hugging Face Model** ![Hugging Face Model](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/finetuning/hf_model.png) * **Weights & Biases Model** ![Weights & Biases Model](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/finetuning/wandb_model.png) ### Hugging Face Integration Integrate your [Hugging Face account](https://huggingface.co) to access your private repo or a gated repo. Go to [**Personal settings > Account > Hugging Face integration**](https://friendli.ai/suite/setting/account) and save your [Hugging Face access token](https://huggingface.co/docs/hub/security-tokens). This access token will be used upon creating your fine-tuning jobs. Check our FAQ section on [using a Hugging Face repository as a model](/guides/dedicated_endpoints/faq#how-to-use-a-hugging-face-repository-as-a-model) and [integrating a Hugging Face account](/guides/dedicated_endpoints/faq#how-to-integrate-a-hugging-face-account) for more detailed integration information. ### Weights & Biases (W\&B) Integration Integrate your [Weights & Biases account](https://wandb.ai/site) to access your model artifact. Go to [**Personal settings > Account > Weights & Biases integration**](https://friendli.ai/suite/setting/account) and save your Weights & Biases API key, which you can obtain [here](https://wandb.ai/authorize). This API key will be used upon creating your fine-tuning jobs. Check our FAQ section on [using a W\&B artifact as a model](/guides/dedicated_endpoints/faq#how-to-use-a-w%26b-artifact-as-a-model) and [integrating a W\&B account](/guides/dedicated_endpoints/faq#how-to-integrate-a-w%26b-account) for more detailed integration information. ## How to Upload Your Dataset Navigate to the **'Datasets'** section to upload your fine-tuning dataset. Enter the dataset name, then either drag and drop your `.jsonl` or `.parquet` dataset file or browse for them on your computer. If your files meet the required criteria, the blue 'Upload' button will be activated, allowing you to complete the process. Upload dataset (Chat) For more advanced dataset management, including uploading and organizing datasets via API, refer to our [API documentation](/openapi/dataset/overview). You can also upload a dataset with our [Python SDK](/sdk/python-sdk). You can access our example dataset ['FriendliAI/gsm8k'](https://huggingface.co/datasets/FriendliAI/gsm8k) (for Chat), ['FriendliAI/sample-vision'](https://huggingface.co/datasets/FriendliAI/sample-vision) (for Chat with image) and explore some of our quantized generative AI models on [our Hugging Face page](https://huggingface.co/FriendliAI). ### Upload chat with image dataset via Python SDK #### Install Python SDK ```bash pip install friendli ``` #### Upload dataset ```python import os from friendli.friendli import SyncFriendli from friendli.models import Sample TEAM_ID = os.environ["FRIENDLI_TEAM_ID"] PROJECT_ID = os.environ["FRIENDLI_PROJECT_ID"] TOKEN = os.environ["FRIENDLI_TOKEN"] with SyncFriendli( token=TOKEN, x_friendli_team=TEAM_ID, ) as friendli: # Create dataset with friendli.dataset.create( modality=["TEXT", "IMAGE"], name="test-create-dataset-sync", project_id=PROJECT_ID, ) as dataset: # Read dataset with open("dataset.jsonl", "rb") as f: data = [Sample.model_validate_json(line) for line in f] # Add samples to dataset dataset.upload_samples( samples=data, split="train", ) ``` ### Dataset Format The dataset used for fine-tuning should satisfy the following conditions: 1. The dataset must contain a column named **"messages"**, which will be used for fine-tuning. 2. Each row in the "messages" column should be compatible with the chat template of the base model. For example, [`tokenizer_config.json`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/blob/41b61a33a2483885c981aa79e0df6b32407ed873/tokenizer_config.json#L42) of `mistralai/Mistral-7B-Instruct-v0.2` is a template that repeats the messages of a user and an assistant. Concretely, each row in the "messages" field should follow a format like: `[{"role": "user", "content": "The 1st user's message"}, {"role": "assistant", "content": "The 1st assistant's message"}]`. In this case, `HuggingFaceH4/ultrachat_200k` is a dataset that is compatible with the chat template. #### Examples Here’s an example of what it should look like. Note that it’s one line but beautified for readability: ```json Chat { "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": "What is the capital of France?" }, { "role": "assistant", "content": "The capital of France is Paris" } ] } ``` ```json Chat with image { "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": [ { "type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg" }, { "type": "text", "text": "Describe this image in detail." } ] }, { "role": "assistant", "content": "The image is a bee." } ] } ``` ## How to Create Your Fine-tuning Job Navigate to the **'Fine-tuning'** section to launch and view your fine-tuning jobs. You can view the training progress in a job's detail page by clicking on the fine-tuning job. To create a new fine-tuning job, follow these steps: 1. Go to your project and click on the **Fine-tuning** tab. 2. Click **'New job'**. 3. Fill out the job configuration based on the following field descriptions: * **Job name**: Name of fine-tuning job to create. * **Model**: Hugging Face Models repository or Weights & Biases model artifact name. * **Dataset**: Your uploaded fine-tuning dataset. * **Weights & Biases (W\&B)**: Optional for W\&B integration. * **W\&B project**: Your W\&B project name. * **Hyperparameters**: Fine-tuning Hyperparameters. * **`Learning rate`**: Initial learning rate for AdamW optimizer. * **`Batch size`**: Total training batch size. * **Total number of training**: Configure the number of training cycles with either `Number of training epochs` or `Training steps`. * **`Number of training epochs`**: Total number of training epochs. * **`Training steps`**: Total number of training steps. * **`Evaluation steps`**: Number of steps between model evaluation using the validation dataset. * **`LoRA rank`**: The rank of the LoRA parameters (optional). * **`LoRA alpha`**: Scaling factor that determines the influence of the low-rank matrices during fine-tuning (optional). * **`LoRA dropout`**: Dropout rate applied during fine-tuning (optional). 4. Click the **'Create'** button to create a job with the input configuration. ## How to Monitor Progress After launching the fine-tuning job, you can monitor the job overview, including progress information and fine-tuning configuration. If you have integrated your Weights & Biases (W\&B) account, you can also monitor the training status in your W\&B project. Read our FAQ section on [using W\&B with dedicated fine-tuning](/guides/dedicated_endpoints/faq#using-w%26b-with-dedicated-fine-tuning) to learn more about monitoring you fine-tuning jobs on their platform. ## How to Deploy the Fine-tuned Model Once the fine-tuning process is complete, you can immediately deploy the model by clicking the 'Deploy' button in the top right corner. The name of the fine-tuned LoRA adapter will be the same as your fine-tuning job name. Fine-tuning Completed The steps to deploy the fine-tuned model are equivalent to how you would deploy a custom model on Friendli Dedicated Endpoints. For further information, refer to the [Endpoints documentation](/guides/dedicated_endpoints/endpoints) for more detailed information on launching a model. ## Resources * [Supported open-source models](https://friendli.ai/models) * ['FriendliAI/gsm8k' on Hugging Face](https://huggingface.co/datasets/FriendliAI/gsm8k) * [FAQ on general requirements for a model](/guides/dedicated_endpoints/faq#general-requirements-for-a-model) * [FAQ on using a Hugging Face repository as a model](/guides/dedicated_endpoints/faq#how-to-use-a-hugging-face-repository-as-a-model) * [FAQ on integrating a Hugging Face account](/guides/dedicated_endpoints/faq#how-to-integrate-a-hugging-face-account) * [FAQ on using a W\&B artifact as a model](/guides/dedicated_endpoints/faq#how-to-use-a-w%26b-artifact-as-a-model) * [FAQ on integrating a W\&B account](/guides/dedicated_endpoints/faq#how-to-integrate-a-w%26b-account) * [FAQ on using W\&B with dedicated fine-tuning](/guides/dedicated_endpoints/faq#using-w%26b-with-dedicated-fine-tuning) * [Endpoints documentation on model deployment](/guides/dedicated_endpoints/endpoints) # Deploy with Hugging Face Models Source: https://friendli.ai/docs/guides/dedicated_endpoints/huggingface_tutorial Hands-on tutorial for launching and deploying LLMs using Friendli Dedicated Endpoints with Hugging Face models. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; #### Hands-on Tutorial Deploying `meta-llama-3-8b-instruct` LLM from Hugging Face using Friendli Dedicated Endpoints ## Introduction With Friendli Dedicated Endpoints, you can easily spin up scalable, secure, and highly available inference deployments, without the need for extensive infrastructure expertise or significant capital expenditures. This tutorial is designed to guide you through the process of launching and deploying LLMs using Friendli Dedicated Endpoints. Through a series of step-by-step instructions and hands-on examples, you'll learn how to: * Select and deploy pre-trained LLMs from Hugging Face repositories * Deploy and manage your models using the Friendli Engine * Monitor and optimize your inference deployments By the end of this tutorial, you'll be equipped with the knowledge and skills necessary to unlock the full potential of LLMs in your applications, products, and services. So, let's get started and explore the possibilities of Friendli Dedicated Endpoints! ## Prerequisites: * A Friendli Suite account with access to [Friendli Dedicated Endpoints](https://friendli.ai/suite) * A Hugging Face account with access to the [meta-llama-3-8b-instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model ## Step 1: Create a new endpoint 1. Log in to your Friendli Suite account and navigate to the Friendli Dedicated Endpoints dashboard. 2. If not done already, start the free trial for Dedicated Endpoints. 3. Create a new project, then click on the 'New Endpoint' button. 4. Fill in the basic information: * Endpoint name: Choose a unique name for your endpoint (e.g., "My New Endpoint"). 5. Select the model: Hugging Face Model Search * Model Repository: Select "Hugging Face" as the model provider. * Model ID: Enter "meta-llama/Meta-Llama-3-8B-Instruct" as the model id. As the search bar loads the list, click on the top result that exactly matches the repository id. By default, the model pulls the latest commit on the default branch of the model. You may manually select a specific branch / tag / commit instead. If you're using your own model, check [Format Requirements](/guides/dedicated_endpoints/faq#format-requirements) for requirements. 6. Select the instance: Select instance * Instance configuration: Choose a suitable instance type based on your performance requirements. We suggest 1x A100 80G for most models. In some cases where the model's size is big, some options may be restricted as they are guaranteed to not run due to insufficient VRAM. Low Memory Warning 7. Edit the configurations: Autoscaling Config
Engine Config * Autoscaling: By default, the autoscaling ranges from 0 to 2 replicas. This means that the deployment will sleep when it's not being used, which reduces cost. * Advanced configuration: Some LLM options including the batch size and token configurations are mutable. For this tutorial, we'll leave it as-is. 8. Click 'Create' to create a new endpoint. ## Step 2: Test the endpoint 1. Wait for the deployment to be created and initialized. This may take a few minutes. You may check the status by the indicator under the endpoint's name. Initializing Endpoint 2. In the "Playground" section, you may enter a sample input prompt (e.g., "What is the capital of France?"). 3. Click on the right arrow button to send the inference request. Playground 4. If you are an enterprise user, you can use the "Metrics" and "Logs" section to monitor the endpoint. Metrics
Logs ## Step 3: Send requests by using cURL or Python 1. As instructed in our [API docs](/openapi/serverless/chat-completions), you can send instructions with the following code: ```sh cURL curl -X POST https://api.friendli.ai/dedicated/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ -d '{ "model": "(endpoint-id)", "messages": [ { "role": "user", "content": "What is the capital of France?" } ], "max_tokens": 200, "top_k": 1 }' ``` ```python Python import requests import json import os url = 'https://api.friendli.ai/dedicated/v1/chat/completions' payload = json.dumps({ "model": f"{os.environ['ENDPOINT_ID']}", "messages": [ { "role": "user", "content": "What is the capital of France?" } ], "max_tokens": 200, "top_k": 1 }) headers = { "Content-Type": "application/json", "Accept": "application/json", "Authorization": f"Bearer {os.environ['FRIENDLI_TOKEN']}" } response = requests.request("POST", url, headers=headers, data=payload) print(response.text) ``` ## Step 4: Update the endpoint 1. You can update the model and change almost everything by clicking the update button. # Introducing Friendli Dedicated Endpoints Source: https://friendli.ai/docs/guides/dedicated_endpoints/introduction Friendli Dedicated Endpoints gives you the reins to explore the full potential of your custom generative AI models on the hardware of your choice, whether you're crafting innovative eloquent texts, generating stunning images, or even more. Friendli Dedicated Endpoints (previously known as **PeriFlow Cloud**) gives you the reins to explore the full potential of your custom generative AI models on the hardware of your choice, whether you're crafting innovative eloquent texts, generating stunning images, or even more. ## What are Friendli Dedicated Endpoints? Don't be limited to pre-trained models. Friendli Dedicated Endpoints lets you take center stage: * **Seamless Serving, Powered by the Friendli Engine**: Experience the magic of the Friendli Engine, our patented GPU-optimized serving technology. Sit back and watch as your models come to life with automatically optimized performances, orchestrated seamlessly by Friendli Dedicated Endpoints. * **Choose or Upload Your Model**: Use your own custom models that are tailored to your specific needs and purposes. Otherwise, simply choose from the open-source models available on [HuggingFace](https://huggingface.co). Text generation, image creation, code synthesis – the possibilities are limitless. * **Control Your Instance**: Select the perfect GPU for your model. The GPU resources are dedicated entirely to your generative AI models. No sharing is required. * **Per-second Billing, Worry-free Optimization**: Focus on your creative pursuits, not cost management. Pay only for the seconds your model runs, eliminating the burden of manual optimization. Let Friendli Dedicated Endpoints handle the heavy lifting. * **Proven Reliability for Real-World Success**: Trusted by leading companies, Friendli Dedicated Endpoints delivers robust performance for even the most demanding workloads. ## Getting Started with Friendli Dedicated Endpoints: Ready to step up your generative AI game? Getting started is as simple as: 1. **Sign Up for a Free Account**: Experience the power of Friendli Dedicated Endpoints risk-free. 2. **Choose or Upload Your Model**: Harness your own custom-trained creation or simply select an open-source model. 3. **Launch Your GPU Instance**: Select the perfect GPU for your model. 4. **Get Your Endpoint Address**: Your gateway to unleashing your model's magic. 5. **Fine-tune Your Model**: Optionally, you can fine-tune your generic model for your specific needs. 6. **Send Your Input**: Prompt your model, send your queries, and let your creativity flow. 7. **Witness the Magic**: Sit back and marvel as your custom model delivers stunningly fast outputs, tailored to your specific needs. Friendli Dedicated Endpoints is more than just an AI serving platform - it's a launchpad for your creative ambitions. Dive into the website ([https://friendli.ai](https://friendli.ai)) and blog ([https://friendli.ai/blog](https://friendli.ai/blog)) to discover deeper insights, use cases, and customer testimonials. In our documentations, you can find how you can (1) manage your [projects](/guides/dedicated_endpoints/projects) and (2) [models](/guides/dedicated_endpoints/models), and (3) make them come to life on your [endpoints](/guides/dedicated_endpoints/endpoints), as well as to (4) [fine-tune](/guides/dedicated_endpoints/fine-tuning) them for your specific purposes. To quickly have a look at our service, take a look at our [quickstart](/guides/dedicated_endpoints/quickstart) document. Reserve your own GPUs for your model! It's time to run your own models cost-efficiently with Friendli Dedicated Endpoints! ## Additional Resources: * FriendliAI website: [https://friendli.ai](https://friendli.ai) * FriendliAI blog: [https://friendli.ai/blog](https://friendli.ai/blog) # Serving LoRA Models Source: https://friendli.ai/docs/guides/dedicated_endpoints/lora-models Learn how to deploy LoRA models from Hugging Face Hub to Friendli Dedicated Endpoints for efficient inference, including a quick guide for FLUX LoRA models. This document explains how to deploy LoRA models available on Hugging Face to Friendli Dedicated Endpoints. Friendli Dedicated Endpoints support deploying LoRA adapters for both text generation and FLUX models. ## FLUX LoRA Quick Deployment Guide This tutorial demonstrates how to deploy the FLUX LoRA model [multimodalart/flux-tarot-v1](https://huggingface.co/multimodalart/flux-tarot-v1), which is trained to generate images in the style of Rider–Waite Tarot cards. Friendli offers a convenient one-click deployment feature, Deploy-Model, that streamlines the process of serving LoRA adapters from the Hugging Face Hub on Dedicated Endpoints. To deploy a specific model, simply use a URL in the format `https://friendli.ai/deploy-model/{hf-model-id}`. For example, to deploy the FLUX LoRA model mentioned above, use [this link](https://friendli.ai/deploy-model/multimodalart/flux-tarot-v1). This will launch the deployment workflow, allowing you to quickly serve and experiment with the model on Friendli. LoRA Model Deployment Clicking the link above will display a screen like the one shown. Click the 'Deploy now' button here to deploy the LoRA model to Friendli Dedicated Endpoints. Once the deployment is complete, a screen like the one below will appear. Click the 'Go to Suite' button to navigate to the playground where you can use the LoRA model. Original Generated Image LoRA Generated Image ## Advanced: Deploying LoRA Models with Custom Settings While the quick deployment method described above is convenient, you can also deploy LoRA endpoints with custom settings. This allows you to specify the GPU instance type, endpoint name, scaling options, and more. Log in to your Friendli Suite account and navigate to the Friendli Dedicated Endpoints dashboard. If not done already, start the free trial for Dedicated Endpoints. Friendli Suite Endpoint List Create a new project, then click on the 'New Endpoint' button. You'll see a screen like the one below. Enter an Endpoint Name, for example, "My New LoRA Endpoint". Create Endpoint Friendli Suite currently supports LoRA adapters trained within the Suite and those available on the Hugging Face Hub. Since this tutorial doesn't cover fine-tuning, we'll focus on deploying LoRA adapters from the Hugging Face Hub. First, in the Base Model section, select "Hugging Face" and choose the base model for the LoRA adapter you want to deploy. There are several ways to find the base model of a LoRA adapter. The most common method is to check the model tree on the Hugging Face model page. In this example, we'll deploy the `predibase/tldr_content_gen` adapter. Base Model Selection On the [Hugging Face model page](https://huggingface.co/predibase/tldr_content_gen) for this adapter, you can find the Model tree on the right side. This shows the base model used. In this case, the adapter is based on the `mistralai/Mistral-7B-v0.1` model. Base Model Selection Enter the identified base model name into the model input field on the Endpoint Create page. Now it's time to select the LoRA adapter. Base Model Selection Once the base model is selected, the 'Add LoRA adapter' button will become active. Click it to open the modal window for adding LoRA adapters. Base Model Selection In this modal, you can choose between "Project adapters" (adapters fine-tuned within Friendli Suite) and "Hugging Face adapters". Select "Hugging Face adapters" and enter the Hugging Face Model ID of the adapter. For this tutorial, it's `predibase/tldr_content_gen`. Base Model Selection After adding the adapter, your screen should look like this. Now, select the instance type, configure the autoscaling options appropriately, and click the 'Create' button. For details on other options, please refer to the [Deploy with Hugging Face Models](/guides/dedicated_endpoints/huggingface_tutorial#step-1%3A-create-a-new-endpoint) documentation. Base Model Selection Once the endpoint is deployed, you'll see a screen like this. Navigate to the Playground page to quickly compare the adapter model and the base model. Base Model Selection In the Playground, use the highlighted dropdown menu to switch between the adapter model and the base model for experimentation and comparison. That's it! You have successfully deployed a LoRA adapter on Friendli Dedicated Endpoints and experimented with it in the Playground. Now you can explore deploying multiple adapters on a single endpoint (Multi-LoRA Endpoints) or use the API to send requests to the model and integrate it into your applications. # Models Source: https://friendli.ai/docs/guides/dedicated_endpoints/models Within your Friendli Dedicated Endpoints projects you can prepare and manage the models that you wish to deploy. You may upload your models within your project to deploy them directly on your endpoints. Alternatively, you may manage them on the HuggingFace repository or Weights & Biases artifacts, as our endpoints can load models from your project, HuggingFace repositories, and Weights & Biases artifacts. ### Within your project, you can prepare and manage the models that you wish to deploy. You may upload your models within your project to deploy them directly on your endpoints. Alternatively, you may manage them on the HuggingFace repository or Weights & Biases artifacts, as our endpoints can load models from your project, HuggingFace repositories, and Weights & Biases artifacts. * At the moment, we support loading models from your uploaded model, HuggingFace repositories, and Weights & Biases artifacts. ![HuggingFace](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/hugging_face.png) Deploy models from public or private Hugging Face repositories. Load models as Weights & Biases artifacts for easy versioning. Use LoRA-adapted models for efficient fine-tuning and deployment. # Pricing Source: https://friendli.ai/docs/guides/dedicated_endpoints/pricing Friendli Dedicated Endpoints pricing detail page. Friendli Dedicated Endpoints offer pricing with flexible monthly billing based on actual usage. ### Supported Instance Types Pricing is based on the instance type selected for the endpoint. The following instance types are supported for endpoints: | Endpoint | GPU Type | Basic | Enterprise | | -------- | ---------- | ------------ | ------------- | | | H200 141GB | \$5.9 / hour | Contact sales | | | H100 80GB | \$4.9 / hour | Contact sales | | | A100 80GB | \$2.9 / hour | Contact sales | ### Supported Model Sizes Pricing is based on model size and calculated per 1M tokens. | Fine-tuning | Model | Basic | Enterprise | | ----------- | ------------------ | ------------------ | ------------- | | | Models ≀ 16B | \$0.50 / 1M tokens | Contact sales | | | Models 16.1B - 72B | \$3.00 / 1M tokens | Contact sales | Contact sales for a discounted custom pricing plan for your enterprise. For more information on pricing and feature comparisons between basic and enterprise plans, please visit our [pricing page](https://friendli.ai/pricing/dedicated-endpoints). # Projects Source: https://friendli.ai/docs/guides/dedicated_endpoints/projects Friendli Dedicated Endpoints projects are a basic working unit for your team. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ### Projects are a basic working unit for your team. You can freely add and remove members to control access to your project. * You can view your list of projects Project List * For project settings, you can view your project ID. Project Settings * For project members, you can manage the members who have access to your project. Project Members * To add a member to your project, simply enter their names or emails and hit the add button. Add Project Members # QuickStart: Friendli Dedicated Endpoints Source: https://friendli.ai/docs/guides/dedicated_endpoints/quickstart Learn how to get started with Friendli Dedicated Endpoints in this step-by-step guide. Create an account, select your project, choose a model you wish to serve, deploy your endpoint, and seamlessly generate text, code, and more with ease. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ## 1. Log In or Sign Up * If you have an account, log in using your preferred SSO or email/password combination. * If you're new to FriendliAI, create an account for free. Login ## 2. Access Friendli Dedicated Endpoints * On your left sidebar, find the "Dedicated Endpoints" option. * Click the option to access the endpoint list page. Sidebar ## 3. Prepare Your Model * Choose a model that you wish to serve from HuggingFace, Weights & Biases, or upload your custom model on our cloud. ![HuggingFace](https://mintlify.s3.us-west-1.amazonaws.com/friendliai/static/images/guides/dedicated_endpoints/hugging_face.png) ## 4. Deploy Your Endpoint * Deploy your endpoint, using the model of your choice prepared from step 3, and the instance equipped with your desired GPU specification. * You can also configure your replicas and the max-batch-size for your endpoint. Endpoint Create
Endpoint Detail ## 5. Generate Responses * You can generate your responses in two ways: playground and endpoint URL. * Try out and test generating responses on your custom model using a chatGPT-like interface at the playground tab. Endpoint Playground * For general usages, send queries to your model through our [API](/openapi) at the given endpoint address, accessible on the endpoint information tab. ### Generating Responses Through the Endpoint URL Refer to [this guide](/guides/personal_access_tokens) for general instructions on Friendli Token. ```sh cURL curl -X POST https://api.friendli.ai/dedicated/v1/chat/completions \ -H "Content-Type: application/json" \ -H "X-Friendli-Team: $TEAM_ID" \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ -d '{ "model": "(endpoint-id)", "messages": [ { "role": "user", "content": "Python is a popular" } ] }' ``` ```python Python SDK # pip install friendli import os from friendli import SyncFriendli client = SyncFriendli( token=os.getenv("FRIENDLI_TOKEN"), ) chat_completion = client.dedicated.chat.complete( model="YOUR_ENDPOINT_ID", messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" } ] ) print(chat_completion.choices[0].message.content) ``` {/* TODO: add image for sending APIs */} For a more detailed tutorial for your usage, please refer to our tutorial for using [HuggingFace models](/guides/dedicated_endpoints/huggingface_tutorial) and [W\&B models](/guides/dedicated_endpoints/wandb_tutorial). # Versions Source: https://friendli.ai/docs/guides/dedicated_endpoints/versions Learn how to use the endpoint versions feature to manage model deployment history. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ## Rollout and Rollback Endpoints Without Downtime The versioning feature in Friendli Dedicated Endpoints helps you manage all changes to your deployed endpoints safely and transparently. When you update the configurationβ€”like changing the model, engine settings, or autoscalingβ€”a new version is created instead of replacing the current one. Each version captures a full snapshot of the deployment, including: * Model name and artifact source * Accelerator type and count * Autoscaling and engine settings * Metadata (creator, timestamps, comments) updated version configuration modal ## Why Use Versioning? * **Zero-Downtime Updates**: Safely apply changes while the current version continues to serve traffic. * **One-Click Rollbacks**: Instantly revert to a previous stable configuration if issues occur. * **Easy-to-Follow History**: Each version shows who made the change, when it was made, and what was changed. This makes audits and debugging easier. ## How to Use Versioning 1. **Initial Deployment**: Deploy your model for the first time via the platform or webhook. This creates version `v0`. Initial version (v0) running 2. **Apply Configuration Updates**: Changing any settingβ€”such as model, accelerator type, or autoscalingβ€”triggers a new version (`v1`, `v2`, etc.). 3. **Browse Version History**: View the full version list by clicking on the 'Versions' tab on the endpoint detail page. You'll see which version is current or in progress. Applying version v2 4. **View Configuration Details**: Click 'View configs' to see a version’s full settings. You can see the updates from the previous version marked with a blue badge for easy comparison. Viewing version v1 details ## How to Rollback to a Previous Version To rollback, select a previous version from the version history and click 'Rollback'. Rollback The system creates a new version (`vN+1`) using the selected version’s settings. This new version will become the current one, allowing you to quickly revert to a known good state. ### When an Update Fails Update failures can occur due to various reasons, such as: * **Configuration Errors**: Invalid settings or unsupported configurations can prevent the update. * **Resource Limitations**: Insufficient resources (like GPU availability) can block the update. * **Network Issues**: Temporary network problems can interrupt the update process. When you attempt to update an endpoint and the process fails, the system will not automatically apply the changes. Instead, it will log the error and allow you to troubleshoot the issue without affecting the live endpoint. This ensures that your endpoint remains operational without disruption. # Deploy with W&B Models Source: https://friendli.ai/docs/guides/dedicated_endpoints/wandb_tutorial Hands-on tutorial for launching and deploying LLMs using Friendli Dedicated Endpoints with Weights & Biases artifacts. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; #### Hands-on Tutorial Deploying `meta-llama-3-8b-instruct` LLM from W\&B using Friendli Dedicated Endpoints ## Introduction With Friendli Dedicated Endpoints, you can easily spin up scalable, secure, and highly available inference deployments, without the need for infrastructure expertise or significant capital expenditures. This tutorial is designed to guide you through the process of launching and deploying LLMs using Friendli Dedicated Endpoints. Through a series of step-by-step instructions and hands-on examples, you'll learn how to: * Select and deploy pre-trained LLMs from W\&B artifacts * Deploy and manage your models using the Friendli Engine * Monitor and optimize your inference deployments By the end of this tutorial, you'll be equipped with the knowledge and skills necessary to unlock the full potential of LLMs in your applications, products, and services. So, let's get started and explore the possibilities of Friendli Dedicated Endpoints! ## Prerequisites: * A Friendli Suite account with access to [Friendli Dedicated Endpoints](https://friendli.ai/suite) * A W\&B account with an api key (as an access token) ## Step 1: Create a new endpoint 1. Log in to your Friendli Suite account and navigate to the Friendli Dedicated Endpoints dashboard. 2. If not done already, start the free trial for Dedicated Endpoints. 3. Create a new project, then click on the 'New Endpoint' button. 4. [Integrate your W\&B account with an api key.](https://wandb.ai/authorize) 5. Fill in the basic information: * Endpoint name: Choose a unique name for your endpoint (e.g., "My New Endpoint"). 6. Select the model: W&B Model Select * Model Repository: Select "Weights & Biases" as the model provider. * Model ID: Enter `friendliai/model-registry/Meta-Llama-3-8B-Instruct:v0` as the model id. If you're using your own model, check [Format Requirements](/guides/dedicated_endpoints/faq#format-requirements) for requirements. 7. Select the instance: Select instance * Instance configuration: Choose a suitable instance type based on your performance requirements. We suggest 1x A100 80G for most models. In some cases where the model's size is big, some options may be restricted as they are guaranteed to not run due to insufficient VRAM. Low Memory Warning 8. Edit the configurations: Autoscaling Config
Engine Config * Autoscaling: By default, the autoscaling ranges from 0 to 2 replicas. This means that the deployment will sleep when it's not being used, which reduces cost. * Advanced configuration: Some LLM options including the maximum processing batch size and token configurations can be updated. For this tutorial, we'll leave it as-is. 9. Click 'Create' to create a new endpoint. ## Step 2: Test the endpoint 1. Wait for the deployment to be created and initialized. This may take a few minutes. You may check the status by the indicator under the endpoint's name. Initializing Endpoint 2. In the "Playground" section, you may enter a sample input prompt (e.g., "What is the capital of France?"). 3. Click on the right arrow button to send the inference request. Playground 4. If you are an enterprise user, you can use the "Metrics" and "Logs" section to monitor the endpoint. Metrics
Logs ## Step 3: Send requests by using cURL or Python 1. As instructed in our [API docs](/openapi/serverless/chat-completions), you can send instructions with the following code: ```sh cURL curl -X POST https://api.friendli.ai/dedicated/v1/chat/completions \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ -d '{ "model": "(endpoint-id)", "messages": [ { "role": "user", "content": "What is the capital of France?" } ], "max_tokens": 200, "top_k": 1 }' ``` ```python Python import requests import json import os url = 'https://api.friendli.ai/dedicated/v1/chat/completions' payload = json.dumps({ "model": f"{os.environ['ENDPOINT_ID']}", "messages": [ { "role": "user", "content": "What is the capital of France?" } ], "max_tokens": 200, "top_k": 1 }) headers = { "Content-Type": "application/json", "Accept": "application/json", "Authorization": f"Bearer {os.environ['FRIENDLI_TOKEN']}" } response = requests.request("POST", url, headers=headers, data=payload) print(response.text) ``` ## Step 4: Update the endpoint 1. You can update the model and change almost everything by clicking the update button. # Visualizing Ideas with Friendli: A Guide to Image Generation Source: https://friendli.ai/docs/guides/image-generation Dive into the characteristics of popular Image Generation Models available on Friendli Dedicated Endpoints. Friendli provides powerful Image Generation capabilities, allowing users to transform text prompts into high-quality visuals with ease. This guide explores how to generate images using Friendli Dedicated Endpoints, including code examples to help you make the most of these powerful tools. ## Model Supports We supports **FLUX.1-dev** and **FLUX.1-schnell** models. Also their fine-tuned and quantized models are supported, and adapters are available as well. For a detailed list of models, refer to the models page in our website. * [FLUX.1-dev](https://friendli.ai/models/search?baseModel=black-forest-labs/FLUX.1-dev) * [FLUX.1-schnell](https://friendli.ai/models/search?baseModel=black-forest-labs/FLUX.1-schnell) * [See all image generation models](https://friendli.ai/models/search?input=TEXT\&output=IMAGE) ## API Usage For full API specifications, refer to: * [Dedicated API Reference](/openapi/dedicated/inference/image-generations) * [Container API Reference](/openapi/container/image-generations) ## Examples ```python Python import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/dedicated/v1", api_key=os.environ.get("FRIENDLI_TOKEN"), ) images = client.images.generate( # Replace YOUR_ENDPOINT_ID with the ID of your endpoint, e.g. "zbimjgovmlcb" model="YOUR_ENDPOINT_ID", prompt="An orange Lamborghini driving down a hill road at night with a beautiful ocean view in the background.", extra_body={ "num_inference_steps": 10, "guidance_scale": 3.5 } ) print(images.data[0].url) ``` ```sh cURL curl -L -X POST "https://api.friendli.ai/dedicated/v1/images/generations" \ -H "Content-Type: application/json" \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ --data-raw '{ "model": "YOUR_ENDPOINT_ID", "prompt": "An orange Lamborghini driving down a hill road at night with a beautiful ocean view in the background.", "num_inference_steps": 10, "guidance_scale": 3.5 }' ``` `guidance_scale` is required when using Friendli Container. For more detail, please refer to the [Container API Reference](/openapi/container/image-generations). # Unleash the Power of Generative AI with Friendli Suite: Your End-to-End Solution Source: https://friendli.ai/docs/guides/introduction Friendli Suite empowers you to explore generative AI with three solutions: Serverless Endpoints for quick access to open-source models, Dedicated Endpoints for deploying custom models on dedicated GPUs, and Containers for secure, on-premise control. Powered by the optimized Friendli Engine, each option ensures fast, cost-efficient AI serving for text, code, and image generation. export const ServerlessIcon = () => { return ; }; export const ContainerIcon = () => { return ; }; export const DedicatedIcon = () => { return ; }; Welcome to the exciting world of generative AI, where words dance into text, code sparks creation, and images bloom from the imagination. Friendli Suite empowers you to tap into this potential with three distinct offerings, catering to your specific needs and technical expertise. Whether you're a seasoned developer or a curious newcomer, Friendli Suite provides the perfect platform to bring your AI-powered visions to life. ## What is Generative AI Serving? Before diving into Friendli Suite, let's get familiar with the magic behind the curtain. Generative AI models, including large language models (LLMs), learn from massive datasets of text and code, mimicking human creativity and knowledge. However, utilizing these models in real-world applications requires generative AI serving. Inference serving acts as the bridge between the model and your desired outputs, efficiently processing your prompts and queries to generate text, code, images, and more. An efficient inference serving is not a process that can be achieved easily. During the process, one needs to actively optimize the various aspects of the system to optimize how the machine can handle user requests efficiently on the limited amount of resources. Inference serving without optimizations can result in extremely high latencies or unnecessarily over-excessive usage of many expensive GPUs. In order to offload such optimization hassles from your concerns, the Friendli Engine steps in to enable fast and cost-efficient inference serving for your generative-AI models. ## Friendli Suite: Your Flexible Gateway to Generative AI Mastery Now, let's meet the three members of Friendli Suite, each unlocking different doors to AI innovation: ### 1. [Friendli Dedicated Endpoints](/guides/dedicated_endpoints/introduction): Power and Customization at Your Fingertips Ready to take the reins and unleash the full potential of your own models? Friendli Dedicated Endpoints is for you. This service provides dedicated GPU resources, letting you upload and run your custom generative AI models. Reserve the exact GPU you need and enjoy fine-grained control over your model settings. Pay-per-second billing makes it perfect for regular or resource-intensive workloads. ### 2. [Friendli Serverless Endpoints](/guides/serverless_endpoints/introduction): Your Quickest Path to Creativity Imagine a playground for your AI dreams. Friendli Serverless Endpoints is just that - a simple, click-and-play interface that lets you access popular general-purpose open-source models like Llama 3.1 without any heavy lifting. Choose your model, enter your prompt, and marvel at the generated text, or code outputs. With token-based (or time-based) billing, this is ideal for exploration and experimentation. You can think of it as an AI sampler to try out the abilities of general-purpose AI models. ### 3. [Friendli Container](/guides/container/introduction): On-Premise Control for the AI Purist Do you prefer the comfort and security of your own data center? Friendli Container is the solution. We provide the Friendli Engine within Docker containers that can be installed on your on-premise GPUs so your data stays within your own secure cluster. This option offers maximum control and security, ideal for advanced users or those with specific data privacy requirements. ## [The Friendli Engine](https://friendli.ai/solutions/engine): The Powerhouse Behind the Suite At the heart of each Friendli Suite offering lies the Friendli Engine, a patented GPU-optimized serving engine. This technological marvel is what enables Friendli Suite's superior performance and cost-effectiveness, featuring innovations like continuous batching (iteration batching) that significantly improve resource utilization compared to traditional LLM serving solutions. ## Which Friendli solution is Right for You? Friendli Suite provides flexibility to match your needs: * Level up with your own models: Opt for [Friendli Dedicated Endpoints](/guides/dedicated_endpoints/introduction) for customized models on autopilot. * Embrace on-premise control: Utilize [Friendli Container](/guides/container/introduction) for maximum control and efficiency on your GPUs. * Start quick and simple: Choose [Friendli Serverless Endpoints](/guides/serverless_endpoints/introduction) for exploration and quick projects. No matter your skill level or preferences, Friendli Suite has the perfect option to empower your generative AI journey. Dive in, explore, and unleash the endless possibilities of AI creativity! Remember to explore the resources at [https://friendli.ai/blog](https://friendli.ai/blog) for deeper insights into generative AI and Friendli Suite capabilities. ## Popular Guides Check out popular how to guides and dive into the Friendli Suite. } href="/guides/dedicated_endpoints/quickstart"> Deploy your models with Friendli Dedicated Endpoints, and enjoy the flexibility of customizing your own models. Use the Friendli Engine to generate images, text, and more with extraordinary speed and efficiency. } href="/guides/serverless_endpoints/quickstart"> Only a few clicks are required for you to access general-purpose open-source models like Llama 3.1. Enjoy the power of generative AI without any hassle at a blazing speed. } href="/guides/container/quickstart"> Opt for maximum control with Friendli Container, offering the Friendli Engine in Docker containers installable on your on-premise GPUs, ensuring your data remains within your cluster. # Friendli Documentation Source: https://friendli.ai/docs/guides/overview Get started with FriendliAI products and explore APIs. export const ToolIcon = () => { return ; }; export const ChatIcon = () => { return ; }; export const ServerlessIcon = () => { return ; }; export const ContainerIcon = () => { return ; }; export const DedicatedIcon = () => { return ; }; ## QuickStarts } href="/guides/dedicated_endpoints/quickstart"> Deploy your models with Friendli Dedicated Endpoints, and enjoy the flexibility of customizing your own models. Use the Friendli Engine to generate images, text, and more with extraordinary speed and efficiency. } href="/guides/serverless_endpoints/quickstart"> Only a few clicks are required for you to access general-purpose open-source models like Llama 3.1. Enjoy the power of generative AI without any hassle at a blazing speed. } href="/guides/container/quickstart"> Opt for maximum control with Friendli Container, offering the Friendli Engine in Docker containers installable on your on-premise GPUs, ensuring your data remains within your cluster. ## SDK The official Python SDK for Friendli provides a powerful and flexible way to interact with Friendli's AI services, including Serverless Endpoints, Dedicated Endpoints, and Friendli Container. Friendli offers tools for developers to easily integrate AI into various applications. Our solutions support popular frameworks, enabling AI integration from simple chatbots to complex systems. ## Explore APIs } href="/openapi/serverless/chat-completions"> Discover how to generate text through interactive conversations. } href="/openapi/serverless/tool-assisted-chat-completions"> Learn how to enhance responses with tool assisted chat completions using built-in tools. # Personal Access Tokens Source: https://friendli.ai/docs/guides/personal_access_tokens Learn how to manage credentials in Friendli Suite, including using personal access tokens for authentication and authorization. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; Effective management of credentials is crucial when using Friendli Suite and its endpoints for authentication and authorization purposes. This guide outlines when the credentials are required and provides instructions on how to manage them. A Friendli Token serves as an alternative method of authorization to signing in with an email and a password. You can generate a new Friendli Token through the [Friendli Suite](https://friendli.ai/suite), at your **'Personal settings'** page. 1. Go to the [Friendli Suite](https://friendli.ai/suite) and sign in with your account. 2. Click the profile icon at the top-right corner of the page. 3. Click **'Personal settings'** menu. Personal settings 4. Go to the **'Tokens'** tab on the navigation bar. 5. Create a new Friendli Token by clicking the **'Create token'** button. 6. Copy the token and save it in a safe place. You will not be able to see this token again once the page is refreshed. Tokens # Advanced Applications on Friendli Serverless Endpoints (Coming Soon!) Source: https://friendli.ai/docs/guides/serverless_endpoints/applications Stay tuned for detailed guides on how to perform tasks like Retrieval-Augmented Generation (RAG), Conditional Image Generation, Fine-tuning Custom Models. Friendli Serverless Endpoints empowers you to unleash the full potential of generative AI models with ease. While we've already covered some exciting applications through text and image generation, we're eager to offer even more possibilities for users like you! This document serves as a preview for upcoming content showcasing advanced applications of Friendli Serverless Endpoints. Stay tuned for detailed guides on how to perform tasks like: * **Retrieval-Augmented Generation (RAG)**: Combine the power of search and generation to create highly relevant and informative text outputs based on real-world data. * **Conditional Image Generation**: Fine-tune your image creations by using specific conditions or attributes as additional prompts, pushing the boundaries of creative control. * **Fine-tuning Custom Models**: Tailor existing models to your specific needs and data for a truly personalized generative AI experience. This is just a glimpse of the advanced applications on the horizon! We're actively working on bringing you comprehensive guides that explain the process, settings, and potential benefits of each approach. In the meantime, feel free to explore the current capabilities of Friendli Serverless Endpoints with text generation. Experiment with different models, settings, and prompts to discover the vast creative and informative potential at your fingertips. We're committed to evolving Friendli Serverless Endpoints into a one-stop platform for all your generative AI needs. Stay tuned for updates and get ready to dive into the world of advanced applications soon! #### For any questions or feedback regarding these upcoming features, please don't hesitate to [reach out to us](https://friendli.ai/contact)! We appreciate your understanding and continuous support as we push the boundaries of generative AI accessibility. # Function Calling Source: https://friendli.ai/docs/guides/serverless_endpoints/function-calling Learn how to do OpenAI compatible function calling on Friendli Serverless Endpoints. Function calling is a powerful feature that connects large language models (LLMs) with external systems to maximize the model’s utility. It goes beyond simply relying on model’s learned knowledge and provides the possibility of utilizing real-time data and performing complex tasks. Function calling ## Simple Example In the example below, which consists of 1 to 5 steps, we define a `get_weather` function that retrieves weather information, ask a question that prompts the model to use the tool, and execute the tool to execute the final response. Open In Colab Define a function that the model can call (`get_weather`) with a JSON Schema.\ The function requires the following parameters: * `location`: The location to look up weather information for. * `date`: The date to look up weather information for. This definition is included in the `tools` array and passed to the model. ```python tools = [ { "type": "function", "function": { "name": "get_weather", "parameters": { "type": "object", "properties": { "location": {"type": "string"}, "date": {"type": "string", "format": "date"} }, }, }, } ] ``` When a user asks a question, this request is passed to the model as a `messages` array.\ For example, the request "What's the weather like in Paris today?" would be passed as: ```python from datetime import datetime today = datetime.now() messages = [ {"role": "system", "content": f"You are a helpful assistant. today is {today}."}, {"role": "user", "content": "What's the weather like in Paris today?"} ] ``` Call the model using the `tools` and `messages` defined above. ```python {13-14} import os from openai import OpenAI token = os.getenv("FRIENDLI_TOKEN") or "" client = OpenAI( base_url = "https://api.friendli.ai/serverless/v1", api_key = token ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=messages, tools=tools, ) print(completion.choices[0].message.tool_calls) ``` The API caller runs the tool based on the function call information of the model.\ For example, the `get_weather` function is executed as follows: ```python import json import random def get_weather(location: str, date: str): temperature = random.randint(60, 80) return {"temperature": temperature, "forecast": "sunny"} tool_call = completion.choices[0].message.tool_calls[0] tool_response = locals()[tool_call.function.name](**json.loads(tool_call.function.arguments)) print(tool_response) ``` ```python Result: {'temperature': 65, 'forecast': 'sunny'} ``` Add the tool's response to the `messages` array and pass it back to the model. 1. Append tool call information 2. Append the tool's execution result This ensures the model has all the necessary information to generate a response. ```python model_response = completion.choices[0].message # Append the response from the model messages.append( { "role": model_response.role, "tool_calls": [ tool_call.model_dump() for tool_call in model_response.tool_calls ] } ) # Append the response from the tool messages.append( { "role": "tool", "content": json.dumps(tool_response), "tool_call_id": tool_call.id } ) print(json.dumps(messages, indent=2)) ``` The model generates the final response based on the tool's output: ```python next_completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=messages, tools=tools ) print(next_completion.choices[0].message.content) ``` ```text Final output: According to the forecast, it's going to be a sunny day in Paris with a temperature of 65 degrees. ``` ## Parameters To use function calling, modify the `tool_choice`, `tools`, and `parallel_tool_calls` parameters. | Parameter | Description | default | | --------------------- | ---------------------------------------------------------------------------------------------------------------- | ------- | | `tool_choice` | Specifies how the model should choose tools. Has four options: "none", "auto", "required", or named tool choice. | `auto` | | `tools` | The list of tool objects that define the functions the model can call. | - | | `parallel_tool_calls` | Boolean value (`True` or `False`) specifying whether to make tool calls in parallel. | `True` | ### `tool_choice` options The model will automatically choose whether to call a function and which function to call by default.\ However, you can use the `tool_choice` parameter to tell the model to use a function. * `none`: Disables the use of tools. * `auto`: Enables the model to decide whether to use tools and which ones to use. * `required`: Forces the model to use a tool, but the model chooses which one. * Named tool choice: Forces the model to use a specific tool. It must be in the following format: ```json { "type": "function", "function": { "name": "get_current_weather" // The function name you want to specify } } ``` ## References Building an AI Agent for Google Calendar ([Part 1](https://friendli.ai/blog/ai-agent-google-calendar) / [Part 2](https://friendli.ai/blog/calendar-agent-vercel))\ Friendli Tools Blog Series ([Part 1](https://friendli.ai/blog/llm-function-calling) / [Part 2](https://friendli.ai/blog/ai-agents-function-calling) / [Part 3](https://friendli.ai/blog/friendli-tools-llama3-outperforms-gpt4o)) # Integrations Source: https://friendli.ai/docs/guides/serverless_endpoints/integrations Friendli integrates with LangChain, LiteLLM, LlamaIndex, and MongoDB to streamline GenAI application deployment. LangChain and LlamaIndex enable tool calling AI agents and Retrieval-Augmented Generation (RAG), while MongoDB provides memory via vector databases, and LiteLLM boosts performance through load balancing. [Friendli](/guides/introduction) integrates with LangChain, LiteLLM, LlamaIndex, and MongoDB to streamline the deployment of compound GenAI applications. The integration of LangChain and LlamaIndex facilitates tool calling AI agents or Retrieval-Augmented Generation (RAG). MongoDB supports these agentic systems by providing memory with vector databases, while LiteLLM enhances performance through load balancing and evaluation. Get a quick overview of [Friendli Serverless Endpoints'](/guides/serverless_endpoints/introduction) integrations and learn more through the linked resources. ## LangChain [LangChain](https://python.langchain.com/v0.2/docs/introduction) is a framework for developing applications powered by large language models (LLMs). Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in LangChain by preparing a [Friendli Token](/guides/personal_access_tokens). To install the required packages, run: ``` pip install -qU langchain-openai langchain ``` Here's a streaming chat sample code to get started with LangChain and FriendliAI: ```python import os from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="meta-llama-3.3-70b-instruct", base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) result = llm.invoke("Tell me a joke.") print(result.content) ``` Output: ``` Here's one: Why couldn't the bicycle stand up by itself? (Wait for it...) Because it was two-tired! Hope that brought a smile to your face! ``` #### Resources * [FriendliAI Blog Post on Building RAG Chatbots with Friendli, MongoDB Atlas, and LangChain](https://friendli.ai/blog/rag-chatbot-friendli-mongodb-atlas-langchain) * [FriendliAI Blog Post on Example RAG Application with Friendli and LangChain](https://friendli.ai/blog/chatdocs-rag-friendli-langchain) * [FriendliAI Blog Post on LangChain Integration with Friendli Dedicated Endpoints](https://friendli.ai/blog/langchain-integration-friendli-engine) * [LangChain's Documentation on Friendli](https://python.langchain.com/v0.1/docs/integrations/llms/friendli) ## MongoDB [MongoDB Atlas](https://www.mongodb.com/docs/atlas/getting-started) is a developer data platform offering vector stores and searches for compound GenAI applications, compatible through both LangChain and LlamaIndex. Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in MongoDB by preparing a [Friendli Token](/guides/personal_access_tokens). To install the required packages, run: ``` pip install pymongo friendli-client langchain langchain-mongodb langchain-community pypdf langchain-openai tiktoken ``` Here's a RAG sample code to get started with MongoDB and FriendliAI using LangChain: ```python # Note: You can find detailed explanation on this code in the blog post below. from pymongo import MongoClient from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch from langchain_community.chat_models.friendli import ChatFriendli from langchain_community.document_loaders import PyPDFLoader from langchain_openai import OpenAIEmbeddings from langchain_text_splitters import RecursiveCharacterTextSplitter from langchain_core.output_parsers import StrOutputParser from langchain_core.prompts import PromptTemplate from langchain_core.runnables import RunnablePassthrough # Fill in your Cluster URI here. MONGODB_ATLAS_CLUSTER_URI = "{YOUR CLUSTER URI}" client = MongoClient(MONGODB_ATLAS_CLUSTER_URI) # Fill in your DB information here. DB_NAME = "{YOUR DB NAME}" COLLECTION_NAME = "{YOUR COLLECTION NAME}" ATLAS_VECTOR_SEARCH_INDEX_NAME = "{YOUR INDEX NAME}" MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME] # Fill in your PDF link here. loader = PyPDFLoader("{YOUR PDF DOCUMENT LINK}") data = loader.load() text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150) docs = text_splitter.split_documents(data) vector_store = MongoDBAtlasVectorSearch.from_documents( documents=docs, embedding=OpenAIEmbeddings(disallowed_special=()), collection=MONGODB_COLLECTION, index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME, ) retriever = vector_store.as_retriever() llm = ChatFriendli(model="meta-llama-3.3-70b-instruct") prompt = PromptTemplate.from_template( """ Use the following pieces of context to answer the question. {context} Question: {question} Helpful Answer: """ ) def format_docs(docs): return "\n\n".join(doc.page_content for doc in docs) rag_chain = ( {"context": retriever | format_docs, "question": RunnablePassthrough()} | prompt | llm | StrOutputParser() ) # Input your user query here. rag_chain.invoke("{Sample Query Texts}") ``` #### Resources * [FriendliAI Blog Post on Building RAG Chatbots with Friendli, MongoDB Atlas, and LangChain](https://friendli.ai/blog/rag-chatbot-friendli-mongodb-atlas-langchain) * [FriendliAI Blog Post on RAG with FriendliAI and MongoDB](https://friendli.ai/blog/rag-mongodb-friendli) * [MongoDB's Partner Ecosystem Page on FriendliAI](https://cloud.mongodb.com/ecosystem/friendliai) ## LlamaIndex [LlamaIndex](https://docs.llamaindex.ai/en/stable) is a data framework designed to connect LLMs to custom data sources. Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in LlamaIndex by preparing a [Friendli Token](/guides/personal_access_tokens). Additionally, an [OpenAI API key](https://platform.openai.com/docs/api-reference/authentication) is required to access the [OpenAI embedding API](https://platform.openai.com/docs/api-reference/embeddings). To install the required packages, run: ``` pip install llama-index-llms-friendli llama-index ``` Here's a RAG streaming chat sample code to get started with LlamaIndex and FriendliAI: ```python import os from llama_index.llms.friendli import Friendli from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN" Settings.llm = Friendli(model="meta-llama-3.3-70b-instruct") # Assuming a directory named 'data_folder' stores your pdf file. documents = SimpleDirectoryReader('data_folder').load_data() index = VectorStoreIndex.from_documents(documents) query_engine = index.as_query_engine(streaming=True) # Input your user query here. response = query_engine.query("{Sample Query Texts}") response.print_response_stream() ``` #### Resources * [FriendliAI Blog Post on Building RAG Applications with Friendli and LlamaIndex](https://friendli.ai/blog/llamaindex-rag-app-friendli-engine) * [Google Colab Notebook on Two-Stage Retrieval with LlamaIndex Friendli Integration](https://colab.research.google.com/drive/1_-1aITFQh0UUbRzaRM8FRid_wZHrfIjX?usp=sharing) * [LlamaIndex's Documentation on Friendli](https://docs.llamaindex.ai/en/stable/examples/llm/friendli) ## LiteLLM [LiteLLM](https://docs.litellm.ai/docs) is a versatile platform offering access to 100+ LLMs in the [OpenAI API format](https://platform.openai.com/docs/api-reference/chat/create). Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in LiteLLM by preparing a [Friendli Token](/guides/personal_access_tokens). To install the required package, run: ``` pip install litellm ``` Here's a streaming chat sample code to get started with LiteLLM and FriendliAI: ```python from litellm import completion response = completion( # Simply change the model ID to use different LLM inference models & engines. model="friendliai/meta-llama-3.3-70b-instruct", messages=[ {"role": "user", "content": "Hello from LiteLLM"} ], stream=True, ) for chunk in response: print(chunk.choices[0].delta.content, end="", flush=True) ``` Output: ``` Hello from an AI! It's great to meet you, LiteLLM! How's your day going so far? ``` #### Resources * [FriendliAI Blog Post on LiteLLM Friendli Integration using LiteLLM's Budget Manager](https://friendli.ai/blog/litellm-friendli-integration) * [LiteLLM's Supported Models & Providers Documentation Page on FriendliAI](https://docs.litellm.ai/docs/providers/friendliai) # Introducing Friendli Serverless Endpoints Source: https://friendli.ai/docs/guides/serverless_endpoints/introduction Guide for Friendli Serverless Endpoints, allowing you to seamlessly integrate state-of-the-art AI models into your workflows, regardless of your technical expertise. {/* Welcome to the exciting world of generative AI, where words dance into text, code sparks creation, and images bloom from the imagination. FriendliAI makes this world readily accessible with Friendli Serverless Endpoints, a revolutionary service that puts the power of cutting-edge generative models right at your fingertips. */} This tutorial will guide you through Friendli Serverless Endpoints, allowing you to seamlessly integrate state-of-the-art AI models into your workflows, regardless of your technical expertise. Whether you're a seasoned developer or a curious newcomer, get ready to unlock the limitless potential of generative AI! ## What are Friendli Serverless Endpoints? Imagine there is a powerful racecar (a generative AI model) that needs much maintenance and tuning (infrastructure and technical know-how). Friendli Serverless Endpoints is like a rental service, taking care of the hassle so you can just drive! It provides a simple, serverless interface that connects you to Friendli Engine, a high-performance, cost-effective inference serving engine optimized for generative AI models. With Friendli Serverless Endpoints, you can: * **Access popular open-source models**: Get started with pre-loaded models like Llama 3.1. No need to worry about downloading or optimizing them. * **Build your own workflows**: Integrate these models into your applications with just a few lines of code. Generate creative text formats, code, musical pieces, email, letters, etc. and create stunning images with ease. * **Pay per token (or time), not per GPU**: Unlike traditional solutions that require whole GPU instances, Friendli Serverless Endpoints bills you only for the resources your models actually use. This translates to significant cost savings and efficient resource utilization. * **Focus on what matters**: Forget about infrastructure setup and GPU optimization. Friendli Serverless Endpoints handles the heavy lifting, freeing you to focus on your creative vision and application development. ## Getting Started with Friendli Serverless Endpoints: 1. **Sign up for a free account**: Visit [Friendli Suite](https://friendli.ai/suite) and create your Friendli Suite account. 2. **Choose your model**: Select the pre-loaded model you want to experiment with, such as Llama 3.1 for text generation. 3. **Connect to the endpoint**: Friendli Serverless Endpoints provides simple API documentation for a variety of programming languages. Follow the instructions to integrate the endpoint into your code. 4. **Send your input**: Supply the model with your input text, code, or image prompt. 5. **Witness the magic**: Friendli Serverless Endpoints will utilize Friendli Engine to process your input and generate the desired output, be it text, code, or an image. You can then integrate the generated results into your application or simply marvel at the AI's creativity! ## Beyond the Basics: As you gain confidence, Friendli Serverless Endpoints offers even more: * **Granular control**: Optimize resource usage at the per-token or per-step level for each model, ensuring efficient resource allocation for your specific needs. {/* - **Customization**: Build your own custom generative models and seamlessly integrate them into your workflows using Friendli Serverless Endpoints. */} * **Scalability**: As your needs grow, easily scale your resources without worrying about complex infrastructure management. Friendli Serverless Endpoints is the perfect springboard for your generative AI journey. Whether you're a experienced developer seeking to integrate AI into your projects or a curious explorer yearning to unleash your creative potential, FriendliAI provides the tools and resources you need to succeed. So, start your engines, take the wheel, and explore the vast possibilities of generative AI with Friendli Serverless Endpoints! ## Additional Resources: * FriendliAI website: [https://friendli.ai](https://friendli.ai) * FriendliAI blog: [https://friendli.ai/blog](https://friendli.ai/blog) # OpenAI Compatibility Source: https://friendli.ai/docs/guides/serverless_endpoints/openai-compatibility Friendli Serverless Endpoints is compatible with the OpenAI API standard through the Python API Libraries and the Node API Libraries. Friendli Dedicated Endpoints and Friendli Container are also OpenAI API compatible. Friendli Serverless Endpoints is compatible with the [OpenAI API standard](https://platform.openai.com/docs/api-reference/chat) through the [Python API Libraries](https://pypi.org/project/openai) and the [Node API Libraries](https://www.npmjs.com/package/openai). [Friendli Dedicated Endpoints](https://friendli.ai/products/dedicated-endpoints) and [Friendli Container](https://friendli.ai/products/container) are also OpenAI API compatible. Through this guide, you will learn how to: * Send inference requests to Friendli Serverless Endpoints in Python and Node.js. * Use chat models supported by Friendli Endpoint. * Generate streaming chat responses. ## Model Supports * `K-intelligence/Midm-2.0-Base-Instruct` * `K-intelligence/Midm-2.0-Mini-Instruct` * `deepseek-ai/DeepSeek-R1` * `deepseek-ai/DeepSeek-R1-0528` * `meta-llama/Llama-4-Maverick-17B-128E-Instruct` * `meta-llama/Llama-4-Scout-17B-16E-Instruct` * `meta-llama/Llama-3.3-70B-Instruct` * `meta-llama/Llama-3.1-8B-Instruct` * `Qwen/Qwen3-235B-A22B` * `Qwen/Qwen3-30B-A3B` * `Qwen/Qwen3-32B` * `google/gemma-3-27b-it` * `mistralai/Mistral-Small-3.1-24B-Instruct-2503` * `mistralai/Devstral-Small-2505` * `mistralai/Magistral-Small-2506`
* [and more!](https://friendli.ai/models) You can find more information about each text generation model [here](https://friendli.ai/models). Log in to the [Friendli Suite](https://friendli.ai/login) to create your Friendli Token for this quick tutorial. We will use the *Llama 3.3 70B Instruct* model as an example in this tutorial. ## Quick Guide If you want to integrate Friendli Serverless Endpoints to your application that had been using OpenAI, you can simply switch the following components: **API key**, **model**, and the **base url**. The **API key** is equivalent to your Friendli Token, which you can create [here](https://friendli.ai/suite/setting/tokens). After choosing your generative text model, you can find the **model id** by pressing the 'More info' icon, or by using the ids listed in the Model Supports section above. Last but not least, change the **base url** to [https://api.friendli.ai/serverless/v1](https://api.friendli.ai/serverless/v1) and you are all set! ## Python This example demonstrates how you can use the OpenAI Python SDK to generate a response. #### Default Example Code ```python from openai import OpenAI import os client = OpenAI( api_key=os.getenv("FRIENDLI_TOKEN"), base_url="https://api.friendli.ai/serverless/v1", ) completion = client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a funny joke."}, ], stream=False, ) print(completion.choices[0].message.content) ``` #### Streaming Example Code ```python from openai import OpenAI import os client = OpenAI( api_key=os.getenv("FRIENDLI_TOKEN"), base_url="https://api.friendli.ai/serverless/v1", ) stream = client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Tell me a funny joke."}, ], stream=True, ) for chunk in stream: print(chunk.choices[0].delta.content or "", end="", flush=True) ``` ## Node.js This example demonstrates how you can use the OpenAI Node.js SDK to generate a response. #### Default Example Code ```javascript import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.FRIENDLI_TOKEN, baseURL: "https://api.friendli.ai/serverless/v1", }); async function main() { const completion = await client.chat.completions.create({ model: "deepseek-r1", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Tell me a funny joke." }, ], }); console.log(completion.choices[0].message.content); } main().catch(console.error); ``` #### Streaming Example Code ```javascript import OpenAI from "openai"; const client = new OpenAI({ apiKey: process.env.FRIENDLI_TOKEN, baseURL: "https://api.friendli.ai/serverless/v1", }); async function main() { const stream = await client.chat.completions.create({ model: "meta-llama-3.3-70b-instruct", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Tell me a funny joke." }, ], stream: true, }); for await (const chunk of stream) { process.stdout.write(chunk.choices[0].delta?.content || ""); } } main().catch(console.error); ``` ## Results ``` Here's one: Why couldn't the bicycle stand up by itself? (wait for it...) Because it was two-tired! Hope that brought a smile to your face! ``` # Pricing Source: https://friendli.ai/docs/guides/serverless_endpoints/pricing Friendli Serverless Endpoints offer a range of models tailored to various tasks. Friendli Serverless Endpoints offer a flexible, scalable inference solution powered by a wide range of models. You can unlock access to more models and features based on your **usage tier**. **Important Update**: Effective June 20, 2025, we've introduced new billing options and plan changes: * Models are now billed **Token-Based** or **Time-Based**, depending on the model. * The Basic plan has been renamed to the **Starter plan**. * Existing users can continue using their current serverless models without interruption. ## Usage Tiers Usage tiers define your limits on usage and scale **monthly** based on your payment history. | Tiers | Usage Limits | Rate Limit (RPM) | Output Token Length | Qualifications | | ------ | ---------------- | ---------------- | ------------------- | --------------------------------------------------------- | | Tier 1 | \$50 / month | 100 | 2K | Valid payment method added | | Tier 2 | \$500 / month | 1,000 | 4K | Total historical spend of \$50+ | | Tier 3 | \$5,000 / month | 5,000 | 8K | Total historical spend of \$500+ | | Tier 4 | \$50,000 / month | 10,000 | 16K | Total historical spend of \$5,000+ | | Tier 5 | Custom | Custom | Custom | Contact [support@friendli.ai](mailto:support@friendli.ai) | **Qualifications** only apply to usage within the Serverless Endpoints plan. 'Output Token Length' is how much the model can write in response. It’s different from 'Context Length', which is sum of the input and output tokens. ## Billing Methods Friendli Serverless Endpoints use two different billing methods, Token-Based or Time-Based, depending on the model type. ### Token-Based Billing **Pinned models** (such as DeepSeek, Llama, and other popular models) are charged per token basis. These models are billed based on the number of tokens processed, where a "token" refers to an individual unit processed by the model. ### Time-Based Billing Other models use **time-based billing**, meaning you are charged per second of compute time used to run your inference request. ## Free Models The following models are available for free for a limited time. | Model Code | Free until | | ------------------------------------- | ---------- | | K-intelligence/Midm-2.0-Base-Instruct | August 4th | | K-intelligence/Midm-2.0-Mini-Instruct | August 4th | ## Pinned Models (Token-Based Billing) The following **pinned** popular models are billed **per token**: | Model Code | Price per Token | | --------------------------------- | ---------------------------------- | | deepseek-ai/DeepSeek-R1 | Input \$3 Β· Output \$7 / 1M tokens | | meta-llama/Llama-3.3-70B-Instruct | \$0.6 / 1M tokens | | meta-llama/Llama-3.1-8B-Instruct | \$0.1 / 1M tokens | ## Other Models (Time-Based Billing) Other models are billed **per second of compute time**: | Model Code | Price per Second | | --------------------------------------------- | ---------------- | | deepseek-ai/DeepSeek-R1-0528 | \$0.004 / second | | meta-llama/Llama-4-Maverick-17B-128E-Instruct | \$0.004 / second | | meta-llama/Llama-4-Scout-17B-16E-Instruct | \$0.002 / second | | Qwen/Qwen3-235B-A22B | \$0.004 / second | | Qwen/Qwen3-30B-A3B | \$0.002 / second | | Qwen/Qwen3-32B | \$0.002 / second | | google/gemma-3-27b-it | \$0.002 / second | | mistralai/Mistral-Small-3.1-24B-Instruct-2503 | \$0.002 / second | | mistralai/Devstral-Small-2505 | \$0.002 / second | | mistralai/Magistral-Small-2506 | \$0.002 / second | ## FAQs Your usage tier, which determines your rate limits, increases monthly based on your proof-of-payment. Need a faster upgrade? Reach out anytime at [support@friendli.ai](mailto:support@friendli.ai) β€” we’re happy to help! Popular models are available to all users, depending on the limits determined by their usage tiers. You'll receive an alert when approaching your monthly cap. Please contact [support@friendli.ai](mailto:support@friendli.ai) to discuss options for increasing your monthly cap. We may help you (1) pay early to reset your monthly cap, or (2) upgrade your plan to increase your monthly cap and unlock more features. For more questions, contact [support@friendli.ai](mailto:support@friendli.ai). # QuickStart: Friendli Serverless Endpoints Source: https://friendli.ai/docs/guides/serverless_endpoints/quickstart Learn how to get started with Friendli Serverless Endpoints in this step-by-step guide. Create an account, choose from powerful AI models like Llama 3.1, and seamlessly generate text, code, and more with ease. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ## 1. Log In or Sign Up * If you have an account, log in using your preferred SSO or email/password combination. * If you're new to FriendliAI, create an account for free. Login ## 2. Access Friendli Serverless Endpoints * On your left sidebar, find the "Serverless Endpoints" option. * Click the option to access the playground page. Sidebar ## 3. Select a Model * Browse available generative models. Choose the model that best aligns with your desired use case. * Click on a model that supports Friendli Serverless Endpoints to directly select the endpoint. * First-time user receives a free trial to explore Friendli Serverless Endpoints without any financial commitment. Select Model
Select Endpoint ## 4. Generate Responses 1. Enter Your Query: * Type in your prompt or question. Chat Prompt 2. Adjust Settings: * Refer to the [Text Generation](/guides/serverless_endpoints/text-generation) docs for more details on the settings applicable for the text generation models. Chat Parameters 3. Generate Your Response: * Click submit button to start the generation process. * The model will process your query and produce the corresponding text output. That's it! Chat Response ### Generating Responses Through the Endpoint URL If you wish to send your requests through the endpoint URL, you can find the model id by hitting the info button on the top-right corner of the page. Refer to [this guide](/guides/personal_access_tokens) for general instructions on the Friendli Token. Model Info
```sh cURL curl -X POST https://api.friendli.ai/serverless/v1/chat/completions \ -H "Authorization: Bearer $FRIENDLI_TOKEN" \ -d '{ "model": "meta-llama-3.1-8b-instruct", "messages": [ { "role": "user", "content": "Python is a popular" } ] }' ``` ```python Python SDK # pip install friendli import os from friendli import SyncFriendli client = SyncFriendli(token=os.getenv("FRIENDLI_TOKEN")) chat_completion = client.serverless.chat.complete( model="meta-llama-3.3-70b-instruct", messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" } ], stream=False, ) print(chat_completion.choices[0].message.content) ```
## Additional Tips Check out the [Text Generation](/guides/serverless_endpoints/text-generation) docs for more details. **Ready to unlock the creativity of generative AI? Get started with Friendli Serverless Endpoints today!** # Structured Outputs Source: https://friendli.ai/docs/guides/serverless_endpoints/structured-outputs Generate structured outputs using FriendliAI's Structured Outputs feature. Large language models (LLMs) excel at creative text generation, but we often face a case where we need LLM outputs to be more structured. This is where our exciting new "structured output" feature comes in. Structured Outputs is also available in [Friendli Dedicated Endpoints](https://friendli.ai/products/dedicated-endpoints) and [Friendli Container](https://friendli.ai/products/container). For more advanced use cases of our Structured Outputs feature, check out our detailed blog post on [Structured Output for LLM Agents](https://friendli.ai/blog/structured-output-llm-agents). ## Structured response modes | Type | Description | Name at OpenAI | | ------------- | ------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- | | `json_schema` | The model returns a JSON object that conforms to the given schema. | [Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#introduction) | | `json_object` | The model can return any JSON object. | [JSON mode](https://platform.openai.com/docs/guides/structured-outputs#json-mode) | | `regex` | The model returns a string that conforms to the given regex schema. | N/A | ## How to use This guide provides a step-by-step example of how to create a structured output response in JSON form.\ In this example, we will use Python and the `pydantic` library to define a schema for the output. Define a schema that contains information about a dish. ```python from pydantic import BaseModel class Result(BaseModel): dish: str cuisine: str calories: int ``` Call structured output and use schema to structure the response. ```python {17-22} import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.getenv("FRIENDLI_TOKEN"), ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ { "role": "user", "content": "Suggest a popular Italian dish in JSON format.", }, ], response_format={ "type": "json_schema", "json_schema": { "schema": Result.model_json_schema(), } } ) ``` You can use the output in the following way. ```python response = completion.choices[0].message.content print(response) ``` The code output result is as follows. ```json Result: { "dish": "Spaghetti Bolognese", "cuisine": "Italian", "calories": 540 } ``` This example demonstrates how to generate an arbitrary JSON object response without a predefined schema. In `json_object` mode, the response may start with `{` or `[` and can be any arbitrary JSON object (dictionary) or array. If you need predictable results, we recommend using `json_schema`. ```python {15} import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.getenv("FRIENDLI_TOKEN"), ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ {"role": "system", "content": "You MUST answer with JSON."}, {"role": "user", "content": "Generate a lasagna recipe. (very short)"}, ], response_format={"type": "json_object"}, ) print(completion.choices[0].message.content) ``` This example shows how to generate output that matches a specific regular expression pattern. ```python {17-18} import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.getenv("FRIENDLI_TOKEN"), ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ { "role": "user", "content": "μ‘°μ„  μ™•μ‘°μ˜ 첫번째 왕은 λˆ„κ΅¬μž…λ‹ˆκΉŒ (Who is the first king of the Joseon Dynasty)?", }, ], # Korean characters and numbers are allowed in the response. response_format={"type": "regex", "schema": "[\n ,.?!0-9\uac00-\ud7af]*"}, ) print(completion.choices[0].message.content) ``` ## Supported JSON schemas We ensure super-fast schema-guided generation by disabling JSON schema features that cause computation inefficiencies. We support **all seven standard JSON schema types** (`null`, `boolean`, `number`, `integer`, `string`, `object`, `array`), and **the supported JSON schema keywords are listed below**. Using unsupported or unexpected JSON schema keywords may result in them being ignored, triggering an error, or causing undefined behavior. ### Type-specific keywords * `integer` * `exclusiveMinimum`, `exclusiveMaximum`, `minimum`, `maximum` (Note: these are not supported in `number`) * `string` * `pattern` * `format` * Supported values: `uuid`, `date-time`, `date`, `time` * `object` * `properties` * `additionalProperties` is ignored, and is always set to `False`. * `required`: We support both required and optional properties, but have these limitations: * The sequence of the properties is fixed. * The first property should be `required`. If not, the first required property is moved to the first. * `array` * `items` * `minItems`: We support only `0` or `1` for `minItems`. ### Constant values and enumerated values `const` and `enum` only support constant values of null, boolean, number, and string. ### Schema composition We support only `anyOf` for [schema composition](https://json-schema.org/understanding-json-schema/reference/combining). ### Referencing subschemas We only support referencing (`$ref`) to "internal" subschemas. These subschemas must be defined within `$defs`, and the value of `$ref` must be a valid URI pointing to a subschema. Please refer [here](https://json-schema.org/understanding-json-schema/structuring#defs) for more details. ### Annotation JSON schema annotations such as `title`, `$comments` or `description` are accepted but ignored. # Text Generation Models Source: https://friendli.ai/docs/guides/serverless_endpoints/text-generation Dive into the characteristics of popular Text Generation Models (TGMs) available on Friendli. ## Unleashing the Power of Language with Friendli Welcome to the captivating world of Text Generation Models (TGMs)! These AI models learn from massive datasets of text and code, mimicking human language patterns to generate creative and informative outputs. Friendli empowers you to harness the potential of several cutting-edge TGMs through its convenient interface, letting you unlock the magic of words with ease. This guide dives into the characteristics of popular TGMs available on Friendli Serverless Endpoints: ## Model Supports * `K-intelligence/Midm-2.0-Base-Instruct` * `K-intelligence/Midm-2.0-Mini-Instruct` * `deepseek-ai/DeepSeek-R1` * `deepseek-ai/DeepSeek-R1-0528` * `meta-llama/Llama-4-Maverick-17B-128E-Instruct` * `meta-llama/Llama-4-Scout-17B-16E-Instruct` * `meta-llama/Llama-3.3-70B-Instruct` * `meta-llama/Llama-3.1-8B-Instruct` * `Qwen/Qwen3-235B-A22B` * `Qwen/Qwen3-30B-A3B` * `Qwen/Qwen3-32B` * `google/gemma-3-27b-it` * `mistralai/Mistral-Small-3.1-24B-Instruct-2503` * `mistralai/Devstral-Small-2505` * `mistralai/Magistral-Small-2506` Please note that the pricing for each model can be found in the [pricing section](/guides/serverless_endpoints/pricing).\ In addition, you can deploy any model of your choice. Check out more models we support [here](https://friendli.ai/models)! ## Llama 3.3 70B Instruct * **Focus**: Engaging dialogues and interactive experiences. * **Strengths**: * Natural language understanding and human-like response generation in conversational settings. * Maintains coherence and context throughout dialogues, fostering seamless interactions. * Can adapt to different conversation styles and tones. * **Example Use Cases**: * Building customer service chatbots that understand natural language and offer personalized support. * Creating interactive storytelling experiences and AI companions. * Developing game AI characters with engaging back-and-forth conversations. ### Examples When you install `friendli`, you can generate chat response with Python SDK.\ Refer to [this guide](/guides/personal_access_tokens) for general instructions on the Friendli Token. ```python Default # pip install friendli import os from friendli import SyncFriendli with SyncFriendli( token=os.getenv("FRIENDLI_TOKEN", ""), ) as friendli: res = friendli.serverless.chat.complete( model="meta-llama-3.3-70b-instruct", messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" }, ], ) print(res) ``` ```python Streaming # pip install friendli import os from friendli import SyncFriendli with SyncFriendli( token=os.getenv("FRIENDLI_TOKEN", ""), ) as friendli: res = friendli.serverless.chat.stream( model="meta-llama-3.3-70b-instruct", messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" }, ], ) with res as event_stream: for event in event_stream: print(event, flush=True) ``` ```python Async # pip install friendli import asyncio import os from friendli import AsyncFriendli async def main(): async with AsyncFriendli( token=os.getenv("FRIENDLI_TOKEN", ""), ) as friendli: res = await friendli.serverless.chat.complete( model="meta-llama-3.3-70b-instruct", messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" }, ], ) print(res) asyncio.run(main()) ``` ```python Streaming (Async) # pip install friendli import asyncio import os from friendli import AsyncFriendli async def main(): async with AsyncFriendli( token=os.getenv("FRIENDLI_TOKEN", ""), ) as friendli: res = await friendli.serverless.chat.stream( model="meta-llama-3.3-70b-instruct", messages=[ { "role": "user", "content": "Tell me how to make a delicious pancake" }, ], ) async with res as event_stream: async for event in event_stream: print(event, flush=True) asyncio.run(main()) ``` ## Beyond the Models: Generation Settings: Friendli Serverless Endpoints unlocks further customization through various generation settings, allowing you to fine-tune your Text Generation Model (TGM) outputs: * **max\_tokens**: This defines the maximum number of words your TGM generates. Lower values produce concise outputs, while higher values allow for longer narratives. * **temperature**: Think of temperature as a creativity knob. Higher values promote more imaginative and surprising outputs, while lower values favor safe and predictable responses. * **top\_p**: This parameter governs the diversity of your output. Lower values focus on the most likely continuation, while higher values encourage exploration of less probable but potentially interesting options. For more details, see [here](/openapi/serverless/chat-completions). ## Unleashing the Full Potential: Friendli Serverless Endpoints removes the technical hurdles, letting you focus on exploring the magic of TGMs. Start experimenting with different models and settings, tailoring the outputs to your unique vision. Remember, practice makes perfect – the more you interact with these models, the more you'll understand their strengths and discover the incredible possibilities they hold. Ready to embark on your text generation journey? Friendli Serverless Endpoints is your gateway to a world of boundless creativity and innovative applications. Sign up today and let the words flow! # Tool Assisted API Source: https://friendli.ai/docs/guides/serverless_endpoints/tool-assisted-api Tool Assisted API enhances a model's capabilities by integrating tools that extend its functionality beyond simple conversational interactions. By using this API, the model becomes more dynamic, providing more comprehensive and actionable responses. Currently, Friendli Serverless Endpoints supports a variety of built-in tools specifically designed for Chat Completion tasks. export const ToolIcon = () => { return ; }; export const ChatIcon = () => { return ; }; ## What is Tool Assisted API? **Tool Assisted API** enhances a model's capabilities by integrating **tools** that extend its functionality beyond simple conversational interactions. By using this API, the model becomes more dynamic, providing more comprehensive and actionable responses. Currently, **[Friendli Serverless Endpoints](/guides/serverless_endpoints/introduction)** supports a variety of built-in tools specifically designed for **Chat Completion** tasks. *** ### What is Chat Completion? **[Chat completion](/openapi/serverless/chat-completions)** refers to a model's ability to generate responses in a conversation. Given a sequence of messages or conversation turns, the model processes the input and generates a response based on its internal knowledge and training data. * **Example**: * **User**: "What is the capital of France?" * **Model**: "The capital of France is Paris." However, chat completion has its limitationsβ€”it is restricted to the knowledge the model has learned during its training and cannot access real-time or external data. *** ### Is Chat Completion Different from Tool Assisted Chat Completion? Yes, **[Tool Assisted Chat Completion](/openapi/serverless/tool-assisted-chat-completions)** goes beyond basic chat completion by integrating external tools to enhance the conversation. This allows the model to access real-time data, perform specific tasks, and interact with external systems in ways that chat completion alone cannot achieve. * **Example**: * **User**: "What is the weather today?" * **Model without Tool Access**: Relies on pre-learned information, potentially giving outdated or generalized answers. * **Model with Tool Access**: Calls a weather API to retrieve live data and responds: "The weather today in New York is 72Β°F with clear skies." With tool access, the model provides a more accurate and up-to-date response. Additionally, some tasksβ€”such as file processing or complex calculationsβ€”cannot be performed by the model alone but can be handled with the help of tools. * **Example**: * **User**: "Can you extract the text from this document?" (provides a file) * **Model without Tool Access**: "I cannot extract data from files directly." * **Model with Tool Access**: Extracts the text from the provided file and responds: "Using the `file:text` tool, I've extracted the following text: \[Text from the file]." When no tools are specified, the model will respond using only its internal knowledge. *** ### Benefits of Tool Assisted Chat Completion Tool Assisted Chat Completion offers several advantages over basic chat completion: * **Real-Time Data Access**: The model can pull live information. * **Extended Capabilities**: The model can perform complex tasks like running calculations, executing code, extracting text from files, and interacting with databases and APIs. *** ### Comparison: Chat Completion vs. Tool Assisted Chat Completion
| Feature | **Chat Completion** | **Tool Assisted Chat Completion** | | ----------------- | ------------------------------------------------ | -------------------------------------------------------------------- | | **Response Type** | Based on internal knowledge | Uses external tools for enhanced, real-time responses | | **Capabilities** | Limited to pre-learned knowledge | Can interact with tools for data retrieval and task execution | | **Example** | "What is the weather today?" (general knowledge) | "What is the weather today?" (live API result) | | **Use Cases** | General conversation and Q\&A | Complex tasks like real-time updates, data analysis, file processing | *** ## Built-In Tools Tool Assisted API automatically selects the best tool to perform an action based on user input when a specific tool is enabled. These tools can handle various operations, such as calculations, statistical analysis, web search, file content extraction, and code execution. Below is a more detailed description of the available tools in Tool Assisted API and when they are typically used: ### `math:calculator` **Description:** Performs basic arithmetic operations like addition, subtraction, multiplication, division, and more complex calculations like and square roots or exponents. It is useful for any tasks requiring mathematical computation. **When Used:** Automatically called when mathematical expressions or calculations are required. Whether you're solving equations, calculating percentages, or handling financial calculations, this tool performs the task for you. ### `math:statistics` **Description:** Performs statistical analysis, including calculating mean, median, mode, standard deviation, and correlations. It is tailored for situations where you need to analyze or interpret numeric datasets to understand trends or patterns. **When Used:** Automatically called when analyzing numeric data or generating insights from datasets, like summarizing survey results, or calculating probabilities. ### `math:calendar` **Description:** Handles date-related data, such as calculating date differences or finding specific days in the past or future. It is effective in managing and manipulating calendar-based information. **When Used:** Automatically called when operations involving dates or time spans are required, like finding how many days remain until an event. For example, figuring out how many days are left until an event, determining the day of the week for a specific date, or calculating the duration between two dates. ### `web:search` **Description:** Retrieves information from the web based on search queries. It fetches information based on keywords and helps gather knowledge or insights from online sources. **When Used:** Automatically called when you ask questions or seek information that requires external research or the latest data from the web. Whether it is looking up definitions, recent news, or general web searches, this tool handles such tasks effectively. ### `web:url` **Description:** Extracts specific data from a given website. You can provide a URL, and the tool will fetch the relevant content, including text, metadata, or other embedded information, from that web page. **When Used:** Automatically called when extracting content from a provided URL, such as fetching text from articles or blog posts. ### `code:python-interpreter` **Description:** Executes Python code directly within the platform for custom scripts, data processing, or automation. You can run Python scripts, test snippets of code, or automate tasks through coding logic. **When Used:** Automatically called when tasks involve writing or running Python scripts, such as custom data manipulations or logic-based automation. ### `file:text` **Description:** Reads and extracts text from files, supporting only `.txt` and `.pdf` formats. To use this tool, you must provide the file IDs. (For now, only one file is supported.) **When Used:** Automatically called when text extraction from a file is requested, such as pulling content from documents or reports. ## Conclusion * **Chat Completion**: Best for general conversations that rely on the model's pre-existing knowledge. * **Tool Assisted Chat Completion**: Ideal for real-time, dynamic tasks and more advanced interactions, leveraging external tools to enhance functionality. *** ## Explore APIs To get started with Tool Assisted Chat Completion, follow this tutorial: **[Tool calling with Serverless Endpoints](/guides/tutorials/tool-calling-with-serverless-endpoints)**. For more details, check out the API Reference documentations below: } href="/openapi/serverless/chat-completions"> Discover how to generate text through interactive conversations. } href="/openapi/serverless/tool-assisted-chat-completions"> Learn how to enhance responses with tool assisted chat completions using built-in tools. # Build an agent with Gradio Source: https://friendli.ai/docs/guides/tutorials/build-an-agent-with-gradio Build and deploy smart AI agents with Friendli Serverless Endpoints and Gradio in under 50 lines. ## Goals * Build your own AI agent using [**Friendli Serverless Endpoints**](https://friendli.ai/products/serverless-endpoints) and [**Gradio**](https://www.gradio.app) less than 50 LoC πŸ€– * Use tool calling to make your agent even smarter 🀩 * Share your AI agent with the world and gather feedback 🌎 > [**Gradio**](https://www.gradio.app) is the fastest way to demo your model with a friendly web interface. ## Getting Started 1. Head to [**https://friendli.ai**](https://friendli.ai/get-started/serverless-endpoints), and create an account. 2. Grab a [Friendli Token](https://friendli.ai/suite/setting/tokens) to use Friendli Serverless Endpoints within an agent. ## πŸš€ Step 1. Prerequisite Install dependencies. ``` pip install openai gradio ``` ## πŸš€ Step 2. Launch your agent Build your own AI agent using **Friendli Serverless Endpoints** and **Gradio**. * Gradio provides a `ChatInterface` that implements a chatbot UI running the `chat_function`. * More information about the *chat\_function(message, history)* > *The input function should accept two parameters: a string input message and list of two-element lists of the form \[\[user\_message, bot\_message], ...] representing the chat history, and return a string response.* * Implement the `chat_function` using Friendli Serverless Endpoints. * Here, we used the `meta-llama-3.3-70b-instruct` model. * Feel free to explore other available models [here](https://friendli.ai/models/search?products=SERVERLESS). ```python from openai import OpenAI import gradio as gr friendli_client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key="YOUR FRIENDLI TOKEN" ) def chat_function(message, history): messages = [] for user, chatbot in history: messages.append({"role" : "user", "content": user}) messages.append({"role" : "assistant", "content": chatbot}) messages.append({"role": "user", "content": message}) stream = friendli_client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=messages, stream=True ) res = "" for chunk in stream: res += chunk.choices[0].delta.content or "" yield res css = """ .gradio-container { max-width: 800px !important; margin-top: 100px !important; } .pending { display: none !important; } .sm { box-shadow: None !important; } #component-2 { height: 400px !important; } """ with gr.Blocks(theme=gr.themes.Soft(), css=css) as friendli_agent: gr.ChatInterface(chat_function) friendli_agent.launch() ``` ## πŸš€ Step 3. Tool Calling (Advanced) Use tool calling to make your agent even smarter! We will show you how to make your agent search the web before answer as an example. * Change the `base_url` to `https://api.friendli.ai/serverless/tools/v1` * Add `tools` parameter when calling chat completion API ```python from openai import OpenAI import gradio as gr friendli_client = OpenAI( base_url="https://api.friendli.ai/serverless/tools/v1", api_key="YOUR FRIENDLI TOKEN" ) def chat_function(message, history): messages = [] for user, chatbot in history: messages.append({"role" : "user", "content": user}) messages.append({"role" : "assistant", "content": chatbot}) messages.append({"role": "user", "content": message}) stream = friendli_client.chat.completions.create( model="meta-llama-3.3-70b-instruct", messages=messages, stream=True, tools=[{"type": "web:search"}], ) res = "" for chunk in stream: if chunk.choices is None: yield "Waiting for tool response..." else: res += chunk.choices[0].delta.content or "" yield res css = """ .gradio-container { max-width: 800px !important; margin-top: 100px !important; } .pending { display: none !important; } .sm { box-shadow: None !important; } #component-2 { height: 400px !important; } """ with gr.Blocks(theme=gr.themes.Soft(), css=css) as agent: gr.ChatInterface(chat_function) agent.launch() ``` Here is the available built-in tools (Beta) list. Feel free to build your agent using the below tools. * `math:calculator` (tool for calculating arithmetic operations) * `math:statistics` (tool for analyzing statistic data) * `math:calendar` (tool for handling date-related data) * `web:search` (tool for retrieving data through the web search) * `web:url` (tool for extracting data from a given website) * `code:python-interpreter` (tool for writing and executing python code) * `file:text` (tool for extracting text data from a given file) ## πŸš€ Step 4. Deploy your agent For the temporal deployment, change the last line of the code. ```python agent.launch(share=True) ``` For the permanent deployment, you can use [HuggingFace Space](https://huggingface.co/spaces)! # Build an agent with LangChain Source: https://friendli.ai/docs/guides/tutorials/build-an-agent-with-langchain Build an AI agent with LangChain and Friendli Serverless Endpoints, integrating tool calling for dynamic and efficient responses. ## Introduction This tutorial walks you through creating an Agent using LangChain and Serverless Endpoints. ## Setup ```bash pip install -qU langchain-openai langchain-community langchain wikipedia ``` Get your [Friendli Token](https://friendli.ai/suite/setting/tokens) to use Friendli Serverless Endpoints. ```python import getpass import os if not os.environ.get("FRIENDLI_TOKEN"): os.environ["FRIENDLI_TOKEN"] = getpass.getpass("Enter your Friendli Token: ") ``` ## Instantiation ```python from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="meta-llama-3.1-8b-instruct", base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) ``` ## Create Agent with LangChain ### Step 1. Create Tool ```python from langchain_community.tools import WikipediaQueryRun from langchain_community.utilities import WikipediaAPIWrapper api_wrapper = WikipediaAPIWrapper(top_k_results=1, doc_content_chars_max=100) wiki = WikipediaQueryRun(api_wrapper=api_wrapper) tools = [wiki] ``` ### Step 2. Create Prompt ```python from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder prompt = ChatPromptTemplate.from_messages( [ ("system", "You are a helpful assistant"), MessagesPlaceholder("chat_history"), ("user", "{input}"), ("placeholder", "{agent_scratchpad}"), ] ) prompt.messages ``` ### Step 3. Create Agent ```python from langchain.agents import AgentExecutor from langchain.agents import create_tool_calling_agent agent = create_tool_calling_agent(llm, tools, prompt) agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True) ``` ### Step 4. Run the Agent ```python chat_history = [] while True: user_input = input("Enter your message: ") result = agent_executor.invoke( {"input": user_input, "chat_history": chat_history}, ) chat_history.append({"role": "user", "content": user_input}) chat_history.append({"role": "assistant", "content": result["output"]}) ``` When you run the code, it will wait for the user's input. After inputting, it will wait and output the result. When you ask a question about a specific wikipedia, it will automatically call the wikipedia tool and output the result. ```text final result Enter your Friendli Token: Β·Β·Β·Β·Β·Β·Β·Β·Β·Β· Enter your message: hello > Entering new AgentExecutor chain... Hello, it's nice to meet you. I'm here to help with any questions or topics you'd like to discuss. Is there something in particular you'd like to talk about, or do you need assistance with something? > Finished chain. Enter your message: What does the Linux kernel do? > Entering new AgentExecutor chain... Invoking: `wikipedia` with `{'query': 'Linux kernel'}` responded: The Linux kernel is the core component of the Linux operating system. It acts as a bridge between the computer hardware and the user space applications. The kernel manages the system's hardware resources, such as memory, CPU, and I/O devices. It provides a set of interfaces and APIs that allow user space applications to interact with the hardware. Page: Linux kernel Summary: The Linux kernel is a free and open source,:β€Š4β€Š UNIX-like kernel that isThe Linux kernel is a free and open source, UNIX-like kernel that is responsible for managing the system's hardware resources, such as memory, CPU, and I/O devices. It provides a set of interfaces and APIs that allow user space applications to interact with the hardware. The kernel is the core component of the Linux operating system, and it plays a crucial role in ensuring the stability and security of the system. > Finished chain. Enter your message: ``` ## Full Example Code ```python import getpass import os from langchain_openai import ChatOpenAI from langchain_community.tools import WikipediaQueryRun from langchain_community.utilities import WikipediaAPIWrapper from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder from langchain.agents import AgentExecutor from langchain.agents import create_tool_calling_agent if not os.environ.get("FRIENDLI_TOKEN"): os.environ["FRIENDLI_TOKEN"] = getpass.getpass("Enter your Friendli Token: ") llm = ChatOpenAI( model="meta-llama-3.1-8b-instruct", base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) api_wrapper = WikipediaAPIWrapper(top_k_results=1, doc_content_chars_max=100) wiki = WikipediaQueryRun(api_wrapper=api_wrapper) tools = [wiki] # Get the prompt to use - you can modify this! prompt = ChatPromptTemplate.from_messages( [ ("system", "You are a helpful assistant"), MessagesPlaceholder("chat_history"), ("user", "{input}"), ("placeholder", "{agent_scratchpad}"), ] ) agent = create_tool_calling_agent(llm, tools, prompt) agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True) chat_history = [] while True: user_input = input("Enter your message: ") result = agent_executor.invoke( {"input": user_input, "chat_history": chat_history}, ) chat_history.append({"role": "user", "content": user_input}) chat_history.append({"role": "assistant", "content": result["output"]}) ``` # Chat docs with LangChain Source: https://friendli.ai/docs/guides/tutorials/chat-docs-with-langchain You can view the content [here](https://friendli.ai/blog/chatdocs-rag-friendli-langchain). # Chat docs with MongoDB Source: https://friendli.ai/docs/guides/tutorials/chat-docs-with-mongodb You can view the content [here](https://friendli.ai/blog/rag-chatbot-friendli-mongodb-atlas-langchain). # Go Playground with Next.js Source: https://friendli.ai/docs/guides/tutorials/go-playground-with-nextjs You can view the content [here](https://friendli.ai/blog/vercel-ai-sdk-playground-tutorial). # How to Fine-tune Vision Language Models (VLMs) Source: https://friendli.ai/docs/guides/tutorials/how-to-fine-tune-vlm Fine-tune Vision Language Models (VLMs) on Friendli Dedicated Endpoints using datasets. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ## Introduction Effortlessly fine-tune your Vision Language Model (VLM) with Friendli Dedicated Endpoints, which leverages the Parameter-Efficient Fine-Tuning (PEFT) method to reduce training costs while preserving model quality, similar to full-parameter fine-tuning. This can make your model become an expert on specific visual tasks and improve its ability to understand and describe images accurately. In this tutorial, we will cover: * How to upload your image-text dataset for VLM fine-tuning. * How to fine-tuning state-of-the-art VLMs like [Qwen2.5-VL-32B-Instruct](https://huggingface.co/Qwen/Qwen2.5-VL-32B-Instruct) and [gemma-3-27b-it](https://huggingface.co/google/gemma-3-27b-it) on your dataset. * How to deploy your fine-tuned VLM model. ## Table of Contents * [Prerequisites](#prerequisites) * [Step 1. Prepare Your Dataset](#step-1-prepare-your-dataset) * [Step 2. Upload Your Dataset](#step-2-upload-your-dataset) * [Step 3. Fine-tune Your VLM](#step-3-fine-tune-your-vlm) * [Step 4. Monitor Training Progress](#step-4-monitor-training-progress) * [Step 5. Deploy Your Fine-tuned Model](#step-5-deploy-your-fine-tuned-model) * [Resources](#resources) ## Prerequisites 1. Head to [Friendli Suite](https://friendli.ai/get-started/dedicated-endpoints) and create an account. 2. Issue a **Friendli Token** by going to [Personal settings > Tokens](https://friendli.ai/suite/setting/tokens). Make sure to copy and store it securely in a safe place as you won't be able to see it again after refreshing the page.\ For detailed instructions, see [Personal Access Tokens](/guides/personal_access_tokens). ## Step 1. Prepare Your Dataset Your dataset should be a conversational dataset in `.jsonl` or `.parquet` format, where each line represents a sequence of messages. Each message in the conversation should include a `"role"` (e.g., `system`, `user`, or `assistant`) and `"content"`. For VLM fine-tuning, user content can contain both text and image data (Note that for image data, we support URL and Base64). Here's an example of what it should look like. Note that it's one line but beautified for readability: ```json { "messages": [ { "role": "system", "content": "You are a helpful assistant." }, { "role": "user", "content": [ { "type": "image", "image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg" }, { "type": "image", "image": "data:image/png;base64," }, { "type": "text", "text": "Describe this image in detail." } ] }, { "role": "assistant", "content": "The image is a bee." } ] } ``` You can access our example dataset ['FriendliAI/gsm8k'](https://huggingface.co/datasets/FriendliAI/gsm8k) (for Chat), ['FriendliAI/sample-vision'](https://huggingface.co/datasets/FriendliAI/sample-vision) (for Chat with image) and explore some of our quantized generative AI models on [our Hugging Face page](https://huggingface.co/FriendliAI). ## Step 2. Upload Your Dataset Once you have prepared your dataset, you can upload it to Friendli using the [Python SDK](/sdk/python-sdk). ### Install the Python SDK First, install the Friendli Python SDK: ```bash # Using pip pip install friendli # Using poetry poetry add friendli ``` ### Upload Your Dataset Use the following code to create a dataset and upload your samples: ```python import os from friendli.friendli import SyncFriendli from friendli.models import Sample TEAM_ID = os.environ["FRIENDLI_TEAM_ID"] PROJECT_ID = os.environ["FRIENDLI_PROJECT_ID"] TOKEN = os.environ["FRIENDLI_TOKEN"] # Read dataset file and parse each line as a Sample with open("dataset.jsonl", "rb") as f: data = [Sample.model_validate_json(line) for line in f] with SyncFriendli( token=TOKEN, x_friendli_team=TEAM_ID, ) as friendli: # Create a new dataset with TEXT and IMAGE modalities with friendli.dataset.create( modality=["TEXT", "IMAGE"], name="my-vlm-dataset", # name of the dataset project_id=PROJECT_ID, ) as dataset: # Upload samples to the dataset # Each line from your dataset file becomes a separate sample dataset.upload_samples( samples=data, split="train", # name of the split to upload to ) ``` ### How It Works Friendli Python SDK doesn't upload your entire dataset file at once. Instead, it processes your dataset more efficiently: 1. **Reads your dataset file line by line**: Each line is parsed as a `Sample` object containing a conversation with messages. 2. **Creates a dataset**: A new dataset is created in your Friendli project with the specified modalities (`TEXT` and `IMAGE`). 3. **Uploads each conversation as a separate sample**: Rather than uploading the entire file, each conversation (line in the dataset file) becomes an individual sample in the dataset. 4. **Organizes by splits**: Samples are organized into splits like "train", "validation", or "test" for different purposes during fine-tuning. ### Environment Variables Make sure to set the required environment variables: ```bash export FRIENDLI_TOKEN="your-friendli-token" export FRIENDLI_TEAM_ID="your-team-id" export FRIENDLI_PROJECT_ID="your-project-id" ``` You can find your Team ID and Project ID in the URL of Friendli Suite, formatted as `https://friendli.ai///...`. ### View Your Dataset To view and edit the datasets you've uploaded, visit [Friendli Suite > Dataset](https://friendli.ai/suite/~/dataset). Datasets
Dataset ## Step 3. Fine-tune Your VLM Go to [Friendli Suite > Fine-tuning](https://friendli.ai/suite/~/fine-tuning), and click the **'New job'** button to create a new job. Fine-tuning
Create a new job In the job creation form, you'll need to configure the following settings: 1. **Job Name**: * Enter a name for your fine-tuning job. * If not provided, a name will be automatically generated (e.g., `accomplished-shark`). 2. **Model**: * Choose your base model from one of these sources: * Hugging Face: Select from models available on Hugging Face. * Weights & Biases: Use a model from your W\&B projects. * Uploaded model: Use a model you've previously uploaded. 3. **Dataset**: * Select the dataset to use. 4. **Weights & Biases Integration** (Optional): * Enable W\&B tracking by providing your W\&B project name. * This will automatically log training metrics to your W\&B dashboard for comprehensive monitoring and experiment tracking. * For detailed setup instructions, see [using W\&B with dedicated fine-tuning](/guides/dedicated_endpoints/faq#using-w%26b-with-dedicated-fine-tuning). 5. **Hyperparameters**: * Learning Rate (required): Initial learning rate for optimizer (e.g., 0.0001). * Batch Size (required): Total batch size used for training (e.g., 16). * Total Number of Training (required), either: * Number of Training Epoch: Total number of training epochs to perform (e.g., 1) * Training Steps: Total number of training steps to perform (e.g., 1000) * Evaluation Steps (required): Number of steps between evaluation of the model using the validation set (e.g., 300). * LoRA Rank (optional): Rank of the LoRA parameters (e.g., 16). * LoRA Alpha (optional): Scaling factor that determines the influence of the low-rank matrices during fine-tuning (e.g., 32). * LoRA Dropout (optional): Dropout rate applied during fine-tuning (e.g., 0.1). After configuring these settings, click the **'Create'** button at the bottom to start your fine-tuning job. ## Step 4. Monitor Training Progress You can now monitor your fine-tuning job progress and on Friendli Suite. If you have integrated your Weights & Biases (W\&B) account, you can also monitor the training status in your W\&B project. Read our FAQ section on [using W\&B with dedicated fine-tuning](/guides/dedicated_endpoints/faq#using-w%26b-with-dedicated-fine-tuning) to learn more about monitoring you fine-tuning jobs on their platform. Fine-tuning job ## Step 5. Deploy Your Fine-tuned Model Once the fine-tuning process is complete, you can immediately deploy the model by clicking the **'Deploy'** button in the top right corner. The name of the fine-tuned LoRA adapter will be the same as your fine-tuning job name. Completed For more information about deploying a model, refer to [Endpoints documentation](/guides/dedicated_endpoints/endpoints). ## Resources Explore these additional resources to learn more about VLM fine-tuning and optimization: * [Browse all models supported by FriendliAI](https://friendli.ai/models) * [Example dataset](https://huggingface.co/datasets/FriendliAI/gsm8k) * [FAQ on general requirements for a model](/guides/dedicated_endpoints/faq#general-requirements-for-a-model) * [FAQ on using a Hugging Face repository as a model](/guides/dedicated_endpoints/faq#how-to-use-a-hugging-face-repository-as-a-model) * [FAQ on integrating a Hugging Face account](/guides/dedicated_endpoints/faq#how-to-integrate-a-hugging-face-account) * [FAQ on using a W\&B artifact as a model](/guides/dedicated_endpoints/faq#how-to-use-a-w%26b-artifact-as-a-model) * [FAQ on integrating a W\&B account](/guides/dedicated_endpoints/faq#how-to-integrate-a-w%26b-account) * [FAQ on using W\&B with dedicated fine-tuning](/guides/dedicated_endpoints/faq#using-w%26b-with-dedicated-fine-tuning) * [Endpoints documentation on model deployment](/guides/dedicated_endpoints/endpoints) # RAG app with LlamaIndex Source: https://friendli.ai/docs/guides/tutorials/rag-app-with-llamaindex You can view the content [here](https://friendli.ai/blog/llamaindex-rag-app-friendli-engine). # Tool calling with Serverless Endpoints Source: https://friendli.ai/docs/guides/tutorials/tool-calling-with-serverless-endpoints Build AI agents with Friendli Serverless Endpoints using tool calling for dynamic, real-time interactions with LLMs. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ## Goals * Use tool calling to build your own AI agent with [**Friendli Serverless Endpoints**](https://friendli.ai/products/serverless-endpoints) * Check out the examples below to see how you can interact with state-of-the-art language models while letting them search the web, run Python code, etc. * Feel free to make your own custom tools! ## Getting Started 1. Head to [**https://friendli.ai**](https://friendli.ai/get-started/serverless-endpoints), and create an account. 2. Grab a [Friendli Token](https://friendli.ai/suite/setting/tokens) to use Friendli Serverless Endpoints within an agent. ## πŸš€ Step 1. Playground UI Experience tool calling on the Playground! Sidebar
Web Search Tool 1. On your left sidebar, click the 'Serverless Endpoints' option to access the playground page. 2. You will see models that can be used as Serverless Endpoints. Choose the one you want and select the endpoint. 3. Click 'Tools' button, select Search tool, and enter a query to see the response. πŸ˜€ ## πŸš€ Step 2. Tool Calling Search interesting information using the `web:search` tool. This time, let's try it by writing python code. 1. Add the user's input as an `user` role message. 2. Add the `web:search` tool to the tools option. ```python # pip install friendli import os from friendli import SyncFriendli with SyncFriendli( token=os.getenv("FRIENDLI_TOKEN", ""), ) as friendli: res = friendli.serverless.tool_assisted_chat.complete( model="meta-llama-3.1-8b-instruct", messages=[ { "role": "user", "content": "Find information on the popular movies currently showing in theaters and provide their ratings.", }, ], tools=[{"type": "web:search"}], max_tokens=200, ) print(res) ``` ## πŸš€ Step 3. Multiple tool calling Use multiple tools at once to calculate "How long it will take you to buy a house in the San Francisco Bay Area based on your annual salary". Here is the available built-in tools. * `math:calculator` (tool for calculating arithmetic operations) * `math:statistics` (tool for analyzing statistic data) * `math:calendar` (tool for handling date-related data) * `web:search` (tool for retrieving data through the web search) * `web:url` (tool for extracting data from a given website) * `code:python-interpreter` (tool for writing and executing python code) * `file:text` (tool for extracting text data from a given file) ### Example Answer sheet ``` Prompt: My annual salary is $ 100k. How long it will take to buy a house in San Francisco Bay Area? (`web:search` & `math:calculator` used) Answer: Based on the web search results, the median price of an existing single-family home in the Bay Area is around $1.25 million. Using a calculator to calculate how long it would take to buy a house in the San Francisco Bay Area with an annual salary of $100,000, we get: $1,200,000 (house price) / $100,000 (annual salary) = 12 years So, it would take approximately 12 years to buy a house in the San Francisco Bay Area with an annual salary of $100,000, assuming you save your entire salary each year and don't consider other factors like interest rates, taxes, and living expenses. ``` ## πŸš€ Step 4. Build a custom tool Build your own creative tool. We will show you how to make a custom tool that retrieves temperature information. (Completed code snippet is provided at the bottom) 1. **Define a function for using as a custom tool** ```python def get_temperature(location: str) -> int: """Mock function that returns the city temperature""" if "new york" in location.lower(): return 45 if "san francisco" in location.lower(): return 72 return 30 ``` 2. **Send a function calling inference request** 1. Add the user's input as an `user` role message. 2. The information about the custom function (e.g., `get_temperature`) goes into the tools option. The function's parameters are described in JSON schema. 3. The response includes the `arguments` field, which are values extracted from the user's input that can be used as parameters of the custom function. ```python # pip install friendli import os from friendli import SyncFriendli token = os.environ.get("FRIENDLI_TOKEN") or "YOUR_FRIENDLI_TOKEN" client= SyncFriendli(token=token) user_prompt = "I live in New York. What should I wear for today's weather?" messages = [ { "role": "user", "content": user_prompt, }, ] tools=[ { "type": "function", "function": { "name": "get_temperature", "description": "Get the temperature information in a given location.", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The name of current location e.g., New York", }, }, }, }, }, ] chat = client.serverless.chat.complete( model="meta-llama-3.3-70b-instruct", messages=messages, tools=tools, temperature=0, frequency_penalty=1, ) print(chat) ``` 3. **Generate the final response using the tool calling results** 1. Add the `tool_calls` response as an `assistant` role message. 2. Add the result obtained by calling the `get_weather` function as a `tool` message to the Chat API again. ```python import json func_kwargs = json.loads(chat.choices[0].message.tool_calls[0].function.arguments) temperature_info = get_temperature(**func_kwargs) messages.append( { "role": "assistant", "tool_calls": [ tool_call.model_dump() for tool_call in chat.choices[0].message.tool_calls ] } ) messages.append( { "role": "tool", "content": str(temperature_info), "tool_call_id": chat.choices[0].message.tool_calls[0].id } ) chat_w_info = client.serverless.chat.complete( model="meta-llama-3.3-70b-instruct", tools=tools, messages=messages, ) for choice in chat_w_info.choices: print(choice.message.content) ``` * **Complete Code Snippet** ```python # pip install friendli import json import os from friendli import SyncFriendli token = os.environ.get("FRIENDLI_TOKEN") or "YOUR_FRIENDLI_TOKEN" client = SyncFriendli(token=token) user_prompt = "I live in New York. What should I wear for today's weather?" messages = [ { "role": "user", "content": user_prompt, }, ] tools=[ { "type": "function", "function": { "name": "get_temperature", "description": "Get the temperature information in a given location.", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The name of current location e.g., New York", }, }, }, }, }, ] chat = client.serverless.chat.complete( model="meta-llama-3.3-70b-instruct", messages=messages, tools=tools, temperature=0, frequency_penalty=1, ) def get_temperature(location: str) -> int: """Mock function that returns the city temperature""" if "new york" in location.lower(): return 45 if "san francisco" in location.lower(): return 72 return 30 func_kwargs = json.loads(chat.choices[0].message.tool_calls[0].function.arguments) temperature_info = get_temperature(**func_kwargs) messages.append( { "role": "assistant", "tool_calls": [ tool_call.model_dump() for tool_call in chat.choices[0].message.tool_calls ] } ) messages.append( { "role": "tool", "content": str(temperature_info), "tool_call_id": chat.choices[0].message.tool_calls[0].id } ) chat_w_info = client.serverless.chat.complete( model="meta-llama-3.3-70b-instruct", tools=tools, messages=messages, ) for choice in chat_w_info.choices: print(choice.message.content) ``` ## πŸŽ‰ Congratulations! Following the above instructions, we've experienced the whole process of defining and using a custom tool to generate an accurate and rich answer from LLM models! Brainstorm creative ideas for your agent by reading our blog articles! * [**Building an AI Agent for Google Calendar**](https://friendli.ai/blog/ai-agent-google-calendar) * [**Hassle-free LLM Fine-tuning with FriendliAI and Weights & Biases**](https://friendli.ai/blog/llm-fine-tuning-friendliai-wandb) * [**Building AI Agents Using Function Calling with LLMs**](https://friendli.ai/blog/ai-agents-function-calling) * [**Function Calling: Connecting LLMs with Functions and APIs**](https://friendli.ai/blog/llm-function-calling) # Deploy from W&B Registry with Webhook Source: https://friendli.ai/docs/guides/tutorials/wandb-registry-with-dedicated-endpoints Hands-on tutorial for launching and deploying LLMs using Friendli Dedicated Endpoints with Weights & Biases artifacts through webhook automation. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; ## Introduction This tutorial is designed to guide you through the process of easily deploying your models from the [W\&B Registry](https://docs.wandb.ai/guides/core/registry/) to Friendli Dedicated Endpoints in the W\&B UI. Through a series of step-by-step instructions and hands-on examples, you’ll learn how to: * **Configure a webhook** in W\&B to trigger deployments to Friendli Dedicated Endpoints. * **Create a [webhook automation](https://docs.wandb.ai/guides/core/automations/create-automations/webhook/)** to automatically deploy model artifacts when adding new versions. * **Deploy a model artifact** to Friendli Dedicated Endpoints by adding an alias in the W\&B Registry. * **Understand how adding and removing aliases** affects deployments on Friendli Dedicated Endpoints. ### Why use W\&B webhook automation with Friendli Dedicated Endpoints? W\&B users often rely on W\&B Registry to manage the lifecycle of models – from tracking experiment artifacts to promoting the best-performing models for production use. As a W\&B user, integrating Friendli Dedicated Endpoints directly into this workflow allows you to: * **Streamline deployment**: Transition your models from experimentation to production with minimal effort. By leveraging W\&B’s aliasing system and FriendliAI’s automated infrastructure, you eliminate the need for custom scripts or manual configurations. * **Ensure deployment consistency**: Friendli Dedicated Endpoints include support for `idempotencyKey` to ensure the reliability of automated workflows. Each deployment trigger via webhook automation is tracked with a unique `idempotencyKey`, ensuring that operations like endpoint creation or updates are processed exactly once. It prevents duplicate or conflicting operations, giving you confidence in the consistency of your deployment. By the end of this tutorial, you’ll be equipped with the knowledge and skills necessary to seamlessly transfer your models from W\&B Registry to Friendli Dedicated Endpoints for efficient deployment. So, let’s get started and explore the possibilities of Friendli Dedicated Endpoints! ## Prerequisites * A Friendli Suite account with access to [Friendli Dedicated Endpoints](https://friendli.ai/docs/guides/dedicated_endpoints/introduction). * A [personal access token](https://friendli.ai/docs/guides/personal_access_tokens) generated through Friendli Suite. ## Step 1: Create a secret Weights & Biases Team settings 1. Navigate to the [team’s page](https://wandb.ai/home) on W\&B and click on **Team settings**. 2. Scroll down to the **Team secrets** section and click **New secret**. 3. Go to [Friendli Suite](https://friendli.ai/suite) and navigate to **[Personal settings > Tokens](https://friendli.ai/suite/setting/tokens)** and click **Create new token**. 4. Copy your [personal access token](https://friendli.ai/docs/guides/personal_access_tokens). 5. Return to W\&B and fill in the **Secret** with the personal access token generated through Friendli Suite. Weights & Biases add team secret ## Step 2: Configure a webhook 1. From the same W\&B team settings page, click on **New webhook** in the **Webhooks** section. 2. Fill in the **URL** field with **Friendli Suite Rest API URL** (see more details [here](/openapi/dedicated/endpoint/wandb-artifact-create)) and **Access token** field with the secret already created through Friendli Suite. Configure webhook from Weights & Biases ## Step 3: Create a webhook automation 1. Go to your W\&B Registry Model and click on **View details** of the model you want to deploy. 2. Click on **Create automation** in the **Automations** section. Create webhook automation from Weights & Biases 3. Select **An artifact alias is added** for the **Event**. 4. Enter an alias you want to use to trigger the deployment for the **Alias regex**. Add an alias from Weights & Biases 5. Select the **Webhooks** for **Action type**. 6. Select the webhook configured with Friendli Dedicated Endpoints for **Webhook**. 7. Fill out the box by referring to the following example for **Payload**. Webhook automation payload example #### Example: Configuration for payload ```json { "wandbArtifactVersionName": "${artifact_version_string}" } ``` | Field | Description | | -------------------------- | ------------------------------------------ | | `wandbArtifactVersionName` | Specific model artifact version from W\&B. | ```json { "wandbArtifactVersionName": "${artifact_version_string}", "name": "Generated from WandB ${project_name}/${artifact_collection_name}", "projectId": "project-id", "idempotencyKey": "${alias}", "accelerator": { "type": "NVIDIA H100", "count": 1 }, "autoscalingPolicy": { "minReplica": 0, "maxReplica": 2, "cooldownPeriod": 300 } } ``` | Field | Description | | -------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- | | `wandbArtifactVersionName` | Specific model artifact version from W\&B. | | `name` | Name of the endpoint. | | `projectId` | Specific project ID of where the endpoint will be created. | | `idempotencyKey` | Unique value to track which webhook automation triggered an endpoint roll out. Use any unique value, but using the example value provided is recommended. | | `accelerator` | Hardware for the endpoint. | | `autoscalingPolicy` | Autoscaling settings for the endpoint. | To gain more control over GPU resources for an endpoint, configure the `accelerator` field by specifying the desired type and count. This is particularly useful for serving large models that require model or data parallelism. ```json { "wandbArtifactVersionName": "${artifact_version_string}", "name": "Generated from WandB ${project_name}/${artifact_collection_name}", "accelerator": { "type": "NVIDIA H100", "count": 4 }, } ``` | Field | Description | | ------------------- | ---------------------------------- | | `accelerator.type` | Specifies the instance type. | | `accelerator.count` | Specifies the number of instances. | View more details about each field [here](/openapi/dedicated/endpoint/wandb-artifact-create). ## Step 4: Deploy a model artifact Deploy your model artifact to Friendli Dedicated Endpoints by simply adding the alias set in **Step 3** to a model artifact version! Deploy model artifact to Friendli Dedicated Endpoints After adding the alias, you can see the endpoint created in Friendli Dedicated Endpoints. Endpoint created in Friendli Dedicated Endpoints UI ## Step 5: Roll out a model artifact (Advanced) To roll out an endpoint to a new model artifact version, simply add the same alias to the new version you want to deploy. This updates the endpoint to use the new model artifact version. After assigning the alias, the endpoint will update to reflect the new version in Friendli Dedicated Endpoints. Roll out to version 1 in Weights & Biases An `idempotencyKey` is required to roll out an endpoint between different model artifact versions. ```json {9} { "wandbArtifactVersionName": "${artifact_version_string}", "name": "Generated from WandB ${project_name}/${artifact_collection_name}", "accelerator": { "type": "NVIDIA H100", "count": 1 }, "projectId": "project-id", "idempotencyKey": "${alias}", "autoscalingPolicy": { "minReplica": 0, "maxReplica": 2, "cooldownPeriod": 300 } } ``` ## Step 6: Track the history of deployment versions Use the Friendli Dedicated Endpoints versioning feature to track the history of your model deployments and maintain a clear record of every update. By adding an alias to a model artifact version, you can deploy models and roll out updates across versions efficiently, without needing to create a new endpoint from scratch. * When an alias is reassigned to a different version, the existing endpoint will automatically roll out to the new version. Friendli Dedicated Endpoints Versions In the diagram, * `v0` represents the first deployed version of the model when the endpoint was created. * `v1` is a newer model artifact version that the alias was reassigned to, triggering a rollout to update the endpoint accordingly. View more details about the versioning feature [here](/guides/dedicated_endpoints/versions). ## Frequently Asked Questions The model artifact version will be deployed as the number of aliases added. Within a model collection, only one artifact version can have a given alias at any time. Therefore, adding an alias to a new artifact version will automatically remove it from the previously aliased version with the same alias. One webhook automation is assigned to one Friendli Dedicated Endpoint. Nothing happens to the endpoint. Removing an alias will not delete the endpoint. However, if you add the removed alias to a new model artifact version, the deployed endpoint will roll out to that version. If an `idempotencyKey` is included in the payload, moving an alias to a different model artifact version will reassign the created endpoint to the new version within the same project. When adding an alias to a model artifact version for the first time, an endpoint will be created in either an existing or a new project within your default team of Friendli Suite. If `projectId` is specified, the endpoint will be made in an existing project. Otherwise, a new project will be created. ## Feedback or issue If you have any feedback or issues about the integration with Friendli Dedicated Endpoints, please ask for support by sending an email to [Support](mailto:support@friendli.ai). # Vision: Image understanding with Friendli Source: https://friendli.ai/docs/guides/vision Guide to using Friendli's Vision feature for image analysis. Covers usage via Playground and API (URL & Base64 examples). The Vision feature is available when the model supports vision capabilities. Friendli is equipped with a new Vision feature that can understand and analyze images, opening up exciting possibilities for multi-modal interactions. This guide explains how to work with images in Friendli, including best practices and code examples. ### How to Use Vision Utilize Friendli's Vision features through the following: * Select and test a vision model at [friendli.ai/playground](https://friendli.ai/playground). * Use the API to process images and receive the model's responses, referring to the methods described in this document. ### Supported Image Formats Supports formats supported by the PIL library, including jpg, png and avif. * JPEG (.jpeg and .jpg) * PNG (.png) * AVIF (.avif) ### Using the API ```python URL-based image {22} import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/dedicated/v1", api_key=os.environ.get("FRIENDLI_TOKEN"), ) image_url = "https://upload.wikimedia.org/wikipedia/commons/9/9e/Ours_brun_parcanimalierpyrenees_1.jpg" completion = client.chat.completions.create( # Replace YOUR_ENDPOINT_ID with the ID of your endpoint, e.g. "zbimjgovmlcb" model="YOUR_ENDPOINT_ID", messages=[ { "role": "user", "content": [ { "type": "text", "text": "What kind of animal is shown in the image?", }, {"type": "image_url", "image_url": {"url": image_url}}, ], }, ], stream=False ) print(completion.choices[0].message.content) ``` ```python Base64-encoded image {28-30} import base64, requests, os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/dedicated/v1", api_key=os.environ.get("FRIENDLI_TOKEN"), ) image_url = "https://upload.wikimedia.org/wikipedia/commons/9/9e/Ours_brun_parcanimalierpyrenees_1.jpg" image_media_type = "image/jpg" image_base64 = base64.standard_b64encode(requests.get(image_url).content).decode( "utf-8" ) completion = client.chat.completions.create( # Replace YOUR_ENDPOINT_ID with the ID of your endpoint, e.g. "zbimjgovmlcb" model="YOUR_ENDPOINT_ID", messages=[ { "role": "user", "content": [ { "type": "text", "text": "What kind of animal is shown in the image?", }, { "type": "image_url", "image_url": { "url": f"data:{image_media_type};base64,{image_base64}" }, }, ], }, ], ) print(completion.choices[0].message.content) ``` # Container chat completions Source: https://friendli.ai/docs/openapi/container/chat-completions post /v1/chat/completions Given a list of messages forming a conversation, the model generates a response. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/container/chat-completions-chunk-object). # Container chat completions chunk object Source: https://friendli.ai/docs/openapi/container/chat-completions-chunk-object Represents a streamed chunk of a chat completions response returned by model, based on the provided input. ```json Response data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "content": " is" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop", "logprobs": null } ], "usage": null, "created": 1726294383 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 8, "completion_tokens": 4, "total_tokens": 12 }, "created": 1726294402 } data: [DONE] ``` ```json With tools data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "id": "call_TARbemDG9CFdwuoaQBTRXiYK", "type": "function", "function": { "name": "func", "arguments": "{\"" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "arg" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "}" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "tool_calls", "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 468, "completion_tokens": 59, "total_tokens": 527 }, "created": 1726294443 } data: [DONE] ``` A unique ID of the chat completion. The object type, which is always set to `chat.completion.chunk`. The model to generate the completion. The index of the choice in the list of generated choices. Role of the generated message author, in this case `assistant`. The contents of the assistant message. The index of tool call being generated. The ID of the tool call. The type of the tool, which is always set to `function`. The name of the function to call. The arguments for calling the function, generated by the model in JSON format. Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON. Termination condition of the generation. `stop` means the API returned the full chat completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. `tool_calls` means the API has generated tool calls. Available options: `stop`, `length`, `tool_calls` Log probability information for the choice. A list of message content tokens with log probability information. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. List of the most likely tokens and their log probability, at this token position. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. Number of tokens in the prompt. Number of tokens in the generated chat completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Container completions Source: https://friendli.ai/docs/openapi/container/completions post /v1/completions Generate text based on the given text prompt. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/container/completions-chunk-object). # Container completions chunk object Source: https://friendli.ai/docs/openapi/container/completions-chunk-object Represents a streamed chunk of a completions response returned by model, based on the provided input. ```json Response data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "object": "text_completion", "choices": [ { "index": 0, "text": " such", "token": 1778, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "object": "text_completion", "choices": [ { "index": 0, "text": " as", "token": 439, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } ... data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "object": "text_completion", "choices": [ { "index": 0, "text": "", "finish_reason": "length", "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "object": "text_completion", "choices": [], "usage": { "prompt_tokens": 5, "completion_tokens": 10, "total_tokens": 15 }, "created": 1733382157 } data: [DONE] ``` A unique ID of the completion. The object type, which is always set to `text_completion`. The model to generate the completion. The index of the choice in the list of generated choices. The text. The token. Termination condition of the generation. `stop` means the API returned the full completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. Available options: `stop`, `length` Log probability information for the choice. The starting character position of each token in the generated text, useful for mapping tokens back to their exact location for detailed analysis. The log probabilities of each generated token, indicating the model's confidence in selecting each token. A list of individual tokens generated in the completion, representing segments of text such as words or pieces of words. A list of dictionaries, where each dictionary represents the top alternative tokens considered by the model at a specific position in the generated text, along with their log probabilities. The number of items in each dictionary matches the value of `logprobs`. Number of tokens in the prompt. Number of tokens in the generated completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Container detokenization Source: https://friendli.ai/docs/openapi/container/detokenization post /v1/detokenize By giving a list of tokens, generate a detokenized output text string. # Container image generations Source: https://friendli.ai/docs/openapi/container/image-generations post /v1/images/generations Given a description, the model generates image. # Container overview Source: https://friendli.ai/docs/openapi/container/overview OpenAPI reference of Friendli Container API. ### Inference Discover how to generate text through interactive conversations. Learn how to generate text. Explore the process of breaking down text into smaller tokens for machine processing. Learn how to reconstruct tokenized text back into its original, human-readable form. Learn how to generate images. # Container tokenization Source: https://friendli.ai/docs/openapi/container/tokenization post /v1/tokenize By giving a text input, generate a tokenized output of token IDs. # Add samples Source: https://friendli.ai/docs/openapi/dataset/add-samples post /beta/dataset/{dataset_id}/split/{split_id}/sample Add samples to dataset. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Create a new dataset Source: https://friendli.ai/docs/openapi/dataset/create-a-new-dataset post /beta/dataset Create a new dataset. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Create a new split Source: https://friendli.ai/docs/openapi/dataset/create-a-split post /beta/dataset/{dataset_id}/split Create a new split. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Create a new version Source: https://friendli.ai/docs/openapi/dataset/create-a-version post /beta/dataset/{dataset_id}/version Create a new version. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Delete a version Source: https://friendli.ai/docs/openapi/dataset/delete-a-version delete /beta/dataset/{dataset_id}/version/{version_id} Delete a version. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Delete a dataset Source: https://friendli.ai/docs/openapi/dataset/delete-dataset delete /beta/dataset/{dataset_id} Delete a dataset. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Delete samples Source: https://friendli.ai/docs/openapi/dataset/delete-samples post /beta/dataset/{dataset_id}/split/{split_id}/sample/delete Delete samples. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Delete a split Source: https://friendli.ai/docs/openapi/dataset/delete-split delete /beta/dataset/{dataset_id}/split/{split_id} Delete a split. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Get dataset info Source: https://friendli.ai/docs/openapi/dataset/get-dataset-info get /beta/dataset/{dataset_id} Get dataset info. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Get split info Source: https://friendli.ai/docs/openapi/dataset/get-split-info get /beta/dataset/{dataset_id}/split/{split_id} Get split info. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Get version info Source: https://friendli.ai/docs/openapi/dataset/get-version-info get /beta/dataset/{dataset_id}/version/{version_id} Get version info. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # List datasets Source: https://friendli.ai/docs/openapi/dataset/list-datasets get /beta/dataset List datasets. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # List samples Source: https://friendli.ai/docs/openapi/dataset/list-samples get /beta/dataset/{dataset_id}/split/{split_id}/sample List samples. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # List splits Source: https://friendli.ai/docs/openapi/dataset/list-splits get /beta/dataset/{dataset_id}/split List splits. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # List versions Source: https://friendli.ai/docs/openapi/dataset/list-versions get /beta/dataset/{dataset_id}/version List versions. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dataset overview Source: https://friendli.ai/docs/openapi/dataset/overview OpenAPI reference of Friendli Dataset API. ### Dataset Management (Beta) Discover how to list datasets. Discover how to list versions of a dataset. Discover how to list splits of a dataset version. Discover how to list samples in a dataset split. Discover how to get information about a dataset. Discover how to get information about a dataset version. Discover how to get information about a dataset split. Discover how to create a new dataset. Discover how to create a new version of a dataset. Discover how to create a new split in a dataset. Discover how to add samples to a dataset. Discover how to delete samples from a dataset. Discover how to update samples in a dataset. Discover how to delete a dataset version. Discover how to delete a dataset. Discover how to delete a dataset split. # Update samples Source: https://friendli.ai/docs/openapi/dataset/update-samples put /beta/dataset/{dataset_id}/split/{split_id}/sample Update samples. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated create endpoint Source: https://friendli.ai/docs/openapi/dedicated/endpoint/create post /dedicated/beta/endpoint Create a Dedicated Endpoint deployment for a Hugging Face model. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated delete endpoint Source: https://friendli.ai/docs/openapi/dedicated/endpoint/delete delete /dedicated/beta/endpoint/{endpoint_id} Delete an endpoint. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated get endpoint Source: https://friendli.ai/docs/openapi/dedicated/endpoint/get-spec get /dedicated/beta/endpoint/{endpoint_id} Given an endpoint ID, return its specification. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated get endpoint status Source: https://friendli.ai/docs/openapi/dedicated/endpoint/get-status get /dedicated/beta/endpoint/{endpoint_id}/status Given an endpoint ID, return its current status. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated get endpoint version Source: https://friendli.ai/docs/openapi/dedicated/endpoint/get-version get /dedicated/beta/endpoint/{endpoint_id}/version Given an endpoint ID, return its version history. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated list endpoints Source: https://friendli.ai/docs/openapi/dedicated/endpoint/list get /dedicated/beta/endpoint List Dedicated Endpoint deployments. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated restart endpoint Source: https://friendli.ai/docs/openapi/dedicated/endpoint/restart put /dedicated/beta/endpoint/{endpoint_id}/restart Restart a failed or terminated Dedicated Endpoint. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated sleep endpoint Source: https://friendli.ai/docs/openapi/dedicated/endpoint/sleep put /dedicated/beta/endpoint/{endpoint_id}/sleep Put a Dedicated Endpoint to sleep mode. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated terminate endpoint Source: https://friendli.ai/docs/openapi/dedicated/endpoint/terminate put /dedicated/beta/endpoint/{endpoint_id}/terminate Terminate an endpoint. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated update endpoint Source: https://friendli.ai/docs/openapi/dedicated/endpoint/update put /dedicated/beta/endpoint/{endpoint_id} Update a Dedicated Endpoint deployment with new configuration. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated wake endpoint Source: https://friendli.ai/docs/openapi/dedicated/endpoint/wake put /dedicated/beta/endpoint/{endpoint_id}/wake Wake up a sleeping Dedicated Endpoint. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated create endpoint from W&B artifact Source: https://friendli.ai/docs/openapi/dedicated/endpoint/wandb-artifact-create post /dedicated/endpoint/wandb-artifact-create Create an endpoint from Weights & Biases artifact. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. # Dedicated audio transcriptions Source: https://friendli.ai/docs/openapi/dedicated/inference/audio-transcriptions post /dedicated/v1/audio/transcriptions Given an audio file, the model transcribes it into text. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated chat completions Source: https://friendli.ai/docs/openapi/dedicated/inference/chat-completions post /dedicated/v1/chat/completions Given a list of messages forming a conversation, the model generates a response. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/dedicated/inference/chat-completions-chunk-object). # Dedicated chat completions chunk object Source: https://friendli.ai/docs/openapi/dedicated/inference/chat-completions-chunk-object Represents a streamed chunk of a chat completions response returned by model, based on the provided input. ```json Response data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "content": " is" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop", "logprobs": null } ], "usage": null, "created": 1726294383 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 8, "completion_tokens": 4, "total_tokens": 12 }, "created": 1726294402 } data: [DONE] ``` ```json With tools data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "id": "call_TARbemDG9CFdwuoaQBTRXiYK", "type": "function", "function": { "name": "func", "arguments": "{\"" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "arg" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "}" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "tool_calls", "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "(endpoint-id)", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 468, "completion_tokens": 59, "total_tokens": 527 }, "created": 1726294443 } data: [DONE] ``` A unique ID of the chat completion. The object type, which is always set to `chat.completion.chunk`. The model to generate the completion. For dedicated endpoints, it returns the endpoint id. The index of the choice in the list of generated choices. Role of the generated message author, in this case `assistant`. The contents of the assistant message. The index of tool call being generated. The ID of the tool call. The type of the tool, which is always set to `function`. The name of the function to call. The arguments for calling the function, generated by the model in JSON format. Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON. Termination condition of the generation. `stop` means the API returned the full chat completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. `tool_calls` means the API has generated tool calls. Available options: `stop`, `length`, `tool_calls` Log probability information for the choice. A list of message content tokens with log probability information. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. List of the most likely tokens and their log probability, at this token position. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. Number of tokens in the prompt. Number of tokens in the generated chat completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Dedicated completions Source: https://friendli.ai/docs/openapi/dedicated/inference/completions post /dedicated/v1/completions Generate text based on the given text prompt. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/dedicated/inference/completions-chunk-object). # Dedicated completions chunk object Source: https://friendli.ai/docs/openapi/dedicated/inference/completions-chunk-object Represents a streamed chunk of a completions response returned by model, based on the provided input. ```json Response data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "(endpoint-id)", "object": "text_completion", "choices": [ { "index": 0, "text": " such", "token": 1778, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "(endpoint-id)", "object": "text_completion", "choices": [ { "index": 0, "text": " as", "token": 439, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } ... data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "(endpoint-id)", "object": "text_completion", "choices": [ { "index": 0, "text": "", "finish_reason": "length", "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "(endpoint-id)", "object": "text_completion", "choices": [], "usage": { "prompt_tokens": 5, "completion_tokens": 10, "total_tokens": 15 }, "created": 1733382157 } data: [DONE] ``` A unique ID of the completion. The object type, which is always set to `text_completion`. The model to generate the completion. For dedicated endpoints, it returns the endpoint id. The index of the choice in the list of generated choices. The text. The token. Termination condition of the generation. `stop` means the API returned the full completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. Available options: `stop`, `length` Log probability information for the choice. The starting character position of each token in the generated text, useful for mapping tokens back to their exact location for detailed analysis. The log probabilities of each generated token, indicating the model's confidence in selecting each token. A list of individual tokens generated in the completion, representing segments of text such as words or pieces of words. A list of dictionaries, where each dictionary represents the top alternative tokens considered by the model at a specific position in the generated text, along with their log probabilities. The number of items in each dictionary matches the value of `logprobs`. Number of tokens in the prompt. Number of tokens in the generated completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Dedicated detokenization Source: https://friendli.ai/docs/openapi/dedicated/inference/detokenization post /dedicated/v1/detokenize By giving a list of tokens, generate a detokenized output text string. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. # Dedicated image generations Source: https://friendli.ai/docs/openapi/dedicated/inference/image-generations post /dedicated/v1/images/generations Given a description, the model generates image(s). To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Dedicated tokenization Source: https://friendli.ai/docs/openapi/dedicated/inference/tokenization post /dedicated/v1/tokenize By giving a text input, generate a tokenized output of token IDs. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. # Dedicated overview Source: https://friendli.ai/docs/openapi/dedicated/overview OpenAPI reference of Friendli Dedicated Endpoints API. ### Inference Discover how to generate text through interactive conversations. Learn how to generate text. Explore the process of breaking down text into smaller tokens for machine processing. Learn how to reconstruct tokenized text back into its original, human-readable form. Learn how to generate images. ### Endpoint (Beta) List Dedicated Endpoint deployments. Given an endpoint ID, return its specification. Given an endpoint ID, return its version history. Given an endpoint ID, return its current status. Create a Dedicated Endpoint deployment for a Hugging Face model. Create an endpoint from Weights & Biases artifact. Update a Dedicated Endpoint deployment with new configuration. Terminate an endpoint. Restart a failed or terminated Dedicated Endpoint. Put a Dedicated Endpoint to sleep mode. Wake up a sleeping Dedicated Endpoint. Delete an endpoint. # Complete file upload Source: https://friendli.ai/docs/openapi/file/complete-file-upload patch /beta/file/{file_id} Complete file upload. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Get file download URL Source: https://friendli.ai/docs/openapi/file/get-file-download-url get /beta/file/{file_id}/download_url Get file download URL. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Get file info Source: https://friendli.ai/docs/openapi/file/get-file-info get /beta/file/{file_id} Get file info. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Initiate file upload Source: https://friendli.ai/docs/openapi/file/init-file-upload post /beta/file Initiate file upload. To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # File overview Source: https://friendli.ai/docs/openapi/file/overview OpenAPI reference of Friendli File API. ### File Management (Beta) Discover how to initiate a file upload. Discover how to complete a file upload. Discover how to get information about a file. Discover how to get a download URL for a file. # Friendli Suite API Reference Source: https://friendli.ai/docs/openapi/introduction OpenAPI reference of Friendli Suite API. You can interact with the API through HTTP requests from any language. export const RoundedBorderBox = ({children, caption}) =>
{children} {caption &&

{caption}

}
; To send inference requests, send to the URI with the prefix: `https://api.friendli.ai`.\ For more information, visit [FriendliAI](https://friendli.ai). ## Authentication When using Friendli Suite API for inference requests, you need to provide a **Friendli Token** for authentication and authorization purposes. A Friendli Token serves as an alternative method of authorization to signing in with an email and a password. You can generate a new Friendli Token through the [Friendli Suite](https://friendli.ai/suite), at your **'Personal settings'** page by following the steps below. 1. Go to the [Friendli Suite](https://friendli.ai/suite) and sign in with your account. 2. Click the profile icon at the top-right corner of the page. 3. Click **'Personal settings'** menu. Personal settings 4. Go to the **'Tokens'** tab on the navigation bar. 5. Create a new Friendli Token by clicking the **'Create token'** button. 6. Copy the token and save it in a safe place. You will not be able to see this token again once the page is refreshed. Tokens # Serverless chat completions Source: https://friendli.ai/docs/openapi/serverless/chat-completions post /serverless/v1/chat/completions Given a list of messages forming a conversation, the model generates a response. See available models at [this pricing table](/guides/serverless_endpoints/pricing#text-generation-models). To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/serverless/chat-completions-chunk-object). You can explore examples on the [Friendli Serverless Endpoints](https://friendli.ai/get-started/serverless-endpoints) playground and adjust settings with just a few clicks. # Serverless chat completions chunk object Source: https://friendli.ai/docs/openapi/serverless/chat-completions-chunk-object Represents a streamed chunk of a chat completions response returned by model, based on the provided input. ```json Response data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "content": " is" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294381 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop", "logprobs": null } ], "usage": null, "created": 1726294383 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 8, "completion_tokens": 4, "total_tokens": 12 }, "created": 1726294402 } data: [DONE] ``` ```json With tools data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "This" }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "id": "call_TARbemDG9CFdwuoaQBTRXiYK", "type": "function", "function": { "name": "func", "arguments": "{\"" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "arg" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "}" } } ] }, "finish_reason": null, "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "tool_calls", "logprobs": null } ], "usage": null, "created": 1726294442 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [], "usage": { "prompt_tokens": 468, "completion_tokens": 59, "total_tokens": 527 }, "created": 1726294443 } data: [DONE] ``` A unique ID of the chat completion. The object type, which is always set to `chat.completion.chunk`. The model to generate the completion. The index of the choice in the list of generated choices. Role of the generated message author, in this case `assistant`. The contents of the assistant message. The index of tool call being generated. The ID of the tool call. The type of the tool, which is always set to `function`. The name of the function to call. The arguments for calling the function, generated by the model in JSON format. Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON. Termination condition of the generation. `stop` means the API returned the full chat completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. `tool_calls` means the API has generated tool calls. Available options: `stop`, `length`, `tool_calls` Log probability information for the choice. A list of message content tokens with log probability information. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. List of the most likely tokens and their log probability, at this token position. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. Number of tokens in the prompt. Number of tokens in the generated chat completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Serverless completions Source: https://friendli.ai/docs/openapi/serverless/completions post /serverless/v1/completions Generate text based on the given text prompt. See available models at [this pricing table](/guides/serverless_endpoints/pricing#text-generation-models). To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/serverless/completions-chunk-object). # Serverless completions chunk object Source: https://friendli.ai/docs/openapi/serverless/completions-chunk-object Represents a streamed chunk of a completions response returned by model, based on the provided input. ```json Response data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "meta-llama-3.1-8b-instruct", "object": "text_completion", "choices": [ { "index": 0, "text": " such", "token": 1778, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "meta-llama-3.1-8b-instruct", "object": "text_completion", "choices": [ { "index": 0, "text": " as", "token": 439, "finish_reason": null, "logprobs": null } ], "created": 1733382157 } ... data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "meta-llama-3.1-8b-instruct", "object": "text_completion", "choices": [ { "index": 0, "text": "", "finish_reason": "length", "logprobs": null } ], "created": 1733382157 } data: { "id": "cmpl-26a1e10db8544bc3adb488d2d205288b", "model": "meta-llama-3.1-8b-instruct", "object": "text_completion", "choices": [], "usage": { "prompt_tokens": 5, "completion_tokens": 10, "total_tokens": 15 }, "created": 1733382157 } data: [DONE] ``` A unique ID of the completion. The object type, which is always set to `text_completion`. The model to generate the completion. The index of the choice in the list of generated choices. The text. The token. Termination condition of the generation. `stop` means the API returned the full completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. Available options: `stop`, `length` Log probability information for the choice. The starting character position of each token in the generated text, useful for mapping tokens back to their exact location for detailed analysis. The log probabilities of each generated token, indicating the model's confidence in selecting each token. A list of individual tokens generated in the completion, representing segments of text such as words or pieces of words. A list of dictionaries, where each dictionary represents the top alternative tokens considered by the model at a specific position in the generated text, along with their log probabilities. The number of items in each dictionary matches the value of `logprobs`. Number of tokens in the prompt. Number of tokens in the generated completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. # Serverless detokenization Source: https://friendli.ai/docs/openapi/serverless/detokenization post /serverless/v1/detokenize By giving a list of tokens, generate a detokenized output text string. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. # Serverless overview Source: https://friendli.ai/docs/openapi/serverless/overview OpenAPI reference of Friendli Serverless Endpoints API. ### Inference Discover how to generate text through interactive conversations. Learn how to enhance responses with tool assisted chat completions using built-in tools. Learn how to generate text. Explore the process of breaking down text into smaller tokens for machine processing. Learn how to reconstruct tokenized text back into its original, human-readable form. # Serverless tokenization Source: https://friendli.ai/docs/openapi/serverless/tokenization post /serverless/v1/tokenize By giving a text input, generate a tokenized output of token IDs. To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. # Serverless tool assisted chat completions Source: https://friendli.ai/docs/openapi/serverless/tool-assisted-chat-completions post /serverless/tools/v1/chat/completions Given a list of messages forming a conversation, the model generates a response. Additionally, the model can utilize built-in tools for tool calls, enhancing its capability to provide more comprehensive and actionable responses. See available models at [this pricing table](/guides/serverless_endpoints/pricing#text-generation-models). To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field. Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token. When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`. You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/serverless/tool-assisted-chat-completions-chunk-object). You can explore examples on the [Friendli Serverless Endpoints](https://friendli.ai/get-started/serverless-endpoints) playground and adjust settings with just a few clicks. Tool assisted chat completions does not fully support parallel tool calls now. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support # Serverless tool assisted chat completions chunk object Source: https://friendli.ai/docs/openapi/serverless/tool-assisted-chat-completions-chunk-object Represents a streamed chunk of a tool assisted chat completions response returned by model, based on the provided input. This API is currently in **Beta**. While we strive to provide a stable and reliable experience, this feature is still under active development. As a result, you may encounter unexpected behavior or limitations. We encourage you to provide feedback to help us improve the feature before its official release. * { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback * { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support ```json Response event: tool_status data: { "tool_call_id": "call_3QrfStXSU6fGdOGPcETocIAq", "name": "math:calculator", "status": "STARTED", "parameters": [{ "name": "expression", "value": "150 * 1.60934" }], "result": null, "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726277121 } event: tool_status data: { "tool_call_id": "call_3QrfStXSU6fGdOGPcETocIAq", "name": "math:calculator", "status": "ENDED", "parameters": [{ "name": "expression", "value": "150 * 1.60934" }], "result": "\"{\\\"result\\\": \\\"150 * 1.60934=241.401000000000\\\"}\"", "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726277121 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "To" }, "finish_reason": null, "logprobs": null } ], "created": 1726277121 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "." }, "finish_reason": null, "logprobs": null } ], "created": 1726277121 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop", "logprobs": null } ], "created": 1726277121 } data: [DONE] ``` ```json Multiple tools event: tool_status data: { "tool_call_id": "call_5X9KQ52bV3CUigqHWleTzD9A", "name": "code:python-interpreter", "status": "STARTED", "parameters": [{ "name": "code", "value": "def is_prime(n): ... \n" }], "result": null, "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726277008 } event: tool_status data: { "tool_call_id": "call_5X9KQ52bV3CUigqHWleTzD9A", "name": "code:python-interpreter", "status": "ENDED", "parameters": [{ "name": "code", "value": "def is_prime(n): ... \n" }], "result": "\"[2, 3, 5, 7, 11, 13, 17]\\n\"", "files": [], "message": null, "error": null, "usage": null, "timestamp": 1726277011 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "Now" }, "finish_reason": null, "logprobs": null } ], "created": 1726277011 } ... event: tool_status data: { "tool_call_id": "call_FgfZYpRoDdPtz3QwLrLZIhdP", "name": "math:calculator", "status": "STARTED", "parameters": [{ "name": "expression", "value": "2 * 3 * 5 * 7 * 11 * 13 * 17" }], "result": null, "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726277012 } event: tool_status data: { "tool_call_id": "call_FgfZYpRoDdPtz3QwLrLZIhdP", "name": "math:calculator", "status": "ENDED", "parameters": [{ "name": "expression", "value": "2 * 3 * 5 * 7 * 11 * 13 * 17" }], "result": "\"{\\\"result\\\": \\\"2 * 3 * 5 * 7 * 11 * 13 * 17=510510\\\"}\"", "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726277016 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "The" }, "finish_reason": null, "logprobs": null } ], "created": 1726277016 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "." }, "finish_reason": null, "logprobs": null } ], "created": 1726277016 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "stop", "logprobs": null } ], "created": 1726277016 } data: [DONE] ``` ```json With custom tool event: tool_status data: { "tool_call_id": "call_iryDFgBCcNoc2ICXuuyZqQUe", "name": "web:search", "status": "STARTED", "parameters": [{ "name": "query", "value": "tallest buildings in the world" }], "result": null, "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726294660 } event: tool_status data: { "tool_call_id": "call_iryDFgBCcNoc2ICXuuyZqQUe", "name": "web:search", "status": "UPDATING", "parameters": [{ "name": "query", "value": "tallest buildings in the world" }], "result": "https://en.wikipedia.org/wiki/List_of_tallest_buildings", "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726294666 } ... event: tool_status data: { "tool_call_id": "call_iryDFgBCcNoc2ICXuuyZqQUe", "name": "web:search", "status": "ENDED", "parameters": [{ "name": "query", "value": "tallest buildings in the world" }], "result": "['https://en.wikipedia.org/wiki/List_of_tallest_buildings', ...]", "files": null, "message": null, "error": null, "usage": null, "timestamp": 1726294671 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "content": "The" }, "finish_reason": null, "logprobs": null } ], "created": 1726294672 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "id": "call_yuvrTUk4O2Uh7Hns5ieUcu1S", "type": "function", "function": { "name": "func", "arguments": "{\"" }, } ] }, "finish_reason": null, "logprobs": null } ], "created": 1726294673 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "arg" } } ] }, "finish_reason": null, "logprobs": null } ], "created": 1726294673 } ... data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": { "role": "assistant", "tool_calls": [ { "index": 0, "type": "function", "function": { "arguments": "}" } } ] }, "finish_reason": null, "logprobs": null } ], "created": 1726294673 } data: { "id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941", "model": "meta-llama-3.1-8b-instruct", "object": "chat.completion.chunk", "choices": [ { "index": 0, "delta": {}, "finish_reason": "tool_calls", "logprobs": null } ], "created": 1726294673 } data: [DONE] ``` A unique ID of the chat completion. The object type, which is always set to `chat.completion.chunk`. The model to generate the completion. The index of the choice in the list of generated choices. Role of the generated message author, in this case `assistant`. The contents of the assistant message. The index of tool call being generated. The ID of the tool call. The type of the tool, which is always set to `function`. The name of the function to call. The arguments for calling the function, generated by the model in JSON format. Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON. Termination condition of the generation. `stop` means the API returned the full chat completions generated by the model without running into any limits. `length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length. `tool_calls` means the API has generated tool calls. Available options: `stop`, `length`, `tool_calls` Log probability information for the choice. A list of message content tokens with log probability information. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. List of the most likely tokens and their log probability, at this token position. The token. The log probability of this token. A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token. Number of tokens in the prompt. Number of tokens in the generated chat completions. Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`). The Unix timestamp (in seconds) for when the token sampled. ### `event: tool_status` chunk object `event: tool_status` tracks the execution progress of built-in tools, such as calculator or web search functions. It provides real-time updates on their status and results. The ID of the tool call. The name of the built-in tool. Available options: `math:calculator`, `math:statistics`, `math:calendar`, `web:search`, `web:url`, `code:python-interpreter`, `file:text` Indicates the current execution status of the tool. Available options: `STARTED`, `UPDATING`, `ENDED`, `ERRORED` The name of the tool's function parameter. The value of the tool's function parameter. The output from the tool's execution. The name of the file generated by the tool's execution. URL of the file generated by the tool's execution. Message generated by the tool's execution. The type of error encountered during the tool's execution. The message of error. {/* */} The Unix timestamp (in seconds) for when the event occurred. # Langchain Node.js SDK Source: https://friendli.ai/docs/sdk/integrations/langchain/nodejs Utilize the LangChain Node.js SDK with FriendliAI for seamless integration and enhanced tool calling capabilities in your applications. You can use [**LangChain Node.js SDK**](https://github.com/langchain-ai/langchainjs) to interact with FriendliAI. This makes migration of existing applications already using LangChain particularly easy. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens). Our products are entirely compatible with OpenAI, so we use the `@langchain/openai` package by referring to the FriendliAI `baseURL`. ```bash npm npm i @langchain/core @langchain/openai ``` ```bash yarn yarn add @langchain/core @langchain/openai ``` ```bash pnpm pnpm add @langchain/core @langchain/openai ``` ### Instantiation Now we can instantiate our model object and generate chat completions. We provide usage examples for each type of endpoint. Choose the one that best suits your needs: ```js Serverless Endpoints import { ChatOpenAI } from "@langchain/openai"; const model = new ChatOpenAI({ model: "meta-llama-3.1-8b-instruct", apiKey: process.env.FRIENDLI_TOKEN, configuration: { baseURL: "https://api.friendli.ai/serverless/v1", }, }); ``` ```js Dedicated Endpoints import { ChatOpenAI } from "@langchain/openai"; const model = new ChatOpenAI({ model: "YOUR_ENDPOINT_ID", apiKey: process.env.FRIENDLI_TOKEN, configuration: { baseURL: "https://api.friendli.ai/dedicated/v1", }, }); ``` ```js Fine-tuned Dedicated Endpoints import { ChatOpenAI } from "@langchain/openai"; const model = new ChatOpenAI({ model: "YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE", apiKey: process.env.FRIENDLI_TOKEN, configuration: { baseURL: "https://api.friendli.ai/dedicated/v1", }, }); ``` ### Runnable interface We support both synchronous and asynchronous runnable methods to generate a response. {/* #### Synchronous methods: #### Asynchronous methods: TODO: Add more examples */} ```js import { HumanMessage, SystemMessage } from "@langchain/core/messages"; const messages = [ new SystemMessage("Translate the following from English into Italian"), new HumanMessage("hi!"), ]; const result = await model.invoke(messages); console.log(result); ``` ### Chaining We can chain our model with a prompt template. Prompt templates convert raw user input to better input to the LLM. ```javascript import { ChatPromptTemplate } from "@langchain/core/prompts"; const prompt = ChatPromptTemplate.fromMessages([ ["system", "You are a world class technical documentation writer."], ["user", "{input}"], ]); const chain = prompt.pipe(model); console.log( await chain.invoke({ input: "how can langsmith help with testing?" }) ); ``` To get the string value instead of the message, we can add an output parser to the chain. ```javascript import { StringOutputParser } from "@langchain/core/output_parsers"; const outputParser = new StringOutputParser(); const chain = prompt.pipe(model).pipe(outputParser); console.log( await chain.invoke({ input: "how can langsmith help with testing?" }) ); ``` ### Tool calling Describe tools and their parameters, and let the model return a tool to invoke with the input arguments. Tool calling is extremely useful for enhancing the model's capability to provide more comprehensive and actionable responses. #### Define tools to use We can define tools with Zod schemas and use them to generate tool calls. ```bash npm npm i zod ``` ```bash yarn yarn add zod ``` ```bash pnpm pnpm add zod ``` ```js import { tool } from "@langchain/core/tools"; import { z } from "zod"; /** * Note that the descriptions here are crucial, as they will be passed along * to the model along with the class name. */ const calculatorSchema = z.object({ operation: z .enum(["add", "subtract", "multiply", "divide"]) .describe("The type of operation to execute."), number1: z.number().describe("The first number to operate on."), number2: z.number().describe("The second number to operate on."), }); const calculatorTool = tool( async ({ operation, number1, number2 }) => { // Functions must return strings if (operation === "add") { return `${number1 + number2}`; } else if (operation === "subtract") { return `${number1 - number2}`; } else if (operation === "multiply") { return `${number1 * number2}`; } else if (operation === "divide") { return `${number1 / number2}`; } else { throw new Error("Invalid operation."); } }, { name: "calculator", description: "Can perform mathematical operations.", schema: calculatorSchema, } ); console.log( await calculatorTool.invoke({ operation: "add", number1: 3, number2: 4 }) ); ``` #### Bind tools to the model Now models can generate a tool calling response. ```js const modelWithTools = model.bindTools([calculatorTool]); const messages = [new HumanMessage("What is 3 * 12? Also, what is 11 + 49?")]; const aiMessage = await modelWithTools.invoke(messages); console.log(aiMessage); ``` #### Generate a tool assisted message Use the tool call results to generate a message. ```js messages.push(aiMessage); const toolsByName = { calculator: calculatorTool, }; for (const toolCall of aiMessage.tool_calls) { const selectedTool = toolsByName[toolCall.name]; const toolMessage = await selectedTool.invoke(toolCall); messages.push(toolMessage); } console.log(await modelWithTools.invoke(messages)); ``` For more information on how to use tools, check out the [LangChain documentation](https://js.langchain.com/v0.2/docs/how_to/#tools). # LangChain Python SDK Source: https://friendli.ai/docs/sdk/integrations/langchain/python Utilize the LangChain Python SDK with FriendliAI for easy integration and advanced tool calling in your applications. You can use [**LangChain Python SDK**](https://github.com/langchain-ai/langchain) to interact with FriendliAI. This makes migration of existing applications already using LangChain particularly easy. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens). Our products are entirely compatible with OpenAI, so we use the `langchain-openai` package by referring to the FriendliAI `baseURL`. ```bash pip install -qU langchain-openai langchain ``` ### Instantiation Now we can instantiate our model object and generate chat completions. We provide usage examples for each type of endpoint. Choose the one that best suits your needs: ```python Serverless Endpoints import os from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="meta-llama-3.1-8b-instruct", base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) ``` ```python Dedicated Endpoints import os from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="YOUR_ENDPOINT_ID", base_url="https://api.friendli.ai/dedicated/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) ``` ```python Fine-tuned Dedicated Endpoints import os from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE", base_url="https://api.friendli.ai/dedicated/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) ``` ### Runnable interface We support both synchronous and asynchronous runnable methods to generate a response. #### Synchronous methods: ```python invoke result = llm.invoke("Tell me a joke.") print(result.content) ``` ```python stream for chunk in llm.stream("Tell me a joke."): print(chunk.content, end="", flush=True) ``` ```python batch for r in llm.batch(["Tell me a joke.", "Tell me a useless fact."]): print(r.content, "\n\n") ``` #### Asynchronous methods: ```python ainvoke result = await llm.ainvoke("Tell me a joke.") print(result.content) ``` ```python astream async for chunk in llm.astream("Tell me a joke."): print(chunk.content, end="", flush=True) ``` ```python abatch for r in await llm.abatch(["Tell me a joke.", "Tell me a useless fact."]): print(r.content, "\n\n") ``` ### Chaining We can [chain](https://python.langchain.com/v0.2/docs/how_to/sequence) our model with a prompt template. Prompt templates convert raw user input to better input to the LLM. ```python from langchain_core.prompts import ChatPromptTemplate prompt = ChatPromptTemplate.from_messages([ ("system", "You are a world class technical documentation writer."), ("user", "{input}") ]) chain = prompt | llm print(chain.invoke({"input": "how can langsmith help with testing?"})) ``` To get the string value instead of the message, we can add an output parser to the chain. ```python from langchain_core.output_parsers import StrOutputParser output_parser = StrOutputParser() chain = prompt | llm | output_parser print(chain.invoke({"input": "how can langsmith help with testing?"})) ``` ### Tool calling Describe tools and their parameters, and let the model return a tool to invoke with the input arguments. Tool calling is extremely useful for enhancing the model's capability to provide more comprehensive and actionable responses. #### Define tools to use The `@tool` decorator is used to define a tool. If you set `parse_docstring=True`, the tool will parse the docstring to extract the information of arguments. ```python Default from langchain_core.tools import tool @tool def add(a: int, b: int) -> int: """Adds a and b.""" return a + b @tool def multiply(a: int, b: int) -> int: """Multiplies a and b.""" return a * b tools = [add, multiply] ``` ```python Parse Docstring from langchain_core.tools import tool @tool(parse_docstring=True) def add(a: int, b: int) -> int: """Adds a and b. Args: a: The first integer. b: The second integer. """ return a + b @tool(parse_docstring=True) def multiply(a: int, b: int) -> int: """Multiplies a and b. Args: a: The first integer. b: The second integer. """ return a * b tools = [add, multiply] ``` #### Bind tools to the model Now models can generate a tool calling response. ```python import os from langchain_openai import ChatOpenAI llm = ChatOpenAI( model="meta-llama-3.1-8b-instruct", base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ["FRIENDLI_TOKEN"], ) llm_with_tools = llm.bind_tools(tools) query = "What is 3 * 12? Also, what is 11 + 49?" print(llm_with_tools.invoke(query).tool_calls) ``` #### Generate a tool assisted message Use the tool call results to generate a message. ```python from langchain_core.messages import HumanMessage, ToolMessage messages = [HumanMessage(query)] ai_msg = llm_with_tools.invoke(messages) messages.append(ai_msg) for tool_call in ai_msg.tool_calls: selected_tool = {"add": add, "multiply": multiply}[tool_call["name"].lower()] tool_output = selected_tool.invoke(tool_call["args"]) messages.append(ToolMessage(tool_output, tool_call_id=tool_call["id"])) print(llm_with_tools.invoke(messages)) ``` For more information on how to use tools, check out the [LangChain documentation](https://python.langchain.com/v0.2/docs/how_to/#tools). # LiteLLM Source: https://friendli.ai/docs/sdk/integrations/litellm LiteLLM SDK supports all FriendliAI models, offering easy integration with serverless, dedicated, and fine-tuned endpoints. You can use [**LiteLLM**](https://github.com/BerriAI/litellm) to interact with FriendliAI. This makes migration of existing applications already using LiteLLM particularly easy. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens). Add `friendliai` prefix to your endpoint name for the `model` parameter. ### Chat completion We provide usage examples for each type of endpoint. Choose the one that best suits your needs. You can specify one of the [available models](/guides/serverless_endpoints/text-generation#model-supports) for the serverless endpoints. ```python Serverless Endpoints import os from litellm import completion os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN" response = completion( model="friendliai/meta-llama-3.3-70b-instruct", messages=[ {"role": "user", "content": "hello from litellm"} ], ) print(response) ``` ```python Dedicated Endpoints import os from litellm import completion os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN" os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1" response = completion( model="friendliai/YOUR_ENDPOINT_ID", messages=[ {"role": "user", "content": "hello from litellm"} ], ) print(response) ``` ```python Fine-tuned Dedicated Endpoints import os from litellm import completion os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN" os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1" response = completion( model="friendliai/YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE", messages=[ {"role": "user", "content": "hello from litellm"} ], ) print(response) ``` ### Chat completion - Streaming ```python Serverless Endpoints import os from litellm import completion os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN" response = completion( model="friendliai/meta-llama-3.3-70b-instruct", messages=[ {"role": "user", "content": "hello from litellm"} ], stream=True ) for chunk in response: print(chunk) ``` ```python Dedicated Endpoints import os from litellm import completion os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN" os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1" response = completion( model="friendliai/YOUR_ENDPOINT_ID", messages=[ {"role": "user", "content": "hello from litellm"} ], stream=True ) for chunk in response: print(chunk) ``` ```python Fine-tuned Dedicated Endpoints import os from litellm import completion os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN" os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1" response = completion( model="friendliai/YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE", messages=[ {"role": "user", "content": "hello from litellm"} ], stream=True ) for chunk in response: print(chunk) ``` # LlamaIndex Source: https://friendli.ai/docs/sdk/integrations/llamaindex Easily integrate large language models with the LlamaIndex SDK, featuring FriendliAI for seamless interaction. {/* Open In Colab */} You can use [**LlamaIndex**](https://github.com/run-llama/llama_index) to interact with FriendliAI. This makes migration of existing applications already using LlamaIndex particularly easy. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens). ```python pip install llama-index llama-index-llms-friendli ``` ### Instantiation Now we can instantiate our model object and generate chat completions. The default model (i.e. `meta-llama-3.3-70b-instruct`) will be used if no model is specified. ```python import os from llama_index.llms.friendli import Friendli os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN" llm = Friendli(model="meta-llama-3.3-70b-instruct") ``` ### Chat completion Generate a response from a given conversation. ```python Default from llama_index.core.llms import ChatMessage, MessageRole message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.") resp = llm.chat([message]) print(resp) ``` ```python Streaming from llama_index.core.llms import ChatMessage, MessageRole message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.") resp = llm.stream_chat([message]) for r in resp: print(r.delta, end="") ``` ```python Async from llama_index.core.llms import ChatMessage, MessageRole message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.") resp = await llm.achat([message]) print(resp) ``` ```python Async Streaming from llama_index.core.llms import ChatMessage, MessageRole message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.") resp = await llm.astream_chat([message]) async for r in resp: print(r.delta, end="") ``` ### Completion Generate a response from a given prompt. ```python Default prompt = "Draft a cover letter for a role in software engineering." resp = llm.complete(prompt) print(resp) ``` ```python Streaming prompt = "Draft a cover letter for a role in software engineering." resp = llm.stream_complete(prompt) for r in resp: print(r.delta, end="") ``` ```python Async prompt = "Draft a cover letter for a role in software engineering." resp = await llm.acomplete(prompt) print(resp) ``` ```python Async Streaming prompt = "Draft a cover letter for a role in software engineering." resp = await llm.astream_complete(prompt) async for r in resp: print(r.delta, end="") ``` # OpenAI Node.js SDK Source: https://friendli.ai/docs/sdk/integrations/openai/nodejs Easily integrate FriendliAI with the OpenAI Node.js SDK. You can use [**OpenAI Node.js SDK**](https://github.com/openai/openai-node) to interact with FriendliAI. This makes migration of existing applications already using OpenAI particularly easy. ## How to use Before you start, ensure the `baseURL` and `apiKey` refer to FriendliAI. Since our products are entirely compatible with OpenAI SDK, now you are good to follow the examples below. Choose one of the [available models](/guides/serverless_endpoints/text-generation#model-supports) for `model` parameter. ```bash npm npm i openai ``` ```bash yarn yarn add openai ``` ```bash pnpm pnpm add openai ``` ### Chat Completion Chat completion API that generates a response from a given conversation. We provide multiple usage examples. Try to find the best one that aligns with your needs: ```ts Default import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/v1", apiKey: process.env.FRIENDLI_TOKEN, }); async function main() { const completion = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Hello!" }, ], }); console.log(completion.choices[0]); } main(); ``` ```ts Streaming import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/v1", apiKey: process.env.FRIENDLI_TOKEN, }); async function main() { const completion = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: [ { role: "system", content: "You are a helpful assistant." }, { role: "user", content: "Hello!" }, ], stream: true, }); for await (const chunk of completion) { console.log(chunk.choices[0].delta.content); } } main(); ``` ```ts Functions import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/v1", apiKey: process.env.FRIENDLI_TOKEN, }); async function main() { const messages = [ { role: "user", content: "What's the weather like in Boston today?" }, ]; const tools = [ { type: "function", function: { name: "get_current_weather", description: "Get the current weather in a given location", parameters: { type: "object", properties: { location: { type: "string", description: "The city and state, e.g. San Francisco, CA", }, unit: { type: "string", enum: ["celsius", "fahrenheit"] }, }, required: ["location"], }, }, }, ]; const completion = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: messages, tools: tools, tool_choice: "auto", }); console.log(completion); } main(); ``` ```ts Logprobs import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/v1", apiKey: process.env.FRIENDLI_TOKEN, }); async function main() { const completion = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: [{ role: "user", content: "Hello!" }], logprobs: true, top_logprobs: 2, }); console.log(completion.choices[0].message); console.log(completion.choices[0].logprobs); } main(); ``` ### Tool assisted chat completion This feature is in Beta and available only on the **Serverless Endpoints**. Using tool assisted chat completion API, models can utilize built-in tools prepared for tool calls, enhancing its capability to provide more comprehensive and actionable responses. Available tools are listed [here](/guides/serverless_endpoints/tool-assisted-api#built-in-tools). ```ts Basic import OpenAI from "openai"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/tools/v1", apiKey: process.env.FRIENDLI_TOKEN, }); async function main() { const messages = [ { role: "user", content: "What is the current average home price in New York City, and if I put 15% down, how much will my mortgage be?", }, ]; const tools = [{ type: "code:python-interpreter" }, { type: "web:search" }]; const completion = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: messages, tools: tools, tool_choice: "auto", stream: true, }); for await (const chunk of completion) { if (chunk.choices === undefined) { console.log(`event: ${chunk.event}, data: ${JSON.stringify(chunk.data)}`); } else { console.log(chunk.choices[0].delta.content); } } } main(); ``` ```ts Advanced (REPL) import OpenAI from "openai"; import * as readline from "node:readline/promises"; const client = new OpenAI({ baseURL: "https://api.friendli.ai/serverless/tools/v1", apiKey: process.env.FRIENDLI_TOKEN, }); const terminal = readline.createInterface({ input: process.stdin, output: process.stdout, }); async function chatbot(input) { const stream = await client.chat.completions.create({ model: "meta-llama-3.1-8b-instruct", messages: [{ role: "user", content: input }], tools: [ { type: "web:url" }, { type: "code:python-interpreter" }, { type: "math:calculator" }, { type: "web:search" }, ], tool_choice: "auto", stream: true, }); for await (const chunk of stream) { if (chunk.choices === undefined) { if (chunk.event === "tool_status") { if (chunk.data.result !== "") { switch (chunk.data.status) { case "STARTED": terminal.write( `βš’οΈ TOOL CALL: ${chunk.data.name}(${JSON.stringify( chunk.data.parameters )})` ); break; case "ENDED": terminal.write(`πŸ”§ TOOL RESULT: ${chunk.data.result}`); break; case "ERRORED": terminal.write(`πŸ”§ TOOL ERROR: ${chunk.data.error}`); break; case "UPDATING": terminal.write(`πŸ”§ TOOL UPDATE: ${chunk.data.result}`); break; default: terminal.write(`Unknown tool status: ${chunk.data}`); } } terminal.write("\n"); } else { terminal.write("Unknown event", chunk); } } else { terminal.write(chunk.choices[0]?.delta?.content || ""); } } terminal.write("\n"); } while (true) { const input = await terminal.question("You: "); terminal.write(" "); await chatbot(input); } ``` # OpenAI Python SDK Source: https://friendli.ai/docs/sdk/integrations/openai/python Integrate FriendliAI with OpenAI Python SDK for chat, streaming, and more. You can use [**OpenAI Python SDK**](https://github.com/openai/openai-python) to interact with FriendliAI. This makes migration of existing applications already using OpenAI particularly easy. ## How to use Before you start, ensure the `base_url` and `api_key` refer to FriendliAI. Since our products are entirely compatible with OpenAI SDK, now you are good to follow the examples below. Choose one of the [available models](/guides/serverless_endpoints/text-generation#model-supports) for `model` parameter. ```bash pip install -qU openai ``` ### Chat Completion Chat completion API that generates a response from a given conversation. We provide multiple usage examples. Try to find the best one that aligns with your needs. ```python Default import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ] ) print(completion.choices[0].message) ``` ```python Streaming import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ {"role": "system", "content": "You are a helpful assistant."}, {"role": "user", "content": "Hello!"} ], stream=True ) for chunk in completion: print(chunk.choices[0].delta) ``` ```python Functions import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) tools = [ { "type": "function", "function": { "name": "get_current_weather", "description": "Get the current weather in a given location", "parameters": { "type": "object", "properties": { "location": { "type": "string", "description": "The city and state, e.g. San Francisco, CA", }, "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]}, }, "required": ["location"], }, } } ] completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ {"role": "user", "content": "What's the weather like in Boston today?"} ], tools=tools, tool_choice="auto" ) print(completion) ``` ```python Logprobs import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) completion = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[ {"role": "user", "content": "Hello!"} ], logprobs=True, top_logprobs=2 ) print(completion.choices[0].message) print(completion.choices[0].logprobs) ``` ### Tool assisted chat completion This feature is in Beta and available only on the **Serverless Endpoints**. Using tool assisted chat completion API, models can utilize built-in tools prepared for tool calls, enhancing its capability to provide more comprehensive and actionable responses. Available tools are listed [here](/guides/serverless_endpoints/tool-assisted-api#built-in-tools). ```python Basic import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/tools/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) stream = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[{"role": "user", "content": "What is the current average home price in New York City, and if I put 15% down, how much will my mortgage be?"}], tools=[ {"type": "web:search"}, {"type": "math:calculator"}, ], stream=True, ) for chunk in stream: if chunk.choices is None: print(f"{chunk.event=}, {chunk.data=}") elif chunk.choices[0].delta.content is not None: print(chunk.choices[0].delta.content, end="") ``` ```python Advanced (REPL) import os from openai import OpenAI client = OpenAI( base_url="https://api.friendli.ai/serverless/tools/v1", api_key=os.environ.get("FRIENDLI_TOKEN") ) class bcolors: OKBLUE = '\033[94m' OKCYAN = '\033[96m' FAIL = '\033[91m' WHITE = '\033[97m' def print_response(response): print(f"{bcolors.OKCYAN}{response}", end='') def print_tool_call(data): print(f"\n{bcolors.OKBLUE}βš’οΈ TOOL CALL: { data['name']}({data['parameters']})") def print_tool_result(data): print(f"{bcolors.OKBLUE}πŸ”§ TOOL RESULT: {data['result']}") def print_tool_error(data): print(f"{bcolors.FAIL}πŸ”§ TOOL ERROR: {data['error']}", end='') def print_tool_update(data): print(f"{bcolors.OKBLUE}πŸ”§ TOOL UPDATE: {data['result']}") def chatbot(prompt): stream = client.chat.completions.create( model="meta-llama-3.1-8b-instruct", messages=[{"role": "user", "content": prompt}], stream=True, tools=[ {"type": "web:url"}, {"type": "code:python-interpreter"}, {"type": "math:calculator"}, {"type": "web:search"} ] ) for chunk in stream: if chunk.choices is None: if chunk.event == "tool_status": match chunk.data: case {"status": "STARTED"}: print_tool_call(chunk.data) case {"status": "ENDED"}: print_tool_result(chunk.data) case {"status": "ERRORED"}: print_tool_error(chunk.data) case {"status": "UPDATING"}: print_tool_update(chunk.data) elif chunk.choices[0].delta.content is not None: print_response(chunk.choices[0].delta.content) print("\n") print("Welcome to the Tool Inference!") print("To exit, enter 'q'.") while True: user_input = input(f"{bcolors.WHITE}You: ") if user_input.lower() == 'q': break chatbot(user_input) ``` # Friendli Integrations Source: https://friendli.ai/docs/sdk/integrations/overview Effortlessly integrate FriendliAI models into your projects with support for popular SDKs and frameworks. ## Effortless AI integration with popular SDKs Friendli is committed to providing developers with flexible and powerful tools to integrate our AI models into their projects. We support a variety of popular SDKs and frameworks, making it easy to incorporate Friendli's capabilities into existing workflows and applications. Our integration options include LiteLLM for unified LLM interactions, Vercel AI SDK for seamless web application development, LangChain for building complex AI-driven applications, and an OpenAI-compatible API for those familiar with OpenAI's interface. These integrations enable developers to leverage Friendli's AI models across a wide range of use cases, from simple chat applications to sophisticated AI systems, all while maintaining ease of use and compatibility with existing tools and practices. openai openai openai openai langchain langchain weaviate weaviate vercel vercel llamaindex litellm litellm # Vercel AI SDK Source: https://friendli.ai/docs/sdk/integrations/vercel-ai-sdk Easily integrate FriendliAI models with the Vercel AI SDK, supporting serverless, dedicated, and fine-tuned endpoints. You can use [**Vercel AI SDK**](https://sdk.vercel.ai) to interact with FriendliAI. This makes migration of existing applications already using Vercel AI SDK particularly easy. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens). ```bash npm npm i ai @friendliai/ai-provider ``` ```bash yarn yarn add ai @friendliai/ai-provider ``` ```bash pnpm pnpm add ai @friendliai/ai-provider ``` ### Instantiation Instantiate your models using a Friendli provider instance. We provide usage examples for each type of endpoint. Choose the one that best suits your needs: ```ts Serverless Endpoints {4,7-9} import { friendli } from '@friendliai/ai-provider'; // Automatically select serverless endpoints const model = friendli("meta-llama-3.3-70b-instruct"); // Or specify a specific serverless endpoint const model = friendli("meta-llama-3.3-70b-instruct", { endpoint: "serverless", }); ``` ```ts Dedicated Endpoints {4,7-9} import { friendli } from '@friendliai/ai-provider'; // Replace YOUR_ENDPOINT_ID with the ID of your endpoint, e.g. "zbimjgovmlcb" const model = friendli("YOUR_ENDPOINT_ID"); // Specify a dedicated endpoint instead of auto-selecting const model = friendli("YOUR_ENDPOINT_ID", { endpoint: "dedicated", }); ``` ```ts Friendli Container {9} import { createFriendli } from "@friendliai/ai-provider"; const friendli = createFriendli({ // Update with the URL where your container is running. baseURL: "http://localhost:8000/v1", }); // Containers do not require a model id. const model = friendli(""); ``` ### Example: Generating text Generate a response with the `generateText` function: ```ts import { friendli } from "@friendliai/ai-provider"; import { generateText } from "ai"; const { text } = await generateText({ model: friendli("meta-llama-3.3-70b-instruct"), prompt: "Write a vegetarian lasagna recipe for 4 people.", }); console.log(text); ``` ### Example: Using Enforcing Patterns (Regex) Specify a specific pattern (e.g., CSV), character sets, or specific language characters (e.g., Korean Hangul characters) for your LLM's output. ```ts {6} import { friendli } from "@friendliai/ai-provider"; import { generateText } from "ai"; const { text } = await generateText({ model: friendli("meta-llama-3.3-70b-instruct", { regex: new RegExp("[\n ,.?!0-9\uac00-\ud7af]*"), }), prompt: "Who is the first king of the Joseon Dynasty?", }); console.log(text); ``` ### Example: Using built-in tools This feature is in Beta and available only on the **Serverless Endpoints**. Using tool assisted chat completion API, models can utilize built-in tools prepared for tool calls, enhancing its capability to provide more comprehensive and actionable responses. Available tools are listed [here](/guides/serverless_endpoints/tool-assisted-api#built-in-tools). ```ts {6-9} import { friendli } from "@friendliai/ai-provider"; import { streamText } from "ai"; const result = await streamText({ model: friendli("meta-llama-3.3-70b-instruct", { tools: [ {"type": "web:search"}, {"type": "math:calculator"}, ], }), prompt: "Find the current USD to CAD exchange rate and calculate how much $5,000 USD would be in Canadian dollars.", }); for await (const textPart of result.textStream) { console.log(textPart); } ``` ## OpenAI Compatibility You can also use `@ai-sdk/openai` as the APIs are OpenAI-compatible. ```ts import { createOpenAI } from '@ai-sdk/openai'; const friendli = createOpenAI({ baseURL: 'https://api.friendli.ai/serverless/v1', apiKey: process.env.FRIENDLI_TOKEN, }); ``` If you are using dedicated endpoints ```ts import { createOpenAI } from '@ai-sdk/openai'; const friendli = createOpenAI({ baseURL: 'https://api.friendli.ai/dedicated/v1', apiKey: process.env.FRIENDLI_TOKEN, }); ``` ## Further resources * [Implementing a simple streaming chat with Next.js](https://sdk.vercel.ai/examples/next-app/basics/streaming-text-generation) * [Build a Next.js app with the Vercel AI SDK](https://sdk.vercel.ai/docs/getting-started/nextjs-app-router) * [Explore the Vercel AI SDK Core Reference](https://sdk.vercel.ai/docs/ai-sdk-core/overview) # FriendliAI + Weaviate (Node.js) Source: https://friendli.ai/docs/sdk/integrations/weaviate/nodejs Utilize the Weaviate to build applications with less hallucination open-source vector database. Integration with [**Weaviate**](https://github.com/weaviate/weaviate) enables performing Retrieval Augmented Generation (RAG) directly within the Weaviate database. This combines the power of [**Friendli Engine**](https://friendli.ai/solutions/engine) and Weaviate's efficient storage and fast retrieval capabilities to generate personalized and context-aware responses. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens). Also, set up your Weaviate instance following this [guide](https://weaviate.io/developers/weaviate/starter-guides/which-weaviate). Your Weaviate instance must be configured with the FriendliAI generative AI integration (`generative-friendliai`) module. ```bash npm npm i weaviate-client ``` ```bash yarn yarn add weaviate-client ``` ```bash pnpm pnpm add weaviate-client ``` ### Instantiation Now we can instantiate a [Weaviate collection](https://weaviate.io/developers/weaviate/manage-data/collections) using our model. We provide usage examples for each type of endpoint. Choose the one that best suits your needs. You can specify one of the [available models](/guides/serverless_endpoints/text-generation#model-supports) for the serverless endpoints. The default model (i.e. `meta-llama-3.3-70b-instruct`) will be used if no model is specified. ```ts Serverless Endpoints import weaviate from 'weaviate-client' const client = await weaviate.connectToWeaviateCloud( 'WEAVIATE_INSTANCE_URL', // your Weaviate instance URL { authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_APIKEY'), headers: { 'X-Friendli-Api-Key': process.env.FRIENDLI_TOKEN, } } ) await client.collections.create({ name: 'DemoCollection', generative: weaviate.configure.generative.friendliai({ model: 'meta-llama-3.3-70b-instruct' }), // Additional parameters ... }); client.close() ``` ```ts Dedicated Endpoints import weaviate from 'weaviate-client' const client = await weaviate.connectToWeaviateCloud( 'WEAVIATE_INSTANCE_URL', // your Weaviate instance URL { authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_APIKEY'), headers: { 'X-Friendli-Api-Key': process.env.FRIENDLI_TOKEN, "X-Friendli-Baseurl": "https://api.friendli.ai/dedicated", } } ) await client.collections.create({ name: 'DemoCollection', generative: weaviate.configure.generative.friendliai({ model: 'YOUR_ENDPOINT_ID' }), // Additional parameters ... }); client.close() ``` ```ts Fine-tuned Dedicated Endpoints import weaviate from 'weaviate-client' const client = await weaviate.connectToWeaviateCloud( 'WEAVIATE_INSTANCE_URL', // your Weaviate instance URL { authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_APIKEY'), headers: { 'X-Friendli-Api-Key': process.env.FRIENDLI_TOKEN, "X-Friendli-Baseurl": "https://api.friendli.ai/dedicated", } } ) await client.collections.create({ name: 'DemoCollection', generative: weaviate.configure.generative.friendliai({ model: 'YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE' }), // Additional parameters ... }); client.close() ``` #### Configurable parameters Configure the following generative parameters to customize the model behavior. ```ts await client.collections.create({ name: 'DemoCollection', generative: weaviate.configure.generative.friendliai({ model: 'meta-llama-3.3-70b-instruct', maxTokens: 500, temperature: 0.7, }), // Additional parameters ... }); ``` ### Retrieval Augmented Generation After configuring Weaviate, perform RAG operations, either with the single prompt or grouped task method. #### Single prompt To generate text for each object in the search results, use the single prompt method. The example below generates outputs for each of the n search results, where n is specified by the limit parameter. When creating a single prompt query, use braces `{}` to interpolate the object properties you want Weaviate to pass on to the language model. For example, to pass on the object's title property, include `{title}` in the query. ```ts let myCollection = client.collections.get('DemoCollection'); const singlePromptResults = await myCollection.generate.nearText( ['A holiday film'], { singlePrompt: `Translate this into French: {title}`, }, { limit: 2, } ); for (const obj of singlePromptResults.objects) { console.log(obj.properties['title']); console.log(`Generated output: ${obj.generated}`); // Note that the generated output is per object } ``` #### Grouped task To generate one text for the entire set of search results, use the grouped task method. In other words, when you have n search results, the generative model generates one output for the entire group. ```ts let myCollection = client.collections.get('DemoCollection'); const groupedTaskResults = await myCollection.generate.nearText( ['A holiday film'], { groupedTask: `Write a fun tweet to promote readers to check out these films.`, }, { limit: 2, } ); console.log(`Generated output: ${groupedTaskResults.generated}`); // Note that the generated output is per query for (const obj of groupedTaskResults.objects) { console.log(obj.properties['title']); } ``` ### Further resources Once the integrations are configured at the collection, the data management and search operations in Weaviate work identically to any other collection. See the following model-agnostic examples: * [How-to manage data guides show how to perform data operations](https://weaviate.io/developers/weaviate/manage-data/create). * [How-to search guides show how to perform search operations](https://weaviate.io/developers/weaviate/search/basics). # FriendliAI + Weaviate (Python) Source: https://friendli.ai/docs/sdk/integrations/weaviate/python Utilize the Weaviate to build applications with less hallucination open-source vector database. Integration with [**Weaviate**](https://github.com/weaviate/weaviate) enables performing Retrieval Augmented Generation (RAG) directly within the Weaviate database. This combines the power of [**Friendli Engine**](https://friendli.ai/solutions/engine) and Weaviate's efficient storage and fast retrieval capabilities to generate personalized and context-aware responses. ## How to use Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens). Also, set up your Weaviate instance following this [guide](https://weaviate.io/developers/weaviate/starter-guides/which-weaviate). Your Weaviate instance must be configured with the FriendliAI generative AI integration (`generative-friendliai`) module. ```bash pip install -qU weaviate-client ``` ### Instantiation Now we can instantiate a [Weaviate collection](https://weaviate.io/developers/weaviate/manage-data/collections) using our model. We provide usage examples for each type of endpoint. Choose the one that best suits your needs. You can specify one of the [available models](/guides/serverless_endpoints/text-generation#model-supports) for the serverless endpoints. The default model (i.e. `meta-llama-3.3-70b-instruct`) will be used if no model is specified. ```python Serverless Endpoints import weaviate import os from weaviate.classes.init import Auth from weaviate.classes.config import Configure headers = { "X-Friendli-Api-Key": os.getenv("FRIENDLI_TOKEN"), } client = weaviate.connect_to_weaviate_cloud( cluster_url=weaviate_url, # `weaviate_url`: your Weaviate URL auth_credentials=Auth.api_key(weaviate_key), # `weaviate_key`: your Weaviate API key headers=headers ) client.collections.create( "DemoCollection", generative_config=Configure.Generative.friendliai( model = "meta-llama-3.3-70b-instruct", ) # Additional parameters not shown ) client.close() ``` ```python Dedicated Endpoints import weaviate import os from weaviate.classes.init import Auth from weaviate.classes.config import Configure headers = { "X-Friendli-Api-Key": os.getenv("FRIENDLI_TOKEN"), "X-Friendli-Baseurl": "https://api.friendli.ai/dedicated", } client = weaviate.connect_to_weaviate_cloud( cluster_url=weaviate_url, # `weaviate_url`: your Weaviate URL auth_credentials=Auth.api_key(weaviate_key), # `weaviate_key`: your Weaviate API key headers=headers ) client.collections.create( "DemoCollection", generative_config=Configure.Generative.friendliai( model = "YOUR_ENDPOINT_ID", ) # Additional parameters not shown ) client.close() ``` ```python Fine-tuned Dedicated Endpoints import weaviate import os from weaviate.classes.init import Auth from weaviate.classes.config import Configure headers = { "X-Friendli-Api-Key": os.getenv("FRIENDLI_TOKEN"), "X-Friendli-Baseurl": "https://api.friendli.ai/dedicated", } client = weaviate.connect_to_weaviate_cloud( cluster_url=weaviate_url, # `weaviate_url`: your Weaviate URL auth_credentials=Auth.api_key(weaviate_key), # `weaviate_key`: your Weaviate API key headers=headers ) client.collections.create( "DemoCollection", generative_config=Configure.Generative.friendliai( model = "YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE", ) # Additional parameters not shown ) client.close() ``` #### Configurable parameters Configure the following generative parameters to customize the model behavior. ```python from weaviate.classes.config import Configure client.collections.create( "DemoCollection", generative_config=Configure.Generative.friendliai( # These parameters are optional model = "meta-llama-3.3-70b-instruct", max_tokens = 500, temperature = 0.7, ) ) ``` ### Retrieval Augmented Generation After configuring Weaviate, perform RAG operations, either with the single prompt or grouped task method. #### Single prompt To generate text for each object in the search results, use the single prompt method. The example below generates outputs for each of the n search results, where n is specified by the limit parameter. When creating a single prompt query, use braces `{}` to interpolate the object properties you want Weaviate to pass on to the language model. For example, to pass on the object's title property, include `{title}` in the query. ```python collection = client.collections.get("DemoCollection") response = collection.generate.near_text( query="A holiday film", # The model provider integration will automatically vectorize the query single_prompt="Translate this into French: {title}", limit=2 ) for obj in response.objects: print(obj.properties["title"]) print(f"Generated output: {obj.generated}") # Note that the generated output is per object ``` #### Grouped task To generate one text for the entire set of search results, use the grouped task method. In other words, when you have n search results, the generative model generates one output for the entire group. ```python collection = client.collections.get("DemoCollection") response = collection.generate.near_text( query="A holiday film", # The model provider integration will automatically vectorize the query grouped_task="Write a fun tweet to promote readers to check out these films.", limit=2 ) print(f"Generated output: {response.generated}") # Note that the generated output is per query for obj in response.objects: print(obj.properties["title"]) ``` ### Further resources Once the integrations are configured at the collection, the data management and search operations in Weaviate work identically to any other collection. See the following model-agnostic examples: * [How-to manage data guides show how to perform data operations](https://weaviate.io/developers/weaviate/manage-data/create). * [How-to search guides show how to perform search operations](https://weaviate.io/developers/weaviate/search/basics). # Friendli Python SDK Source: https://friendli.ai/docs/sdk/python-sdk Interact with Friendli AI services using the official Python SDK for seamless integration with your applications. ## Introduction The [Friendli Python SDK](https://github.com/friendliai/friendli-python) provides a powerful and flexible way to interact with FriendliAI services, including Serverless Endpoints, Dedicated Endpoints, and Container. This allows developers to easily integrate their Python applications with FriendliAI. ## Installation The SDK can be installed with either pip or poetry: ```bash # Using pip pip install friendli # Using poetry poetry add friendli ``` ## Authentication Authentication is done using a Friendli Token, which can be generated from the [Friendli Suite](https://friendli.ai/suite) in your Personal Settings: ```python import os from friendli import SyncFriendli with SyncFriendli( token=os.environ["FRIENDLI_TOKEN"], ) as friendli: # Your code here ``` For detailed instructions on generating a Friendli Token, see the [Personal Access Tokens](/guides/personal_access_tokens) guide. ## Chat Completions The SDK supports chat completions across all deployment types. Choose the deployment option that best fits your needs. ```python Serverless Endpoints import os from friendli import SyncFriendli with SyncFriendli( token=os.environ["FRIENDLI_TOKEN"], ) as friendli: res = friendli.serverless.chat.complete( messages=[ { "content": "You are a helpful assistant.", "role": "system", }, { "content": "Hello!", "role": "user", }, ], model="meta-llama-3.1-8b-instruct", max_tokens=200, ) print(res) ``` ```python Dedicated Endpoints import os from friendli import SyncFriendli with SyncFriendli( token=os.environ["FRIENDLI_TOKEN"], ) as friendli: res = friendli.dedicated.chat.complete( messages=[ { "content": "You are a helpful assistant.", "role": "system", }, { "content": "Hello!", "role": "user", }, ], model="YOUR_ENDPOINT_ID", max_tokens=200, ) print(res) ``` ```python Container Deployment from friendli import SyncFriendli with SyncFriendli() as friendli: res = friendli.container.chat.complete( messages=[ { "content": "You are a helpful assistant.", "role": "system", }, { "content": "Hello!", "role": "user", }, ], max_tokens=200, ) print(res) ``` ### Asynchronous Chat Completions ```python Serverless Endpoints import asyncio import os from friendli import AsyncFriendli async def main(): async with AsyncFriendli( token=os.environ["FRIENDLI_TOKEN"], ) as friendli: res = await friendli.serverless.chat.complete( messages=[ { "content": "You are a helpful assistant.", "role": "system", }, { "content": "Hello!", "role": "user", }, ], model="meta-llama-3.1-8b-instruct", max_tokens=200, ) print(res) asyncio.run(main()) ``` ```python Dedicated Endpoints import asyncio import os from friendli import AsyncFriendli async def main(): async with AsyncFriendli( token=os.environ["FRIENDLI_TOKEN"], ) as friendli: res = await friendli.dedicated.chat.complete( messages=[ { "content": "You are a helpful assistant.", "role": "system", }, { "content": "Hello!", "role": "user", }, ], model="YOUR_ENDPOINT_ID", max_tokens=200, ) print(res) asyncio.run(main()) ``` ```python Container Deployment import asyncio from friendli import AsyncFriendli async def main(): async with AsyncFriendli() as friendli: res = await friendli.container.chat.complete( messages=[ { "content": "You are a helpful assistant.", "role": "system", }, { "content": "Hello!", "role": "user", }, ], max_tokens=200, ) print(res) asyncio.run(main()) ``` ### Tool-Assisted Chat Completions Tool-assisted chat completions are only available for Serverless endpoints. ```python import os from friendli import SyncFriendli with SyncFriendli( token=os.environ["FRIENDLI_TOKEN"], ) as friendli: res = friendli.serverless.tool_assisted_chat.complete( messages=[ { "content": "What is 3 + 6?", "role": "user", }, ], model="meta-llama-3.1-8b-instruct", max_tokens=200, tools=[ { "type": "math:calculator", }, ], ) print(res) ``` ## Advanced Features ### Streaming Responses The SDK supports streaming responses using server-sent events, which can be consumed using a simple `for` loop: ```python import os from friendli import SyncFriendli with SyncFriendli( token=os.environ["FRIENDLI_TOKEN"], ) as friendli: res = friendli.serverless.chat.stream( messages=[ { "content": "You are a helpful assistant.", "role": "system", }, { "content": "Hello!", "role": "user", }, ], model="meta-llama-3.1-8b-instruct", max_tokens=200, ) with res as event_stream: for event in event_stream: # Process each chunk as it arrives print(event, flush=True) ``` ### Custom Retry Strategy You can customize retry behavior for operations that support retries: ```python import os from friendli import SyncFriendli from friendli.utils import BackoffStrategy, RetryConfig with SyncFriendli( token=os.environ["FRIENDLI_TOKEN"], ) as friendli: res = friendli.serverless.chat.complete( messages=[ { "content": "You are a helpful assistant.", "role": "system", }, { "content": "Hello!", "role": "user", }, ], model="meta-llama-3.1-8b-instruct", max_tokens=200, retries=RetryConfig("backoff", BackoffStrategy(1, 50, 1.1, 100), False), ) # Handle response print(res) ``` ### Error Handling The SDK provides comprehensive error handling with detailed exception information: ```python import os from friendli import SyncFriendli, models with SyncFriendli( token=os.environ["FRIENDLI_TOKEN"], ) as friendli: try: res = friendli.dedicated.endpoint.create( advanced={ "tokenizer_add_special_tokens": True, "tokenizer_skip_special_tokens": False, }, hf_model_repo="", instance_option_id="", name="", project_id="", ) # Handle response print(res) except models.HTTPValidationError as e: # Handle validation errors print(f"Validation error: {e.data}") except models.SDKError as e: # Handle general SDK errors print(f"Error {e.status_code}: {e.message}") ``` ### Custom Logging You can pass your own logger to the client class to help troubleshoot and diagnose issues during API interactions. This is especially useful when you encounter unexpected behavior or errors. ```python import logging import os from friendli import SyncFriendli # Configure your custom logger, for example: logger = logging.getLogger(__name__) logging.basicConfig( format="[%(filename)s:%(lineno)s - %(funcName)s()] %(message)s", level=logging.INFO, handlers=[logging.StreamHandler()], ) with SyncFriendli( server_url=SERVER_URL, token=TOKEN, debug_logger=logger, # Pass your logger here ) as friendli: # Your code here pass ``` ## Beta Features ### Dataset Management (Beta) Our SDK provides a straightforward way to create, retrieve, and update datasets within your projects. Datasets can contain samples across various modalitiesβ€”such as text, images, and moreβ€”allowing flexible and comprehensive dataset construction for your fine-tuning and validation workflows. ```python import os from friendli.friendli import SyncFriendli from friendli.models import Sample TEAM_ID = os.environ["FRIENDLI_TEAM_ID"] PROJECT_ID = os.environ["FRIENDLI_PROJECT_ID"] TOKEN = os.environ["FRIENDLI_TOKEN"] with SyncFriendli( token=TOKEN, x_friendli_team=TEAM_ID, ) as friendli: # Create dataset with friendli.dataset.create( modality=["TEXT", "IMAGE"], name="test-create-dataset-sync", project_id=PROJECT_ID, ) as dataset: # Read dataset with open("dataset.jsonl", "rb") as f: data = [Sample.model_validate_json(line) for line in f] # Add samples to dataset dataset.upload_samples( samples=data, split="train", ) ``` ### File Management (Beta) You can download and upload files to and from our database. This feature is primarily designed for storing sample files related to datasets, with additional use cases planned for the future. ```python import io import os from hashlib import sha256 import httpx from friendli import SyncFriendli TEAM_ID = os.environ["FRIENDLI_TEAM_ID"] PROJECT_ID = os.environ["FRIENDLI_PROJECT_ID"] TOKEN = os.environ["FRIENDLI_TOKEN"] with SyncFriendli( token=TOKEN, ) as friendli: # Read data from file with open("lorem.txt", "rb") as f: data = f.read() # Inititate upload init_upload_res = friendli.file.init_upload( digest=f"sha256:{sha256(data).hexdigest()}", name="lorem.txt", project_id=PROJECT_ID, size=len(data), x_friendli_team=TEAM_ID, ) # Upload to S3 if init_upload_res.upload_url is not None: httpx.post( url=init_upload_res.upload_url, data=init_upload_res.aws, files={"file": io.BytesIO(data)}, timeout=60, ).raise_for_status() # Complete upload friendli.file.complete_upload( file_id=init_upload_res.file_id, x_friendli_team=TEAM_ID, ) # Get download URL get_download_url_res = friendli.file.get_download_url( file_id=init_upload_res.file_id, x_friendli_team=TEAM_ID, ) print(get_download_url_res.download_url) ``` ## Further Resources For complete API documentation, advanced usage examples, and detailed reference information, please visit the [Friendli Python SDK GitHub repository](https://github.com/friendliai/friendli-python).