# Changelog
Source: https://friendli.ai/docs/changelog
Product updates and announcements
## September, 2025
#### Custom Chat Template Support
We now support custom chat formatting by providing a custom [Jinja](https://jinja.palletsprojects.com/en/stable/) template when creating Dedicated Endpoints. [Read more](https://friendli.ai/docs/guides/dedicated_endpoints/endpoints#custom-chat-templates)
***
#### 4-Bit Online Quantization Support
We now support 4-bit online quantization for selected models when creating Dedicated Endpoints. [Read more](https://friendli.ai/docs/guides/dedicated_endpoints/online-quantization)
#### Reasoning Parsing Support
We now support reasoning parsing for supported models. By enabling the feature, the response will provide a separate `reasoning_content` field rather than including the reasoning content in the `content` field. [Read more](https://friendli.ai/docs/guides/reasoning#reasoning-parsing-with-friendli)
#### Model Deprecations
We have deprecated the following serverless model.
* `K-intelligence/Midm-2.0-Base-Instruct`
#### Model Deprecations
We have deprecated the following serverless model.
* `K-intelligence/Midm-2.0-Mini-Instruct`
#### B200 Hardware Support
We now support NVIDIA B200 GPUs alongside existing A100, H100, and H200 GPUs. [Read more](https://friendli.ai/pricing/dedicated-endpoints)
## August, 2025
#### New built-in integration w/ Linkup
New built-in web-search tool integration with Linkup has been added. [Read more](https://friendli.ai/blog/linkup-partnership)
#### New auto-scaling type *'Request count'* added
Enterprise plan users can now choose to scale their endpoints based on request count. Request count scaling strategy adjusts worker numbers according to total requests in the queue and in progress.
#### Increased output token limits for reasoning models on Serverless endpoints
We have increased the output token limits for reasoning models on Serverless endpoints, allowing longer reasoning outputs to be generated.
***
#### New endpoint feature *'N-GRAM speculative decoding'*
Users can now enable N-GRAM speculative decoding for their endpoints. For predictable tasks, this can deliver substantial performance gains. [Read more](https://friendli.ai/blog/n-gram-speculative-decoding)
#### Model releases
We now support the following serverless models.
* `Qwen/Qwen3-235B-A22B-Thinking-2507`
* `Qwen/Qwen3-235B-A22B-Instruct-2507`
* `skt/A.X-4.0`
* `skt/A.X-3.1`
* `naver-hyperclovax/HyperCLOVAX-SEED-Think-14B`
## July, 2025
#### New endpoint feature *'Online quantization'*
Users can now quantize their model endpoints without any preparations and accelerate inference. [Read more](https://friendli.ai/blog/online-quantization)
#### Model releases
LG AI Research has partnered with FriendliAI to bring the latest version of EXAONE 4.0. [Read more](https://friendli.ai/blog/lg-ai-research-partnership-exaone-4.0)
* `LGAI-EXAONE/EXAONE-4.0.1-32B`
#### Model releases
We now support the following serverless model.
* `deepseek-ai/DeepSeek-R1-0528`
# CUDA Compatibility
Source: https://friendli.ai/docs/guides/container/cuda_compatibility
The Friendli Engine supports CUDA-enabled NVIDIA GPUs, which means it relies on a specific version of CUDA and necessitates proper CUDA compute compatibilities.
The Friendli Engine supports CUDA-enabled NVIDIA GPUs, which means it relies on a specific version of CUDA and necessitates proper [CUDA compute compatibilities](https://docs.nvidia.com/cuda/cuda-c-programming-guide/index.html#compute-capability).
To utilize the Friendli Container effectively, ensure that you have the appropriate NVIDIA GPUs and an NVIDIA driver in place.
Currently, we publicly offer a single Friendli Container image (`registry.friendli.ai/trial:latest`) equipped with CUDA 12.4, targeting CUDA compute compatibility versions `8.0`, `8.6`, `8.9`, and `9.0`.
To make the right choices regarding GPUs and driver versions, consult the [required driver versions](https://docs.nvidia.com/cuda/cuda-toolkit-release-notes/index.html#id4) and [GPUs](https://developer.nvidia.com/cuda-gpus) for the CUDA toolkit and compute compatibility.
# Inference with gRPC
Source: https://friendli.ai/docs/guides/container/inference_with_grpc
Run gRPC inference server with Friendli Container and interact with it through friendli SDK.
This guide will walk you through how to run gRPC inference server with Friendli Container and interact with it through `friendli` SDK.
## Prerequisites
Install `friendli` to use gRPC client SDK:
```sh
pip install friendli
```
Ensure you have the `friendli` SDK version `1.4.1` or higher installed.
## Starting the Friendli Container with gRPC
Running the Friendli Container with a gRPC server for completions is available by adding the `--grpc true` option to the command argument.
This supports response-streaming gRPC, and you can send requests using our `friendli` SDK.
To start the Friendli Container with gRPC support, use the following command:
```sh
export FRIENDLI_CONTAINER_SECRET="YOUR_FRIENDLI_CONTAINER_SECRET_flc_XXX"
# e.g. Running `NousResearch/Hermes-3-Llama-3.1-8B` on GPU 0 with a trial image.
docker run --gpus '"device=0"' -p 8000:8000 \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
-v ~/.cache/huggingface:/root/.cache/huggingface \
registry.friendli.ai/trial:latest \
--hf-model-name NousResearch/Hermes-3-Llama-3.1-8B \
--grpc true
```
You can change the port of the server with `--web-server-port` argument.
## Sending Requests with the Client SDK
Here is how to use the `friendli` SDK to interact with the gRPC server.
This example assumes that the gRPC server is running on `0.0.0.0:8000`.
```python Default
from friendli import SyncFriendli
client = SyncFriendli()
stream = client.container.chat.complete(
messages=[
{"content": "You are a helpful assistant.", "role": "system"},
{"content": "Hello!", "role": "user"},
],
stream=True, # Should be True
top_k=1,
)
for chunk in stream:
print(chunk.text, end="", flush=True)
```
```python Async
# For asynchronous operations, use the following code snippet:
import asyncio
from friendli import AsyncFriendli
client = AsyncFriendli()
async def run():
stream = await client.container.chat.complete(
messages=[
{"content": "You are a helpful assistant.", "role": "system"},
{"content": "Hello!", "role": "user"},
],
stream=True, # Should be True
top_k=1,
)
async for chunk in stream:
print(chunk.text, end="", flush=True)
asyncio.run(run())
```
## Properly Closing the Client
By default, the library closes underlying HTTP and gRPC connections when the `client` is garbage-collected.
You can manually close the `Friendli` or `AsyncFriendli` client using the `.close()` method or utilize a context manager to ensure proper closure when exiting a `with` block.
```python Default
from friendli import SyncFriendli
client = SyncFriendli()
with client:
stream = client.container.chat.complete(
messages=[
{"content": "You are a helpful assistant.", "role": "system"},
{"content": "Hello!", "role": "user"},
],
stream=True, # Should be True
top_k=1,
min_tokens=10,
)
for chunk in stream:
print(chunk.text, end="", flush=True)
```
```python Async
import asyncio
from friendli import AsyncFriendli
client = AsyncFriendli()
async def run():
async with client:
stream = await client.container.chat.complete(
messages=[
{"content": "You are a helpful assistant.", "role": "system"},
{"content": "Hello!", "role": "user"},
],
stream=True, # Should be True
top_k=1,
)
async for chunk in stream:
print(chunk.text, end="", flush=True)
asyncio.run(run())
```
# Introducing Friendli Container
Source: https://friendli.ai/docs/guides/container/introduction
While Friendli Serverless Endpoints and Dedicated Endpoints offer convenient cloud-based solutions, some users crave even more control and flexibility. For those pioneers, Friendli Container is the answer.
While Friendli Serverless Endpoints and Dedicated Endpoints offer convenient cloud-based solutions, some users crave even more control and flexibility. For those pioneers, Friendli Container is the answer.
## What is Friendli Container?
Unmatched Control: Friendli Container provides the Friendli Engine, our cutting-edge serving technology, as a Docker container. This means you can:
* **Run your own data center or cluster**: Deploy the container on your existing GPU machines, giving you complete control over your infrastructure and data security.
* **Choose your own cloud provider**: If you prefer the cloud, you can still leverage your preferred cloud provider and GPUs.
* **Customize your environment**: Fine-tune the container configuration to perfectly match your specific needs and workflows.
Greater Responsibility, Greater Customization: With Friendli Container, you handle the cluster management, fault tolerance, and scaling. This responsibility comes with these potential benefits:
* **Controlled environment**: Keep your data within your own environment, ideal for sensitive applications or meeting compliance requirements.
* **Unmatched flexibility**: Tailor your infrastructure and workflows to your specific needs, pushing the boundaries of AI innovation.
* **Cost saving opportunities**: Manage your resources on your GPU machines, potentially leading to cost savings compared to cloud-based solutions.
Ideal for:
* **Data-sensitive users**: Securely run your models within your own infrastructure.
* **Control enthusiasts**: Take full control over your AI environment and workflows.
* **Existing cluster owners**: Utilize your existing GPU resources for cost-effective generative AI serving.
## Getting Started with Friendli Container:
1. **Generate Your User Token**: Visit the Friendli Container page through the [Friendli Suite](https://friendli.ai/suite) website and generate your unique token.
2. **Login with Docker Client**: Use your token to authenticate with the Docker client on your machine.
3. **Pull the Friendli Container Image**: Run the docker pull command with the provided image name.
4. [**Launch the Friendli Container**](/guides/container/running_friendli_container): Run the docker run command with the desired configuration and credentials.
5. **Expose Your Model**: The container will expose the model for inference.
6. [**Send Inference Requests**](/guides/container/running_friendli_container#sending-inference-requests): Use tools like curl or Python's requests library to send input prompts or data to the container.
Take generative AI to the next level with unmatched control, security, and flexibility through Friendli Container.
Start your journey today and elevate your AI endeavors on your own terms!
# Observability for Friendli Container
Source: https://friendli.ai/docs/guides/container/monitoring
Observability is an integral part of DevOps. To support this, Friendli Container exports internal metrics in a Prometheus text format.
Observability is an integral part of DevOps. To support this, Friendli Container exports internal metrics in a [Prometheus](https://prometheus.io) text format.
By default, metrics are served at `http://localhost:8281/metrics`. You can configure the port number using the command line option `--metrics-port`.
## Supported Metrics
### Counters
Counters are cumulative metrics whose values monotonically increase.
They are often used in combination with Prometheus function [rate()](https://prometheus.io/docs/prometheus/latest/querying/functions/#rate) for calculating the throughput.
| Metric Name | Description |
| --------------------------------- | -------------------------------------------------------- |
| friendli\_requests\_total | Cumulative number of requests received |
| friendli\_responses\_total | Cumulative number of responses sent |
| friendli\_items\_total | Cumulative number of items requested |
| friendli\_failure\_by\_cancel | Cumulative number of failed requests due to cancellation |
| friendli\_failure\_by\_timeout | Cumulative number of failed requests due to timeout |
| friendli\_failure\_by\_nan\_error | Cumulative number of failed requests due to NaN error |
| friendli\_failure\_by\_reject | Cumulative number of failed requests due to rejection |
One inference request may generate multiple results with the `n` field in the request body.
Upon receiving such request, `friendli_requests_total` is increased by 1 and `friendli_items_total` is increased by `n`.
### Gauges
Gauges are numerical values that can go up and down to represent the current value.
| Metric Name | Description |
| ---------------------------------- | --------------------------------------------------------------------- |
| friendli\_current\_requests | Current number of requests in the engine (either assigned or waiting) |
| friendli\_current\_items | Current number of items in the engine (either assigned or waiting) |
| friendli\_current\_assigned\_items | Current number of items actively processed by the engine |
| friendli\_current\_waiting\_items | Current number number of items waiting in the internal queue |
### Histograms
[Histograms](https://prometheus.io/docs/practices/histograms) are used to track the distribution of variables over time.
Bucketized number of histogram samples for TCache hit ratio, with le label
friendli\_tcache\_hit\_ratio\_count
Total number of histogram samples for TCache hit ratio
friendli\_tcache\_hit\_ratio\_sum
Sum of histogram sample values for TCache hit ratio
The length of input tokens (Experimental metric)
friendli\_input\_lengths\_bucket
Bucketized number of histogram samples for length of input tokens, with le label
friendli\_input\_lengths\_count
Total number of histogram samples for length of input tokens
friendli\_input\_lengths\_sum
Sum of histogram sample values for length of input tokens
The length of output tokens (Experimental metric)
friendli\_output\_lengths\_bucket
Bucketized number of histogram samples for length of output tokens, with le label
friendli\_output\_lengths\_count
Total number of histogram samples for length of output tokens
friendli\_output\_lengths\_sum
Sum of histogram sample values for length of output tokens
For visualizing histograms using Grafana, [How to visualize Prometheus histograms in Grafana](https://grafana.com/blog/2020/06/23/how-to-visualize-prometheus-histograms-in-grafana) provides useful tips.
### Quantiles
Quantiles are used to show the current p50(median), p90, and p99 percentiles of variables.
Quantiles
Metric Name
Description
Request completion latency (in nanoseconds)
friendli\_requests\_latencies
Percentile value for request completion latency (quantile label is either 0.5, 0.9, or 0.99)
friendli\_requests\_latencies\_count
Total number of samples for request completion latency
friendli\_requests\_latencies\_sum
Sum of sample values for request completion latency
Time to first token (TTFT) (in nanoseconds)
friendli\_requests\_ttft
Percentile value for time to first token (TTFT) (quantile label is either 0.5, 0.9, or 0.99)
friendli\_requests\_ttft\_count
Total number of samples for time to first token (TTFT)
friendli\_requests\_ttft\_sum
Sum of sample values for time to first token (TTFT)
Request queueing delay (in nanoseconds)
friendli\_requests\_queueing\_delays
Percentile value for queueing delay (quantile label is either 0.5, 0.9, or 0.99)
friendli\_requests\_queueing\_delays\_count
Total number of samples for queueing delay
friendli\_requests\_queueing\_delays\_sum
Sum of sample values for queueing delay
### Info
The following information metric always has a value of 1. The metric labels contain useful information in text.
| Metric Name | Label | Description |
| ------------------------- | --------- | -------------- |
| friendli\_engine\_version | `version` | Engine version |
## Grafana Dashboard Template
You can import [the dashboard templates](https://github.com/friendliai/container-resource/tree/main/grafana) to your Grafana instance.
The Grafana instance must be connected to a Prometheus instance (or a Prometheus-compatible data source) which is configured to scrape metrics from Friendli Container processes.
The dashboard template works with Grafana v8.0.0 or later versions. We recommend using Grafana v10.0.0 or later for the best experience.
# Optimizing Inference with Policy Search
Source: https://friendli.ai/docs/guides/container/optimizing_inference_with_policy_search
For specialized cases like MoE or quantized models, optimizing the execution policy in Friendli Engine can boost inference performance by 1.5x to 2x, improving throughput and reducing latency.
## Introduction
For specialized cases, like **serving MoE models (e.g., Mixtral)** or **quantized models**, performance of inference can be further optimized through an execution policy search.
This process can be skipped, but it is necessary to get the optimized speed of Friendli Engine.
When Friendli Engine runs with the optimal policy, the performance can increase by from 1.5x to 2x (i.e., throughput and latency).
Therefore, we recommend skipping policy search for simple model testing, and performing policy search for cost analysis or latency analysis in production service.
Policy search is effective only when serving (1) MoE models (2) AWQ, FP8 or INT8 quantized models. Otherwise, it is useless.
## Running Policy Search
You can run policy search by adding the following options to the launch command of Friendli Container.
| Options | Type | Summary | Default |
| -------------------------- | ------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- |
| `--algo-policy-dir` | TEXT | Path to the directory to save the searched optimal policy file. The default value is the current working directory. | current working dir |
| `--search-policy` | BOOLEAN | Runs policy search to find the best Friendli execution policy for the given configuration such as model type, GPU, NVIDIA driver version, quantization scheme, etc. | false |
| `--terminate-after-search` | BOOLEAN | Terminates engine container after policy search. | false |
### Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8`
For example, you can start the policy search for [FriendliAI/Llama-3.1-8B-Instruct-fp8](https://huggingface.co/FriendliAI/Llama-3.1-8B-Instruct-fp8) model as follows:
```sh
export HF_MODEL_NAME="FriendliAI/Llama-3.1-8B-Instruct-fp8"
export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION='"device=0"'
export POLICY_DIR=$PWD/policy
mkdir -p $POLICY_DIR
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_MODEL_NAME \
--algo-policy-dir /policy \
--search-policy true
```
### Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4)
```sh
export HF_MODEL_NAME="mistralai/Mixtral-8x7B-Instruct-v0.1"
export FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET"
export FRIENDLI_CONTAINER_IMAGE="registry.friendli.ai/trial"
export GPU_ENUMERATION='"device=0,1,2,3"'
export POLICY_DIR=$PWD/policy
mkdir -p $POLICY_DIR
docker run -p 8000:8000 \
--ipc=host --gpus $GPU_ENUMERATION \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_MODEL_NAME \
--num-devices 4 \
--algo-policy-dir /policy \
--search-policy true
```
Once the policy search is complete, a policy file will be created in `$POLICY_DIR`.
If the policy file already exists, the engine will search only the necessary spaces and update the policy file accordingly.
After the policy search, engine starts to serve endpoint with using the policy file.
It takes up to several minutes to find the optimal policy for Llama 2 13B model with NVIDIA A100 80GB GPU.
The estimated time and remaining time will be displayed in the stderr when you run the policy search.
## Running Policy Search Without Starting Serving Endpoint
To search for the best policy without starting the serving endpoint, launch the engine with the Friendli Container command and include the `--terminate-after-search true` option.
### Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8`
```sh
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name FriendliAI/Llama-3.1-8B-Instruct-fp8 \
--algo-policy-dir /policy
--search-policy true
--terminate-after-search true
```
### Example: `mistralai/Mixtral-8x7B-Instruct-v0.1` (TP=4)
```sh
docker run -p 8000:8000 \
--ipc=host --gpus $GPU_ENUMERATION \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name mistralai/Mixtral-8x7B-Instruct-v0.1 \
--num-devices 4 \
--algo-policy-dir /policy \
--search-policy true
--terminate-after-search true
```
## FAQ: When to Run Policy Search Again?
The execution policy depends on the following factors:
* Model
* GPU
* GPU count and parallelism degree (The value for `--num-devices` and `--num-workers` options)
* NVIDIA Driver major version
* Friendli Container version
You should run policy search again when any of these are changed from your serving setup.
# QuickStart: Friendli Container Trial
Source: https://friendli.ai/docs/guides/container/quickstart
Learn how to get started with Friendli Container in this step-by-step guide. Access to the Container registry, prepare you container secret, run your Friendli Container, and monitor using Grafana.
## Introduction
[Friendli Container](https://friendli.ai/products/container) enables you to efficiently deploy LLMs of your choice on your infrastructure.
With Friendli Container, you can perform high-speed LLM inferencing in a secure and private environment.
This tutorial will guide you through the process of running a Friendli Container for your LLM.
## Prerequisites
* **Hardware Requirements**: Friendli Container currently only targets x86\_64 architecture and supports NVIDIA GPUs, so please prepare proper GPUs and a compatible driver by referring to [our required CUDA compatibility guide](/guides/container/cuda_compatibility).
* **Software Requirements**: Your machine should be able to run containers with the [NVIDIA container toolkit](https://docs.nvidia.com/datacenter/cloud-native/container-toolkit/latest/index.html). In this tutorial, we will use Docker as container runtime and make use of [Docker Compose](https://docs.docker.com/compose).
* **Model Compatibility**: If your model is in a [safetensors](https://huggingface.co/docs/safetensors/index) format, which is compatible with [Hugging Face transformers](https://huggingface.co/docs/transformers), you can serve the model directly with the Friendli Container. Please check our [Model library](https://friendli.ai/models) for the non-exhaustive list of supported models.
This tutorial assumes that your model of choice is uploaded to [Hugging Face](https://huggingface.co) and you have access to it.
If the model is gated or private, you need to prepare a [Hugging Face Access Token](https://huggingface.co/settings/tokens).
## Getting Access to Friendli Container
### Activate your Free Trial
[Contact sales](https://friendli.ai/contact) to activate your free trial.
### Get Access to the Container Registry
Friendli Token is a user credential that is required for logging into our container registry.
1. Go to [Personal settings > Tokens](https://friendli.ai/suite/setting/tokens) and click 'Create token'.
2. Save the token you just created.
### Prepare your Container Secret
Container secret is a secret code that is used to activate Friendli Container.
You should pass the container secret as an environment variable to run the container image.
1. Go to [Container > Container Secrets](https://friendli.ai/suite/default-team/container/secrets) and click 'Create secret'.
2. Save the secret you just created.
**π Secret Rotation**
You can rotate the container secret for security reasons.
If you rotate the container secret, a new secret will be created and the previous secret will be automatically revoked in **30** minutes.
## Running Friendli Container
### Pull the Friendli Container Image
1. Log in to the container registry using the email address for your Friendli Suite account and the Friendli Token.
```sh
export FRIENDLI_EMAIL="YOUR ACCOUNT EMAIL ADDRESS"
export FRIENDLI_TOKEN="YOUR FRIENDLI TOKEN"
docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_TOKEN
```
2. Pull the image.
```sh
docker pull registry.friendli.ai/trial
```
### Run Friendli Container with a Hugging Face Model
1. Clone our [container resource](https://github.com/friendliai/container-resource) git repository.
```sh
git clone https://github.com/friendliai/container-resource
cd container-resource/quickstart/docker-compose
```
2. Set up environment variables.
```sh
export HF_MODEL_NAME="<...>" # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")
export FRIENDLI_CONTAINER_SECRET="<...>" # Friendli container secret
```
If your model is a private or gated one, you also need to provide [Hugging Face Access Token](https://huggingface.co/settings/tokens).
```sh
export HF_TOKEN="<...>" # Hugging Face Access Token
```
3. Launch the Friendli Container.
```sh
docker compose up -d
```
By default, the container will listen for inference requests at TCP port 8000 and a Grafana service will be available at TCP port 3000.
You can change the designated ports using the following environment variables.
For example, if you want to use TCP port 8001 and port 3001 for Grafana, execute the command below.
```sh
export FRIENDLI_PORT="8001"
export FRIENDLI_GRAFANA_PORT="3001"
```
Even though the machine has multiple GPUs, the container will make use of only one GPU, specifically the first GPU (`device_ids: ['0']`).
You can edit `docker-compose.yaml` to change what GPU device the container will use.
The downloaded Hugging Face model will be cached in the `$HOME/.cache/huggingface` directory.
You may want to clean up this directory after completing this tutorial.
### Send Inference Requests
You can now send inference requests to the running container.
For information on all parameters that can be used in an inference request, please refer to [this document](/openapi).
```sh Chat Completion
curl -X POST http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What makes a good leader?"}
],
"max_tokens": 30
}'
```
```sh Completion
curl -X POST http://0.0.0.0:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "What makes a good leader?",
"max_tokens": 30
}'
```
```sh Tokenization
curl -X POST http://0.0.0.0:8000/v1/tokenize \
-H "Content-Type: application/json" \
-d '{
"prompt": "What is generative AI?"
}'
```
```sh Detokenization
curl -X POST http://0.0.0.0:8000/v1/detokenize \
-H "Content-Type: application/json" \
-d '{
"tokens": [
128000,
3923,
374,
1803,
1413,
15592,
30
]
}'
```
Chat completion requests work only if the model's tokenizer config contains a `chat_template`.
### Monitor using Grafana
Using your browser, open [http://0.0.0.0:3000/d/friendli-engine](http://0.0.0.0:3000/d/friendli-engine), and login with username `admin` and password `admin`.
You can now access the dashboards showing useful engine metrics.
If you cannot open a browser directly in the GPU machine where the Friendli Container is running, you can use SSH to forward requests from the browser running on your PC to the GPU machine.
```sh
# Change these variables to match your environment.
LOCAL_GRAFANA_PORT=3000 # The number of the port in your PC.
FRIENDLI_GRAFANA_PORT=3000 # The number of the port in the GPU machine.
ssh "$GPU_MACHINE_ADDRESS" -L "$LOCAL_GRAFANA_PORT:0.0.0.0:$FRIENDLI_GRAFANA_PORT"
```
where `$GPU_MACHINE_ADDRESS` shall be replaced with the address of the GPU machine.
You may also want to use `-l login_name` or `-p port` options to connect to the GPU machine using SSH.
Then using your browser on the PC, open `http://0.0.0.0:$LOCAL_GRAFANA_PORT/d/friendli-engine`.
## Going Further
Congratulations! You can now serve your LLM of choice using your hardware, with the power of the most efficient LLM serving engine on the planet.
The following topics will help you go further through your AI endeavors.
* **Multi-GPU Serving**: Although this tutorial is limited to using only one GPU, Friendli Container supports tensor parallelism and pipeline parallelism for multi-GPU inference. Check [Multi-GPU Serving](/guides/container/running_friendli_container#multi-gpu-serving) for more information.
* **Serving Multi-LoRA Models**: You can deploy multiple customized LLMs without additional GPU resources. See [Serving Multi-LoRA Models](/guides/container/serving_multi_lora_models) to learn how to launch the container with your adapters.
* **Serving Quantized Models**: Running quantized models requires an additional step of [execution policy search](/guides/container/optimizing_inference_with_policy_search). See [Serving Quantized Models](/guides/container/serving_quantized_models) to learn how to create an inference endpoint for quantized models.
* **Serving MoE Models**: Running MoE (Mixture of Experts) models requires an additional step of [execution policy search](/guides/container/optimizing_inference_with_policy_search). See [Serving MoE Models](/guides/container/serving_moe_models) to learn how to create an inference endpoint for MoE models.
If you are stuck or need help going through this tutorial, please ask for support by sending an email to [Support](mailto:support@friendli.ai).
# Running Friendli Container
Source: https://friendli.ai/docs/guides/container/running_friendli_container
Friendli Container enables you to effortlessly deploy your generative AI model on your own machine. This tutorial will guide you through the process of running a Friendli Container.
## Introduction
Friendli Container enables you to effortlessly deploy your generative AI model on your own machine.
This tutorial will guide you through the process of running a Friendli Container.
The current version of Friendli Container supports most of major generative language models.
## Prerequisites
* Before you begin, make sure you have signed up for [Friendli Suite](https://friendli.ai/suite).
* [Contact sales](https://friendli.ai/contact) to activate your free trial.
* Friendli Container currently only supports NVIDIA GPUs, so please prepare proper GPUs and a compatible driver by referring to [our required CUDA compatibility guide](/guides/container/cuda_compatibility).
* Prepare a Friendli Token following [this guide](#preparing-friendli-token).
* Prepare a Friendli Container Secret following [this guide](#preparing-container-secret).
### Preparing Friendli Token
Friendli Token is the user credentials for logging into our container registry.
1. Sign in [Friendli Suite](https://friendli.ai/suite).
2. Go to **[Personal settings > Tokens](https://friendli.ai/suite/setting/tokens)** and click **'Create new token'**.
3. Save your created token value and export it as `FRIENDLI_TOKEN`.
### Preparing Container Secret
Container secret is a secret code that is used to activate Friendli Container.
You should pass the container secret as an environment variable to run the container image.
1. Sign in [Friendli Suite](https://friendli.ai/suite).
2. Go to **[Container > Container Secrets](https://friendli.ai/suite/default-team/container/secrets)** and click **'Create secret'**.
3. Save your created secret value and export it as `FRIENDLI_CONTAINER_SECRET`.
**π Secret Rotation**
You can rotate the container secret for security reasons.
If you rotate the container secret, a new secret will be created and the previous secret will be revoked automatically in 30 minutes.
## Pulling Friendli Container Image
Log in to the Docker client using the Friendli Token created as outlined in [Preparing Friendli Token](#preparing-friendli-token).
```sh
export FRIENDLI_EMAIL="YOUR ACCOUNT EMAIL ADDRESS"
export FRIENDLI_TOKEN="YOUR FRIENDLI TOKEN"
docker login registry.friendli.ai -u $FRIENDLI_EMAIL -p $FRIENDLI_TOKEN
```
```sh
docker pull registry.friendli.ai/trial:latest
```
## Running Friendli Container with Hugging Face Models
If your model is in a [`safetensors`](https://huggingface.co/docs/safetensors/index) format, which is compatible with [Hugging Face transformers](https://huggingface.co/docs/transformers), you can serve the model directly with Friendli Container.
Friendli Container supports direct loading of `safetensors` checkpoints for many model types. You can find the complete list of supported models on the [Supported Models page](https://friendli.ai/models/search?products=CONTAINER).
If your model does not exist in supported model list, please { e.preventDefault(); window.Intercom('showNewMessage'); }}>contact us.
Here are the instructions to run Friendli Container to serve a Hugging Face model:
```sh
# Fill the values of following variables.
export HF_MODEL_NAME="" # Hugging Face model name (e.g., "meta-llama/Meta-Llama-3-8B-Instruct")
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
docker run --gpus '"device=0"' -p 8000:8000 \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
-v ~/.cache/huggingface:/root/.cache/huggingface \
registry.friendli.ai/trial \
--hf-model-name $HF_MODEL_NAME
```
The `[LAUNCH_OPTIONS]` should be replaced with [Launch Options for Friendli Container](#launch-options).
By running the above command, you will have a running Docker container that exports an HTTP endpoint for handling inference requests.
### Multi-GPU Serving
Friendli Container supports ***tensor parallelism*** and ***pipeline parallelism*** for multi-GPU inference.
#### Tensor Parallelism
Tensor parallelism is employed when serving large models that exceed the memory capacity of a single GPU, by distributing parts of the model's weights across multiple GPUs.
To leverage tensor parallelism with the Friendli Container:
1. Specify multiple GPUs for `$GPU_ENUMERATION` (e.g., '"device=0,1,2,3"').
2. Use `--num-devices` (or `-d`) option to specify the tensor parallelism degree (e.g., `--num-devices 4`).
#### Pipeline Parallelism
Pipeline parallelism splits a model into multiple segments to be processed across different GPU, enabling the deployment of larger models that would not otherwise fit on a single GPU.
To exploit pipeline parallelism with the Friendli Container:
1. Specify multiple GPUs for `$GPU_ENUMERATION` (e.g., '"device=0,1,2,3"').
2. Use `--num-workers` (or `-n`) option to specify the pipeline parallelism degree (e.g., `--num-workers 4`).
**π Choosing between Tensor Parallelism and Pipeline Parallelism**
When deploying models with the Friendli Container, you have the flexibility to combine tensor parallelism and pipeline parallelism.
We recommend exploring a balance between the two, based on their distinct characteristics.
While tensor parallelism involves "expensive" ***all-reduce*** operations to aggregate partial results across all devices, pipeline parallelism relies on "cheaper" ***peer-to-peer*** communication.
Thus, in limited network setup, such as PCIe networks, leveraging pipeline parallelism is preferable.
Conversely, in rich network setup like NVLink, tensor parallelism is recommended due to its superior parallel computation efficiency.
### Advanced: Serving Quantized Models
Running quantized models requires an additional step to search execution policy. See [Serving Quantized Models](/guides/container/serving_quantized_models) to learn how to create an inference endpoint for the quantized model.
### Advanced: Serving MoE Models
Running MoE (Mixture of Experts) models requires an additional step to search execution policy. See [Serving MoE Models](/guides/container/serving_moe_models) to learn how to create an inference endpoint for the MoE model.
### Examples
This is an example running [Llama-3.1-8B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-8B-Instruct) with a single GPU.
```sh
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret (leave it if it's already set in your environment)
export HF_TOKEN="" # Access token from Hugging Face (see the caution below)
docker run -p 8000:8000 --gpus '"device=0"' \
-e HF_TOKEN=$HF_TOKEN \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
-v ~/.cache/huggingface:/root/.cache/huggingface \
registry.friendli.ai/trial \
--hf-model-name meta-llama/Llama-3.1-8B-Instruct
```
Since downloading `meta-llama/Llama-3.1-8B-Instruct` is allowed only for authorized users, you need to provide your [Hugging Face User Access Token](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hftoken) through `HF_TOKEN` environment variable.
It works the same for all private repositories.
This is an example running [Llama-3.1-70B-Instruct](https://huggingface.co/meta-llama/Llama-3.1-70B-Instruct) with a multi-GPU setup.
```sh {5, 11}
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret (leave it if it's already set in your environment)
export HF_TOKEN="" # Access token from Hugging Face (see the caution below)
docker run -p 8000:8000 \
--ipc=host --gpus '"device=0,1"' \
-e HF_TOKEN=$HF_TOKEN \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
-v ~/.cache/huggingface:/root/.cache/huggingface \
registry.friendli.ai/trial \
--hf-model-name meta-llama/Llama-3.1-70B-Instruct \
--num-devices 2
```
Since downloading `meta-llama/Llama-3.1-70B-Instruct` is allowed only for authorized users, you need to provide your [Hugging Face User Access Token](https://huggingface.co/docs/huggingface_hub/package_reference/environment_variables#hftoken) through `HF_TOKEN` environment variable.
It works the same for all private repositories.
## Sending Inference Requests
We can now send inference requests to the running Friendli Container.
For information on all parameters that can be used in an inference request, please refer to [this document](/openapi/serverless/chat-completions).
### Examples
```sh curl
curl -X POST http://0.0.0.0:8000/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"messages": [
{"role": "user", "content": "What makes a good leader?"}
],
"max_tokens": 30,
"stream": true
}'
```
```python OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
base_url="http://0.0.0.0:8000/v1"
)
completion = client.chat.completions.create(
model="",
messages=[
{"role": "user", "content": "What makes a good leader?"}
],
max_tokens=30,
stream=True
)
for chunk in completion:
print(chunk.choices[0].delta.content, end="", flush=True)
```
```python Friendli Python SDK
from friendli import SyncFriendli
client = SyncFriendli()
stream = client.container.chat.complete(
messages=[{"role": "user", "content": "Python is a popular"}],
max_tokens=30,
stream=True,
)
for chunk in stream:
print(chunk.text, end="", flush=True)
```
## Options for Running Friendli Container
### General Options
| Options | Type | Summary | Default | Required |
| ----------- | ---- | -------------------------------------- | ------- | -------- |
| `--version` | - | Print Friendli Container version. | - | β |
| `--help` | - | Print Friendli Container help message. | - | β |
### Launch Options
| Options | Type | Summary | Default | Required |
| --------------------------------- | --------------------------- | ----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------- | ------------------- | -------- |
| `--web-server-port` | INT | Web server port. | 8000 | β |
| `--metrics-port` | INT | Prometheus metrics export port. | 8281 | β |
| `--hf-model-name` | TEXT | Model name hosted on the Hugging Face Models Hub or a path to a local directory containing a model. When a model name is provided, Friendli Container first checks if the model is already cached at \~/.cache/huggingface/hub and uses it if available. If not, it will download the model from the Hugging Face Models Hub before creating the inference endpoint. When a local path is provided, it will load the model from the location without downloading. This option is only available for models in a safetensors format. | - | β |
| `--tokenizer-file-path` | TEXT | Absolute path of tokenizer file. This option is not needed when `tokenizer.json` is located under the path specified at `--ckpt-path`. | - | β |
| `--tokenizer-add-special-tokens` | BOOLEAN | Whether or not to add special tokens in tokenization. Equivalent to Hugging Face Tokenizer's `add_special_tokens` argument. The default value is **false** for versions \< v1.6.0. | `true` | β |
| `--tokenizer-skip-special-tokens` | BOOLEAN | Whether or not to remove special tokens in detokenization. Equivalent to Hugging Face Tokenizer's `skip_special_tokens` argument. | `true` | β |
| `--dtype` | CHOICE: \[bf16, fp16, fp32] | Data type of weights and activations. Choose one of \. This argument applies to non-quantized weights and activations. If not specified, Friendli Container follows the value of `torch_dtype` in `config.json` file or assumes fp16. | fp16 | β |
| `--bad-stop-file-path` | TEXT | JSON file path that contains stop sequences or bad words/tokens. | - | β |
| `--num-request-threads` | INT | Thread pool size for handling HTTP requests. | 4 | β |
| `--timeout-microseconds` | INT | Server-side timeout for client requests, in microseconds. | 0 (no timeout) | β |
| `--ignore-nan-error` | BOOLEAN | If set to True, ignore NaN error. Otherwise, respond with a 400 status code if NaN values are detected while processing a request. | - | β |
| `--max-batch-size` | INT | Max number of sequences that can be processed in a batch. | 384 | β |
| `--num-devices`, `-d` | INT | Number of devices to use in tensor parallelism degree. | 1 | β |
| `--num-workers`, `-n` | INT | Number of workers to use in a pipeline (i.e., pipeline parallelism degree). | 1 | β |
| `--search-policy` | BOOLEAN | Searches for the best engine policy for the given combination of model, hardware, and parallelism degree. Learn more about policy search at [Optimizing Inference with Policy Search](/guides/container/optimizing_inference_with_policy_search). | false | β |
| `--terminate-after-search` | BOOLEAN | Terminates engine container after the policy search. | false | β |
| `--algo-policy-dir` | TEXT | Path to directory containing the policy file. The default value is the current working directory. Learn more about policy search at [Optimizing Inference with Policy Search](/guides/container/optimizing_inference_with_policy_search). | current working dir | β |
| `--adapter-model` | TEXT | Add an adapter model with adapter name and path; \:\. The path can be a name from a Hugging Face model hub. | - | β |
### Model Specific Options
#### T5
| Options | Type | Summary | Default | Required |
| --------------------- | ---- | ---------------------- | ------- | -------- |
| `--max-input-length` | INT | Maximum input length. | - | β |
| `--max-output-length` | INT | Maximum output length. | - | β |
# Running Friendli Container on SageMaker
Source: https://friendli.ai/docs/guides/container/sagemaker_integration
Create a real-time inference endpoint in Amazon SageMaker with Friendli Container backend. By utilizing Friendli Container in your SageMaker pipeline, you'll benefit from the Friendli Engine's speed and resource efficiency.
## Introduction
This guide will walk you through creating a [real-time inference endpoint in Amazon SageMaker](https://docs.aws.amazon.com/sagemaker/latest/dg/realtime-endpoints.html) with Friendli Container backend.
By utilizing Friendli Container in your SageMaker pipeline, you'll benefit from the Friendli Engine's speed and resource efficiency.
We'll explore how to create inference endpoints using both the AWS Console and the boto3 Python SDK.
## General Workflow
1. **Create a Model**: Within SageMaker Inference, define a new model by specifying the model artifacts in your S3 bucket and the Friendli container image from ECR.
2. **Configure the Endpoint**: Create a SageMaker Inference endpoint configuration by selecting the instance type and the number of instances required.
3. **Create the Endpoint**: Utilize the configured settings to launch a SageMaker Inference endpoint.
4. **Invoke the Endpoint**: Once deployed, send requests to your endpoint to receive inference responses.
## Prerequisite
Before beginning, you need to push the Friendli Container image to an ECR repository on AWS.
First, prepare the Friendli Container image by following the instructions in [**Pulling Friendli Container Image**](/guides/container/running_friendli_container/#pulling-friendli-container-image).
Then, tag and push the image to the Amazon ECR repository as guided in [**Pushing a Docker image to an Amazon ECR private repository**](https://docs.aws.amazon.com/AmazonECR/latest/userguide/docker-push-ecr-image.html).
## Using the AWS Console
Let's delve into the step-by-step instructions for creating an inference endpoint using the AWS Console.
### Step 1: Creating a Model
You can start creating a model by clicking on the **'Create model'** button under **SageMaker > Inference > Models**.
Then, configure the model with the following fields:
* **Model settings**:
* **Model name**: A model name.
* **IAM role**: An IAM role that includes the `AmazonSageMakerFullAccess` policy.
* **Container definition 1**:
* **Container input option**: Select the "Provide model artifacts and inference image location".
* **Model Compression Type**:
* To use a model in the S3 bucket:
* When the model is compressed, select "CompressedModel".
* Otherwise, select "UncompressedModel".
* When using a model from the Hugging Face hub, any option would work fine.
* **Location of inference code image**: Specify the ARN of the ECR repo for the Friendli Container.
* **Location of model artifacts** (optional):
* To use a model in the S3 bucket: Specify the S3 URI where your model is stored. Ensure the file structure matches the directory format compatible with the `--hf-model-name` option of the Friendli Container.
* When using a model from the Hugging Face hub, you can leave this field empty.
* **Environment variables**:
* Always required:
* `FRIENDLI_CONTAINER_SECRET`: Your Friendli Container Secret. Refer to [**Preparing Container Secret**](/guides/container/running_friendli_container/#preparing-container-secret) to learn how to get the container secret.
* `SAGEMAKER_MODE`: This should be set to `True`.
* `SAGEMAKER_NUM_DEVICES`: Number of devices to use for tensor parallelism degree.
* Required when using a model in the S3 bucket:
* `SAGEMAKER_USE_S3`: This should be set to `True`.
* Required when using a model from the Hugging Face hub:
* `SAGEMAKER_HF_MODEL_NAME`: The Hugging Face model name (e.g., `mistralai/Mistral-7B-Instruct-v0.2`)
* For private or gated model repos:
* `HF_TOKEN`: The Hugging Face secret access token.
### Step 2: Creating an Endpoint Configuration
You can start by clicking on the **'Create endpoint configuration'** button under **SageMaker > Inference > Endpoint configurations**.
* **Endpoint configuration**:
* **Endpoint configuration name**: The name of this endpoint configuration.
* **Type of endpoint**: For real-time inference, select "Provisioned".
* **Variants**:
* To create a "Production" variant, click 'Create production variant'.
* Select the model that you have created in [**Step 1**](#step-1-creating-a-model).
* Configure the instance type and count by clicking on 'Edit' in the Actions column.
* Create the endpoint configuration by clicking 'Create endpoint configuration'.
### Step 3: Creating SageMaker Inference Endpoint
You can start by clicking the **'Create endpoint'** button under **SageMaker > Inference > Endpoints**.
* Select "Use an existing endpoint configuration".
* Select the endpoint configuration created in [**Step 2**](#step-2-creating-an-endpoint-configuration).
* Finish by clicking on the 'Create endpoint' button.
### Step 4: Invoking Endpoint
When the endpoint status becomes "In Service", you can invoke the endpoint with the following script, after filling in the endpoint name and the region name:
```python
import boto3
import json
endpoint_name = "FILL OUT ENDPOINT NAME"
region_name = "FILL OUT AWS REGION"
sagemaker_runtime = boto3.client("sagemaker-runtime", region_name=region_name)
prompt = "Story title: 3 llamas go for a walk\nSummary: The 3 llamas crossed a bridge and something unexpected happened\n\nOnce upon a time"
payload = {
"prompt": prompt,
"max_tokens": 512,
"temperature": 0.8,
}
response = sagemaker_runtime.invoke_endpoint(
EndpointName=endpoint_name,
Body=json.dumps(payload),
ContentType="application/json",
)
print(response['Body'].read().decode('utf-8'))
```
## Using the boto3 SDK
Next, let's discover the process for creating a SageMaker endpoint using the boto3 Python SDK.
You can achieve this by using the code snippet below. Be sure to fill in the custom fields, customized for your specific use case:
```python
import boto3
from sagemaker import get_execution_role
sm_client = boto3.client(service_name='sagemaker')
runtime_sm_client = boto3.client(service_name='sagemaker-runtime')
account_id = boto3.client('sts').get_caller_identity()['Account']
region = boto3.Session().region_name
role = get_execution_role()
endpoint_name="FILL OUT ENDPOINT NAME"
model_name="FILL OUT MODEL NAME"
container = "FILL OUT ECR IMAGE NAME" # .dkr.ecr..amazonaws.com/IMAGE
instance_type = "ml.g5.12xlarge" # instance type
container = {
'Image': container,
'Environment': {
"HF_TOKEN": "",
"FRIENDLI_CONTAINER_SECRET": "",
"SAGEMAKER_HF_MODEL_NAME": "", # e.g) meta-llama/Meta-Llama-3-8B
"SAGEMAKER_MODE": "True", # Should be true
"SAGEMAKER_NUM_DEVICES": "4", # Number of GPUs in `instance_type`
}
}
endpoint_config_name = 'FILL OUT ENDPOINT CONFIG NAME'
# Create a model
create_model_response = sm_client.create_model(
ModelName=model_name,
ExecutionRoleArn=role,
Containers=[container],
)
# Create an endpoint configuration
create_endpoint_config_response = sm_client.create_endpoint_config(
EndpointConfigName=endpoint_config_name,
ProductionVariants=[
{
'InstanceType': instance_type,
'InitialInstanceCount': 1,
'InitialVariantWeight': 1,
'ModelName': model_name,
'VariantName': 'AllTraffic',
},
],
)
endpoint_name = "FILL OUT ENDPOINT NAME"
# Create an endpoint
sm_client.create_endpoint(
EndpointName=endpoint_name,
EndpointConfigName=endpoint_config_name,
)
sm_client.describe_endpoint(EndpointName=endpoint_name)
```
You can invoke this endpoint by following [**Step 4**](#step-4-invoking-endpoint).
By following these guides, you'll be able to seamlessly deploy your models using Friendli Container on SageMaker endpoints and leverage their capabilities for real-time inference.
# Serving MoE Models
Source: https://friendli.ai/docs/guides/container/serving_moe_models
Explore the steps to serve Mixture of Experts (MoE) models such as Mixtral 8x7B using Friendli Container.
## Introduction
This guide explores the steps to serve Mixture of Experts (MoE) models such as Mixtral 8x7B using Friendli Container.
## Search Optimal Policy and Running Friendli Container
To serve MoE models efficiently, it is required to run a policy search to explore the optimal execution policy.
Learn how to run the policy search at [Running Policy Search](/guides/container/optimizing_inference_with_policy_search#running-policy-search).
When the optimal policy is successfully searched, the policy is compiled into a policy file, which can be used for creating serving endpoints.
And the engine starts to serve the endpoint using the optimal policy.
# Serving Multi-LoRA Models
Source: https://friendli.ai/docs/guides/container/serving_multi_lora_models
The Friendli Engine introduces an innovative approach to this challenge through Multi-LoRA (Low-Rank Adaptation) serving, a method that allows for the simultaneous serving of multiple LLMs, optimized for specific tasks without the need for extensive retraining.
## Introduction
In a world where the demand for highly specialized AI capabilities is surging, the ability to deploy multiple customized large language models (LLMs) without additional GPU resources represents a significant leap forward.
The Friendli Engine introduces an innovative approach to this challenge through Multi-LoRA (Low-Rank Adaptation) serving, a method that allows for the simultaneous serving of multiple LLMs, optimized for specific tasks without the need for extensive retraining.
This advancement opens new avenues for AI efficiency and adaptability, promising to revolutionize the deployment of AI solutions on constrained hardware.
This article provides an overview of efficient serving Multi-LoRA models with the Friendli Engine.
## Prerequisite
huggingface-cli should be installed in your local environment.
```sh
pip install "huggingface_hub[cli]"
```
## Downloading Adapter Checkpoints
For each adapter model that you want to server, you have to download in your local storage.
```sh
# Hugging Face model name of the adapters
export ADAPTER_MODEL1=""
export ADAPTER_MODEL2=""
export ADAPTER_MODEL3=""
export ADAPTER_DIR=/tmp/adapter
huggingface-cli download $ADAPTER_MODEL1 \
--include "adapter_model.safetensors" "adapter_config.json" \
--local-dir $ADAPTER_DIR/model1
huggingface-cli download $ADAPTER_MODEL2 \
--include "adapter_model.safetensors" "adapter_config.json" \
--local-dir $ADAPTER_DIR/model2
huggingface-cli download $ADAPTER_MODEL3 \
--include "adapter_model.safetensors" "adapter_config.json" \
--local-dir $ADAPTER_DIR/model3
...
```
This will result in directory structure like:
```
/tmp/adapter/model1
- adapter_model.safetensors
- adapter_config.json
/tmp/adapter/model2
- adapter_model.safetensors
- adapter_config.json
/tmp/adapter/model3
- adapter_model.safetensors
- adapter_config.json
```
If an adapter's Hugging Face repo does not contain `adapter_model.safetensors` checkpoint file, you have to manually convert `adapter_model.bin` into `adapter_model.safetensors`.
You can use the [official app](https://huggingface.co/spaces/safetensors/convert) or the [python script](https://github.com/huggingface/safetensors/tree/main/bindings/python) for conversion.
## Launch Friendli Engine in Container
When you have prepared adapter model checkpoints, now you can serve the Multi-LoRA model with Friendli Container.
In addition to the command for running the base model, you have to add the `--adapter-model` argument.
* `--adapter-model`: Add an adapter model with adapter name and path. The path can be Hugging Face hub's name.
```sh
# Fill the values of following variables.
export HF_BASE_MODEL_NAME="" # Hugging Face base model name (e.g., "meta-llama/Llama-2-7b-chat-hf")
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')
export ADAPTER_NAME="" # Specify the adapter's name(a user defined alias).
export ADAPTER_DIR=/tmp/adapter
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v $ADAPTER_DIR:/adapter \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_BASE_MODEL_NAME \
--adapter-model $ADAPTER_NAME:/adapter/model1 \
[LAUNCH_OPTIONS]
```
You can find available options for `[LAUNCH_OPTIONS]` at [Running Friendli Container: Launch Options](/guides/container/running_friendli_container#launch-options).
If you want to launch with multiple adapters, you can use `--adapter-model` with comma-separated string.
(e.g. `--adapter-model "adapter_name_0:/adapter/model1,adapter_name_1:/adapter/model2"`)
If `tokenizer_config.json` file is in an adapter checkpoint path, the engine uses a different chat template in `tokenizer_config.json`.
### Example: Llama 2 7B Chat + LoRA Adapter
This is an example that runs [`meta-llama/Llama-2-7b-chat-hf`](https://huggingface.co/meta-llama/Llama-2-7b-chat-hf) with [`FinGPT/fingpt-forecaster_dow30_llama2-7b_lora`](https://huggingface.co/FinGPT/fingpt-forecaster_dow30_llama2-7b_lora) adapter model.
```sh
export ADAPTER_DIR=/tmp/adapter
huggingface-cli download FinGPT/fingpt-forecaster_dow30_llama2-7b_lora \
--include "adapter_model.safetensors" "adapter_config.json" \
--local-dir $ADAPTER_DIR/model1
docker run \
--gpus '"device=0"' \
-p 8000:8000 \
-v $ADAPTER_DIR:/adapter \
-e FRIENDLI_CONTAINER_SECRET="YOUR CONTAINER SECRET" \
registry.friendli.ai/trial \
--hf-model-name meta-llama/Llama-2-7b-chat-hf \
--adapter-model adapter-model-name:/adapter/model1
```
## Sending Request to Specific Adapter
You can generate an inference result from a specific adapter model by specifying `model` in the body of an inference request.
For example, assuming you set the launch option of `--adpater-model` to "\:\", you can send a request to the adapter model as follows.
```sh
curl -X POST http://0.0.0.0:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"model": "adapter-model-name",
"prompt": "Python is a language",
"max_tokens": 30
}'
```
## Sending Request to the Base Model
If you omit the `model` field in your request, the base model will be used for generating an inference request.
You can send a request to the base model as shown below.
```sh
curl -X POST http://0.0.0.0:8000/v1/completions \
-H "Content-Type: application/json" \
-d '{
"prompt": "Python is a language",
"max_tokens": 30
}'
```
## Limitations
We only support models compatible with [`peft`](https://github.com/huggingface/peft).
Base model checkpoint and adapter model checkpoint should have the same datatype.
When serving multiple adapters simultaneously, each adapter model should have the same target modules. In Hugging Face, the target modules are listed at `adapter_config.json`.
# Serving Quantized Models
Source: https://friendli.ai/docs/guides/container/serving_quantized_models
Tutorial for serving quantized model with Friendli Engine. Friendli Engine supports FP8, IN8, and AWQ-ed model checkpoints.
## Introduction
Quantization is a technique that reduces the precision of a generative AI model's parameters, optimizing memory usage and inference speed while maintaining acceptable accuracy.
This tutorial will walk you through the process of serving quantized models with Friendli Container.
## Off-the-Shelf Model Checkpoints from Hugging Face Hub
To use model checkpoints that are already quantized and available on Hugging Face Hub, check the following options:
* Checkpoints quantized with [friendli-model-optimizer](https://github.com/friendliai/friendli-model-optimizer)
* [Quantized model checkpoints by FriendliAI](https://huggingface.co/FriendliAI)
* a subset of models quantized with:
* [`AutoAWQ`](https://github.com/casper-hansen/AutoAWQ)
* [`AutoFP8`](https://github.com/neuralmagic/AutoFP8)
* [`llm-compressor`](https://github.com/vllm-project/llm-compressor)
For details on how to use these models, go directly to [Serving Quantized Models](#serving-quantized-models).
## Quantizing Your Own Models (FP8/INT8)
To quantize your own models with FP8 or INT8, follow these steps:
1. **Install the `friendli-model-optimizer` package**
This tool provides model quantization for efficient generative AI serving with Friendli Engine. Install it using the following command:
```sh
pip install "friendli-model-optimizer"
```
2. **Prepare the Original Model**
Ensure you have the original model checkpoint that can be loaded using Hugging Face's [`transformers`](https://github.com/huggingface/transformers) library.
3. **Quantize Model with Friendli-Model-Optimizer(FMO)**
You can simply run quantization with the command below:
```sh
export MODEL_NAME_OR_PATH="" # Hugging Face pretrained model name or directory path of the original model checkpoint.
export OUTPUT_DIR="" # Directory path to save the quantized checkpoint and related configurations.
export QUANTIZATION_SCHEME="" # Quantization techniques to apply. You can use fp8, int8.
export DEVICE="" # Device to run the quantization process. Defaults to "cuda:0".
fmo quantize \
--model-name-or-path $MODEL_NAME_OR_PATH \
--output-dir $OUTPUT_DIR \
--mode $QUANTIZATION_SCHEME \
--device $DEVICE \
```
When the model checkpoint is successfully quantized, the following files will be created at `$OUTPUT_DIR`.
* `config.json`
* `model.safetensors`
* `special_tokens_map.json`
* `tokenizer_config.json`
* `tokenizer.json`
If the size of the model exceeds **10GB**, multiple sharded checkpoints are generated as follows instead of a single `model.safetensors`.
* `model-00001-of-00005.safetensors`
* `model-00002-of-00005.safetensors`
* `model-00003-of-00005.safetensors`
* `model-00004-of-00005.safetensors`
* `model-00005-of-00005.safetensors`
For more information about FMO, check out [this documentation](https://github.com/friendliai/friendli-model-optimizer) for details.
## Serving Quantized Models
### Search Optimal Policy
To serve quantized models efficiently, it is required to run a policy search to explore the optimal execution policy.
Learn how to run the policy search at [Running Policy Search](/guides/container/optimizing_inference_with_policy_search#running-policy-search).
### Serving FP8 Models
Once you have prepared the quantized model checkpoint, you are ready to create a serving endpoint.
```sh
# Fill the values of following variables.
export HF_MODEL_NAME="" # Quantized model name in Hugging Face Hub or directory path of the quantized model checkpoint.
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')
export POLICY_DIR=$PWD/policy
mkdir -p $POLICY_DIR
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name $HF_MODEL_NAME \
--algo-policy-dir /policy \
--search-policy true
```
### Example: `FriendliAI/Llama-3.1-8B-Instruct-fp8`
FP8 model serving is only supported by NVIDIA **Ada**, **Hopper**, and **Blackwell** GPU architectures.
```sh
# Fill the values of following variables.
export FRIENDLI_CONTAINER_SECRET="" # Friendli container secret
export FRIENDLI_CONTAINER_IMAGE="" # Friendli container image (e.g., "registry.friendli.ai/trial")
export GPU_ENUMERATION="" # GPUs (e.g., '"device=0,1"')
docker run \
--gpus $GPU_ENUMERATION \
-p 8000:8000 \
-v ~/.cache/huggingface:/root/.cache/huggingface \
-v $POLICY_DIR:/policy \ # Make sure running policy search
-e FRIENDLI_CONTAINER_SECRET=$FRIENDLI_CONTAINER_SECRET \
$FRIENDLI_CONTAINER_IMAGE \
--hf-model-name FriendliAI/Llama-3.1-8B-Instruct-fp8
--algo-policy-dir /policy
--search-policy true
```
# Autoscaling
Source: https://friendli.ai/docs/guides/dedicated_endpoints/autoscaling
Autoscaling is a feature that allows you to autoscale your endpoint.
## Intelligent Autoscaling
Our autoscaling system automatically adjusts computational resources based on your traffic patterns, helping you optimize both performance and costs.
### How Autoscaling Works
* **Minimum Replicas**:
* When set to 0, the endpoint enters sleeping status during periods of inactivity, helping to minimize costs
* When set to a value greater than 0, the endpoint maintains at least that number of active replicas at all times
* **Maximum Replicas**: Defines the upper limit of replicas that can be created to handle increased traffic load
* **Cooldown Period**: Measured in seconds; if no requests are received during this period, the endpoint transitions to sleeping status.
### Autoscaling types
We highly recommend using the **Default** autoscaling type, as it performs stable for most workloads.
Please note that performance degradation or unexpected charges may occur with other configurations without a proper understanding of your workload characteristics.
We provide **2 types of autoscaling**, but only the **Default** option is available for non-Enterprise plans.
* **Default** (Recommended): This is the best choice for the majority of users. It operates reliably across most workloads with no configuration required, leveraging our internal expertise to provide a balanced approach to performance and cost.
* **Request count** (Enterprise plan only): This is an advanced option for users who have a deep understanding of their workload characteristics and require granular control over scaling behavior.
* As users define the number of requests a single worker will handle, cost prediction becomes more straightforward and intuitive.
* This method can serve as a foundation for implementing your own custom autoscaling logic by dynamically changing the threshold via an API, targeting custom metrics.
### Benefits of Autoscaling
* **Cost Optimization**: Pay only for the resources you need by automatically scaling to zero during idle periods
* **Performance Management**: Handle traffic spikes efficiently by automatically adding replicas
* **Resource Efficiency**: Maintain optimal resource utilization across varying workload patterns
# Dataset Specifications and Upload Guide
Source: https://friendli.ai/docs/guides/dedicated_endpoints/dataset
Learn how to upload datasets on Friendli.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
### Uploading Datasets
This document explains how to upload datasets. On Friendli, you can upload datasets via the web interface or the SDK.
You can easily upload datasets through the web interface. Files in `.jsonl` and `.parquet` formats are supported, and each dataset should be structured as follows:
#### Conversation
This is the most basic dataset format. The `role` field can be `system`, `user`, or `assistant`.
```
{"messages": [{"role": "...", "content": "..."}]}
```
#### Alpaca (Beta)
Two types of Alpaca datasets are supported as shown below.\
For compatibility with the Conversation format, they are automatically converted according to a template during upload. If you do not want automatic conversion, please convert to the Conversation format before uploading, or use the SDK to upload.
```
{"instruction": "...", "output": "..."}
{"instruction": "...", "input": "...", "output": "..."}
```
#### Multi-Modal (Image)
For multi-modal inputs, the following three formats are supported for compatibility.\
Currently, the web interface does not support `local path`, `base64`, or `PIL.Image` objects. For these cases, please use the SDK to upload.
```
{"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image", "image": "https://example.com/image.jpg"}]}]}
{"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image", "image_url": "https://example.com/image.jpg"}]}]}
{"messages": [{"role": "...", "content": [{"type": "text", "text": "..."}, {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}]}]}
```
### How to Upload a Dataset
First, go to the **'Datasets'** section in the [Friendli Suite](https://friendli.ai/suite).
Click the **'New Dataset'** button to start the upload process.\
From the dropdown, select **'Upload a file directly'** option.
Click the File Upload Area in the Dataset file section, or drag and drop the file you want to upload. Then click the **'Upload'** button to start uploading.
The dataset will be uploaded progressively in the background. Once the upload is complete, you can rename it, add splits, and preview each split.
{/* This content is completely duplicated from "/guides/tutorials/how-to-fine-tune-vlm". */}
## Prerequisites
1. Head to [Friendli Suite](https://friendli.ai/get-started/dedicated-endpoints) and create an account.
2. Issue a **Friendli Token** by going to [Personal settings > Tokens](https://friendli.ai/suite/setting/tokens).
Make sure to copy and store it securely in a safe place as you won't be able to see it again after refreshing the page.\
For detailed instructions, see [Personal Access Tokens](/guides/suite/personal_access_tokens).
## Step 1. Prepare Your Dataset
Your dataset should be a conversational dataset in `.jsonl` or `.parquet` format, where each line represents a sequence of messages. Each message in the conversation should include a `"role"` (e.g., `system`, `user`, or `assistant`) and `"content"`. For VLM fine-tuning, user content can contain both text and image data (Note that for image data, we support URL and Base64).
Here's an example of what it should look like. Note that it's one line but beautified for readability:
```json
{
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": [
{
"type": "image",
"image": "https://huggingface.co/datasets/huggingface/documentation-images/resolve/main/bee.jpg"
},
{
"type": "image",
"image": "data:image/png;base64,"
},
{
"type": "text",
"text": "Describe this image in detail."
}
]
},
{
"role": "assistant",
"content": "The image is a bee."
}
]
}
```
You can access our example dataset ['FriendliAI/gsm8k'](https://huggingface.co/datasets/FriendliAI/gsm8k) (for Chat), ['FriendliAI/sample-vision'](https://huggingface.co/datasets/FriendliAI/sample-vision) (for Chat with image) and explore some of our quantized generative AI models on [our Hugging Face page](https://huggingface.co/FriendliAI).
## Step 2. Upload Your Dataset
Once you have prepared your dataset, you can upload it to Friendli using the [Python SDK](/sdk/python-sdk).
### Install the Python SDK
First, install the Friendli Python SDK:
```bash
# Using pip
pip install friendli
# Using poetry
poetry add friendli
```
### Upload Your Dataset
Use the following code to create a dataset and upload your samples:
```python
import os
from friendli.friendli import SyncFriendli
from friendli.models import Sample
TEAM_ID = os.environ["FRIENDLI_TEAM_ID"]
PROJECT_ID = os.environ["FRIENDLI_PROJECT_ID"]
TOKEN = os.environ["FRIENDLI_TOKEN"]
# Read dataset file and parse each line as a Sample
with open("dataset.jsonl", "rb") as f:
data = [Sample.model_validate_json(line) for line in f]
with SyncFriendli(
token=TOKEN,
x_friendli_team=TEAM_ID,
) as friendli:
# Create a new dataset with TEXT and IMAGE modalities
with friendli.dataset.create(
modality=["TEXT", "IMAGE"],
name="my-vlm-dataset", # name of the dataset
project_id=PROJECT_ID,
) as dataset:
# Upload samples to the dataset
# Each line from your dataset file becomes a separate sample
dataset.upload_samples(
samples=data,
split="train", # name of the split to upload to
)
```
### How It Works
Friendli Python SDK doesn't upload your entire dataset file at once. Instead, it processes your dataset more efficiently:
1. **Reads your dataset file line by line**: Each line is parsed as a `Sample` object containing a conversation with messages.
2. **Creates a dataset**: A new dataset is created in your Friendli project with the specified modalities (`TEXT` and `IMAGE`).
3. **Uploads each conversation as a separate sample**: Rather than uploading the entire file, each conversation (line in the dataset file) becomes an individual sample in the dataset.
4. **Organizes by splits**: Samples are organized into splits like "train", "validation", or "test" for different purposes.
### Environment Variables
Make sure to set the required environment variables:
```bash
export FRIENDLI_TOKEN="your-friendli-token"
export FRIENDLI_TEAM_ID="your-team-id"
export FRIENDLI_PROJECT_ID="your-project-id"
```
You can find your Team ID and Project ID in the URL of Friendli Suite, formatted as `https://friendli.ai///...`.
### View Your Dataset
To view and edit the datasets you've uploaded, visit [Friendli Suite > Dataset](https://friendli.ai/suite/~/dataset).
# Deploy with Hugging Face Models
Source: https://friendli.ai/docs/guides/dedicated_endpoints/deploy_with_huggingface
Hands-on tutorial for launching and deploying LLMs using Friendli Dedicated Endpoints with Hugging Face models.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
#### Hands-on Tutorial
Deploying `meta-llama-3-8b-instruct` LLM from Hugging Face using Friendli Dedicated Endpoints
## Introduction
With Friendli Dedicated Endpoints, you can easily spin up scalable, secure, and highly available inference deployments, without the need for extensive infrastructure expertise or significant capital expenditures.
This tutorial is designed to guide you through the process of launching and deploying LLMs using Friendli Dedicated Endpoints. Through a series of step-by-step instructions and hands-on examples, you'll learn how to:
* Select and deploy pre-trained LLMs from Hugging Face repositories
* Deploy and manage your models using the Friendli Engine
* Monitor and optimize your inference deployments
By the end of this tutorial, you'll be equipped with the knowledge and skills necessary to unlock the full potential of LLMs in your applications, products, and services. So, let's get started and explore the possibilities of Friendli Dedicated Endpoints!
## Prerequisites:
* A Friendli Suite account with access to [Friendli Dedicated Endpoints](https://friendli.ai/suite)
* A Hugging Face account with access to the [meta-llama-3-8b-instruct](https://huggingface.co/meta-llama/Meta-Llama-3-8B-Instruct) model
## Step 1: Create a new endpoint
1. Log in to your Friendli Suite account and navigate to the Friendli Dedicated Endpoints dashboard.
2. If not done already, start the free trial for Dedicated Endpoints.
3. Create a new project, then click on the 'New Endpoint' button.
4. Fill in the basic information:
* Endpoint name: Choose a unique name for your endpoint (e.g., "My New Endpoint").
5. Select the model:
* Model Repository: Select "Hugging Face" as the model provider.
* Model ID: Enter "meta-llama/Meta-Llama-3-8B-Instruct" as the model id. As the search bar loads the list, click on the top result that exactly matches the repository id.
By default, the model pulls the latest commit on the default branch of the model. You may manually select a specific branch / tag / commit instead.
If you're using your own model, check [Format Requirements](/guides/dedicated_endpoints/faq#format-requirements) for requirements.
6. Select the instance:
* Instance configuration: Choose a suitable instance type based on your performance requirements. We suggest 1x A100 80G for most models.
In some cases where the model's size is big, some options may be restricted as they are guaranteed to not run due to insufficient VRAM.
7. Edit the configurations:
* Autoscaling: By default, the autoscaling ranges from 0 to 2 replicas. This means that the deployment will sleep when it's not being used, which reduces cost.
* Advanced configuration: Some LLM options including the batch size and token configurations are mutable. For this tutorial, we'll leave it as-is.
8. Click 'Create' to create a new endpoint.
## Step 2: Test the endpoint
1. Wait for the deployment to be created and initialized. This may take a few minutes.
You may check the status by the indicator under the endpoint's name.
2. In the "Playground" section, you may enter a sample input prompt (e.g., "What is the capital of France?").
3. Click on the right arrow button to send the inference request.
4. If you are an Enterprise plan user, you can use the "Metrics" and "Logs" section to monitor the endpoint.
## Step 3: Send requests by using curl or Python
1. As instructed in our [API docs](/openapi/serverless/chat-completions), you can send instructions with the following code:
```python OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("FRIENDLI_TOKEN"),
base_url="https://api.friendli.ai/dedicated/v1",
)
chat_completion = client.chat.completions.create(
model="YOUR_ENDPOINT_ID",
messages=[
{
"role": "user",
"content": "Tell me how to make a delicious pancake"
}
]
)
print(chat_completion.choices[0].message.content)
```
```python Friendli Python SDK
import os
from friendli import SyncFriendli
client = SyncFriendli(
token=os.getenv("FRIENDLI_TOKEN"),
)
chat_completion = client.dedicated.chat.complete(
model="YOUR_ENDPOINT_ID",
messages=[
{
"role": "user",
"content": "Tell me how to make a delicious pancake"
}
]
)
print(chat_completion.choices[0].message.content)
```
```sh curl
curl -X POST https://api.friendli.ai/dedicated/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-d '{
"model": "(endpoint-id)",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"max_tokens": 200,
"top_k": 1
}'
```
1. You can update the model and change almost everything by clicking the update button.
# Deploy with W&B Models
Source: https://friendli.ai/docs/guides/dedicated_endpoints/deploy_with_wandb
Hands-on tutorial for launching and deploying LLMs using Friendli Dedicated Endpoints with Weights & Biases artifacts.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
#### Hands-on Tutorial
Deploying `meta-llama-3-8b-instruct` LLM from W\&B using Friendli Dedicated Endpoints
## Introduction
With Friendli Dedicated Endpoints, you can easily spin up scalable, secure, and highly available inference deployments, without the need for infrastructure expertise or significant capital expenditures.
This tutorial is designed to guide you through the process of launching and deploying LLMs using Friendli Dedicated Endpoints. Through a series of step-by-step instructions and hands-on examples, you'll learn how to:
* Select and deploy pre-trained LLMs from W\&B artifacts
* Deploy and manage your models using the Friendli Engine
* Monitor and optimize your inference deployments
By the end of this tutorial, you'll be equipped with the knowledge and skills necessary to unlock the full potential of LLMs in your applications, products, and services. So, let's get started and explore the possibilities of Friendli Dedicated Endpoints!
## Prerequisites:
* A Friendli Suite account with access to [Friendli Dedicated Endpoints](https://friendli.ai/suite)
* A W\&B account with an api key (as an access token)
## Step 1: Create a new endpoint
1. Log in to your Friendli Suite account and navigate to the Friendli Dedicated Endpoints dashboard.
2. If not done already, start the free trial for Dedicated Endpoints.
3. Create a new project, then click on the 'New Endpoint' button.
4. [Integrate your W\&B account with an api key.](https://wandb.ai/authorize)
5. Fill in the basic information:
* Endpoint name: Choose a unique name for your endpoint (e.g., "My New Endpoint").
6. Select the model:
* Model Repository: Select "Weights & Biases" as the model provider.
* Model ID: Enter `friendliai/model-registry/Meta-Llama-3-8B-Instruct:v0` as the model id.
If you're using your own model, check [Format Requirements](/guides/dedicated_endpoints/faq#format-requirements) for requirements.
7. Select the instance:
* Instance configuration: Choose a suitable instance type based on your performance requirements. We suggest 1x A100 80G for most models.
In some cases where the model's size is big, some options may be restricted as they are guaranteed to not run due to insufficient VRAM.
8. Edit the configurations:
* Autoscaling: By default, the autoscaling ranges from 0 to 2 replicas. This means that the deployment will sleep when it's not being used, which reduces cost.
* Advanced configuration: Some LLM options including the maximum processing batch size and token configurations can be updated. For this tutorial, we'll leave it as-is.
9. Click 'Create' to create a new endpoint.
## Step 2: Test the endpoint
1. Wait for the deployment to be created and initialized. This may take a few minutes.
You may check the status by the indicator under the endpoint's name.
2. In the "Playground" section, you may enter a sample input prompt (e.g., "What is the capital of France?").
3. Click on the right arrow button to send the inference request.
4. If you are an Enterprise plan user, you can use the "Metrics" and "Logs" section to monitor the endpoint.
## Step 3: Send requests by using curl or Python
1. As instructed in our [API docs](/openapi/serverless/chat-completions), you can send instructions with the following code:
```python OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("FRIENDLI_TOKEN"),
base_url="https://api.friendli.ai/dedicated/v1",
)
chat_completion = client.chat.completions.create(
model="YOUR_ENDPOINT_ID",
messages=[
{
"role": "user",
"content": "What is the capital of France?"
}
],
max_tokens=200,
extra_body={
"top_k": 1
}
)
print(chat_completion.choices[0].message.content)
```
```python Friendli Python SDK
import os
from friendli import SyncFriendli
client = SyncFriendli(
token=os.getenv("FRIENDLI_TOKEN"),
)
chat_completion = client.dedicated.chat.complete(
model="YOUR_ENDPOINT_ID",
messages=[
{
"role": "user",
"content": "What is the capital of France?"
}
],
max_tokens=200,
top_k=1
)
print(chat_completion.choices[0].message.content)
```
```sh curl
curl -X POST https://api.friendli.ai/dedicated/v1/chat/completions \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-d '{
"model": "YOUR_ENDPOINT_ID",
"messages": [
{
"role": "user",
"content": "What is the capital of France?"
}
],
"max_tokens": 200,
"top_k": 1
}'
```
1. You can update the model and change almost everything by clicking the update button.
# Endpoints
Source: https://friendli.ai/docs/guides/dedicated_endpoints/endpoints
Endpoints are the actual deployments of your models on your specified GPU resource.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
## What are Endpoints?
Endpoints are the actual deployments of your models on a dedicated GPU resource.
They provide a stable and efficient interface to serve your models in real-world applications, ensuring high availability and optimized performance.
With endpoints, you can manage model versions, scale resources, and seamlessly integrate your model into production environments.
### Key Capabilities of Endpoints:
* **Efficient Model Serving**: Deploy models on powerful GPU instances optimized for your use case.
* **Flexibility with Multi-LoRA Models**: Serve multiple fine-tuned adapters alongside base models.
* **Autoscaling**: Automatically adjust resources to handle varying workloads, ensuring optimal performance and cost efficiency.
* **Monitoring and Management**: Check endpoint health, adjust configurations, and view logs directly from the platform.
* **Interactive Testing**: Use the integrated playground to test your models before integrating them into applications.
* **API Integration**: Access your models via robust OpenAI-compatible APIs, enabling easy integration into any system.
## Creating Endpoints
You can create your endpoint by specifying the name, the model, and the instance configuration, consisting of your desired GPU specification.
## Selecting Instance
Instance selection depends on your model size and workload.
If the selected GPU type or count is insufficient for the model size, the instance may not be selectable. If the configuration is expected to prevent you from fully leveraging the capabilities of the Friendli Inference Engine, youβll see a *'TIGHT MEMORY'* warning. In such cases, we recommend enabling Online Quantization or increasing the GPU count.
## Available Features
### Online Quantization
This feature description is moved to [Online Quantization](/guides/dedicated_endpoints/online-quantization) page. Please refer to the page for more information.
### Speculative Decoding
This feature description is moved to [Speculative Decoding](/guides/dedicated_endpoints/speculative-decoding) page. Please refer to the page for more information.
### Serving Multi-LoRA Models
This feature description is moved to [Multi-LoRA Serving](/guides/dedicated_endpoints/multi-lora-serving) page. Please refer to the page for more information.
### Custom Chat Templates
Customize chat formatting by uploading your own [Jinja](https://jinja.palletsprojects.com/en/stable/) templates when creating Dedicated Endpoint instances. This overrides the model's default chat template and gives you full control over how inputs and outputs are displayed.
### Reasoning Parsing
You can configure the default behavior of an endpoint by setting the `parse_reasoning` configuration during its creation. This default will apply when the corresponding argument is not explicitly provided in incoming requests.
For more details, refer to the [Reasoning Parsing with Friendli](/guides/reasoning#reasoning-parsing-with-friendli) documentation.
## Checking Endpoint Status
After creating the Endpoint, you can view its health status and Endpoint URL on the Endpoint's details page.
The cost of using dedicated endpoints accumulates from the *'INITIALIZING'* status.
Specifically, charges begin after the *'Initializing GPU'* phase, where the endpoint waits to acquire the GPU.
The endpoint then downloads and loads the model onto the GPU, which usually takes less than a minute.
### Model max context length and KV cache size
**Model max context length** is determined by how the model was trained, while **KV cache size** depends on memory and is affected by your instance type and Online Quantization setting.
In some workloads, having a KV cache size smaller than the model max context length may still work fine.\
However, to fully leverage the performance of the Friendli Inference Engine, we recommend first enabling [Online Quantization](/guides/dedicated_endpoints/endpoints#online-quantization) (as it doesnβt require changing your instance), and then, if needed, selecting a GPU with more VRAM or increasing the number of GPUs.
## Using Playgrounds
To test the deployed model via the web, we provide a playground interface where you can interact with the model using a user-friendly chat interface.
Simply enter your query, adjust your settings, and generate your responses!
Send inference queries to your model through our [API](/openapi) at the given endpoint address, accessible on the endpoint information tab.
{/* TODO: add image for sending APIs */}
# Frequently Asked Questions and Troubleshooting
Source: https://friendli.ai/docs/guides/dedicated_endpoints/faq
While following through our tutorials, you might have had questions regarding the details of the requirements and specifications. We have listed out the frequently asked questions as a separate document.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
## Integrations
1. Log in to Hugging Face, then navigate to [Access Tokens](https://huggingface.co/settings/tokens).
2. Create a new token. You may use a fine-grained token. In this case, please make sure the token has view permission for the repository you'd like to use.
3. Integrate the key in Friendli Suite β Personal settings β [Integrations](https://friendli.ai/suite/setting/integrations).
If you revoke / invalidate the key, you will have to update the key in order to not disrupt ongoing deployments, or to launch a new inference deployment.
1. Log in to your W\&B account at the [authorization page](https://wandb.ai/authorize), then navigate to User Settings, and scroll to the API Keys section.
2. Acquire a token.
3. Integrate the key in Friendli Suite β Personal settings β [Integrations](https://friendli.ai/suite/setting/integrations).
If you revoke / invalidate the key, you will have to update the key in order to not disrupt ongoing deployments, or to launch a new inference deployment.
## Using 3rd-party model
* Make sure to use the full name of the artifact.
* The *artifact name* must be in the format of `org/project/artifact_id:version`
1. Install the CLI and log in with your API key. See the [W\&B CLI documentation](https://docs.wandb.ai/ref/cli) for details.
2. Upload the model as a W\&B artifact using the command below:
```bash
wandb artifact put -n project/artifact_id --type model /path/to/dir
```
3. With all this, the W\&B artifact will look like this:
* Use the repository id of the model. You may select the entry from the list of autocompleted model repositories.
* You may choose specific branch, or manually enter a commit hash.
## Format Requirements
* A model should be in safetensors format.
* The model should NOT be nested inside another directory.
* Including other arbitrary files (that are not in the list) is totally fine. However, those files will not be downloaded nor used.
| Required | Filename | Description |
| -------- | ------------------------- | -------------------------------------------------------------------------------------------------------------------- |
| Yes | *safetensors* | Model weight, e.g. model.safetensors. Use model.safetensors.index.json for split safetensors files |
| Yes | config.json | Model config that includes the architecture. ([Supported Models on Friendli](https://friendli.ai/models)) |
| No | tokenizer.json | Tokenizer for the model |
| No | tokenizer\_config.json | Tokenizer config. This should be present & have a `chat_template` field for the Friendli Engine to provide chat APIs |
| No | special\_tokens\_map.json | |
The dataset should satisfy the following conditions:
1. The dataset must contain a column named **"messages"**.
2. Each row in the "messages" column should be compatible with the chat template of the base model.
For example, [`tokenizer_config.json`](https://huggingface.co/mistralai/Mistral-7B-Instruct-v0.2/blob/41b61a33a2483885c981aa79e0df6b32407ed873/tokenizer_config.json#L42)
of `mistralai/Mistral-7B-Instruct-v0.2` is a template that repeats the messages of a user and an assistant.
Concretely, each row in the "messages" field should follow a format like: `[{"role": "user", "content": "The 1st user's message"}, {"role": "assistant", "content": "The 1st assistant's message"}]`.
In this case, `HuggingFaceH4/ultrachat_200k` is a dataset that is compatible with the chat template.
## Troubleshooting
### Inference Request Errors
Below is a table of common error codes you might encounter when making inference-related API requests.
| Code | Name | Cause | Suggested Solution |
| ----- | ----------------------- | ------------------------------------------------------------------------------------------------------------------------ | -------------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `400` | *Bad Request* | The request is malformed or missing required fields. | Check your request payload. Ensure it is valid JSON with all required fields. |
| `401` | *Unauthorized* | Missing or invalid API key. The request lacks proper authentication. | Include a valid Friendli token in the `Authorization` header. Verify the token is active and correct. |
| `403` | *Forbidden* | The API key is valid but does not have permission to access the endpoint. | Ensure your token has access rights to the endpoint. Use the correct team token or add the `X-Friendli-Team` header if needed. |
| `404` | *Not Found* | The specified endpoint or resource does not exist. This typically occurs when the `endpoint_id` or `team_id` is invalid. | Verify the `endpoint_id` and model name in your request. Ensure they match an existing, non-deleted deployment. Also check for typos in your endpoint ID or team ID. |
| `422` | *Unprocessable Entity* | The request is syntactically correct but semantically invalid (e.g. exceeding token limits, invalid parameter values). | Adjust your request (e.g. reduce `max_tokens`, correct parameter values) and try again. |
| `429` | *Too Many Requests* | You have exceeded rate limits for your plan. | Reduce request frequency or upgrade your plan for higher limits. Wait before retrying after a 429 error. |
| `500` | *Internal Server Error* | A server-side error occurred while processing the request. | Retry the request after a short delay. If the error persists, check endpoint health in the overview dashboard or contact FriendliAI support. |
#### Quick checklist before retrying
* Verify the endpoint URL, `endpoint_id`, and (if applicable) `X-Friendli-Team` header
* Include the `Authorization` header with a valid token
* Confirm the target deployment exists, is healthy, and is not deleted
* Validate request JSON and required fields; reduce `max_tokens` if needed
* Check rate limits; add retry with backoff when receiving `429`
### Model Selection Errors
The artifact might be nonexistent, or hidden so that you cannot access it.
The repository is gated. Please follow the steps and gain approval from the owner using Hugging Face Hub.
The model does not meet the requirements. Please check if the model follows a correct safetensors format. See the [format requirements](#format-requirements) for details.
The model architecture is not supported. Please refer to the [Supported Models](https://friendli.ai/models/search?products=DEDICATED) page.
This page may not cover all cases. If your issue persists { e.preventDefault(); window.Intercom('showNewMessage'); }}>contact support.
# Introducing Friendli Dedicated Endpoints
Source: https://friendli.ai/docs/guides/dedicated_endpoints/introduction
Friendli Dedicated Endpoints let you deploy and run generative AI models β custom or open source β on dedicated GPU hardware.
Friendli Dedicated Endpoints let you run custom or open-source generative AI models on dedicated GPU hardware β without sharing resources or managing infrastructure.
## What are Friendli Dedicated Endpoints?
* **Powered by the Friendli Engine**: Serve models effortlessly with the Friendli Engine, our patented GPU-optimized serving technology. Friendli Dedicated Endpoints automatically orchestrate resources for high-performance inference.
* **Bring Your Own Model**: Run your own model or choose any available model from [Hugging Face](https://huggingface.co) and [Weights & Biases](https://wandb.ai).
* **Dedicated Resources**: Select the GPU type for your workload. Each instance is fully dedicated to your model.
* **Reliable at Scale**: Trusted by leading companies, Friendli Dedicated Endpoints deliver robust performance for production workloads.
* **Per-second Billing**: Pay only for the time your model runs. No manual optimization required β Friendli handles efficiency for you.
## Getting Started:
1. **Sign Up**: Create a Friendli Suite account with free credits.
2. **Choose Your Model**: Upload your own or choose one from Hugging Face and Weights & Biases.
3. **Launch an Instance**: Select the perfect GPU for your model.
4. **Get Your Endpoint Address**: Use it to send requests to your model.
5. **Send Your Input**: Prompt your model and receive responses.
Friendli Dedicated Endpoints is more than just an AI serving platform β it provides a reliable, high-performance, and cost-efficient way to run your own models.
Explore more in our documentation:
* [Models](/guides/dedicated_endpoints/models)
* [Endpoints](/guides/dedicated_endpoints/endpoints)
* [Quickstart](/guides/dedicated_endpoints/quickstart)
## Additional Resources:
* FriendliAI website: [https://friendli.ai](https://friendli.ai)
* FriendliAI blog: [https://friendli.ai/blog](https://friendli.ai/blog)
# Serving LoRA Models
Source: https://friendli.ai/docs/guides/dedicated_endpoints/lora-models
Learn how to deploy LoRA models from Hugging Face Hub to Friendli Dedicated Endpoints for efficient inference, including a quick guide for FLUX LoRA models.
This document explains how to deploy LoRA models available on Hugging Face to Friendli Dedicated Endpoints.
Friendli Dedicated Endpoints support deploying LoRA adapters for both text generation and FLUX models.
## FLUX LoRA Quick Deployment Guide
This tutorial demonstrates how to deploy the FLUX LoRA model [multimodalart/flux-tarot-v1](https://huggingface.co/multimodalart/flux-tarot-v1), which is trained to generate images in the style of RiderβWaite Tarot cards.
Friendli offers a convenient one-click deployment feature, Deploy-Model, that streamlines the process of serving LoRA adapters from the Hugging Face Hub on Dedicated Endpoints. To deploy a specific model, simply use a URL in the format `https://friendli.ai/deploy-model/{hf-model-id}`.
For example, to deploy the FLUX LoRA model mentioned above, use [this link](https://friendli.ai/deploy-model/multimodalart/flux-tarot-v1). This will launch the deployment workflow, allowing you to quickly serve and experiment with the model on Friendli.
Clicking the link above will display a screen like the one shown. Click the 'Deploy now' button here to deploy the LoRA model to Friendli Dedicated Endpoints.
Once the deployment is complete, a screen like the one below will appear. Click the 'Go to Suite' button to navigate to the playground where you can use the LoRA model.
Original Generated Image
LoRA Generated Image
## Advanced: Deploying LoRA Models with Custom Settings
While the quick deployment method described above is convenient, you can also deploy LoRA endpoints with custom settings. This allows you to specify the GPU instance type, endpoint name, scaling options, and more.
Log in to your Friendli Suite account and navigate to the Friendli Dedicated Endpoints dashboard.
If not done already, start the free trial for Dedicated Endpoints.
Create a new project, then click on the 'New Endpoint' button.
You'll see a screen like the one below. Enter an Endpoint Name, for example, "My New LoRA Endpoint".
Friendli Suite currently supports LoRA adapters trained within the Suite and those available on the Hugging Face Hub.
Since this tutorial doesn't cover fine-tuning, we'll focus on deploying LoRA adapters from the Hugging Face Hub.
First, in the Base Model section, select "Hugging Face" and choose the base model for the LoRA adapter you want to deploy.
There are several ways to find the base model of a LoRA adapter. The most common method is to check the model tree on the Hugging Face model page.
In this example, we'll deploy the `predibase/tldr_content_gen` adapter.
On the [Hugging Face model page](https://huggingface.co/predibase/tldr_content_gen) for this adapter, you can find the Model tree on the right side. This shows the base model used.
In this case, the adapter is based on the `mistralai/Mistral-7B-v0.1` model.
Enter the identified base model name into the model input field on the Endpoint Create page.
Now it's time to select the LoRA adapter.
Once the base model is selected, the 'Add LoRA adapter' button will become active. Click it to open the modal window for adding LoRA adapters.
In this modal, you can choose between "Project adapters" (adapters fine-tuned within Friendli Suite) and "Hugging Face adapters".
Select "Hugging Face adapters" and enter the Hugging Face Model ID of the adapter. For this tutorial, it's `predibase/tldr_content_gen`.
After adding the adapter, your screen should look like this. Now, select the instance type, configure the autoscaling options appropriately, and click the 'Create' button.
For details on other options, please refer to the [Deploy with Hugging Face Models](/guides/dedicated_endpoints/deploy_with_huggingface#step-1%3A-create-a-new-endpoint) documentation.
Once the endpoint is deployed, you'll see a screen like this. Navigate to the Playground page to quickly compare the adapter model and the base model.
In the Playground, use the highlighted dropdown menu to switch between the adapter model and the base model for experimentation and comparison.
That's it! You have successfully deployed a LoRA adapter on Friendli Dedicated Endpoints and experimented with it in the Playground.
Now you can explore deploying multiple adapters on a single endpoint (Multi-LoRA Endpoints) or use the API to send requests to the model and integrate it into your applications.
# Models
Source: https://friendli.ai/docs/guides/dedicated_endpoints/models
Within your Friendli Dedicated Endpoints projects you can prepare and manage the models that you wish to deploy. You may upload your models within your project to deploy them directly on your endpoints. Alternatively, you may manage them on the Hugging Face repository or Weights & Biases artifacts, as our endpoints can load models from your project, Hugging Face repositories, and Weights & Biases artifacts.
### Within your project, you can prepare and manage the models that you wish to deploy.
You may upload your models within your project to deploy them directly on your endpoints. Alternatively, you may manage them on the Hugging Face repository or Weights & Biases artifacts, as our endpoints can load models from your project, Hugging Face repositories, and Weights & Biases artifacts.
* At the moment, we support loading models from your uploaded model, Hugging Face repositories, and Weights & Biases artifacts.
Deploy models from public or private Hugging Face repositories.
Load models as Weights & Biases artifacts for easy versioning.
Use LoRA-adapted models for efficient deployment.
# Multi-LoRA Serving
Source: https://friendli.ai/docs/guides/dedicated_endpoints/multi-lora-serving
Multi-LoRA Serving is a feature that allows you to serve multiple LoRA models on an endpoint.
## Serving Multi-LoRA Models
You can serve Multi-LoRA models using Friendli Dedicated Endpoints.
For an overview of Multi-LoRA models, refer to our [document on serving Multi-LoRA models with Friendli Container](/guides/container/serving_multi_lora_models).
In Friendli Dedicated Endpoints, Multi-LoRA model is supported only in Enterprise plan. For pricing and availability, [Contact sales](https://friendli.ai/contact).
# Online Quantization
Source: https://friendli.ai/docs/guides/dedicated_endpoints/online-quantization
Online Quantization is a feature that allows you to quantize your model online.
### Online Quantization
Skip the hassle of preparing a quantized model. By enabling Online Quantization, your model will be automatically quantized to the target precision at runtime using Friendliβs proprietary methodβpreserving quality while improving speed and cost-efficiency. We currently support two precision levels, 4BIT and 8BIT.\
This allows you to select lower-VRAM GPU instances without performance loss.
Some models (e.g., those already quantized) may not be compatible with Online Quantization.\
Not all models support all target precisions. Some may only support 8BIT.\
In certain cases, specific GPU instance types may not be available when this option is enabled.
# Plans and Pricing
Source: https://friendli.ai/docs/guides/dedicated_endpoints/pricing
Friendli Dedicated Endpoints pricing detail page.
Friendli Dedicated Endpoints offer pricing with flexible monthly billing based on actual usage.
### Supported Instance Types
Pricing is based on the instance type selected for the endpoint. The following instance types are supported for endpoints:
| GPU Type | Basic | Enterprise |
| --------- | ------------ | ------------- |
| B200 | \$8.9 / hour | Contact sales |
| H200 | \$4.5 / hour | Contact sales |
| H100 | \$3.9 / hour | Contact sales |
| A100 80GB | \$2.9 / hour | Contact sales |
Contact sales for a discounted custom pricing plan for your enterprise.
For more information on pricing and feature comparisons between Basic and Enterprise plans, please visit our [pricing page](https://friendli.ai/pricing/dedicated-endpoints).
# QuickStart: Friendli Dedicated Endpoints
Source: https://friendli.ai/docs/guides/dedicated_endpoints/quickstart
Learn how to get started with Friendli Dedicated Endpoints in this step-by-step guide. Create an account, select your project, choose a model you wish to serve, deploy your endpoint, and seamlessly generate text, code, and more with ease.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
## 1. Log In or Sign Up
* If you have an account, log in using your preferred SSO or email/password combination.
* If you're new to FriendliAI, create an account for free.
## 2. Access Friendli Dedicated Endpoints
* On your left sidebar, find the "Dedicated Endpoints" option.
* Click the option to access the endpoint list page.
## 3. Prepare Your Model
* Choose a model that you wish to serve from Hugging Face, Weights & Biases, or upload your custom model on our cloud.
## 4. Deploy Your Endpoint
* Deploy your endpoint, using the model of your choice prepared from step 3, and the instance equipped with your desired GPU specification.
* You can also configure your replicas and the max-batch-size for your endpoint.
## 5. Generate Responses
* You can generate your responses in two ways: playground and endpoint URL.
* Try out and test generating responses on your custom model using a chatGPT-like interface at the playground tab.
* For general usages, send queries to your model through our [API](/openapi) at the given endpoint address, accessible on the endpoint information tab.
### Generating Responses Through the Endpoint URL
Refer to [this guide](/guides/suite/personal_access_tokens) for general instructions on Friendli Token.
```python OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("FRIENDLI_TOKEN"),
base_url="https://api.friendli.ai/dedicated/v1",
)
chat_completion = client.chat.completions.create(
model="YOUR_ENDPOINT_ID",
messages=[
{
"role": "user",
"content": "Tell me how to make a delicious pancake"
}
]
)
print(chat_completion.choices[0].message.content)
```
```python Friendli Python SDK
import os
from friendli import SyncFriendli
client = SyncFriendli(
token=os.getenv("FRIENDLI_TOKEN"),
)
chat_completion = client.dedicated.chat.complete(
model="YOUR_ENDPOINT_ID",
messages=[
{
"role": "user",
"content": "Tell me how to make a delicious pancake"
}
]
)
print(chat_completion.choices[0].message.content)
```
```sh curl
curl -X POST https://api.friendli.ai/dedicated/v1/chat/completions \
-H "Content-Type: application/json" \
-H "X-Friendli-Team: $TEAM_ID" \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-d '{
"model": "YOUR_ENDPOINT_ID",
"messages": [
{
"role": "user",
"content": "Python is a popular"
}
]
}'
```
{/* TODO: add image for sending APIs */}
For a more detailed tutorial for your usage, please refer to our tutorial for using [Hugging Face models](/guides/dedicated_endpoints/deploy_with_huggingface) and [W\&B models](/guides/dedicated_endpoints/deploy_with_wandb).
# Speculative Decoding
Source: https://friendli.ai/docs/guides/dedicated_endpoints/speculative-decoding
Speculative Decoding is a feature that allows you to predict the next tokens in the sequence.
### Speculative Decoding
#### N-gram speculative decoding
You may toggle the switch to activate N-gram speculative decoding. When enabled, past tokens are leveraged to pre-generate future tokens. For predictable tasks, this can deliver substantial performance gains.
You can also set the `Maximum N-gram Size`, which defines how many tokens are predicted in advance. We recommend keeping the default value of 3.
Higher values can further reduce latency when successful. However, predicting too many tokens at once may lower prediction efficiency and, in extreme cases, even increase latency.
#### Advanced speculative decoding (Coming soon)
Weβll soon be releasing an advanced speculative decoding feature that requires training but delivers better performance in most cases compared to the N-gram method. If youβre interested, please { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact us for more details.
# Versioning
Source: https://friendli.ai/docs/guides/dedicated_endpoints/versioning
Learn how to use the endpoint versions feature to manage model deployment history.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
## Rollout and Rollback Endpoints Without Downtime
The versioning feature in Friendli Dedicated Endpoints helps you manage all changes to your deployed endpoints safely and transparently. When you update the configurationβlike changing the model, engine settings, or autoscalingβa new version is created instead of replacing the current one.
Each version captures a full snapshot of the deployment, including:
* Model name and artifact source
* Accelerator type and count
* Autoscaling and engine settings
* Metadata (creator, timestamps, comments)
## Why Use Versioning?
* **Zero-Downtime Updates**: Safely apply changes while the current version continues to serve traffic.
* **One-Click Rollbacks**: Instantly revert to a previous stable configuration if issues occur.
* **Easy-to-Follow History**: Each version shows who made the change, when it was made, and what was changed. This makes audits and debugging easier.
## How to Use Versioning
1. **Initial Deployment**: Deploy your model for the first time via the platform or webhook. This creates version `v0`.
2. **Apply Configuration Updates**: Changing any settingβsuch as model, accelerator type, or autoscalingβtriggers a new version (`v1`, `v2`, etc.).
3. **Browse Version History**: View the full version list by clicking on the 'Versions' tab on the endpoint detail page. You'll see which version is current or in progress.
4. **View Configuration Details**: Click 'View configs' to see a versionβs full settings. You can see the updates from the previous version marked with a blue badge for easy comparison.
## How to Rollback to a Previous Version
To rollback, select a previous version from the version history and click 'Rollback'.
The system creates a new version (`vN+1`) using the selected versionβs settings. This new version will become the current one, allowing you to quickly revert to a known good state.
### When an Update Fails
Update failures can occur due to various reasons, such as:
* **Configuration Errors**: Invalid settings or unsupported configurations can prevent the update.
* **Resource Limitations**: Insufficient resources (like GPU availability) can block the update.
* **Network Issues**: Temporary network problems can interrupt the update process.
When you attempt to update an endpoint and the process fails, the system will not automatically apply the changes.
Instead, it will log the error and allow you to troubleshoot the issue without affecting the live endpoint. This ensures that your endpoint remains operational without disruption.
# Multiβmodality
Source: https://friendli.ai/docs/guides/multi-modality
Use Friendli to handle text, image, audio, and video modalities.
Friendli supports multimodal workflows across text, image, audio, and video.
Use the comprehensive guides below to get started with each modality.
## Quick Navigation
* [Image Generation](#image-generation) - Generate images from text prompts
* [Vision (Image Understanding)](#vision-image-understanding) - Analyze and understand images
* [Video Understanding](#video-understanding) - Process and analyze video content
* [Audio and Speech](#audio-and-speech) - Convert audio to text and analyze audio
### Image Generation
Transform text prompts into high-quality visuals with Friendli's image generation capabilities.
#### Representative Models
We support various trending image generation models including:
* [FLUX.1-dev](https://friendli.ai/models/search?baseModel=black-forest-labs/FLUX.1-dev)
* [FLUX.1-schnell](https://friendli.ai/models/search?baseModel=black-forest-labs/FLUX.1-schnell)
* [See all image generation models](https://friendli.ai/models/search?input=TEXT\&output=IMAGE)
#### API Usage
```bash curl
curl -L -X POST "https://api.friendli.ai/dedicated/v1/images/generations" \
-H "Content-Type: application/json" \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
--data-raw '{
"model": "YOUR_ENDPOINT_ID",
"prompt": "An orange Lamborghini driving down a hill road at night with a beautiful ocean view in the background.",
"num_inference_steps": 10,
"guidance_scale": 3.5
}'
```
```python Friendli Python SDK
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ.get("FRIENDLI_TOKEN"),
) as friendli:
images = friendli.dedicated.images.generate(
model="YOUR_ENDPOINT_ID",
prompt="An orange Lamborghini driving down a hill road at night with a beautiful ocean view in the background.",
num_inference_steps=10,
guidance_scale=3.5
)
print(images.data[0].url)
```
```python OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/dedicated/v1",
api_key=os.environ.get("FRIENDLI_TOKEN"),
)
images = client.images.generate(
model="YOUR_ENDPOINT_ID",
prompt="An orange Lamborghini driving down a hill road at night with a beautiful ocean view in the background.",
extra_body={
"num_inference_steps": 10,
"guidance_scale": 3.5
}
)
print(images.data[0].url)
```
`guidance_scale` is required when using Friendli Container. For more detail, please refer to the [Container API Reference](/openapi/container/image-generations).
### Vision (Image Understanding)
Analyze and understand images using Friendli's vision capabilities.
#### Representative Models
We support various trending vision models including:
* [Qwen2.5-VL](https://friendli.ai/models/search?input=IMAGE\&output=TEXT)
* [InternVL3](https://friendli.ai/models/search?input=IMAGE\&output=TEXT)
* [See all vision models](https://friendli.ai/models/search?input=IMAGE\&output=TEXT)
#### Supported Image Formats
Supports formats supported by the PIL library:
* JPEG (.jpeg and .jpg)
* PNG (.png)
* AVIF (.avif)
#### API Usage
```python URL-based image
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/dedicated/v1",
api_key=os.environ.get("FRIENDLI_TOKEN"),
)
image_url = "https://upload.wikimedia.org/wikipedia/commons/9/9e/Ours_brun_parcanimalierpyrenees_1.jpg"
completion = client.chat.completions.create(
model="YOUR_ENDPOINT_ID",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What kind of animal is shown in the image?",
},
{"type": "image_url", "image_url": {"url": image_url}},
],
},
],
stream=False
)
print(completion.choices[0].message.content)
```
```python Base64-encoded image
import base64, requests, os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/dedicated/v1",
api_key=os.environ.get("FRIENDLI_TOKEN"),
)
image_url = "https://upload.wikimedia.org/wikipedia/commons/9/9e/Ours_brun_parcanimalierpyrenees_1.jpg"
image_media_type = "image/jpg"
image_base64 = base64.standard_b64encode(requests.get(image_url).content).decode(
"utf-8"
)
completion = client.chat.completions.create(
model="YOUR_ENDPOINT_ID",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What kind of animal is shown in the image?",
},
{
"type": "image_url",
"image_url": {
"url": f"data:{image_media_type};base64,{image_base64}"
},
},
],
},
],
)
print(completion.choices[0].message.content)
```
### Video Understanding
Process and analyze video content with Friendli's video understanding capabilities.
#### Representative Models
We support various video understanding models including:
* [Qwen2.5-VL](https://friendli.ai/models/search?input=VIDEO\&output=TEXT)
* [See all video models](https://friendli.ai/models/search?input=VIDEO\&output=TEXT)
#### Video Requirements
* Videos must be hosted at publicly accessible URLs
* HTTPS URLs are recommended for security
* Consider video file size and processing time implications
* Some models may have specific resolution or duration requirements
#### API Usage
By default, video fetching timeout is 30 seconds. To increase the timeout value, please { e.preventDefault(); window.Intercom('showNewMessage'); }}>contact us.
```python Single Video Input
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/dedicated/v1",
api_key=os.environ.get("FRIENDLI_TOKEN"),
)
video_url = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"
completion = client.chat.completions.create(
model="YOUR_ENDPOINT_ID",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this video?",
},
{
"type": "video_url",
"video_url": {"url": video_url},
},
],
},
],
temperature=0,
max_tokens=100,
)
print(completion.choices[0].message.content)
```
```python Multi-Video Input
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/dedicated/v1",
api_key=os.environ.get("FRIENDLI_TOKEN"),
)
video_url_1 = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/ForBiggerFun.mp4"
video_url_2 = "http://commondatastorage.googleapis.com/gtv-videos-bucket/sample/BigBuckBunny.mp4"
completion = client.chat.completions.create(
model="YOUR_ENDPOINT_ID",
messages=[
{
"role": "user",
"content": [
{
"type": "text",
"text": "Describe the characters in each video concisely.",
},
{
"type": "video_url",
"video_url": {"url": video_url_2},
},
{
"type": "video_url",
"video_url": {"url": video_url_1},
},
],
},
],
temperature=0,
max_tokens=100,
)
print(completion.choices[0].message.content)
```
### Audio and Speech
Convert audio files to text and perform various AI tasks with Friendli's audio capabilities.
#### Representative Models
We support various trending audio models including:
* [Whisper Large V3](https://friendli.ai/models/search?input=AUDIO\&output=TEXT)
* [Qwen2-Audio](https://friendli.ai/models/search?input=AUDIO\&output=TEXT)
* [Ultravox](https://friendli.ai/models/search?input=AUDIO\&output=TEXT)
* [See all audio models](https://friendli.ai/models/search?input=AUDIO\&output=TEXT)
#### Supported Audio Formats
Our platform supports a wide range of audio formats compatible with the **librosa library**:
* **MP3** (.mp3)
* **WAV** (.wav)
* **FLAC** (.flac)
* **OGG** (.ogg)
* And many other standard audio formats
#### API Usage
By default, audio input is limited to 30 seconds. To enable longer audio inputs, please { e.preventDefault(); window.Intercom('showNewMessage'); }}>contact us.
```bash curl
curl -X POST https://api.friendli.ai/dedicated/v1/audio/transcriptions \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-H 'Content-Type: multipart/form-data' \
-F file=@/path/to/audio/file.mp3 \
-F model="YOUR_ENDPOINT_ID"
```
```python Friendli Python SDK
from friendli import SyncFriendli
import os
with SyncFriendli(
token=os.getenv("FRIENDLI_TOKEN"),
) as friendli:
audio_file = open("/path/to/file/audio.mp3", "rb")
transcription = friendli.dedicated.audio.transcriptions.create(
model="YOUR_ENDPOINT_ID",
file=audio_file
)
print(transcription.text)
```
```python OpenAI Python SDK
from openai import OpenAI
import os
client = OpenAI(
base_url="https://api.friendli.ai/dedicated/v1",
api_key=os.getenv("FRIENDLI_TOKEN"),
)
audio_file= open("/path/to/file/audio.mp3", "rb")
transcription = client.audio.transcriptions.create(
model="YOUR_ENDPOINT_ID",
file=audio_file
)
print(transcription.text)
```
### API References
For detailed API specifications, refer to:
* [Image Generation API Reference](/openapi/dedicated/inference/image-generations)
* [Image/Video/Audio Understanding API Reference](/openapi/dedicated/inference/chat-completions)
* [Audio Transcriptions API Reference](/openapi/dedicated/inference/audio-transcriptions)
# OpenAI Compatibility
Source: https://friendli.ai/docs/guides/openai-compatibility
Friendli Engine is compatible with the OpenAI API standard through the Python API Libraries and the Node API Libraries.
Friendli Dedicated Endpoints, Serverless Endpoints, and Container are [OpenAI-compatible](/openapi/introduction).
Existing applications can migrate with minimal effort, still using the official OpenAI SDKs.
### Specify the base URL and API key
Initialize the OpenAI client using Friendliβs base URL and your Friendli token (API key).
* **Serverless Endpoints**: `https://api.friendli.ai/serverless/v1`.
* **Dedicated Endpoints**: `https://api.friendli.ai/dedicated/v1`.
* **Container**: your own container's URL (e.g., `http://HOST:PORT/v1`).
Get your Friendli token in [Friendli Suite β Settings β Tokens](https://friendli.ai/suite/setting/tokens).
```python Python
client = OpenAI(
api_key=os.getenv("FRIENDLI_TOKEN"),
base_url="https://api.friendli.ai/serverless/v1",
)
```
```javascript Node.js
const client = new OpenAI({
apiKey: process.env.FRIENDLI_TOKEN,
baseURL: "https://api.friendli.ai/serverless/v1",
});
```
## Usage
Choose any model available on Friendli Serverless Endpoints, Dedicated Endpoints, or Container.
#### Completions API
Generate text completions using a simple prompt-based approach.
```python Python
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("FRIENDLI_TOKEN"),
base_url="https://api.friendli.ai/serverless/v1",
)
completion = client.completions.create(
model="meta-llama-3.3-70b-instruct",
prompt="Tell me a funny joke about programming.",
max_tokens=100,
temperature=0.7,
)
print(completion.choices[0].text)
```
```javascript Node.js
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FRIENDLI_TOKEN,
baseURL: "https://api.friendli.ai/serverless/v1",
});
async function main() {
const completion = await client.completions.create({
model: "meta-llama-3.3-70b-instruct",
prompt: "Tell me a funny joke about programming.",
max_tokens: 100,
temperature: 0.7,
});
console.log(completion.choices[0].text);
}
main().catch(console.error);
```
#### Chat Completions API
Generate chat completions using a conversational message-based approach.
```python Python
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("FRIENDLI_TOKEN"),
base_url="https://api.friendli.ai/serverless/v1",
)
completion = client.chat.completions.create(
model="meta-llama-3.3-70b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a funny joke."},
],
stream=False,
)
print(completion.choices[0].message.content)
```
```javascript Node.js
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FRIENDLI_TOKEN,
baseURL: "https://api.friendli.ai/serverless/v1",
});
async function main() {
const completion = await client.chat.completions.create({
model: "meta-llama-3.3-70b-instruct",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Tell me a funny joke." },
],
});
console.log(completion.choices[0].message.content);
}
main().catch(console.error);
```
#### Streaming Mode
Receive responses in real-time, enabling better user experience for long responses.
```python Python
from openai import OpenAI
import os
client = OpenAI(
api_key=os.getenv("FRIENDLI_TOKEN"),
base_url="https://api.friendli.ai/serverless/v1",
)
stream = client.chat.completions.create(
model="meta-llama-3.3-70b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Tell me a funny joke."},
],
stream=True,
)
for chunk in stream:
print(chunk.choices[0].delta.content or "", end="", flush=True)
```
```javascript Node.js
import OpenAI from "openai";
const client = new OpenAI({
apiKey: process.env.FRIENDLI_TOKEN,
baseURL: "https://api.friendli.ai/serverless/v1",
});
async function main() {
const stream = await client.chat.completions.create({
model: "meta-llama-3.3-70b-instruct",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Tell me a funny joke." },
],
stream: true,
});
for await (const chunk of stream) {
process.stdout.write(chunk.choices[0].delta?.content || "");
}
}
main().catch(console.error);
```
# Friendli Documentation
Source: https://friendli.ai/docs/guides/overview
Get started with FriendliAI products and explore APIs.
export const ToolIcon = () => {
return ;
};
export const ChatIcon = () => {
return ;
};
export const ServerlessIcon = () => {
return ;
};
export const ContainerIcon = () => {
return ;
};
export const DedicatedIcon = () => {
return ;
};
Let your team focus on building great AI products.
FriendliAI will make sure your AI runs fast, affordable, and reliable at scale.
### Start building
**For teams requiring production-scale AI without infra worries:**
} href="/guides/dedicated_endpoints/quickstart">
Reliable, high-performance inference with dedicated GPU resources.
Predictable, efficient scaling with full observability at scale.
**For teams seeking instant access to popular models:**
} href="/guides/serverless_endpoints/quickstart">
Instant API access to popular open-source models.
Fast, affordable inference with simple pay-as-you-go pricing.
**For teams prioritizing security and compliance:**
} href="/guides/container/quickstart">
On-premise, containerized solutions with data protection and governance controls.
Kubernetes-native, designed for enhanced privacy, security, and governance.
### Resources
Learn how to interact with Friendli products programmatically via the official Python SDK.
Learn how to use Friendli Suite, our all-in-one platform with a feature-rich web console.
Browse 440k+ models supported by Friendli.
API references for all endpoints.
Build AI agents with Friendli products.
Check technical insights from the Friendli team.
# Reasoning
Source: https://friendli.ai/docs/guides/reasoning
Friendli offers comprehensive, model-agnostic reasoning parsing. No need for custom parsers.
Friendli offers comprehensive, **model-agnostic reasoning parsing**. No need for custom parsers.
Leverage reasoning to build great AI products and let Friendli handle the complexity of reasoning.
## What is Reasoning?
Reasoning models are LLMs trained to "think" before answering, enhancing precision of answers.
This enables LLMs excel in complex problem solving and multi-step planning for agentic workflows.
When a model performs reasoning, the reasoning content is included in its response.
#### What makes reasoning parsing tedious?
Different models handle reasoning in different ways.
Some models always generate reasoning, while others expose it as an optional feature.
The format also varies. The reasoning content may be wrapped in `` tags or model-specific tokens.
As a result, separating reasoning content from the response can be non-trivial.
## Reasoning Model Types
* **Always Reasoning Models**: Reasoning is enabled by default. (e.g., DeepSeek-R1)
* **Controllable Reasoning Models**: Reasoning can be toggled on or off. (e.g., Qwen3-32B)
#### Usage: Always Reasoning Models
```bash curl
curl -X POST https://api.friendli.ai/serverless/v1/chat/completions \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-0528",
"messages": [
{
"role": "user",
"content": "Does technology expand or limit human freedom?"
}
]
}'
```
```python Friendli Python SDK
# pip install friendli
import os
from friendli import SyncFriendli
client = SyncFriendli(token=os.getenv("FRIENDLI_TOKEN"))
completion = client.serverless.chat.complete(
model="deepseek-ai/DeepSeek-R1-0528",
messages=[
{
"role": "user",
"content": "Tell me how to make a delicious pancake"
}
]
)
print(completion.choices[0].message)
```
```python OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ.get("FRIENDLI_TOKEN")
)
completion = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-0528",
messages=[
{"role": "user", "content": "Does technology expand or limit human freedom?"}
]
)
print(completion.choices[0].message)
```
#### Usage: Controllable Reasoning Models
These models let you control reasoning via the `enable_thinking` parameter.
Setting it to `true` enables reasoning, while `false` returns empty `` tags.
Important: Support for `enable_thinking` parameter is model-specificβeven among controllable reasoning models. Refer to the model card or release notes for details.
```bash {12-14} curl
curl -X POST https://api.friendli.ai/serverless/v1/chat/completions \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [
{
"role": "user",
"content": "Does technology expand or limit human freedom?"
}
],
"chat_template_kwargs": {
"enable_thinking": true
}
}'
```
```python {16-18} Friendli Python SDK
# pip install friendli
import os
from friendli import SyncFriendli
client = SyncFriendli(token=os.getenv("FRIENDLI_TOKEN"))
completion = client.serverless.chat.complete(
model="Qwen/Qwen3-32B",
messages=[
{
"role": "user",
"content": "Tell me how to make a delicious pancake"
}
],
chat_template_kwargs={
"enable_thinking": True
}
)
print(completion.choices[0].message)
```
```python {14-18} OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ.get("FRIENDLI_TOKEN")
)
completion = client.chat.completions.create(
model="Qwen/Qwen3-32B",
messages=[
{"role": "user", "content": "Does technology expand or limit human freedom?"}
],
extra_body={
"chat_template_kwargs": {
"enable_thinking": True
}
}
)
print(completion.choices[0].message)
```
## Reasoning Parsing with Friendli
Friendli deterministically separates reasoning content from the model response.
Enable parsing with the following two parameters in the Chat Completions API:
* `parse_reasoning` (boolean): Enables reasoning parsing.
* `include_reasoning` (boolean): Effective when reasoning parsing is enabled. Decides whether the parsed reasoning content is included in the response.
When using Dedicated Endpoints, you can set default value for `parse_reasoning` at the endpoint level.
For the OpenAI SDK, place the parameters inside `extra_body`.
The reasoning content tokens are included in the token usage and billing, even when `include_reasoning` is `false`.
For more detailed information, please refer to the [Chat Completions API](/openapi/dedicated/inference/chat-completions) documentation.
#### Parse Reasoning: On vs Off
The following shows how responses differ when `parse_reasoning` is on vs off.
```json Parse On
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "Hello! How can I assist you today? π",
"reasoning_content": "Okay, the user just said \"hello.\" I need to respond appropriately. Let's keep it simple and welcoming. Let's make sure there are no typos and the tone is warm.\n",
"role": "assistant"
}
}
],
// ...
}
```
```json Parse Off
{
"choices": [
{
"finish_reason": "stop",
"index": 0,
"logprobs": null,
"message": {
"content": "Okay, the user said \"hello.\" I need to respond appropriately.Let's keep it simple and welcoming. Let's make sure there are no typos and the tone is warm.\nHello! How can I assist you today? π",
"role": "assistant"
}
}
],
// ...
}
```
#### Response Schema
* `parse_reasoning = false`: Reasoning text remains inline in `choices[].message.content`.
* `parse_reasoning = true`:
* `include_reasoning = true`: Reasoning text moves to `choices[].message.reasoning_content`.
* `include_reasoning = false`: Reasoning text is removed from `choices[].message.content`.
#### Streaming Response Schema
`delta.reasoning_content` streams reasoning tokens. `delta.content` streams answer tokens.
When `parse_reasoning` is `true` and `stream` is `true`:
* If `include_reasoning` is `false`, no `delta.reasoning_content` is sent.
* If `include_reasoning` is `true`, both `delta.reasoning_content` and `delta.content` are sent.
```json
data: {
"choices": [
{ "index": 0, "delta": { "reasoning_content": "Let's break the problem down..." } }
]
}
data: {
"choices": [
{ "index": 0, "delta": { "content": "The result is 1554." } }
]
}
data: [DONE]
```
## Examples
#### Usage: Always Reasoning Models
```bash curl
curl -X POST https://api.friendli.ai/serverless/v1/chat/completions \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "deepseek-ai/DeepSeek-R1-0528",
"messages": [
{ "role": "user", "content": "Explain why the sky is blue." }
],
"parse_reasoning": true
}'
```
```python Friendli Python SDK
import os
from friendli import SyncFriendli
client = SyncFriendli(token=os.getenv("FRIENDLI_TOKEN"))
completion = client.serverless.chat.complete(
model="deepseek-ai/DeepSeek-R1-0528",
messages=[
{"role": "user", "content": "Explain why the sky is blue."}
],
parse_reasoning=True,
)
print(completion.choices[0].message)
```
```python OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ.get("FRIENDLI_TOKEN")
)
completion = client.chat.completions.create(
model="deepseek-ai/DeepSeek-R1-0528",
messages=[
{"role": "user", "content": "Explain why the sky is blue."}
],
extra_body={
"parse_reasoning": True
}
)
print(completion.choices[0].message)
```
#### Usage: Controllable Reasoning Models
```bash curl
curl -X POST https://api.friendli.ai/serverless/v1/chat/completions \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "Qwen/Qwen3-32B",
"messages": [
{ "role": "user", "content": "Solve 37 * 42." }
],
"chat_template_kwargs": { "enable_thinking": true },
"parse_reasoning": true,
"include_reasoning": true
}'
```
```python Friendli Python SDK
import os
from friendli import SyncFriendli
client = SyncFriendli(token=os.getenv("FRIENDLI_TOKEN"))
completion = client.serverless.chat.complete(
model="Qwen/Qwen3-32B",
messages=[
{"role": "user", "content": "Solve 37 * 42."}
],
chat_template_kwargs={"enable_thinking": True},
parse_reasoning=True,
include_reasoning=True,
)
print(completion.choices[0].message)
```
```python OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ.get("FRIENDLI_TOKEN")
)
completion = client.chat.completions.create(
model="Qwen/Qwen3-32B",
messages=[
{"role": "user", "content": "Solve 37 * 42."}
],
extra_body={
"chat_template_kwargs": {"enable_thinking": True},
"parse_reasoning": True,
"include_reasoning": True,
}
)
print(completion.choices[0].message)
```
# Integrations
Source: https://friendli.ai/docs/guides/serverless_endpoints/integrations
Friendli integrates with LangChain, LiteLLM, LlamaIndex, and MongoDB to streamline GenAI application deployment. LangChain and LlamaIndex enable tool calling AI agents and Retrieval-Augmented Generation (RAG), while MongoDB provides memory via vector databases, and LiteLLM boosts performance through load balancing.
[Friendli](/guides/overview) integrates with LangChain, LiteLLM, LlamaIndex, and MongoDB
to streamline the deployment of compound GenAI applications.
The integration of LangChain and LlamaIndex facilitates
tool calling AI agents or Retrieval-Augmented Generation (RAG).
MongoDB supports these agentic systems by providing memory with vector databases,
while LiteLLM enhances performance through load balancing and evaluation.
Get a quick overview of [Friendli Serverless Endpoints'](/guides/serverless_endpoints/introduction) integrations
and learn more through the linked resources.
## LangChain
[LangChain](https://python.langchain.com/v0.2/docs/introduction) is a framework for developing applications powered by large language models (LLMs).
Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in LangChain by preparing a [Friendli Token](/guides/suite/personal_access_tokens).
To install the required packages, run:
```
pip install -qU langchain-openai langchain
```
Here's a streaming chat sample code to get started with LangChain and FriendliAI:
```python
import os
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="meta-llama-3.3-70b-instruct",
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ["FRIENDLI_TOKEN"],
)
result = llm.invoke("Tell me a joke.")
print(result.content)
```
Output:
```
Here's one:
Why couldn't the bicycle stand up by itself?
(Wait for it...)
Because it was two-tired!
Hope that brought a smile to your face!
```
#### Resources
* [FriendliAI Blog Post on Building RAG Chatbots with Friendli, MongoDB Atlas, and LangChain](https://friendli.ai/blog/rag-chatbot-friendli-mongodb-atlas-langchain)
* [FriendliAI Blog Post on Example RAG Application with Friendli and LangChain](https://friendli.ai/blog/chatdocs-rag-friendli-langchain)
* [FriendliAI Blog Post on LangChain Integration with Friendli Dedicated Endpoints](https://friendli.ai/blog/langchain-integration-friendli-engine)
* [LangChain's Documentation on Friendli](https://python.langchain.com/v0.1/docs/integrations/llms/friendli)
## MongoDB
[MongoDB Atlas](https://www.mongodb.com/docs/atlas/getting-started) is a developer data platform offering vector stores and searches for compound GenAI applications,
compatible through both LangChain and LlamaIndex.
Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in MongoDB by preparing a [Friendli Token](/guides/suite/personal_access_tokens).
To install the required packages, run:
```
pip install pymongo friendli-client langchain langchain-mongodb langchain-community pypdf langchain-openai tiktoken
```
Here's a RAG sample code to get started with MongoDB and FriendliAI using LangChain:
```python
# Note: You can find detailed explanation on this code in the blog post below.
from pymongo import MongoClient
from langchain_mongodb.vectorstores import MongoDBAtlasVectorSearch
from langchain_community.chat_models.friendli import ChatFriendli
from langchain_community.document_loaders import PyPDFLoader
from langchain_openai import OpenAIEmbeddings
from langchain_text_splitters import RecursiveCharacterTextSplitter
from langchain_core.output_parsers import StrOutputParser
from langchain_core.prompts import PromptTemplate
from langchain_core.runnables import RunnablePassthrough
# Fill in your Cluster URI here.
MONGODB_ATLAS_CLUSTER_URI = "{YOUR CLUSTER URI}"
client = MongoClient(MONGODB_ATLAS_CLUSTER_URI)
# Fill in your DB information here.
DB_NAME = "{YOUR DB NAME}"
COLLECTION_NAME = "{YOUR COLLECTION NAME}"
ATLAS_VECTOR_SEARCH_INDEX_NAME = "{YOUR INDEX NAME}"
MONGODB_COLLECTION = client[DB_NAME][COLLECTION_NAME]
# Fill in your PDF link here.
loader = PyPDFLoader("{YOUR PDF DOCUMENT LINK}")
data = loader.load()
text_splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=150)
docs = text_splitter.split_documents(data)
vector_store = MongoDBAtlasVectorSearch.from_documents(
documents=docs,
embedding=OpenAIEmbeddings(disallowed_special=()),
collection=MONGODB_COLLECTION,
index_name=ATLAS_VECTOR_SEARCH_INDEX_NAME,
)
retriever = vector_store.as_retriever()
llm = ChatFriendli(model="meta-llama-3.3-70b-instruct")
prompt = PromptTemplate.from_template(
"""
Use the following pieces of context to answer the question.
{context}
Question: {question}
Helpful Answer:
"""
)
def format_docs(docs):
return "\n\n".join(doc.page_content for doc in docs)
rag_chain = (
{"context": retriever | format_docs, "question": RunnablePassthrough()}
| prompt
| llm
| StrOutputParser()
)
# Input your user query here.
rag_chain.invoke("{Sample Query Texts}")
```
#### Resources
* [FriendliAI Blog Post on Building RAG Chatbots with Friendli, MongoDB Atlas, and LangChain](https://friendli.ai/blog/rag-chatbot-friendli-mongodb-atlas-langchain)
* [FriendliAI Blog Post on RAG with FriendliAI and MongoDB](https://friendli.ai/blog/rag-mongodb-friendli)
* [MongoDB's Partner Ecosystem Page on FriendliAI](https://cloud.mongodb.com/ecosystem/friendliai)
## LlamaIndex
[LlamaIndex](https://docs.llamaindex.ai/en/stable) is a data framework designed to connect LLMs to custom data sources.
Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in LlamaIndex by preparing a [Friendli Token](/guides/suite/personal_access_tokens).
Additionally, an [OpenAI API key](https://platform.openai.com/docs/api-reference/authentication) is required to access the [OpenAI embedding API](https://platform.openai.com/docs/api-reference/embeddings).
To install the required packages, run:
```
pip install llama-index-llms-friendli llama-index
```
Here's a RAG streaming chat sample code to get started with LlamaIndex and FriendliAI:
```python
import os
from llama_index.llms.friendli import Friendli
from llama_index.core import Settings, SimpleDirectoryReader, VectorStoreIndex
os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN"
Settings.llm = Friendli(model="meta-llama-3.3-70b-instruct")
# Assuming a directory named 'data_folder' stores your pdf file.
documents = SimpleDirectoryReader('data_folder').load_data()
index = VectorStoreIndex.from_documents(documents)
query_engine = index.as_query_engine(streaming=True)
# Input your user query here.
response = query_engine.query("{Sample Query Texts}")
response.print_response_stream()
```
#### Resources
* [FriendliAI Blog Post on Building RAG Applications with Friendli and LlamaIndex](https://friendli.ai/blog/llamaindex-rag-app-friendli-engine)
* [Google Colab Notebook on Two-Stage Retrieval with LlamaIndex Friendli Integration](https://colab.research.google.com/drive/1_-1aITFQh0UUbRzaRM8FRid_wZHrfIjX?usp=sharing)
* [LlamaIndex's Documentation on Friendli](https://docs.llamaindex.ai/en/stable/examples/llm/friendli)
## LiteLLM
[LiteLLM](https://docs.litellm.ai/docs) is a versatile platform offering access to 100+ LLMs in the [OpenAI API format](https://platform.openai.com/docs/api-reference/chat/create).
Utilize [Friendli Serverless Endpoints](/guides/serverless_endpoints/quickstart) for LLM inferencing in LiteLLM by preparing a [Friendli Token](/guides/suite/personal_access_tokens).
To install the required package, run:
```
pip install litellm
```
Here's a streaming chat sample code to get started with LiteLLM and FriendliAI:
```python
from litellm import completion
response = completion(
# Simply change the model ID to use different LLM inference models & engines.
model="friendliai/meta-llama-3.3-70b-instruct",
messages=[
{"role": "user", "content": "Hello from LiteLLM"}
],
stream=True,
)
for chunk in response:
print(chunk.choices[0].delta.content, end="", flush=True)
```
Output:
```
Hello from an AI! It's great to meet you, LiteLLM! How's your day going so far?
```
#### Resources
* [FriendliAI Blog Post on LiteLLM Friendli Integration using LiteLLM's Budget Manager](https://friendli.ai/blog/litellm-friendli-integration)
* [LiteLLM's Supported Models & Providers Documentation Page on FriendliAI](https://docs.litellm.ai/docs/providers/friendliai)
# Introducing Friendli Serverless Endpoints
Source: https://friendli.ai/docs/guides/serverless_endpoints/introduction
Guide for Friendli Serverless Endpoints, allowing you to seamlessly integrate state-of-the-art AI models into your workflows, regardless of your technical expertise.
{/* Welcome to the exciting world of generative AI, where words dance into text, code sparks creation, and images bloom from the imagination. FriendliAI makes this world readily accessible with Friendli Serverless Endpoints, a revolutionary service that puts the power of cutting-edge generative models right at your fingertips. */}
This tutorial will guide you through Friendli Serverless Endpoints, allowing you to seamlessly integrate state-of-the-art AI models into your workflows, regardless of your technical expertise. Whether you're a seasoned developer or a curious newcomer, get ready to unlock the limitless potential of generative AI!
## What are Friendli Serverless Endpoints?
Imagine there is a powerful racecar (a generative AI model) that needs much maintenance and tuning (infrastructure and technical know-how). Friendli Serverless Endpoints is like a rental service, taking care of the hassle so you can just drive! It provides a simple, serverless interface that connects you to Friendli Engine, a high-performance, cost-effective inference serving engine optimized for generative AI models. With Friendli Serverless Endpoints, you can:
* **Access popular open-source models**: Get started with pre-loaded models like Llama 3.1. No need to worry about downloading or optimizing them.
* **Build your own workflows**: Integrate these models into your applications with just a few lines of code. Generate creative text formats, code, musical pieces, email, letters, etc. and create stunning images with ease.
* **Pay per token (or time), not per GPU**: Unlike traditional solutions that require whole GPU instances, Friendli Serverless Endpoints bills you only for the resources your models actually use. This translates to significant cost savings and efficient resource utilization.
* **Focus on what matters**: Forget about infrastructure setup and GPU optimization. Friendli Serverless Endpoints handles the heavy lifting, freeing you to focus on your creative vision and application development.
## Getting Started with Friendli Serverless Endpoints:
1. **Sign up for a free account**: Visit [Friendli Suite](https://friendli.ai/suite) and create your Friendli Suite account.
2. **Choose your model**: Select the pre-loaded model you want to experiment with, such as Llama 3.1 for text generation.
3. **Connect to the endpoint**: Friendli Serverless Endpoints provides simple API documentation for a variety of programming languages. Follow the instructions to integrate the endpoint into your code.
4. **Send your input**: Supply the model with your input text, code, or image prompt.
5. **Witness the magic**: Friendli Serverless Endpoints will utilize Friendli Engine to process your input and generate the desired output, be it text, code, or an image. You can then integrate the generated results into your application or simply marvel at the AI's creativity!
## Beyond the Basics:
As you gain confidence, Friendli Serverless Endpoints offers even more:
* **Granular control**: Optimize resource usage at the per-token or per-step level for each model, ensuring efficient resource allocation for your specific needs.
{/* - **Customization**: Build your own custom generative models and seamlessly integrate them into your workflows using Friendli Serverless Endpoints. */}
* **Scalability**: As your needs grow, easily scale your resources without worrying about complex infrastructure management.
Friendli Serverless Endpoints is the perfect springboard for your generative AI journey. Whether you're an experienced developer seeking to integrate AI into your projects or a curious explorer yearning to unleash your creative potential, FriendliAI provides the tools and resources you need to succeed.
So, start your engines, take the wheel, and explore the vast possibilities of generative AI with Friendli Serverless Endpoints!
## Additional Resources:
* FriendliAI website: [https://friendli.ai](https://friendli.ai)
* FriendliAI blog: [https://friendli.ai/blog](https://friendli.ai/blog)
# Plans and Pricing
Source: https://friendli.ai/docs/guides/serverless_endpoints/pricing
Friendli Serverless Endpoints offer a range of models tailored to various tasks.
Friendli Serverless Endpoints offer a flexible, scalable inference solution powered by a wide range of models. You can unlock access to more models and features based on your **usage tier**.
**Important Update**: Effective June 20, 2025, we've introduced new billing options and plan changes:
* Models are now billed **Token-Based** or **Time-Based**, depending on the model.
* The Basic plan has been renamed to the **Starter plan**.
* Existing users can continue using their current serverless models without interruption.
## Usage Tiers
Usage tiers define your limits on usage and scale **monthly** based on your payment history.
| Tiers | Usage Limits | Rate Limit (RPM) | Output Token Length | Qualifications |
| ------ | ---------------- | ---------------- | --------------------------------------------- | --------------------------------------------------------- |
| Tier 1 | \$50 / month | 100 | 2K / 8K (if reasoning model) | Valid payment method added |
| Tier 2 | \$500 / month | 1,000 | 4K / 8K (if reasoning model) | Total historical spend of \$50+ |
| Tier 3 | \$5,000 / month | 5,000 | 8K / 16K (if reasoning model) | Total historical spend of \$500+ |
| Tier 4 | \$50,000 / month | 10,000 | 16K / 32K (if reasoning model) | Total historical spend of \$5,000+ |
| Tier 5 | Custom | Custom | Custom | Contact [support@friendli.ai](mailto:support@friendli.ai) |
**Qualifications** only apply to usage within the Serverless Endpoints plan.
'Output Token Length' is how much the model can write in response. Itβs different from 'Context Length', which is sum of the input and output tokens.
## Billing Methods
Friendli Serverless Endpoints use two different billing methods, Token-Based or Time-Based, depending on the model type.
### Token-Based Billing
In a **token-based billing model**, charges are determined by the number of tokens processed, where each βtokenβ represents an individual unit processed by the model.
| Model Code | Price per Token |
| --------------------------------- | ------------------------------------ |
| LGAI-EXAONE/EXAONE-4.0.1-32B | Input \$0.6 Β· Output \$1 / 1M tokens |
| meta-llama/Llama-3.3-70B-Instruct | \$0.6 / 1M tokens |
| meta-llama/Llama-3.1-8B-Instruct | \$0.1 / 1M tokens |
### Time-Based Billing
In a **time-based billing model**, charges are determined by the compute time required to run your inference request, measured in milliseconds.
Non-compute latencies, such as network delays or queueing time, are excludedβensuring you are charged only for the actual model execution time.
A serverless endpoint model can be in either a **Warm** status, where it's ready to handle requests instantly, or a **Cold** status, where it is inactive and requires time to start up.
When a model in a cold status receives a request, it undergoes a "warm-up" process that typically takes 7-30 seconds, depending on the model's size.
During this period, requests will be queued, but this warm-up delay is not included in your billable compute time.
| Model Code | Price per Second |
| --------------------------------------------- | ---------------- |
| skt/A.X-4.0 | \$0.002 / second |
| skt/A.X-3.1 | \$0.002 / second |
| naver-hyperclovax/HyperCLOVAX-SEED-Think-14B | \$0.002 / second |
| deepseek-ai/DeepSeek-R1-0528 | \$0.004 / second |
| meta-llama/Llama-4-Maverick-17B-128E-Instruct | \$0.004 / second |
| meta-llama/Llama-4-Scout-17B-16E-Instruct | \$0.002 / second |
| Qwen/Qwen3-235B-A22B-Thinking-2507 | \$0.004 / second |
| Qwen/Qwen3-235B-A22B-Instruct-2507 | \$0.004 / second |
| Qwen/Qwen3-30B-A3B | \$0.002 / second |
| Qwen/Qwen3-32B | \$0.002 / second |
| google/gemma-3-27b-it | \$0.002 / second |
| mistralai/Mistral-Small-3.1-24B-Instruct-2503 | \$0.002 / second |
| mistralai/Devstral-Small-2505 | \$0.002 / second |
| mistralai/Magistral-Small-2506 | \$0.002 / second |
## FAQs
Your usage tier, which determines your rate limits, increases monthly based on your proof-of-payment. Need a faster upgrade? Reach out anytime at [support@friendli.ai](mailto:support@friendli.ai) β weβre happy to help!
Popular models are available to all users, depending on the limits determined by their usage tiers.
You'll receive an alert when approaching your monthly cap. Please contact [support@friendli.ai](mailto:support@friendli.ai) to discuss options for increasing your monthly cap. We may help you (1) pay early to reset your monthly cap, or (2) upgrade your plan to increase your monthly cap and unlock more features.
For more questions, contact [support@friendli.ai](mailto:support@friendli.ai).
# QuickStart: Friendli Serverless Endpoints
Source: https://friendli.ai/docs/guides/serverless_endpoints/quickstart
Learn how to get started with Friendli Serverless Endpoints in this step-by-step guide. Create an account, choose from powerful AI models like Llama 3.1, and seamlessly generate text, code, and more with ease.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
## 1. Log In or Sign Up
* If you have an account, log in using your preferred SSO or email/password combination.
* If you're new to FriendliAI, create an account for free.
## 2. Access Friendli Serverless Endpoints
* On your left sidebar, find the "Serverless Endpoints" option.
* Click the option to access the playground page.
## 3. Select a Model
* Browse available generative models. Choose the model that best aligns with your desired use case.
* Click on a model that supports Friendli Serverless Endpoints to directly select the endpoint.
* First-time user receives a free trial to explore Friendli Serverless Endpoints without any financial commitment.
## 4. Generate Responses
1. Enter Your Query:
* Type in your prompt or question.
2. Adjust Settings:
* Refer to the [Chat Completions API Reference](/openapi/serverless/chat-completions) for more details on the settings applicable for the text generation models.
3. Generate Your Response:
* Click submit button to start the generation process.
* The model will process your query and produce the corresponding text output. That's it!
### Generating Responses Through the Endpoint URL
If you wish to send your requests through the endpoint URL, you can find the model id by hitting the info button on the top-right corner of the page.
Refer to [this guide](/guides/suite/personal_access_tokens) for general instructions on the Friendli Token.
```python OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("FRIENDLI_TOKEN"),
base_url="https://api.friendli.ai/serverless/v1",
)
chat_completion = client.chat.completions.create(
model="meta-llama-3.1-8b-instruct",
messages=[
{
"role": "user",
"content": "Tell me how to make a delicious pancake"
}
]
)
print(chat_completion.choices[0].message.content)
```
```python Friendli Python SDK
import os
from friendli import SyncFriendli
client = SyncFriendli(token=os.getenv("FRIENDLI_TOKEN"))
chat_completion = client.serverless.chat.complete(
model="meta-llama-3.3-70b-instruct",
messages=[
{
"role": "user",
"content": "Tell me how to make a delicious pancake"
}
],
stream=False,
)
print(chat_completion.choices[0].message.content)
```
```sh curl
curl -X POST https://api.friendli.ai/serverless/v1/chat/completions \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-d '{
"model": "meta-llama-3.1-8b-instruct",
"messages": [
{
"role": "user",
"content": "Python is a popular"
}
]
}'
```
## Additional Tips
Check out the [Chat Completions API Reference](/openapi/serverless/chat-completions) docs for more details.
**Ready to unlock the creativity of generative AI? Get started with Friendli Serverless Endpoints today!**
# Tool Assisted API
Source: https://friendli.ai/docs/guides/serverless_endpoints/tool-assisted-api
Tool Assisted API enhances a model's capabilities by integrating tools that extend its functionality beyond simple conversational interactions. By using this API, the model becomes more dynamic, providing more comprehensive and actionable responses. Currently, Friendli Serverless Endpoints supports a variety of built-in tools specifically designed for Chat Completion tasks.
export const ToolIcon = () => {
return ;
};
export const ChatIcon = () => {
return ;
};
## What is Tool Assisted API?
**Tool Assisted API** enhances a model's capabilities by integrating **tools** that extend its functionality beyond simple conversational interactions.
By using this API, the model becomes more dynamic, providing more comprehensive and actionable responses.
Currently, [**Friendli Serverless Endpoints**](/guides/serverless_endpoints/introduction) supports a variety of built-in tools specifically designed for **Chat Completion** tasks.
***
### What is Chat Completion?
[**Chat completion**](/openapi/serverless/chat-completions) refers to a model's ability to generate responses in a conversation. Given a sequence of messages or conversation turns, the model processes the input and generates a response based on its internal knowledge and training data.
* **Example**:
* **User**: "What is the capital of France?"
* **Model**: "The capital of France is Paris."
However, chat completion has its limitationsβit is restricted to the knowledge the model has learned during its training and cannot access real-time or external data.
***
### Is Chat Completion Different from Tool Assisted Chat Completion?
Yes, [**Tool Assisted Chat Completion**](/openapi/serverless/tool-assisted-chat-completions) goes beyond basic chat completion by integrating external tools to enhance the conversation. This allows the model to access real-time data, perform specific tasks, and interact with external systems in ways that chat completion alone cannot achieve.
* **Example**:
* **User**: "What is the weather today?"
* **Model without Tool Access**: Relies on pre-learned information, potentially giving outdated or generalized answers.
* **Model with Tool Access**: Calls a weather API to retrieve live data and responds: "The weather today in New York is 72Β°F with clear skies."
With tool access, the model provides a more accurate and up-to-date response.
Additionally, some tasksβsuch as file processing or complex calculationsβcannot be performed by the model alone but can be handled with the help of tools.
* **Example**:
* **User**: "Can you extract the text from this document?" (provides a file)
* **Model without Tool Access**: "I cannot extract data from files directly."
* **Model with Tool Access**: Extracts the text from the provided file and responds: "Using the `file:text` tool, I've extracted the following text: \[Text from the file]."
When no tools are specified, the model will respond using only its internal knowledge.
***
### Benefits of Tool Assisted Chat Completion
Tool Assisted Chat Completion offers several advantages over basic chat completion:
* **Real-Time Data Access**: The model can pull live information.
* **Extended Capabilities**: The model can perform complex tasks like running calculations, executing code, extracting text from files, and interacting with databases and APIs.
***
### Comparison: Chat Completion vs. Tool Assisted Chat Completion
| Feature | **Chat Completion** | **Tool Assisted Chat Completion** |
| ----------------- | ------------------------------------------------ | -------------------------------------------------------------------- |
| **Response Type** | Based on internal knowledge | Uses external tools for enhanced, real-time responses |
| **Capabilities** | Limited to pre-learned knowledge | Can interact with tools for data retrieval and task execution |
| **Example** | "What is the weather today?" (general knowledge) | "What is the weather today?" (live API result) |
| **Use Cases** | General conversation and Q\&A | Complex tasks like real-time updates, data analysis, file processing |
***
## Integrated Tools
Tool Assisted API can also make use of integrated tools, which operate in the same way as built-in tools but require a connection to external services.
Because they leverage specialized APIs or platforms, integrated tools are more likely to be production-ready compared to built-in tools, offering higher performance and broader functionality.
### `linkup:search`
**Description:**
[Linkup](https://www.linkup.so) seamlessly integrates with models served by FriendliAI to provide real-time web search capabilities. This enables AI applications to retrieve up-to-date facts, events, and information beyond the modelβs training data, complete with accurate citations.
Grounding responses in real-time data, Linkup improves precision, accuracy and factuality while delivering production-ready state-of-the-art web information retrieval.
**When Used:**
Automatically called when you need to retrieve current information from the web, such as recent news, real-time data, or up-to-date facts.
This tool is particularly useful for tasks requiring accurate and reliable web search results.
This tool requires integration setup for SDK/API usage. You can obtain your Linkup API key from [app.linkup.so](https://app.linkup.so) and integrate it in the [Friendli Suite > Personal Settings > Integrations](https://friendli.ai/suite/setting/integrations).
For more details, see the [Linkup integration guide](/sdk/integrations/linkup#for-sdk%2Fapi-usage).
***
## Built-In Tools
Tool Assisted API automatically selects the best tool to perform an action based on user input when a specific tool is enabled. Built-in tools are available without any external integration, making them free to use and instantly accessible.
See the list below for available built-in tools.
### `web:search`
**Description:**
Retrieves information from the web based on search queries. It fetches information based on keywords and helps gather
knowledge or insights from online sources.
**When Used:**
Automatically called when you ask questions or seek information that requires external research or the latest data from the web.\
However, compared to specialized web search tools that offer advanced reasoning and enhanced accuracy with citations, this built-in tool may have limitations for complex queries and production applications.
### `math:calculator`
**Description:**
Performs basic arithmetic operations like addition, subtraction, multiplication, division, and more complex calculations like and square roots or exponents.
It is useful for any tasks requiring mathematical computation.
**When Used:**
Automatically called when mathematical expressions or calculations are required.\
Whether you're solving equations, calculating percentages, or handling financial calculations, this tool performs the task for you.
### `code:python-interpreter`
**Description:**
Executes Python code directly within the platform for custom scripts, data processing, or automation.
You can run Python scripts, test snippets of code, or automate tasks through coding logic.
**When Used:**
Automatically called when tasks involve writing or running Python scripts, such as custom data manipulations or logic-based automation.
## Conclusion
* **Chat Completion**: Best for general conversations that rely on the model's pre-existing knowledge.
* **Tool Assisted Chat Completion**: Ideal for real-time, dynamic tasks and more advanced interactions, leveraging external tools to enhance functionality.
***
## Explore APIs
To get started with Tool Assisted Chat Completion, follow this tutorial: [**Tool calling with Serverless Endpoints**](/guides/tutorials/tool-calling-with-serverless-endpoints).
For more details, check out the API Reference documentations below:
} href="/openapi/serverless/chat-completions">
Discover how to generate text through interactive conversations.
} href="/openapi/serverless/tool-assisted-chat-completions">
Learn how to enhance responses with tool assisted chat completions using built-in tools.
# Structured Outputs
Source: https://friendli.ai/docs/guides/structured-outputs
Generate structured outputs using FriendliAI's Structured Outputs feature.
Friendli offers structured outputs capability with two core guarantees:
* **Model-agnostic**: Supported on **all** chatβcapable models on Friendli.
* **High schema fidelity**: Generates outputs that reliably conform to your provided schemas.
## What is Structured Outputs?
Structured Outputs ensures LLMs return predictable, machineβreadable results (e.g., JSON) instead of freeβform text. This is essential for workflows that require validation or downstream automation.
## Structured Outputs with Friendli
* **Schemaβaligned generation**: Highβaccuracy adherence to your JSON Schema.
* **Flexible modes**: Choose strict or loose JSON mode, or apply regex constraints as needed.
* **OpenAI compatible**: Use standard `response_format` options with OpenAI SDKs.
#### Structured Outputs Parameters
| Type | Description | Name at OpenAI |
| ------------- | ------------------------------------------------------------------- | --------------------------------------------------------------------------------------------- |
| `json_schema` | The model returns a JSON object that conforms to the given schema. | [Structured Outputs](https://platform.openai.com/docs/guides/structured-outputs#introduction) |
| `json_object` | The model can return any JSON object. | [JSON mode](https://platform.openai.com/docs/guides/structured-outputs#json-mode) |
| `regex` | The model returns a string that conforms to the given regex schema. | N/A |
#### Supported JSON schemas
We support **all seven standard JSON schema types** (`null`, `boolean`, `number`, `integer`, `string`, `object`, `array`). The supported JSON schema keywords are listed below.
Using unsupported or unexpected JSON schema keywords may result in them being ignored, triggering an error, or causing undefined behavior.
#### Type-specific keywords
* `integer`
* `exclusiveMinimum`, `exclusiveMaximum`, `minimum`, `maximum` (Note: these are not supported in `number`)
* `string`
* `pattern`
* `format`
* Supported values: `uuid`, `date-time`, `date`, `time`
* `object`
* `properties`
* `additionalProperties` is ignored, and is always set to `False`.
* `required`: We support both required and optional properties, but have these limitations:
* The sequence of the properties is fixed.
* The first property should be `required`. If not, the first required property is moved to the first.
* `array`
* `items`
* `minItems`: We support only `0` or `1` for `minItems`.
#### Constant values and enumerated values
`const` and `enum` only support constant values of `null`, `boolean`, `number`, and `string`.
#### Schema composition
We support only `anyOf` for [schema composition](https://json-schema.org/understanding-json-schema/reference/combining).
#### Referencing subschemas
We only support referencing (`$ref`) to "internal" subschemas. These subschemas must be defined within `$defs`, and the value of `$ref` must be a valid URI pointing to a subschema.
#### Annotations
JSON schema annotations such as `title` or `description` are accepted but ignored.
## Simple Example
This example provides a step-by-step guide of how to create a structured output response in JSON format.
We use Python and the `pydantic` library to define a schema for the output in this example.
Define a schema that contains information about a dish.
```python
from pydantic import BaseModel
class Result(BaseModel):
dish: str
cuisine: str
calories: int
```
Call structured output and use schema to structure the response.
```python {17-22} OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.getenv("FRIENDLI_TOKEN"),
)
completion = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{
"role": "user",
"content": "Suggest a popular Italian dish in JSON format.",
},
],
response_format={
"type": "json_schema",
"json_schema": {
"schema": Result.model_json_schema(),
}
}
)
```
```python {15-20} Friendli Python SDK
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.getenv("FRIENDLI_TOKEN"),
) as friendli:
completion = friendli.serverless.chat.complete(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{
"role": "user",
"content": "Suggest a popular Italian dish in JSON format.",
},
],
response_format={
"type": "json_schema",
"json_schema": {
"schema": Result.model_json_schema(),
}
}
)
```
```bash {12-25} curl*
curl -X POST https://api.friendli.ai/serverless/v1/chat/completions \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{
"role": "user",
"content": "Suggest a popular Italian dish in JSON format."
}
],
"response_format": {
"type": "json_schema",
"json_schema": {
"schema": {
"type": "object",
"properties": {
"dish": {"type": "string"},
"cuisine": {"type": "string"},
"calories": {"type": "integer"}
},
"required": ["dish", "cuisine", "calories"]
}
}
}
}'
```
You can use the output in the following way.
```python
response = completion.choices[0].message.content
print(response)
```
The code output result is as follows.
```json Result:
{
"dish": "Spaghetti Bolognese",
"cuisine": "Italian",
"calories": 540
}
```
This example demonstrates how to generate an arbitrary JSON object response without a predefined schema.
In `json_object` mode, the response may start with `{` or `[` and can be any arbitrary JSON object (dictionary) or array. If you need predictable results, we recommend using `json_schema`.
```bash curl
curl -X POST https://api.friendli.ai/serverless/v1/chat/completions \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{"role": "system", "content": "You MUST answer with JSON."},
{"role": "user", "content": "Generate a lasagna recipe. (very short)"}
],
"response_format": {"type": "json_object"}
}'
```
```python OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.getenv("FRIENDLI_TOKEN"),
)
completion = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You MUST answer with JSON."},
{"role": "user", "content": "Generate a lasagna recipe. (very short)"},
],
response_format={"type": "json_object"},
)
print(completion.choices[0].message.content)
```
```python Friendli Python SDK
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.getenv("FRIENDLI_TOKEN"),
) as friendli:
completion = friendli.serverless.chat.complete(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{"role": "system", "content": "You MUST answer with JSON."},
{"role": "user", "content": "Generate a lasagna recipe. (very short)"},
],
response_format={"type": "json_object"},
)
print(completion.choices[0].message.content)
```
This example shows how to generate output that matches a specific regular expression pattern.
```bash curl
curl -X POST https://api.friendli.ai/serverless/v1/chat/completions \
-H "Authorization: Bearer $FRIENDLI_TOKEN" \
-H "Content-Type: application/json" \
-d '{
"model": "meta-llama/Llama-3.1-8B-Instruct",
"messages": [
{
"role": "user",
"content": "μ‘°μ μμ‘°μ 첫λ²μ§Έ μμ λꡬμ λκΉ (Who is the first king of the Joseon Dynasty)?"
}
],
"response_format": {
"type": "regex",
"schema": "[\\n ,.?!0-9\\uac00-\\ud7af]*"
}
}'
```
```python OpenAI Python SDK
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.getenv("FRIENDLI_TOKEN"),
)
completion = client.chat.completions.create(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{
"role": "user",
"content": "μ‘°μ μμ‘°μ 첫λ²μ§Έ μμ λꡬμ λκΉ (Who is the first king of the Joseon Dynasty)?",
},
],
# Korean characters and numbers are allowed in the response.
response_format={"type": "regex", "schema": "[\n ,.?!0-9\uac00-\ud7af]*"},
)
print(completion.choices[0].message.content)
```
```python Friendli Python SDK
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.getenv("FRIENDLI_TOKEN"),
) as friendli:
completion = friendli.serverless.chat.complete(
model="meta-llama/Llama-3.1-8B-Instruct",
messages=[
{
"role": "user",
"content": "μ‘°μ μμ‘°μ 첫λ²μ§Έ μμ λꡬμ λκΉ (Who is the first king of the Joseon Dynasty)?",
},
],
# Korean characters and numbers are allowed in the response.
response_format={"type": "regex", "schema": "[\n ,.?!0-9\uac00-\ud7af]*"},
)
print(completion.choices[0].message.content)
```
## Advanced Examples
For more advanced use cases, see our blog: [Structured Output for LLM Agents](https://friendli.ai/blog/structured-output-llm-agents).
# Free Credits
Source: https://friendli.ai/docs/guides/suite/free_credits
Guide to using free credits and how to apply coupons.
We offer free credits to help you get started with your trials.
## Receiving When You Start
* Serverless Endpoints: Free credits are granted when you start a Trial plan during onboarding.
* Dedicated Endpoints: Free credits are provided when you start a Basic plan during onboarding.
If you didnβt claim your free credits during onboarding, you can still do so later. Go to the **Team Settings > Billing** page and start product plans to receive them.
Access to the billing page is restricted to admins only.
## Redeem Free Coupons
You can also get additional free credits by using coupons.
Go to the **Team Settings > Billing** page and click the **'Redeem free coupon'** button on the top-right corner of the page.
Once you successfully enter the coupon code, the additional credits will be applied immediately.
**Notes**
* If the coupon has expired: Check the expiration date. The free coupon may no longer be valid.
* If the coupon has already been used: It may have already been redeemed by someone on your team.
* If the coupon does not exist: Double-check the code to ensure it is correct.
* If the coupon limit has been reached: The coupon may no longer be available if all available uses have already been claimed.
Enter your coupon code to apply the credits to your team account. Credits are team-based, meaning all members in your team can use them.
Access to the billing page is restricted to admins only.
If you encounter any issues, please { e.preventDefault(); window.Intercom('showNewMessage'); }}>contact support.
# Personal Access Tokens
Source: https://friendli.ai/docs/guides/suite/personal_access_tokens
Learn how to manage credentials in Friendli Suite, including using personal access tokens for authentication and authorization.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
Effective management of credentials is crucial when using Friendli Suite and its endpoints for authentication and authorization purposes.
This guide outlines when the credentials are required and provides instructions on how to manage them.
A Friendli Token serves as an alternative method of authorization to signing in with an email and a password.
You can generate a new Friendli Token through the [Friendli Suite](https://friendli.ai/suite), at your **'Personal settings'** page.
1. Go to the [Friendli Suite](https://friendli.ai/suite) and sign in with your account.
2. Click the profile icon at the top-right corner of the page.
3. Click **'Personal settings'** menu.
4. Go to the **'Tokens'** tab on the navigation bar.
5. Create a new Friendli Token by clicking the **'Create token'** button.
6. Copy the token and save it in a safe place. You will not be able to see this token again once the page is refreshed.
# Projects
Source: https://friendli.ai/docs/guides/suite/projects
Collaborate securely with your teammates.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
Collaborate securely with your teammates.
## Using Projects
Projects are scoped to a team. Only members added to a project can access and manage its resources.
* Control who can manage and access resources.
* Group endpoints and related resources in one place.
* Ensure only the right people can create, update, or terminate endpoints.
Project access currently apply only to Dedicated Endpoints.
## Viewing Projects
You can view all projects from the **Team Overview**.
You may see project resources, but to **create**, **update**, or **terminate** endpoints, you must be added as a project member.
## Adding Members
To collaborate, teammates must be added as project members.
Being a team member does not automatically give access to project endpoints.
* Admin and Owners can add or remove members.
To add a member, enter their name or email and click **Add**.
## Project Usage
You can view monthly usage for each project.
This shows total resource consumption and helps track costs.
## Project Settings
From Project Settings, you can:
* View the project ID and creation date.
* Edit the project name.
* Archive the project when itβs no longer needed.
The last remaining project cannot be archived.
## Permission
Project permissions depend on your team role.
| Permission | Owner/Admin | Member |
| ----------------------------- | ----------- | ------ |
| View projects | β | β |
| View monthly project usage | β | β |
| Access project settings | β | β |
| Edit project names | β | β |
| Add or remove project members | β | β |
| Archive projects | β | β |
You can view your team role in [Personal Settings β Teams](https://friendli.ai/suite/setting/teams).
If you are not added as a project member, you cannot create, update, or terminate endpoints in that project. This often applies to invited users who joined the team but were not added to a project. Contact a team owner or admin for access.
# Supported Models
Source: https://friendli.ai/docs/guides/supported-models
You can view the content [here](https://friendli.ai/models).
# Tool Calling
Source: https://friendli.ai/docs/guides/tool-calling
Friendli supports tool calling for a wide range of openβsource and commercial models.
Friendli provides OpenAI-compatible tool calling with two core guarantees:
* **Broad model coverage**: Works across most chatβcapable models. No custom parsers required.
* **High accuracy**: Ensures reliable tool-call response that aligns with your provided schemas.
## What is Tool Calling?
Tool calling (also called function calling) connects LLMs to external systems, enabling realβtime data access and action executionβcapability essential for agentic workflows.
## Broad Model Coverage
Friendli supports tool calling for a wide range of openβsource and commercial models.
You can browse available models on our [Models page](https://friendli.ai/models) and try them out with the Playground.
## Tool Calling with Friendli
Friendli supports tool calling for a wide range of openβsource and commercial models.
#### Tool Calling Parameters
To enable tool calling, use the `tools`, `tool_choice`, and `parallel_tool_calls` parameters.
| Parameter | Description | default |
| --------------------- | ---------------------------------------------------------------------- | ------- |
| `tools` | The list of tool objects that define the functions the model can call. | - |
| `tool_choice` | Determines the tool calling behavior of the model. | `auto` |
| `parallel_tool_calls` | Whether to let the model issue tool calls in parallel. | `True` |
By default, the model decides whether to call a function and which one to use.
With the `tool_choice` parameter, you can explicitly instruct the model to use a specific function.
* `none`: Disable the use of tools.
* `auto`: Enable the model to decide whether to use tools and which ones to use.
* `required`: Force the model to use a tool, but the model chooses which one.
* Named tool choice: Force the model to use a specific tool. It must be in the following format:
```json
{
"type": "function",
"function": {
"name": "get_current_weather" // The function name you want to specify
}
}
```
#### Response Schema
Friendli follows the OpenAI function calling schema. Tool calls are returned in `choices[].message.tool_calls[]` with each item containing a `function.name` and JSONβstringified `function.arguments`. After executing a tool, append a new message with role: `tool`, the matching `tool_call_id`, and the tool result in content.
## Simple Example
The example below walks through five steps:
1. Define a tool (`get_weather`) that retrieves weather information.
2. Ask a question that triggers tool use.
3. Let the model select the tool.
4. Execute the tool.
5. Generate the final answer using the tool result.
Define a function that the model can call (`get_weather`) with a JSON Schema.\
The function requires the following parameters:
* `location`: The location to look up weather information for.
* `date`: The date to look up weather information for.
This definition is included in the `tools` array and passed to the model.
```python
tools = [
{
"type": "function",
"function": {
"name": "get_weather",
"parameters": {
"type": "object",
"properties": {
"location": {"type": "string"},
"date": {"type": "string", "format": "date"}
},
},
},
}
]
```
When a user asks a question, this request is passed to the model as a `messages` array.\
For example, the request "What's the weather like in Paris today?" would be passed as:
```python
from datetime import datetime
today = datetime.now()
messages = [
{"role": "system", "content": f"You are a helpful assistant. today is {today}."},
{"role": "user", "content": "What's the weather like in Paris today?"}
]
```
Call the model using the `tools` and `messages` defined above.
```python OpenAI Python SDK
import os
from openai import OpenAI
token = os.getenv("FRIENDLI_TOKEN") or ""
client = OpenAI(
base_url = "https://api.friendli.ai/serverless/v1",
api_key = token
)
completion = client.chat.completions.create(
model="meta-llama-3.1-8b-instruct",
messages=messages,
tools=tools,
)
print(completion.choices[0].message.tool_calls)
```
```python Friendli Python SDK
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.getenv("FRIENDLI_TOKEN"),
) as friendli:
completion = friendli.serverless.chat.complete(
model="meta-llama-3.1-8b-instruct",
messages=messages,
tools=tools,
)
print(completion.choices[0].message.tool_calls)
```
The API caller runs the tool based on the function call information of the model.\
For example, the `get_weather` function is executed as follows:
```python
import json
import random
def get_weather(location: str, date: str):
temperature = random.randint(60, 80)
return {"temperature": temperature, "forecast": "sunny"}
tool_call = completion.choices[0].message.tool_calls[0]
tool_response = locals()[tool_call.function.name](**json.loads(tool_call.function.arguments))
print(tool_response)
```
```python Result:
{'temperature': 65, 'forecast': 'sunny'}
```
Add the tool's response to the `messages` array and pass it back to the model.
1. Append tool call information
2. Append the tool's execution result
This ensures the model has all the necessary information to generate a response.
```python
model_response = completion.choices[0].message
# Append the response from the model
messages.append(
{
"role": model_response.role,
"tool_calls": [
tool_call.model_dump()
for tool_call in model_response.tool_calls
]
}
)
# Append the response from the tool
messages.append(
{
"role": "tool",
"content": json.dumps(tool_response),
"tool_call_id": tool_call.id
}
)
print(json.dumps(messages, indent=2))
```
The model generates the final response based on the tool's output:
```python OpenAI Python SDK
next_completion = client.chat.completions.create(
model="meta-llama-3.1-8b-instruct",
messages=messages,
tools=tools
)
print(next_completion.choices[0].message.content)
```
```python Friendli Python SDK
next_completion = friendli.serverless.chat.complete(
model="meta-llama-3.1-8b-instruct",
messages=messages,
tools=tools
)
print(next_completion.choices[0].message.content)
```
```text Final output:
According to the forecast, it's going to be a sunny day in Paris with a temperature of 65 degrees.
```
## Advanced Examples
Follow the following blog posts to learn more about how to use tool calling with Friendli:
* Building an AI Agent for Google Calendar ([Part 1](https://friendli.ai/blog/ai-agent-google-calendar) / [Part 2](https://friendli.ai/blog/calendar-agent-vercel))
* Friendli Tools Blog Series ([Part 1](https://friendli.ai/blog/llm-function-calling) / [Part 2](https://friendli.ai/blog/ai-agents-function-calling) / [Part 3](https://friendli.ai/blog/friendli-tools-llama3-outperforms-gpt4o))
# Build an agent with Gradio
Source: https://friendli.ai/docs/guides/tutorials/build-an-agent-with-gradio
Build and deploy smart AI agents with Friendli Serverless Endpoints and Gradio in under 50 lines.
## Goals
* Build your own AI agent using [**Friendli Serverless Endpoints**](https://friendli.ai/products/serverless-endpoints) and [**Gradio**](https://www.gradio.app) in less than 50 LoC
* Use tool calling to make your agent even smarter
* Share your AI agent with the world and gather feedback
> [**Gradio**](https://www.gradio.app) is the fastest way to demo your model with a friendly web interface.
## Getting Started
1. Head to [**https://friendli.ai**](https://friendli.ai/get-started/serverless-endpoints), and create an account.
2. Grab a [Friendli Token](https://friendli.ai/suite/setting/tokens) to use Friendli Serverless Endpoints within an agent.
## Step 1. Prerequisite
Install dependencies.
```
pip install openai gradio
```
## Step 2. Launch your agent
Build your own AI agent using **Friendli Serverless Endpoints** and **Gradio**.
* Gradio provides a `ChatInterface` that implements a chatbot UI running the `chat_function`.
* More information about the *chat\_function(message, history)*
> *The input function should accept two parameters: a string input message and list of two-element lists of the form \[\[user\_message, bot\_message], ...] representing the chat history, and return a string response.*
* Implement the `chat_function` using Friendli Serverless Endpoints.
* Here, we used the `meta-llama-3.3-70b-instruct` model.
* Feel free to explore other available models [here](https://friendli.ai/models/search?products=SERVERLESS).
```python
from openai import OpenAI
import gradio as gr
friendli_client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key="YOUR FRIENDLI TOKEN"
)
def chat_function(message, history):
messages = []
for user, chatbot in history:
messages.append({"role" : "user", "content": user})
messages.append({"role" : "assistant", "content": chatbot})
messages.append({"role": "user", "content": message})
stream = friendli_client.chat.completions.create(
model="meta-llama-3.3-70b-instruct",
messages=messages,
stream=True
)
res = ""
for chunk in stream:
res += chunk.choices[0].delta.content or ""
yield res
css = """
.gradio-container {
max-width: 800px !important;
margin-top: 100px !important;
}
.pending {
display: none !important;
}
.sm {
box-shadow: None !important;
}
#component-2 {
height: 400px !important;
}
"""
with gr.Blocks(theme=gr.themes.Soft(), css=css) as friendli_agent:
gr.ChatInterface(chat_function)
friendli_agent.launch()
```
## Step 3. Tool Calling (Advanced)
Use tool calling to make your agent even smarter! We will show you how to make your agent search the web before answer as an example.
* Change the `base_url` to `https://api.friendli.ai/serverless/tools/v1`
* Add `tools` parameter when calling chat completion API
```python
from openai import OpenAI
import gradio as gr
friendli_client = OpenAI(
base_url="https://api.friendli.ai/serverless/tools/v1",
api_key="YOUR FRIENDLI TOKEN"
)
def chat_function(message, history):
messages = []
for user, chatbot in history:
messages.append({"role" : "user", "content": user})
messages.append({"role" : "assistant", "content": chatbot})
messages.append({"role": "user", "content": message})
stream = friendli_client.chat.completions.create(
model="meta-llama-3.3-70b-instruct",
messages=messages,
stream=True,
tools=[{"type": "web:search"}],
)
res = ""
for chunk in stream:
if chunk.choices is None:
yield "Waiting for tool response..."
else:
res += chunk.choices[0].delta.content or ""
yield res
css = """
.gradio-container {
max-width: 800px !important;
margin-top: 100px !important;
}
.pending {
display: none !important;
}
.sm {
box-shadow: None !important;
}
#component-2 {
height: 400px !important;
}
"""
with gr.Blocks(theme=gr.themes.Soft(), css=css) as agent:
gr.ChatInterface(chat_function)
agent.launch()
```
Here is the available built-in tools (Beta) list. Feel free to build your agent using the below tools.
* `linkup:search` (tool for high-quality, AI-powered web search with real-time data and improved accuracy)
* `math:calculator` (tool for calculating arithmetic operations)
* `math:statistics` (tool for analyzing statistic data)
* `math:calendar` (tool for handling date-related data)
* `web:search` (tool for retrieving data through the web search)
* `web:url` (tool for extracting data from a given website)
* `code:python-interpreter` (tool for writing and executing python code)
* `file:text` (tool for extracting text data from a given file)
## Step 4. Deploy your agent
For the temporal deployment, change the last line of the code.
```python
agent.launch(share=True)
```
For the permanent deployment, you can use [Hugging Face Space](https://huggingface.co/spaces)!
# Build an agent with LangChain
Source: https://friendli.ai/docs/guides/tutorials/build-an-agent-with-langchain
Build an AI agent with LangChain and Friendli Serverless Endpoints, integrating tool calling for dynamic and efficient responses.
## Introduction
This tutorial walks you through creating an Agent using LangChain and Serverless Endpoints.
## Setup
```bash
pip install -qU langchain-openai langchain-community langchain wikipedia
```
Get your [Friendli Token](https://friendli.ai/suite/setting/tokens) to use Friendli Serverless Endpoints.
```python
import getpass
import os
if not os.environ.get("FRIENDLI_TOKEN"):
os.environ["FRIENDLI_TOKEN"] = getpass.getpass("Enter your Friendli Token: ")
```
## Instantiation
```python
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="meta-llama-3.1-8b-instruct",
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ["FRIENDLI_TOKEN"],
)
```
## Create Agent with LangChain
### Step 1. Create Tool
```python
from langchain_community.tools import WikipediaQueryRun
from langchain_community.utilities import WikipediaAPIWrapper
api_wrapper = WikipediaAPIWrapper(top_k_results=1, doc_content_chars_max=100)
wiki = WikipediaQueryRun(api_wrapper=api_wrapper)
tools = [wiki]
```
### Step 2. Create Prompt
```python
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
prompt = ChatPromptTemplate.from_messages(
[
("system", "You are a helpful assistant"),
MessagesPlaceholder("chat_history"),
("user", "{input}"),
("placeholder", "{agent_scratchpad}"),
]
)
prompt.messages
```
### Step 3. Create Agent
```python
from langchain.agents import AgentExecutor
from langchain.agents import create_tool_calling_agent
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
```
### Step 4. Run the Agent
```python
chat_history = []
while True:
user_input = input("Enter your message: ")
result = agent_executor.invoke(
{"input": user_input, "chat_history": chat_history},
)
chat_history.append({"role": "user", "content": user_input})
chat_history.append({"role": "assistant", "content": result["output"]})
```
When you run the code, it will wait for the user's input.
After inputting, it will wait and output the result.
When you ask a question about a specific wikipedia, it will automatically call the wikipedia tool and output the result.
```text final result
Enter your Friendli Token: Β·Β·Β·Β·Β·Β·Β·Β·Β·Β·
Enter your message: hello
> Entering new AgentExecutor chain...
Hello, it's nice to meet you. I'm here to help with any questions or topics you'd like to discuss. Is there something in particular you'd like to talk about, or do you need assistance with something?
> Finished chain.
Enter your message: What does the Linux kernel do?
> Entering new AgentExecutor chain...
Invoking: `wikipedia` with `{'query': 'Linux kernel'}`
responded: The Linux kernel is the core component of the Linux operating system. It acts as a bridge between the computer hardware and the user space applications. The kernel manages the system's hardware resources, such as memory, CPU, and I/O devices. It provides a set of interfaces and APIs that allow user space applications to interact with the hardware.
Page: Linux kernel
Summary: The Linux kernel is a free and open source,:β4β UNIX-like kernel that isThe Linux kernel is a free and open source, UNIX-like kernel that is responsible for managing the system's hardware resources, such as memory, CPU, and I/O devices. It provides a set of interfaces and APIs that allow user space applications to interact with the hardware. The kernel is the core component of the Linux operating system, and it plays a crucial role in ensuring the stability and security of the system.
> Finished chain.
Enter your message:
```
## Full Example Code
```python
import getpass
import os
from langchain_openai import ChatOpenAI
from langchain_community.tools import WikipediaQueryRun
from langchain_community.utilities import WikipediaAPIWrapper
from langchain_core.prompts import ChatPromptTemplate, MessagesPlaceholder
from langchain.agents import AgentExecutor
from langchain.agents import create_tool_calling_agent
if not os.environ.get("FRIENDLI_TOKEN"):
os.environ["FRIENDLI_TOKEN"] = getpass.getpass("Enter your Friendli Token: ")
llm = ChatOpenAI(
model="meta-llama-3.1-8b-instruct",
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ["FRIENDLI_TOKEN"],
)
api_wrapper = WikipediaAPIWrapper(top_k_results=1, doc_content_chars_max=100)
wiki = WikipediaQueryRun(api_wrapper=api_wrapper)
tools = [wiki]
# Get the prompt to use - you can modify this!
prompt = ChatPromptTemplate.from_messages(
[
("system", "You are a helpful assistant"),
MessagesPlaceholder("chat_history"),
("user", "{input}"),
("placeholder", "{agent_scratchpad}"),
]
)
agent = create_tool_calling_agent(llm, tools, prompt)
agent_executor = AgentExecutor(agent=agent, tools=tools, verbose=True)
chat_history = []
while True:
user_input = input("Enter your message: ")
result = agent_executor.invoke(
{"input": user_input, "chat_history": chat_history},
)
chat_history.append({"role": "user", "content": user_input})
chat_history.append({"role": "assistant", "content": result["output"]})
```
# Chat docs with LangChain
Source: https://friendli.ai/docs/guides/tutorials/chat-docs-with-langchain
You can view the content [here](https://friendli.ai/blog/chatdocs-rag-friendli-langchain).
# Chat docs with MongoDB
Source: https://friendli.ai/docs/guides/tutorials/chat-docs-with-mongodb
You can view the content [here](https://friendli.ai/blog/rag-chatbot-friendli-mongodb-atlas-langchain).
# Go Playground with Next.js
Source: https://friendli.ai/docs/guides/tutorials/go-playground-with-nextjs
You can view the content [here](https://friendli.ai/blog/vercel-ai-sdk-playground-tutorial).
# RAG app with LlamaIndex
Source: https://friendli.ai/docs/guides/tutorials/rag-app-with-llamaindex
You can view the content [here](https://friendli.ai/blog/llamaindex-rag-app-friendli-engine).
# Tool calling with Serverless Endpoints
Source: https://friendli.ai/docs/guides/tutorials/tool-calling-with-serverless-endpoints
Build AI agents with Friendli Serverless Endpoints using tool calling for dynamic, real-time interactions with LLMs.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
## Goals
* Use tool calling to build your own AI agent with [**Friendli Serverless Endpoints**](https://friendli.ai/products/serverless-endpoints)
* Check out the examples below to see how you can interact with state-of-the-art language models while letting them search the web, run Python code, etc.
* Feel free to make your own custom tools!
## Getting Started
1. Head to [**https://friendli.ai**](https://friendli.ai/get-started/serverless-endpoints), and create an account.
2. Grab a [Friendli Token](https://friendli.ai/suite/setting/tokens) to use Friendli Serverless Endpoints within an agent.
## Step 1. Playground UI
Experience tool calling on the Playground!
1. On your left sidebar, click the 'Serverless Endpoints' option to access the playground page.
2. You will see models that can be used as Serverless Endpoints. Choose the one you want and select the endpoint.
3. Click 'Tools' button, select Search tool, and enter a query to see the response. π
## Step 2. Tool Calling
Search interesting information using the `web:search` tool.
This time, let's try it by writing python code.
1. Add the user's input as an `user` role message.
2. Add the `web:search` tool to the tools option.
```python
# pip install friendli
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.getenv("FRIENDLI_TOKEN", ""),
) as friendli:
res = friendli.serverless.tool_assisted_chat.complete(
model="meta-llama-3.1-8b-instruct",
messages=[
{
"role": "user",
"content": "Find information on the popular movies currently showing in theaters and provide their ratings.",
},
],
tools=[{"type": "web:search"}],
max_tokens=200,
)
print(res)
```
## Step 3. Multiple tool calling
Use multiple tools at once to calculate "How long it will take you to buy a house in the San Francisco Bay Area based on your annual salary". Here is the available built-in tools.
* `math:calculator` (tool for calculating arithmetic operations)
* `web:search` (tool for retrieving data through the web search)
* `code:python-interpreter` (tool for writing and executing python code)
### Example Answer sheet
```
Prompt: My annual salary is $ 100k. How long it will take to buy a house in San Francisco Bay Area? (`web:search` & `math:calculator` used)
Answer: Based on the web search results, the median price of an existing single-family home in the Bay Area is around $1.25 million.
Using a calculator to calculate how long it would take to buy a house in the San Francisco Bay Area with an annual salary of $100,000, we get:
$1,200,000 (house price) / $100,000 (annual salary) = 12 years
So, it would take approximately 12 years to buy a house in the San Francisco Bay Area with an annual salary of $100,000,
assuming you save your entire salary each year and don't consider other factors like interest rates, taxes, and living expenses.
```
## Step 4. Build a custom tool
Build your own creative tool. We will show you how to make a custom tool that retrieves temperature information. (Completed code snippet is provided at the bottom)
1. **Define a function for using as a custom tool**
```python
def get_temperature(location: str) -> int:
"""Mock function that returns the city temperature"""
if "new york" in location.lower():
return 45
if "san francisco" in location.lower():
return 72
return 30
```
2. **Send a function calling inference request**
1. Add the user's input as an `user` role message.
2. The information about the custom function (e.g., `get_temperature`) goes into the tools option. The function's parameters are described in JSON schema.
3. The response includes the `arguments` field, which are values extracted from the user's input that can be used as parameters of the custom function.
```python
# pip install friendli
import os
from friendli import SyncFriendli
token = os.environ.get("FRIENDLI_TOKEN") or "YOUR_FRIENDLI_TOKEN"
client= SyncFriendli(token=token)
user_prompt = "I live in New York. What should I wear for today's weather?"
messages = [
{
"role": "user",
"content": user_prompt,
},
]
tools=[
{
"type": "function",
"function": {
"name": "get_temperature",
"description": "Get the temperature information in a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The name of current location e.g., New York",
},
},
},
},
},
]
chat = client.serverless.chat.complete(
model="meta-llama-3.3-70b-instruct",
messages=messages,
tools=tools,
temperature=0,
frequency_penalty=1,
)
print(chat)
```
3. **Generate the final response using the tool calling results**
1. Add the `tool_calls` response as an `assistant` role message.
2. Add the result obtained by calling the `get_weather` function as a `tool` message to the Chat API again.
```python
import json
func_kwargs = json.loads(chat.choices[0].message.tool_calls[0].function.arguments)
temperature_info = get_temperature(**func_kwargs)
messages.append(
{
"role": "assistant",
"tool_calls": [
tool_call.model_dump()
for tool_call in chat.choices[0].message.tool_calls
]
}
)
messages.append(
{
"role": "tool",
"content": str(temperature_info),
"tool_call_id": chat.choices[0].message.tool_calls[0].id
}
)
chat_w_info = client.serverless.chat.complete(
model="meta-llama-3.3-70b-instruct",
tools=tools,
messages=messages,
)
for choice in chat_w_info.choices:
print(choice.message.content)
```
* **Complete Code Snippet**
```python
# pip install friendli
import json
import os
from friendli import SyncFriendli
token = os.environ.get("FRIENDLI_TOKEN") or "YOUR_FRIENDLI_TOKEN"
client = SyncFriendli(token=token)
user_prompt = "I live in New York. What should I wear for today's weather?"
messages = [
{
"role": "user",
"content": user_prompt,
},
]
tools=[
{
"type": "function",
"function": {
"name": "get_temperature",
"description": "Get the temperature information in a given location.",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The name of current location e.g., New York",
},
},
},
},
},
]
chat = client.serverless.chat.complete(
model="meta-llama-3.3-70b-instruct",
messages=messages,
tools=tools,
temperature=0,
frequency_penalty=1,
)
def get_temperature(location: str) -> int:
"""Mock function that returns the city temperature"""
if "new york" in location.lower():
return 45
if "san francisco" in location.lower():
return 72
return 30
func_kwargs = json.loads(chat.choices[0].message.tool_calls[0].function.arguments)
temperature_info = get_temperature(**func_kwargs)
messages.append(
{
"role": "assistant",
"tool_calls": [
tool_call.model_dump()
for tool_call in chat.choices[0].message.tool_calls
]
}
)
messages.append(
{
"role": "tool",
"content": str(temperature_info),
"tool_call_id": chat.choices[0].message.tool_calls[0].id
}
)
chat_w_info = client.serverless.chat.complete(
model="meta-llama-3.3-70b-instruct",
tools=tools,
messages=messages,
)
for choice in chat_w_info.choices:
print(choice.message.content)
```
## Congratulations!
Following the above instructions, we've experienced the whole process of defining and using a custom tool to generate an accurate and rich answer from LLM models!
Brainstorm creative ideas for your agent by reading our blog articles!
* [**Building an AI Agent for Google Calendar**](https://friendli.ai/blog/ai-agent-google-calendar)
* [**Hassle-free LLM Fine-tuning with FriendliAI and Weights & Biases**](https://friendli.ai/blog/llm-fine-tuning-friendliai-wandb)
* [**Building AI Agents Using Function Calling with LLMs**](https://friendli.ai/blog/ai-agents-function-calling)
* [**Function Calling: Connecting LLMs with Functions and APIs**](https://friendli.ai/blog/llm-function-calling)
# Deploy from W&B Registry with Webhook
Source: https://friendli.ai/docs/guides/tutorials/wandb-registry-with-dedicated-endpoints
Hands-on tutorial for launching and deploying LLMs using Friendli Dedicated Endpoints with Weights & Biases artifacts through webhook automation.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
## Introduction
This tutorial is designed to guide you through the process of easily deploying your models from the [W\&B Registry](https://docs.wandb.ai/guides/core/registry/) to Friendli Dedicated Endpoints in the W\&B UI. Through a series of step-by-step instructions and hands-on examples, youβll learn how to:
* **Configure a webhook** in W\&B to trigger deployments to Friendli Dedicated Endpoints.
* **Create a [webhook automation](https://docs.wandb.ai/guides/core/automations/create-automations/webhook/)** to automatically deploy model artifacts when adding new versions.
* **Deploy a model artifact** to Friendli Dedicated Endpoints by adding an alias in the W\&B Registry.
* **Understand how adding and removing aliases** affects deployments on Friendli Dedicated Endpoints.
### Why use W\&B webhook automation with Friendli Dedicated Endpoints?
W\&B users often rely on W\&B Registry to manage the lifecycle of models β from tracking experiment artifacts to promoting the best-performing models for production use. As a W\&B user, integrating Friendli Dedicated Endpoints directly into this workflow allows you to:
* **Streamline deployment**: Transition your models from experimentation to production with minimal effort. By leveraging W\&Bβs aliasing system and FriendliAIβs automated infrastructure, you eliminate the need for custom scripts or manual configurations.
* **Ensure deployment consistency**: Friendli Dedicated Endpoints include support for `idempotencyKey` to ensure the reliability of automated workflows. Each deployment trigger via webhook automation is tracked with a unique `idempotencyKey`, ensuring that operations like endpoint creation or updates are processed exactly once. It prevents duplicate or conflicting operations, giving you confidence in the consistency of your deployment.
By the end of this tutorial, youβll be equipped with the knowledge and skills necessary to seamlessly transfer your models from W\&B Registry to Friendli Dedicated Endpoints for efficient deployment. So, letβs get started and explore the possibilities of Friendli Dedicated Endpoints!
## Prerequisites
* A Friendli Suite account with access to [Friendli Dedicated Endpoints](/guides/dedicated_endpoints/introduction).
* A [personal access token](/guides/suite/personal_access_tokens) generated through Friendli Suite.
## Step 1: Create a secret
1. Navigate to the [teamβs page](https://wandb.ai/home) on W\&B and click on **Team settings**.
2. Scroll down to the **Team secrets** section and click **New secret**.
3. Go to [Friendli Suite](https://friendli.ai/suite) and navigate to **[Personal settings > Tokens](https://friendli.ai/suite/setting/tokens)** and click **Create new token**.
4. Copy your [personal access token](/guides/suite/personal_access_tokens).
5. Return to W\&B and fill in the **Secret** with the personal access token generated through Friendli Suite.
## Step 2: Configure a webhook
1. From the same W\&B team settings page, click on **New webhook** in the **Webhooks** section.
2. Fill in the **URL** field with **Friendli Suite Rest API URL** (see more details [here](/openapi/dedicated/endpoint/wandb-artifact-create)) and **Access token** field with the secret already created through Friendli Suite.
## Step 3: Create a webhook automation
1. Go to your W\&B Registry Model and click on **View details** of the model you want to deploy.
2. Click on **Create automation** in the **Automations** section.
3. Select **An artifact alias is added** for the **Event**.
4. Enter an alias you want to use to trigger the deployment for the **Alias regex**.
5. Select the **Webhooks** for **Action type**.
6. Select the webhook configured with Friendli Dedicated Endpoints for **Webhook**.
7. Fill out the box by referring to the following example for **Payload**.
#### Example: Configuration for payload
```json
{
"wandbArtifactVersionName": "${artifact_version_string}"
}
```
| Field | Description |
| -------------------------- | ------------------------------------------ |
| `wandbArtifactVersionName` | Specific model artifact version from W\&B. |
```json
{
"wandbArtifactVersionName": "${artifact_version_string}",
"name": "Generated from WandB ${project_name}/${artifact_collection_name}",
"projectId": "project-id",
"idempotencyKey": "${alias}",
"accelerator": {
"type": "NVIDIA H100",
"count": 1
},
"autoscalingPolicy": {
"minReplica": 0,
"maxReplica": 2,
"cooldownPeriod": 300
}
}
```
| Field | Description |
| -------------------------- | --------------------------------------------------------------------------------------------------------------------------------------------------------- |
| `wandbArtifactVersionName` | Specific model artifact version from W\&B. |
| `name` | Name of the endpoint. |
| `projectId` | Specific project ID of where the endpoint will be created. |
| `idempotencyKey` | Unique value to track which webhook automation triggered an endpoint roll out. Use any unique value, but using the example value provided is recommended. |
| `accelerator` | Hardware for the endpoint. |
| `autoscalingPolicy` | Autoscaling settings for the endpoint. |
To gain more control over GPU resources for an endpoint, configure the `accelerator` field by specifying the desired type and count. This is particularly useful for serving large models that require model or data parallelism.
```json
{
"wandbArtifactVersionName": "${artifact_version_string}",
"name": "Generated from WandB ${project_name}/${artifact_collection_name}",
"accelerator": {
"type": "NVIDIA H100",
"count": 4
},
}
```
| Field | Description |
| ------------------- | ---------------------------------- |
| `accelerator.type` | Specifies the instance type. |
| `accelerator.count` | Specifies the number of instances. |
View more details about each field [here](/openapi/dedicated/endpoint/wandb-artifact-create).
## Step 4: Deploy a model artifact
Deploy your model artifact to Friendli Dedicated Endpoints by simply adding the alias set in **Step 3** to a model artifact version!
After adding the alias, you can see the endpoint created in Friendli Dedicated Endpoints.
## Step 5: Roll out a model artifact (Advanced)
To roll out an endpoint to a new model artifact version, simply add the same alias to the new version you want to deploy. This updates the endpoint to use the new model artifact version. After assigning the alias, the endpoint will update to reflect the new version in Friendli Dedicated Endpoints.
An `idempotencyKey` is required to roll out an endpoint between different model artifact versions.
```json {9}
{
"wandbArtifactVersionName": "${artifact_version_string}",
"name": "Generated from WandB ${project_name}/${artifact_collection_name}",
"accelerator": {
"type": "NVIDIA H100",
"count": 1
},
"projectId": "project-id",
"idempotencyKey": "${alias}",
"autoscalingPolicy": {
"minReplica": 0,
"maxReplica": 2,
"cooldownPeriod": 300
}
}
```
## Step 6: Track the history of deployment versions
Use the Friendli Dedicated Endpoints versioning feature to track the history of your model deployments and maintain a clear record of every update. By adding an alias to a model artifact version, you can deploy models and roll out updates across versions efficiently, without needing to create a new endpoint from scratch.
* When an alias is reassigned to a different version, the existing endpoint will automatically roll out to the new version.
In the diagram,
* `v0` represents the first deployed version of the model when the endpoint was created.
* `v1` is a newer model artifact version that the alias was reassigned to, triggering a rollout to update the endpoint accordingly.
View more details about the versioning feature [here](/guides/dedicated_endpoints/versioning).
## Frequently Asked Questions
The model artifact version will be deployed as the number of aliases added. Within a model collection, only one artifact version can have a given alias at any time. Therefore, adding an alias to a new artifact version will automatically remove it from the previously aliased version with the same alias. One webhook automation is assigned to one Friendli Dedicated Endpoint.
Nothing happens to the endpoint. Removing an alias will not delete the endpoint. However, if you add the removed alias to a new model artifact version, the deployed endpoint will roll out to that version.
If an `idempotencyKey` is included in the payload, moving an alias to a different model artifact version will reassign the created endpoint to the new version within the same project.
When adding an alias to a model artifact version for the first time, an endpoint will be created in either an existing or a new project within your default team of Friendli Suite. If `projectId` is specified, the endpoint will be made in an existing project. Otherwise, a new project will be created.
## Feedback or issue
If you have any feedback or issues about the integration with Friendli Dedicated Endpoints, please ask for support by sending an email to [Support](mailto:support@friendli.ai).
# Container chat completions
Source: https://friendli.ai/docs/openapi/container/chat-completions
post /v1/chat/completions
Given a list of messages forming a conversation, the model generates a response.
Given a list of messages forming a conversation, the model generates a response.
When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`.
You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/container/chat-completions-chunk-object).
# Container chat completions chunk object
Source: https://friendli.ai/docs/openapi/container/chat-completions-chunk-object
Represents a streamed chunk of a chat completions response returned by model, based on the provided input.
Represents a streamed chunk of a chat completions response returned by model, based on the provided input.
```json Response
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "This" },
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294381
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "content": " is" },
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294381
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": null,
"created": 1726294383
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"object": "chat.completion.chunk",
"choices": [],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 4,
"total_tokens": 12
},
"created": 1726294402
}
data: [DONE]
```
```json With tools
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "This" },
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"id": "call_TARbemDG9CFdwuoaQBTRXiYK",
"type": "function",
"function": { "name": "func", "arguments": "{\"" }
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"type": "function",
"function": { "arguments": "arg" }
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"type": "function",
"function": { "arguments": "}" }
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {},
"finish_reason": "tool_calls",
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"object": "chat.completion.chunk",
"choices": [],
"usage": {
"prompt_tokens": 468,
"completion_tokens": 59,
"total_tokens": 527
},
"created": 1726294443
}
data: [DONE]
```
A unique ID of the chat completion.
The object type, which is always set to `chat.completion.chunk`.
The model to generate the completion.
The index of the choice in the list of generated choices.
Role of the generated message author, in this case `assistant`.
The contents of the assistant message.
The index of tool call being generated.
The ID of the tool call.
The type of the tool, which is always set to `function`.
The name of the function to call.
The arguments for calling the function, generated by the model in JSON format.
Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON.
Termination condition of the generation.
`stop` means the API returned the full chat completions generated by the model without running into any limits.
`length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length.
`tool_calls` means the API has generated tool calls.
Available options: `stop`, `length`, `tool_calls`
Log probability information for the choice.
A list of message content tokens with log probability information.
The token.
The log probability of this token.
A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token.
List of the most likely tokens and their log probability, at this token position.
The token.
The log probability of this token.
A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token.
Number of tokens in the prompt.
Number of tokens in the generated chat completions.
Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`).
The Unix timestamp (in seconds) for when the token sampled.
# Container completions
Source: https://friendli.ai/docs/openapi/container/completions
post /v1/completions
Generate text based on the given text prompt.
Generate text based on the given text prompt.
When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`.
You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/container/completions-chunk-object).
# Container completions chunk object
Source: https://friendli.ai/docs/openapi/container/completions-chunk-object
Represents a streamed chunk of a completions response returned by model, based on the provided input.
Represents a streamed chunk of a completions response returned by model, based on the provided input.
```json Response
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"object": "text_completion",
"choices": [
{
"index": 0,
"text": " such",
"token": 1778,
"finish_reason": null,
"logprobs": null
}
],
"created": 1733382157
}
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"object": "text_completion",
"choices": [
{
"index": 0,
"text": " as",
"token": 439,
"finish_reason": null,
"logprobs": null
}
],
"created": 1733382157
}
...
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"object": "text_completion",
"choices": [
{
"index": 0,
"text": "",
"finish_reason": "length",
"logprobs": null
}
],
"created": 1733382157
}
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"object": "text_completion",
"choices": [],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 10,
"total_tokens": 15
},
"created": 1733382157
}
data: [DONE]
```
A unique ID of the completion.
The object type, which is always set to `text_completion`.
The model to generate the completion.
The index of the choice in the list of generated choices.
The text.
The token.
Termination condition of the generation.
`stop` means the API returned the full completions generated by the model without running into any limits.
`length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length.
Available options: `stop`, `length`
Log probability information for the choice.
The starting character position of each token in the generated text, useful for mapping tokens back to their exact location for detailed analysis.
The log probabilities of each generated token, indicating the model's confidence in selecting each token.
A list of individual tokens generated in the completion, representing segments of text such as words or pieces of words.
A list of dictionaries, where each dictionary represents the top alternative tokens considered by the model at a specific position in the generated text, along with their log probabilities. The number of items in each dictionary matches the value of `logprobs`.
Number of tokens in the prompt.
Number of tokens in the generated completions.
Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`).
The Unix timestamp (in seconds) for when the token sampled.
# Container detokenization
Source: https://friendli.ai/docs/openapi/container/detokenization
post /v1/detokenize
By giving a list of tokens, generate a detokenized output text string.
By giving a list of tokens, generate a detokenized output text string.
# Container image generations
Source: https://friendli.ai/docs/openapi/container/image-generations
post /v1/images/generations
Given a description, the model generates image.
Given a description, the model generates image.
# Container overview
Source: https://friendli.ai/docs/openapi/container/overview
OpenAPI reference of Friendli Container API.
OpenAPI reference of Friendli Container API.
### Inference
Discover how to generate text through interactive conversations.
Learn how to generate text.
Explore the process of breaking down text into smaller tokens for machine processing.
Learn how to reconstruct tokenized text back into its original, human-readable form.
Learn how to generate images.
# Container tokenization
Source: https://friendli.ai/docs/openapi/container/tokenization
post /v1/tokenize
By giving a text input, generate a tokenized output of token IDs.
By giving a text input, generate a tokenized output of token IDs.
# Add samples
Source: https://friendli.ai/docs/openapi/dataset/add-samples
post /beta/dataset/{dataset_id}/split/{split_id}/sample
Add samples to dataset.
Add samples to dataset.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Create a new dataset
Source: https://friendli.ai/docs/openapi/dataset/create-a-new-dataset
post /beta/dataset
Create a new dataset.
Create a new dataset.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Create a new split
Source: https://friendli.ai/docs/openapi/dataset/create-a-split
post /beta/dataset/{dataset_id}/split
Create a new split.
Create a new split.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Create a new version
Source: https://friendli.ai/docs/openapi/dataset/create-a-version
post /beta/dataset/{dataset_id}/version
Create a new version.
Create a new version.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Delete a version
Source: https://friendli.ai/docs/openapi/dataset/delete-a-version
delete /beta/dataset/{dataset_id}/version/{version_id}
Delete a version.
Delete a version.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Delete a dataset
Source: https://friendli.ai/docs/openapi/dataset/delete-dataset
delete /beta/dataset/{dataset_id}
Delete a dataset.
Delete a dataset.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Delete samples
Source: https://friendli.ai/docs/openapi/dataset/delete-samples
post /beta/dataset/{dataset_id}/split/{split_id}/sample/delete
Delete samples.
Delete samples.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Delete a split
Source: https://friendli.ai/docs/openapi/dataset/delete-split
delete /beta/dataset/{dataset_id}/split/{split_id}
Delete a split.
Delete a split.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Get dataset info
Source: https://friendli.ai/docs/openapi/dataset/get-dataset-info
get /beta/dataset/{dataset_id}
Get dataset info.
Get dataset info.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Get split info
Source: https://friendli.ai/docs/openapi/dataset/get-split-info
get /beta/dataset/{dataset_id}/split/{split_id}
Get split info.
Get split info.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Get version info
Source: https://friendli.ai/docs/openapi/dataset/get-version-info
get /beta/dataset/{dataset_id}/version/{version_id}
Get version info.
Get version info.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# List datasets
Source: https://friendli.ai/docs/openapi/dataset/list-datasets
get /beta/dataset
List datasets.
List datasets.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# List samples
Source: https://friendli.ai/docs/openapi/dataset/list-samples
get /beta/dataset/{dataset_id}/split/{split_id}/sample
List samples.
List samples.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# List splits
Source: https://friendli.ai/docs/openapi/dataset/list-splits
get /beta/dataset/{dataset_id}/split
List splits.
List splits.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# List versions
Source: https://friendli.ai/docs/openapi/dataset/list-versions
get /beta/dataset/{dataset_id}/version
List versions.
List versions.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dataset overview
Source: https://friendli.ai/docs/openapi/dataset/overview
OpenAPI reference of Friendli Dataset API.
OpenAPI reference of Friendli Dataset API.
### Dataset Management (Beta)
Discover how to list datasets.
Discover how to list versions of a dataset.
Discover how to list splits of a dataset version.
Discover how to list samples in a dataset split.
Discover how to get information about a dataset.
Discover how to get information about a dataset version.
Discover how to get information about a dataset split.
Discover how to create a new dataset.
Discover how to create a new version of a dataset.
Discover how to create a new split in a dataset.
Discover how to add samples to a dataset.
Discover how to delete samples from a dataset.
Discover how to update samples in a dataset.
Discover how to delete a dataset version.
Discover how to delete a dataset.
Discover how to delete a dataset split.
# Update samples
Source: https://friendli.ai/docs/openapi/dataset/update-samples
put /beta/dataset/{dataset_id}/split/{split_id}/sample
Update samples.
Update samples.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated create endpoint
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/create
post /dedicated/beta/endpoint
Create a Dedicated Endpoint deployment for a Hugging Face model.
Create a Dedicated Endpoint deployment for a Hugging Face model.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated delete endpoint
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/delete
delete /dedicated/beta/endpoint/{endpoint_id}
Delete an endpoint.
Delete an endpoint.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated get endpoint
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/get-spec
get /dedicated/beta/endpoint/{endpoint_id}
Given an endpoint ID, return its specification.
Given an endpoint ID, return its specification.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated get endpoint status
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/get-status
get /dedicated/beta/endpoint/{endpoint_id}/status
Given an endpoint ID, return its current status.
Given an endpoint ID, return its current status.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated get endpoint version
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/get-version
get /dedicated/beta/endpoint/{endpoint_id}/version
Given an endpoint ID, return its version history.
Given an endpoint ID, return its version history.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated list endpoints
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/list
get /dedicated/beta/endpoint
List Dedicated Endpoint deployments.
List Dedicated Endpoint deployments.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated restart endpoint
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/restart
put /dedicated/beta/endpoint/{endpoint_id}/restart
Restart a failed or terminated Dedicated Endpoint.
Restart a failed or terminated Dedicated Endpoint.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated sleep endpoint
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/sleep
put /dedicated/beta/endpoint/{endpoint_id}/sleep
Put a Dedicated Endpoint to sleep mode.
Put a Dedicated Endpoint to sleep mode.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated terminate endpoint
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/terminate
put /dedicated/beta/endpoint/{endpoint_id}/terminate
Terminate an endpoint.
Terminate an endpoint.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated update endpoint
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/update
put /dedicated/beta/endpoint/{endpoint_id}
Update a Dedicated Endpoint deployment with new configuration.
Update a Dedicated Endpoint deployment with new configuration.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated wake endpoint
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/wake
put /dedicated/beta/endpoint/{endpoint_id}/wake
Wake up a sleeping Dedicated Endpoint.
Wake up a sleeping Dedicated Endpoint.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated create endpoint from W&B artifact
Source: https://friendli.ai/docs/openapi/dedicated/endpoint/wandb-artifact-create
post /dedicated/endpoint/wandb-artifact-create
Create an endpoint from Weights & Biases artifact.
Create an endpoint from Weights & Biases artifact.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
# Dedicated audio transcriptions
Source: https://friendli.ai/docs/openapi/dedicated/inference/audio-transcriptions
post /dedicated/v1/audio/transcriptions
Given an audio file, the model transcribes it into text.
Given an audio file, the model transcribes it into text.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated chat completions
Source: https://friendli.ai/docs/openapi/dedicated/inference/chat-completions
post /dedicated/v1/chat/completions
Given a list of messages forming a conversation, the model generates a response.
Given a list of messages forming a conversation, the model generates a response.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`.
You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/dedicated/inference/chat-completions-chunk-object).
# Dedicated chat completions chunk object
Source: https://friendli.ai/docs/openapi/dedicated/inference/chat-completions-chunk-object
Represents a streamed chunk of a chat completions response returned by model, based on the provided input.
Represents a streamed chunk of a chat completions response returned by model, based on the provided input.
```json Response
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "(endpoint-id)",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "This" },
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294381
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "(endpoint-id)",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "content": " is" },
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294381
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "(endpoint-id)",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": null,
"created": 1726294383
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "(endpoint-id)",
"object": "chat.completion.chunk",
"choices": [],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 4,
"total_tokens": 12
},
"created": 1726294402
}
data: [DONE]
```
```json With tools
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "(endpoint-id)",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "This" },
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "(endpoint-id)",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"id": "call_TARbemDG9CFdwuoaQBTRXiYK",
"type": "function",
"function": { "name": "func", "arguments": "{\"" }
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "(endpoint-id)",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"type": "function",
"function": { "arguments": "arg" }
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "(endpoint-id)",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"type": "function",
"function": { "arguments": "}" }
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "(endpoint-id)",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {},
"finish_reason": "tool_calls",
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "(endpoint-id)",
"object": "chat.completion.chunk",
"choices": [],
"usage": {
"prompt_tokens": 468,
"completion_tokens": 59,
"total_tokens": 527
},
"created": 1726294443
}
data: [DONE]
```
A unique ID of the chat completion.
The object type, which is always set to `chat.completion.chunk`.
The model to generate the completion. For dedicated endpoints, it returns the endpoint id.
The index of the choice in the list of generated choices.
Role of the generated message author, in this case `assistant`.
The contents of the assistant message.
The index of tool call being generated.
The ID of the tool call.
The type of the tool, which is always set to `function`.
The name of the function to call.
The arguments for calling the function, generated by the model in JSON format.
Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON.
Termination condition of the generation.
`stop` means the API returned the full chat completions generated by the model without running into any limits.
`length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length.
`tool_calls` means the API has generated tool calls.
Available options: `stop`, `length`, `tool_calls`
Log probability information for the choice.
A list of message content tokens with log probability information.
The token.
The log probability of this token.
A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token.
List of the most likely tokens and their log probability, at this token position.
The token.
The log probability of this token.
A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token.
Number of tokens in the prompt.
Number of tokens in the generated chat completions.
Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`).
The Unix timestamp (in seconds) for when the token sampled.
# Dedicated completions
Source: https://friendli.ai/docs/openapi/dedicated/inference/completions
post /dedicated/v1/completions
Generate text based on the given text prompt.
Generate text based on the given text prompt.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`.
You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/dedicated/inference/completions-chunk-object).
# Dedicated completions chunk object
Source: https://friendli.ai/docs/openapi/dedicated/inference/completions-chunk-object
Represents a streamed chunk of a completions response returned by model, based on the provided input.
Represents a streamed chunk of a completions response returned by model, based on the provided input.
```json Response
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"model": "(endpoint-id)",
"object": "text_completion",
"choices": [
{
"index": 0,
"text": " such",
"token": 1778,
"finish_reason": null,
"logprobs": null
}
],
"created": 1733382157
}
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"model": "(endpoint-id)",
"object": "text_completion",
"choices": [
{
"index": 0,
"text": " as",
"token": 439,
"finish_reason": null,
"logprobs": null
}
],
"created": 1733382157
}
...
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"model": "(endpoint-id)",
"object": "text_completion",
"choices": [
{
"index": 0,
"text": "",
"finish_reason": "length",
"logprobs": null
}
],
"created": 1733382157
}
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"model": "(endpoint-id)",
"object": "text_completion",
"choices": [],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 10,
"total_tokens": 15
},
"created": 1733382157
}
data: [DONE]
```
A unique ID of the completion.
The object type, which is always set to `text_completion`.
The model to generate the completion. For dedicated endpoints, it returns the endpoint id.
The index of the choice in the list of generated choices.
The text.
The token.
Termination condition of the generation.
`stop` means the API returned the full completions generated by the model without running into any limits.
`length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length.
Available options: `stop`, `length`
Log probability information for the choice.
The starting character position of each token in the generated text, useful for mapping tokens back to their exact location for detailed analysis.
The log probabilities of each generated token, indicating the model's confidence in selecting each token.
A list of individual tokens generated in the completion, representing segments of text such as words or pieces of words.
A list of dictionaries, where each dictionary represents the top alternative tokens considered by the model at a specific position in the generated text, along with their log probabilities. The number of items in each dictionary matches the value of `logprobs`.
Number of tokens in the prompt.
Number of tokens in the generated completions.
Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`).
The Unix timestamp (in seconds) for when the token sampled.
# Dedicated detokenization
Source: https://friendli.ai/docs/openapi/dedicated/inference/detokenization
post /dedicated/v1/detokenize
By giving a list of tokens, generate a detokenized output text string.
By giving a list of tokens, generate a detokenized output text string.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
# Dedicated image generations
Source: https://friendli.ai/docs/openapi/dedicated/inference/image-generations
post /dedicated/v1/images/generations
Given a description, the model generates image(s).
Given a description, the model generates image(s).
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Dedicated tokenization
Source: https://friendli.ai/docs/openapi/dedicated/inference/tokenization
post /dedicated/v1/tokenize
By giving a text input, generate a tokenized output of token IDs.
By giving a text input, generate a tokenized output of token IDs.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
# Dedicated overview
Source: https://friendli.ai/docs/openapi/dedicated/overview
OpenAPI reference of Friendli Dedicated Endpoints API.
OpenAPI reference of Friendli Dedicated Endpoints API.
### Inference
Discover how to generate text through interactive conversations.
Learn how to generate text.
Explore the process of breaking down text into smaller tokens for machine processing.
Learn how to reconstruct tokenized text back into its original, human-readable form.
Learn how to generate images.
### Endpoint (Beta)
List Dedicated Endpoint deployments.
Given an endpoint ID, return its specification.
Given an endpoint ID, return its version history.
Given an endpoint ID, return its current status.
Create a Dedicated Endpoint deployment for a Hugging Face model.
Create an endpoint from Weights & Biases artifact.
Update a Dedicated Endpoint deployment with new configuration.
Terminate an endpoint.
Restart a failed or terminated Dedicated Endpoint.
Put a Dedicated Endpoint to sleep mode.
Wake up a sleeping Dedicated Endpoint.
Delete an endpoint.
# Complete file upload
Source: https://friendli.ai/docs/openapi/file/complete-file-upload
patch /beta/file/{file_id}
Complete file upload.
Complete file upload.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Get file download URL
Source: https://friendli.ai/docs/openapi/file/get-file-download-url
get /beta/file/{file_id}/download_url
Get file download URL.
Get file download URL.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Get file info
Source: https://friendli.ai/docs/openapi/file/get-file-info
get /beta/file/{file_id}
Get file info.
Get file info.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Initiate file upload
Source: https://friendli.ai/docs/openapi/file/init-file-upload
post /beta/file
Initiate file upload.
Initiate file upload.
To request successfully, it is required to enter a **Friendli Token** (e.g. flp\_XXX) in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn more and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# File overview
Source: https://friendli.ai/docs/openapi/file/overview
OpenAPI reference of Friendli File API.
OpenAPI reference of Friendli File API.
### File Management (Beta)
Discover how to initiate a file upload.
Discover how to complete a file upload.
Discover how to get information about a file.
Discover how to get a download URL for a file.
# API Reference
Source: https://friendli.ai/docs/openapi/introduction
OpenAPI reference of Friendli Suite API. You can interact with the API through HTTP requests from any language.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
OpenAPI reference of Friendli Suite API. You can interact with the API through HTTP requests from any language.
To send inference requests, send to the URI with the prefix: `https://api.friendli.ai`.
For more information, visit [FriendliAI](https://friendli.ai).
## Authentication
When using Friendli Suite API for inference requests, you need to provide a **Friendli Token** for authentication and authorization purposes.
A Friendli Token serves as an alternative method of authorization to signing in with an email and a password.
You can generate a new Friendli Token through the [Friendli Suite](https://friendli.ai/suite), at your **'Personal settings'** page by following the steps below.
1. Go to the [Friendli Suite](https://friendli.ai/suite) and sign in with your account.
2. Click the profile icon at the top-right corner of the page.
3. Click **'Personal settings'** menu.
4. Go to the **'Tokens'** tab on the navigation bar.
5. Create a new Friendli Token by clicking the **'Create token'** button.
6. Copy the token and save it in a safe place. You will not be able to see this token again once the page is refreshed.
# Serverless chat completions
Source: https://friendli.ai/docs/openapi/serverless/chat-completions
post /serverless/v1/chat/completions
Given a list of messages forming a conversation, the model generates a response.
Given a list of messages forming a conversation, the model generates a response.
See available models at [this pricing table](/guides/serverless_endpoints/pricing#billing-methods).
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`.
You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/serverless/chat-completions-chunk-object).
You can explore examples on the [Friendli Serverless Endpoints](https://friendli.ai/get-started/serverless-endpoints) playground and adjust settings with just a few clicks.
# Serverless chat completions chunk object
Source: https://friendli.ai/docs/openapi/serverless/chat-completions-chunk-object
Represents a streamed chunk of a chat completions response returned by model, based on the provided input.
Represents a streamed chunk of a chat completions response returned by model, based on the provided input.
```json Response
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "This" },
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294381
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "content": " is" },
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294381
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {},
"finish_reason": "stop",
"logprobs": null
}
],
"usage": null,
"created": 1726294383
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [],
"usage": {
"prompt_tokens": 8,
"completion_tokens": 4,
"total_tokens": 12
},
"created": 1726294402
}
data: [DONE]
```
```json With tools
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "This" },
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"id": "call_TARbemDG9CFdwuoaQBTRXiYK",
"type": "function",
"function": { "name": "func", "arguments": "{\"" }
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"type": "function",
"function": { "arguments": "arg" }
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"type": "function",
"function": { "arguments": "}" }
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {},
"finish_reason": "tool_calls",
"logprobs": null
}
],
"usage": null,
"created": 1726294442
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [],
"usage": {
"prompt_tokens": 468,
"completion_tokens": 59,
"total_tokens": 527
},
"created": 1726294443
}
data: [DONE]
```
A unique ID of the chat completion.
The object type, which is always set to `chat.completion.chunk`.
The model to generate the completion.
The index of the choice in the list of generated choices.
Role of the generated message author, in this case `assistant`.
The contents of the assistant message.
The index of tool call being generated.
The ID of the tool call.
The type of the tool, which is always set to `function`.
The name of the function to call.
The arguments for calling the function, generated by the model in JSON format.
Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON.
Termination condition of the generation.
`stop` means the API returned the full chat completions generated by the model without running into any limits.
`length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length.
`tool_calls` means the API has generated tool calls.
Available options: `stop`, `length`, `tool_calls`
Log probability information for the choice.
A list of message content tokens with log probability information.
The token.
The log probability of this token.
A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token.
List of the most likely tokens and their log probability, at this token position.
The token.
The log probability of this token.
A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token.
Number of tokens in the prompt.
Number of tokens in the generated chat completions.
Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`).
The Unix timestamp (in seconds) for when the token sampled.
# Serverless completions
Source: https://friendli.ai/docs/openapi/serverless/completions
post /serverless/v1/completions
Generate text based on the given text prompt.
Generate text based on the given text prompt.
See available models at [this pricing table](/guides/serverless_endpoints/pricing#billing-methods).
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`.
You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/serverless/completions-chunk-object).
# Serverless completions chunk object
Source: https://friendli.ai/docs/openapi/serverless/completions-chunk-object
Represents a streamed chunk of a completions response returned by model, based on the provided input.
Represents a streamed chunk of a completions response returned by model, based on the provided input.
```json Response
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"model": "meta-llama-3.1-8b-instruct",
"object": "text_completion",
"choices": [
{
"index": 0,
"text": " such",
"token": 1778,
"finish_reason": null,
"logprobs": null
}
],
"created": 1733382157
}
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"model": "meta-llama-3.1-8b-instruct",
"object": "text_completion",
"choices": [
{
"index": 0,
"text": " as",
"token": 439,
"finish_reason": null,
"logprobs": null
}
],
"created": 1733382157
}
...
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"model": "meta-llama-3.1-8b-instruct",
"object": "text_completion",
"choices": [
{
"index": 0,
"text": "",
"finish_reason": "length",
"logprobs": null
}
],
"created": 1733382157
}
data: {
"id": "cmpl-26a1e10db8544bc3adb488d2d205288b",
"model": "meta-llama-3.1-8b-instruct",
"object": "text_completion",
"choices": [],
"usage": {
"prompt_tokens": 5,
"completion_tokens": 10,
"total_tokens": 15
},
"created": 1733382157
}
data: [DONE]
```
A unique ID of the completion.
The object type, which is always set to `text_completion`.
The model to generate the completion.
The index of the choice in the list of generated choices.
The text.
The token.
Termination condition of the generation.
`stop` means the API returned the full completions generated by the model without running into any limits.
`length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length.
Available options: `stop`, `length`
Log probability information for the choice.
The starting character position of each token in the generated text, useful for mapping tokens back to their exact location for detailed analysis.
The log probabilities of each generated token, indicating the model's confidence in selecting each token.
A list of individual tokens generated in the completion, representing segments of text such as words or pieces of words.
A list of dictionaries, where each dictionary represents the top alternative tokens considered by the model at a specific position in the generated text, along with their log probabilities. The number of items in each dictionary matches the value of `logprobs`.
Number of tokens in the prompt.
Number of tokens in the generated completions.
Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`).
The Unix timestamp (in seconds) for when the token sampled.
# Serverless detokenization
Source: https://friendli.ai/docs/openapi/serverless/detokenization
post /serverless/v1/detokenize
By giving a list of tokens, generate a detokenized output text string.
By giving a list of tokens, generate a detokenized output text string.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
# Serverless overview
Source: https://friendli.ai/docs/openapi/serverless/overview
OpenAPI reference of Friendli Serverless Endpoints API.
OpenAPI reference of Friendli Serverless Endpoints API.
### Inference
Discover how to generate text through interactive conversations.
Learn how to enhance responses with tool assisted chat completions using built-in tools.
Learn how to generate text.
Explore the process of breaking down text into smaller tokens for machine processing.
Learn how to reconstruct tokenized text back into its original, human-readable form.
# Serverless tokenization
Source: https://friendli.ai/docs/openapi/serverless/tokenization
post /serverless/v1/tokenize
By giving a text input, generate a tokenized output of token IDs.
By giving a text input, generate a tokenized output of token IDs.
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
# Serverless tool assisted chat completions
Source: https://friendli.ai/docs/openapi/serverless/tool-assisted-chat-completions
post /serverless/tools/v1/chat/completions
Given a list of messages forming a conversation, the model generates a response. Additionally, the model can utilize built-in tools for tool calls, enhancing its capability to provide more comprehensive and actionable responses.
Given a list of messages forming a conversation, the model generates a response. Additionally, the model can utilize built-in tools for tool calls, enhancing its capability to provide more comprehensive and actionable responses.
See available models at [this pricing table](/guides/serverless_endpoints/pricing#billing-methods).
To request successfully, it is mandatory to enter a **Friendli Token** (e.g. flp\_XXX) value in the **Bearer Token** field.
Refer to the [authentication section](/openapi/introduction#authentication) on our introduction page to learn how to acquire this variable and [visit here](https://friendli.ai/suite/setting/tokens) to generate your token.
When streaming mode is used (i.e., `stream` option is set to `true`), the response is in MIME type `text/event-stream`. Otherwise, the content type is `application/json`.
You can view the schema of the streamed sequence of chunk objects in streaming mode [here](/openapi/serverless/tool-assisted-chat-completions-chunk-object).
You can explore examples on the [Friendli Serverless Endpoints](https://friendli.ai/get-started/serverless-endpoints) playground and adjust settings with just a few clicks.Tool assisted chat completions does not fully support parallel tool calls now.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
# Serverless tool assisted chat completions chunk object
Source: https://friendli.ai/docs/openapi/serverless/tool-assisted-chat-completions-chunk-object
Represents a streamed chunk of a tool assisted chat completions response returned by model, based on the provided input.
Represents a streamed chunk of a tool assisted chat completions response returned by model, based on the provided input.
This API is currently in **Beta**.
While we strive to provide a stable and reliable experience, this feature is still under active development.
As a result, you may encounter unexpected behavior or limitations.
We encourage you to provide feedback to help us improve the feature before its official release.
* { e.preventDefault(); window.Intercom('show'); }}>Feature request & feedback
* { e.preventDefault(); window.Intercom('showNewMessage'); }}>Contact support
```json Response
event: tool_status
data: {
"tool_call_id": "call_3QrfStXSU6fGdOGPcETocIAq",
"name": "math:calculator",
"status": "STARTED",
"parameters": [{ "name": "expression", "value": "150 * 1.60934" }],
"result": null,
"files": null,
"message": null,
"error": null,
"usage": null,
"timestamp": 1726277121
}
event: tool_status
data: {
"tool_call_id": "call_3QrfStXSU6fGdOGPcETocIAq",
"name": "math:calculator",
"status": "ENDED",
"parameters": [{ "name": "expression", "value": "150 * 1.60934" }],
"result": "\"{\\\"result\\\": \\\"150 * 1.60934=241.401000000000\\\"}\"",
"files": null,
"message": null,
"error": null,
"usage": null,
"timestamp": 1726277121
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "To" },
"finish_reason": null,
"logprobs": null
}
],
"created": 1726277121
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "." },
"finish_reason": null,
"logprobs": null
}
],
"created": 1726277121
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {},
"finish_reason": "stop",
"logprobs": null
}
],
"created": 1726277121
}
data: [DONE]
```
```json Multiple tools
event: tool_status
data: {
"tool_call_id": "call_5X9KQ52bV3CUigqHWleTzD9A",
"name": "code:python-interpreter",
"status": "STARTED",
"parameters": [{ "name": "code", "value": "def is_prime(n): ... \n" }],
"result": null,
"files": null,
"message": null,
"error": null,
"usage": null,
"timestamp": 1726277008
}
event: tool_status
data: {
"tool_call_id": "call_5X9KQ52bV3CUigqHWleTzD9A",
"name": "code:python-interpreter",
"status": "ENDED",
"parameters": [{ "name": "code", "value": "def is_prime(n): ... \n" }],
"result": "\"[2, 3, 5, 7, 11, 13, 17]\\n\"",
"files": [],
"message": null,
"error": null,
"usage": null,
"timestamp": 1726277011
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "Now" },
"finish_reason": null,
"logprobs": null
}
],
"created": 1726277011
}
...
event: tool_status
data: {
"tool_call_id": "call_FgfZYpRoDdPtz3QwLrLZIhdP",
"name": "math:calculator",
"status": "STARTED",
"parameters": [{ "name": "expression", "value": "2 * 3 * 5 * 7 * 11 * 13 * 17" }],
"result": null,
"files": null,
"message": null,
"error": null,
"usage": null,
"timestamp": 1726277012
}
event: tool_status
data: {
"tool_call_id": "call_FgfZYpRoDdPtz3QwLrLZIhdP",
"name": "math:calculator",
"status": "ENDED",
"parameters": [{ "name": "expression", "value": "2 * 3 * 5 * 7 * 11 * 13 * 17" }],
"result": "\"{\\\"result\\\": \\\"2 * 3 * 5 * 7 * 11 * 13 * 17=510510\\\"}\"",
"files": null,
"message": null,
"error": null,
"usage": null,
"timestamp": 1726277016
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "The" },
"finish_reason": null,
"logprobs": null
}
],
"created": 1726277016
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "." },
"finish_reason": null,
"logprobs": null
}
],
"created": 1726277016
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {},
"finish_reason": "stop",
"logprobs": null
}
],
"created": 1726277016
}
data: [DONE]
```
```json With custom tool
event: tool_status
data: {
"tool_call_id": "call_iryDFgBCcNoc2ICXuuyZqQUe",
"name": "web:search",
"status": "STARTED",
"parameters": [{ "name": "query", "value": "tallest buildings in the world" }],
"result": null,
"files": null,
"message": null,
"error": null,
"usage": null,
"timestamp": 1726294660
}
event: tool_status
data: {
"tool_call_id": "call_iryDFgBCcNoc2ICXuuyZqQUe",
"name": "web:search",
"status": "UPDATING",
"parameters": [{ "name": "query", "value": "tallest buildings in the world" }],
"result": "https://en.wikipedia.org/wiki/List_of_tallest_buildings",
"files": null,
"message": null,
"error": null,
"usage": null,
"timestamp": 1726294666
}
...
event: tool_status
data: {
"tool_call_id": "call_iryDFgBCcNoc2ICXuuyZqQUe",
"name": "web:search",
"status": "ENDED",
"parameters": [{ "name": "query", "value": "tallest buildings in the world" }],
"result": "['https://en.wikipedia.org/wiki/List_of_tallest_buildings', ...]",
"files": null,
"message": null,
"error": null,
"usage": null,
"timestamp": 1726294671
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": { "role": "assistant", "content": "The" },
"finish_reason": null,
"logprobs": null
}
],
"created": 1726294672
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"id": "call_yuvrTUk4O2Uh7Hns5ieUcu1S",
"type": "function",
"function": { "name": "func", "arguments": "{\"" },
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"created": 1726294673
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"type": "function",
"function": { "arguments": "arg" }
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"created": 1726294673
}
...
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {
"role": "assistant",
"tool_calls": [
{
"index": 0,
"type": "function",
"function": { "arguments": "}" }
}
]
},
"finish_reason": null,
"logprobs": null
}
],
"created": 1726294673
}
data: {
"id": "chatcmpl-4b71d12c86d94e719c7e3984a7bb7941",
"model": "meta-llama-3.1-8b-instruct",
"object": "chat.completion.chunk",
"choices": [
{
"index": 0,
"delta": {},
"finish_reason": "tool_calls",
"logprobs": null
}
],
"created": 1726294673
}
data: [DONE]
```
A unique ID of the chat completion.
The object type, which is always set to `chat.completion.chunk`.
The model to generate the completion.
The index of the choice in the list of generated choices.
Role of the generated message author, in this case `assistant`.
The contents of the assistant message.
The index of tool call being generated.
The ID of the tool call.
The type of the tool, which is always set to `function`.
The name of the function to call.
The arguments for calling the function, generated by the model in JSON format.
Ensure to validate these arguments in your code before invoking the function since the model may not always produce valid JSON.
Termination condition of the generation.
`stop` means the API returned the full chat completions generated by the model without running into any limits.
`length` means the generation exceeded `max_tokens` or the conversation exceeded the max context length.
`tool_calls` means the API has generated tool calls.
Available options: `stop`, `length`, `tool_calls`
Log probability information for the choice.
A list of message content tokens with log probability information.
The token.
The log probability of this token.
A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token.
List of the most likely tokens and their log probability, at this token position.
The token.
The log probability of this token.
A list of integers representing the UTF-8 bytes representation of the token. Useful in instances where characters are represented by multiple tokens and their byte representations must be combined to generate the correct text representation. Can be `null` if there is no bytes representation for the token.
Number of tokens in the prompt.
Number of tokens in the generated chat completions.
Total number of tokens used in the request (`prompt_tokens` + `completion_tokens`).
The Unix timestamp (in seconds) for when the token sampled.
### `event: tool_status` chunk object
`event: tool_status` tracks the execution progress of built-in tools, such as calculator or web search functions.
It provides real-time updates on their status and results.
The ID of the tool call.
The name of the built-in tool.
Available options: `linkup:search`, `math:calculator`, `math:statistics`, `math:calendar`, `web:search`, `web:url`, `code:python-interpreter`, `file:text`
Indicates the current execution status of the tool.
Available options: `STARTED`, `UPDATING`, `ENDED`, `ERRORED`
The name of the tool's function parameter.
The value of the tool's function parameter.
The output from the tool's execution.
The name of the file generated by the tool's execution.
URL of the file generated by the tool's execution.
Message generated by the tool's execution.
The type of error encountered during the tool's execution.
The message of error.
{/* */}
The Unix timestamp (in seconds) for when the event occurred.
# Langchain Node.js SDK
Source: https://friendli.ai/docs/sdk/integrations/langchain/nodejs
Utilize the LangChain Node.js SDK with FriendliAI for seamless integration and enhanced tool calling capabilities in your applications.
You can use [**LangChain Node.js SDK**](https://github.com/langchain-ai/langchainjs) to interact with FriendliAI.
This makes migration of existing applications already using LangChain particularly easy.
## How to use
Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens).
Our products are entirely compatible with OpenAI, so we use the `@langchain/openai` package by referring to the FriendliAI `baseURL`.
```bash npm
npm i @langchain/core @langchain/openai
```
```bash yarn
yarn add @langchain/core @langchain/openai
```
```bash pnpm
pnpm add @langchain/core @langchain/openai
```
### Instantiation
Now we can instantiate our model object and generate chat completions.
We provide usage examples for each type of endpoint. Choose the one that best suits your needs:
```js Serverless Endpoints
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({
model: "meta-llama-3.1-8b-instruct",
apiKey: process.env.FRIENDLI_TOKEN,
configuration: {
baseURL: "https://api.friendli.ai/serverless/v1",
},
});
```
```js Dedicated Endpoints
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({
model: "YOUR_ENDPOINT_ID",
apiKey: process.env.FRIENDLI_TOKEN,
configuration: {
baseURL: "https://api.friendli.ai/dedicated/v1",
},
});
```
```js Fine-tuned Dedicated Endpoints
import { ChatOpenAI } from "@langchain/openai";
const model = new ChatOpenAI({
model: "YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE",
apiKey: process.env.FRIENDLI_TOKEN,
configuration: {
baseURL: "https://api.friendli.ai/dedicated/v1",
},
});
```
### Runnable interface
We support both synchronous and asynchronous runnable methods to generate a response.
{/* #### Synchronous methods: #### Asynchronous methods: TODO: Add more examples */}
```js
import { HumanMessage, SystemMessage } from "@langchain/core/messages";
const messages = [
new SystemMessage("Translate the following from English into Italian"),
new HumanMessage("hi!"),
];
const result = await model.invoke(messages);
console.log(result);
```
### Chaining
We can chain our model with a prompt template.
Prompt templates convert raw user input to better input to the LLM.
```javascript
import { ChatPromptTemplate } from "@langchain/core/prompts";
const prompt = ChatPromptTemplate.fromMessages([
["system", "You are a world class technical documentation writer."],
["user", "{input}"],
]);
const chain = prompt.pipe(model);
console.log(
await chain.invoke({ input: "how can langsmith help with testing?" })
);
```
To get the string value instead of the message, we can add an output parser to the chain.
```javascript
import { StringOutputParser } from "@langchain/core/output_parsers";
const outputParser = new StringOutputParser();
const chain = prompt.pipe(model).pipe(outputParser);
console.log(
await chain.invoke({ input: "how can langsmith help with testing?" })
);
```
### Tool calling
Describe tools and their parameters, and let the model return a tool to invoke with the input arguments.
Tool calling is extremely useful for enhancing the model's capability to provide more comprehensive and actionable responses.
#### Define tools to use
We can define tools with Zod schemas and use them to generate tool calls.
```bash npm
npm i zod
```
```bash yarn
yarn add zod
```
```bash pnpm
pnpm add zod
```
```js
import { tool } from "@langchain/core/tools";
import { z } from "zod";
/**
* Note that the descriptions here are crucial, as they will be passed along
* to the model along with the class name.
*/
const calculatorSchema = z.object({
operation: z
.enum(["add", "subtract", "multiply", "divide"])
.describe("The type of operation to execute."),
number1: z.number().describe("The first number to operate on."),
number2: z.number().describe("The second number to operate on."),
});
const calculatorTool = tool(
async ({ operation, number1, number2 }) => {
// Functions must return strings
if (operation === "add") {
return `${number1 + number2}`;
} else if (operation === "subtract") {
return `${number1 - number2}`;
} else if (operation === "multiply") {
return `${number1 * number2}`;
} else if (operation === "divide") {
return `${number1 / number2}`;
} else {
throw new Error("Invalid operation.");
}
},
{
name: "calculator",
description: "Can perform mathematical operations.",
schema: calculatorSchema,
}
);
console.log(
await calculatorTool.invoke({ operation: "add", number1: 3, number2: 4 })
);
```
#### Bind tools to the model
Now models can generate a tool calling response.
```js
const modelWithTools = model.bindTools([calculatorTool]);
const messages = [new HumanMessage("What is 3 * 12? Also, what is 11 + 49?")];
const aiMessage = await modelWithTools.invoke(messages);
console.log(aiMessage);
```
#### Generate a tool assisted message
Use the tool call results to generate a message.
```js
messages.push(aiMessage);
const toolsByName = {
calculator: calculatorTool,
};
for (const toolCall of aiMessage.tool_calls) {
const selectedTool = toolsByName[toolCall.name];
const toolMessage = await selectedTool.invoke(toolCall);
messages.push(toolMessage);
}
console.log(await modelWithTools.invoke(messages));
```
For more information on how to use tools, check out the [LangChain documentation](https://js.langchain.com/v0.2/docs/how_to/#tools).
# LangChain Python SDK
Source: https://friendli.ai/docs/sdk/integrations/langchain/python
Utilize the LangChain Python SDK with FriendliAI for easy integration and advanced tool calling in your applications.
You can use [**LangChain Python SDK**](https://github.com/langchain-ai/langchain) to interact with FriendliAI.
This makes migration of existing applications already using LangChain particularly easy.
## How to use
Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens).
Our products are entirely compatible with OpenAI, so we use the `langchain-openai` package by referring to the FriendliAI `baseURL`.
```bash
pip install -qU langchain-openai langchain
```
### Instantiation
Now we can instantiate our model object and generate chat completions.
We provide usage examples for each type of endpoint. Choose the one that best suits your needs:
```python Serverless Endpoints
import os
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="meta-llama-3.1-8b-instruct",
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ["FRIENDLI_TOKEN"],
)
```
```python Dedicated Endpoints
import os
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="YOUR_ENDPOINT_ID",
base_url="https://api.friendli.ai/dedicated/v1",
api_key=os.environ["FRIENDLI_TOKEN"],
)
```
```python Fine-tuned Dedicated Endpoints
import os
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE",
base_url="https://api.friendli.ai/dedicated/v1",
api_key=os.environ["FRIENDLI_TOKEN"],
)
```
### Runnable interface
We support both synchronous and asynchronous runnable methods to generate a response.
#### Synchronous methods:
```python invoke
result = llm.invoke("Tell me a joke.")
print(result.content)
```
```python stream
for chunk in llm.stream("Tell me a joke."):
print(chunk.content, end="", flush=True)
```
```python batch
for r in llm.batch(["Tell me a joke.", "Tell me a useless fact."]):
print(r.content, "\n\n")
```
#### Asynchronous methods:
```python ainvoke
result = await llm.ainvoke("Tell me a joke.")
print(result.content)
```
```python astream
async for chunk in llm.astream("Tell me a joke."):
print(chunk.content, end="", flush=True)
```
```python abatch
for r in await llm.abatch(["Tell me a joke.", "Tell me a useless fact."]):
print(r.content, "\n\n")
```
### Chaining
We can [chain](https://python.langchain.com/v0.2/docs/how_to/sequence) our model with a prompt template.
Prompt templates convert raw user input to better input to the LLM.
```python
from langchain_core.prompts import ChatPromptTemplate
prompt = ChatPromptTemplate.from_messages([
("system", "You are a world class technical documentation writer."),
("user", "{input}")
])
chain = prompt | llm
print(chain.invoke({"input": "how can langsmith help with testing?"}))
```
To get the string value instead of the message, we can add an output parser to the chain.
```python
from langchain_core.output_parsers import StrOutputParser
output_parser = StrOutputParser()
chain = prompt | llm | output_parser
print(chain.invoke({"input": "how can langsmith help with testing?"}))
```
### Tool calling
Describe tools and their parameters, and let the model return a tool to invoke with the input arguments.
Tool calling is extremely useful for enhancing the model's capability to provide more comprehensive and actionable responses.
#### Define tools to use
The `@tool` decorator is used to define a tool.
If you set `parse_docstring=True`, the tool will parse the docstring to extract the information of arguments.
```python Default
from langchain_core.tools import tool
@tool
def add(a: int, b: int) -> int:
"""Adds a and b."""
return a + b
@tool
def multiply(a: int, b: int) -> int:
"""Multiplies a and b."""
return a * b
tools = [add, multiply]
```
```python Parse Docstring
from langchain_core.tools import tool
@tool(parse_docstring=True)
def add(a: int, b: int) -> int:
"""Adds a and b.
Args:
a: The first integer.
b: The second integer.
"""
return a + b
@tool(parse_docstring=True)
def multiply(a: int, b: int) -> int:
"""Multiplies a and b.
Args:
a: The first integer.
b: The second integer.
"""
return a * b
tools = [add, multiply]
```
#### Bind tools to the model
Now models can generate a tool calling response.
```python
import os
from langchain_openai import ChatOpenAI
llm = ChatOpenAI(
model="meta-llama-3.1-8b-instruct",
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ["FRIENDLI_TOKEN"],
)
llm_with_tools = llm.bind_tools(tools)
query = "What is 3 * 12? Also, what is 11 + 49?"
print(llm_with_tools.invoke(query).tool_calls)
```
#### Generate a tool assisted message
Use the tool call results to generate a message.
```python
from langchain_core.messages import HumanMessage, ToolMessage
messages = [HumanMessage(query)]
ai_msg = llm_with_tools.invoke(messages)
messages.append(ai_msg)
for tool_call in ai_msg.tool_calls:
selected_tool = {"add": add, "multiply": multiply}[tool_call["name"].lower()]
tool_output = selected_tool.invoke(tool_call["args"])
messages.append(ToolMessage(tool_output, tool_call_id=tool_call["id"]))
print(llm_with_tools.invoke(messages))
```
For more information on how to use tools, check out the [LangChain documentation](https://python.langchain.com/v0.2/docs/how_to/#tools).
# Linkup
Source: https://friendli.ai/docs/sdk/integrations/linkup
Find and access high-quality web content using the Linkup API, integrated with Friendli Serverless Endpoints for seamless interaction.
export const RoundedBorderBox = ({children, caption}) =>
{children}
{caption &&
{caption}
}
;
**Linkup** provides real-time web search capabilities. With Linkup integration in Friendli, you can easily enhance your AI applications with up-to-date facts, recent events, and current information that goes beyond what your model was trained on.
You can use Linkup's real-time web search through Friendli Serverless Endpoints with just a few simple steps.
## How to use
### For Playground Testing
1. Create an account at [**https://friendli.ai**](https://friendli.ai).
2. Subscribe to the free trial of the Friendli Serverless Endpoints product. ([guide](/guides/suite/free_credits#receiving-when-you-start))
3. Go to **Serverless Endpoints** from your Project and click **'Try'** button to open the playground.
4. In the playground, open the **Tools** panel and select **Search the web (Linkup)** to test the integration.
### For SDK/API Usage
1. Go to [**https://app.linkup.so**](https://app.linkup.so), and get your **Linkup API key** (free tier available).
2. In Friendli Suite, open **Personal settings > Integrations** and add your Linkup API key.
In the following code snippet, `FRIENDLI_TOKEN` refers to your **Personal Access Token**, which you can obtain from **Personal settings > Settings > Tokens** ([guide](/guides/suite/personal_access_tokens)).
```bash curl {13-15}
curl --request POST \
--url https://api.friendli.ai/serverless/tools/v1/chat/completions \
--header 'Authorization: Bearer ' \
--header 'Content-Type: application/json' \
--data '{
"model": "meta-llama-3.1-8b-instruct",
"messages": [
{
"content": "Find information on the popular movies currently showing in theaters and provide their ratings.",
"role": "user"
}
],
"tools": [
{ "type": "linkup:search" }
]
}'
```
```python python {17-19}
import os
from openai import OpenAI
client = OpenAI(
api_key=os.getenv("FRIENDLI_TOKEN"),
base_url="https://api.friendli.ai/serverless/tools/v1",
)
completion = client.chat.completions.create(
model="meta-llama-3.1-8b-instruct",
messages=[
{
"role": "user",
"content": "Find information on the popular movies currently showing in theaters and provide their ratings."
}
],
tools=[
{"type": "linkup:search"}
],
stream=False
)
print(completion.choices[0].message.content)
```
## Notes & Caveats
* **Playground vs. SDK/API**: In the playground, you can test web search functionality using a Linkup-sponsored API key. However, for SDK usage and API calls, you must provide your own Linkup API key.
* Make sure your Linkup integration is enabled in your Friendli account before calling the API β otherwise the `linkup:search` tool will error.
* Linkup and Friendli both have rate limits β handle retries/backoff accordingly.
* Keep API keys and tokens secret (use environment variables or secret managers).
# LiteLLM
Source: https://friendli.ai/docs/sdk/integrations/litellm
LiteLLM SDK supports all FriendliAI models, offering easy integration with serverless, dedicated, and fine-tuned endpoints.
You can use [**LiteLLM**](https://github.com/BerriAI/litellm) to interact with FriendliAI.
This makes migration of existing applications already using LiteLLM particularly easy.
## How to use
Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens).
Add `friendliai` prefix to your endpoint name for the `model` parameter.
### Chat completion
We provide usage examples for each type of endpoint. Choose the one that best suits your needs.
You can specify one of the [available models](https://friendli.ai/models/search?products=SERVERLESS) for the serverless endpoints.
```python Serverless Endpoints
import os
from litellm import completion
os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN"
response = completion(
model="friendliai/meta-llama-3.3-70b-instruct",
messages=[
{"role": "user", "content": "hello from litellm"}
],
)
print(response)
```
```python Dedicated Endpoints
import os
from litellm import completion
os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN"
os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1"
response = completion(
model="friendliai/YOUR_ENDPOINT_ID",
messages=[
{"role": "user", "content": "hello from litellm"}
],
)
print(response)
```
```python Fine-tuned Dedicated Endpoints
import os
from litellm import completion
os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN"
os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1"
response = completion(
model="friendliai/YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE",
messages=[
{"role": "user", "content": "hello from litellm"}
],
)
print(response)
```
### Chat completion - Streaming
```python Serverless Endpoints
import os
from litellm import completion
os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN"
response = completion(
model="friendliai/meta-llama-3.3-70b-instruct",
messages=[
{"role": "user", "content": "hello from litellm"}
],
stream=True
)
for chunk in response:
print(chunk)
```
```python Dedicated Endpoints
import os
from litellm import completion
os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN"
os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1"
response = completion(
model="friendliai/YOUR_ENDPOINT_ID",
messages=[
{"role": "user", "content": "hello from litellm"}
],
stream=True
)
for chunk in response:
print(chunk)
```
```python Fine-tuned Dedicated Endpoints
import os
from litellm import completion
os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN"
os.environ['FRIENDLI_API_BASE'] = "https://api.friendli.ai/dedicated/v1"
response = completion(
model="friendliai/YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE",
messages=[
{"role": "user", "content": "hello from litellm"}
],
stream=True
)
for chunk in response:
print(chunk)
```
# LlamaIndex
Source: https://friendli.ai/docs/sdk/integrations/llamaindex
Easily integrate large language models with the LlamaIndex SDK, featuring FriendliAI for seamless interaction.
{/* */}
You can use [**LlamaIndex**](https://github.com/run-llama/llama_index) to interact with FriendliAI.
This makes migration of existing applications already using LlamaIndex particularly easy.
## How to use
Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens).
```python
pip install llama-index llama-index-llms-friendli
```
### Instantiation
Now we can instantiate our model object and generate chat completions.
The default model (i.e. `meta-llama-3.3-70b-instruct`) will be used if no model is specified.
```python
import os
from llama_index.llms.friendli import Friendli
os.environ['FRIENDLI_TOKEN'] = "YOUR_FRIENDLI_TOKEN"
llm = Friendli(model="meta-llama-3.3-70b-instruct")
```
### Chat completion
Generate a response from a given conversation.
```python Default
from llama_index.core.llms import ChatMessage, MessageRole
message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.")
resp = llm.chat([message])
print(resp)
```
```python Streaming
from llama_index.core.llms import ChatMessage, MessageRole
message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.")
resp = llm.stream_chat([message])
for r in resp:
print(r.delta, end="")
```
```python Async
from llama_index.core.llms import ChatMessage, MessageRole
message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.")
resp = await llm.achat([message])
print(resp)
```
```python Async Streaming
from llama_index.core.llms import ChatMessage, MessageRole
message = ChatMessage(role=MessageRole.USER, content="Tell me a joke.")
resp = await llm.astream_chat([message])
async for r in resp:
print(r.delta, end="")
```
### Completion
Generate a response from a given prompt.
```python Default
prompt = "Draft a cover letter for a role in software engineering."
resp = llm.complete(prompt)
print(resp)
```
```python Streaming
prompt = "Draft a cover letter for a role in software engineering."
resp = llm.stream_complete(prompt)
for r in resp:
print(r.delta, end="")
```
```python Async
prompt = "Draft a cover letter for a role in software engineering."
resp = await llm.acomplete(prompt)
print(resp)
```
```python Async Streaming
prompt = "Draft a cover letter for a role in software engineering."
resp = await llm.astream_complete(prompt)
async for r in resp:
print(r.delta, end="")
```
# OpenAI Node.js SDK
Source: https://friendli.ai/docs/sdk/integrations/openai/nodejs
Easily integrate FriendliAI with the OpenAI Node.js SDK.
You can use [**OpenAI Node.js SDK**](https://github.com/openai/openai-node) to interact with FriendliAI.
This makes migration of existing applications already using OpenAI particularly easy.
## How to use
Before you start, ensure the `baseURL` and `apiKey` refer to FriendliAI.
Since our products are entirely compatible with OpenAI SDK, now you are good to follow the examples below.
Choose one of the [available models](https://friendli.ai/models/search?products=SERVERLESS) for `model` parameter.
```bash npm
npm i openai
```
```bash yarn
yarn add openai
```
```bash pnpm
pnpm add openai
```
### Chat Completion
Chat completion API that generates a response from a given conversation.
We provide multiple usage examples. Try to find the best one that aligns with your needs:
```ts Default
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.friendli.ai/serverless/v1",
apiKey: process.env.FRIENDLI_TOKEN,
});
async function main() {
const completion = await client.chat.completions.create({
model: "meta-llama-3.1-8b-instruct",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Hello!" },
],
});
console.log(completion.choices[0]);
}
main();
```
```ts Streaming
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.friendli.ai/serverless/v1",
apiKey: process.env.FRIENDLI_TOKEN,
});
async function main() {
const completion = await client.chat.completions.create({
model: "meta-llama-3.1-8b-instruct",
messages: [
{ role: "system", content: "You are a helpful assistant." },
{ role: "user", content: "Hello!" },
],
stream: true,
});
for await (const chunk of completion) {
console.log(chunk.choices[0].delta.content);
}
}
main();
```
```ts Functions
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.friendli.ai/serverless/v1",
apiKey: process.env.FRIENDLI_TOKEN,
});
async function main() {
const messages = [
{ role: "user", content: "What's the weather like in Boston today?" },
];
const tools = [
{
type: "function",
function: {
name: "get_current_weather",
description: "Get the current weather in a given location",
parameters: {
type: "object",
properties: {
location: {
type: "string",
description: "The city and state, e.g. San Francisco, CA",
},
unit: { type: "string", enum: ["celsius", "fahrenheit"] },
},
required: ["location"],
},
},
},
];
const completion = await client.chat.completions.create({
model: "meta-llama-3.1-8b-instruct",
messages: messages,
tools: tools,
tool_choice: "auto",
});
console.log(completion);
}
main();
```
```ts Logprobs
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.friendli.ai/serverless/v1",
apiKey: process.env.FRIENDLI_TOKEN,
});
async function main() {
const completion = await client.chat.completions.create({
model: "meta-llama-3.1-8b-instruct",
messages: [{ role: "user", content: "Hello!" }],
logprobs: true,
top_logprobs: 2,
});
console.log(completion.choices[0].message);
console.log(completion.choices[0].logprobs);
}
main();
```
### Tool assisted chat completion
This feature is in Beta and available only on the **Serverless Endpoints**.
Using tool assisted chat completion API, models can utilize built-in tools prepared for tool calls, enhancing its capability to provide more comprehensive and actionable responses.
Available tools are listed [here](/guides/serverless_endpoints/tool-assisted-api#built-in-tools).
```ts Basic
import OpenAI from "openai";
const client = new OpenAI({
baseURL: "https://api.friendli.ai/serverless/tools/v1",
apiKey: process.env.FRIENDLI_TOKEN,
});
async function main() {
const messages = [
{
role: "user",
content:
"What is the current average home price in New York City, and if I put 15% down, how much will my mortgage be?",
},
];
const tools = [{ type: "code:python-interpreter" }, { type: "web:search" }];
const completion = await client.chat.completions.create({
model: "meta-llama-3.1-8b-instruct",
messages: messages,
tools: tools,
tool_choice: "auto",
stream: true,
});
for await (const chunk of completion) {
if (chunk.choices === undefined) {
console.log(`event: ${chunk.event}, data: ${JSON.stringify(chunk.data)}`);
} else {
console.log(chunk.choices[0].delta.content);
}
}
}
main();
```
```ts Advanced (REPL)
import OpenAI from "openai";
import * as readline from "node:readline/promises";
const client = new OpenAI({
baseURL: "https://api.friendli.ai/serverless/tools/v1",
apiKey: process.env.FRIENDLI_TOKEN,
});
const terminal = readline.createInterface({
input: process.stdin,
output: process.stdout,
});
async function chatbot(input) {
const stream = await client.chat.completions.create({
model: "meta-llama-3.1-8b-instruct",
messages: [{ role: "user", content: input }],
tools: [
{ type: "web:url" },
{ type: "code:python-interpreter" },
{ type: "math:calculator" },
{ type: "web:search" },
],
tool_choice: "auto",
stream: true,
});
for await (const chunk of stream) {
if (chunk.choices === undefined) {
if (chunk.event === "tool_status") {
if (chunk.data.result !== "") {
switch (chunk.data.status) {
case "STARTED":
terminal.write(
`βοΈ TOOL CALL: ${chunk.data.name}(${JSON.stringify(
chunk.data.parameters
)})`
);
break;
case "ENDED":
terminal.write(`π§ TOOL RESULT: ${chunk.data.result}`);
break;
case "ERRORED":
terminal.write(`π§ TOOL ERROR: ${chunk.data.error}`);
break;
case "UPDATING":
terminal.write(`π§ TOOL UPDATE: ${chunk.data.result}`);
break;
default:
terminal.write(`Unknown tool status: ${chunk.data}`);
}
}
terminal.write("\n");
} else {
terminal.write("Unknown event", chunk);
}
} else {
terminal.write(chunk.choices[0]?.delta?.content || "");
}
}
terminal.write("\n");
}
while (true) {
const input = await terminal.question("You: ");
terminal.write(" ");
await chatbot(input);
}
```
# OpenAI Python SDK
Source: https://friendli.ai/docs/sdk/integrations/openai/python
Integrate FriendliAI with OpenAI Python SDK for chat, streaming, and more.
You can use [**OpenAI Python SDK**](https://github.com/openai/openai-python) to interact with FriendliAI.
This makes migration of existing applications already using OpenAI particularly easy.
## How to use
Before you start, ensure the `base_url` and `api_key` refer to FriendliAI.
Since our products are entirely compatible with OpenAI SDK, now you are good to follow the examples below.
Choose one of the [available models](https://friendli.ai/models/search?products=SERVERLESS) for `model` parameter.
```bash
pip install -qU openai
```
### Chat Completion
Chat completion API that generates a response from a given conversation.
We provide multiple usage examples. Try to find the best one that aligns with your needs.
```python Default
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ.get("FRIENDLI_TOKEN")
)
completion = client.chat.completions.create(
model="meta-llama-3.1-8b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
]
)
print(completion.choices[0].message)
```
```python Streaming
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ.get("FRIENDLI_TOKEN")
)
completion = client.chat.completions.create(
model="meta-llama-3.1-8b-instruct",
messages=[
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "Hello!"}
],
stream=True
)
for chunk in completion:
print(chunk.choices[0].delta)
```
```python Functions
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ.get("FRIENDLI_TOKEN")
)
tools = [
{
"type": "function",
"function": {
"name": "get_current_weather",
"description": "Get the current weather in a given location",
"parameters": {
"type": "object",
"properties": {
"location": {
"type": "string",
"description": "The city and state, e.g. San Francisco, CA",
},
"unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
},
"required": ["location"],
},
}
}
]
completion = client.chat.completions.create(
model="meta-llama-3.1-8b-instruct",
messages=[
{"role": "user", "content": "What's the weather like in Boston today?"}
],
tools=tools,
tool_choice="auto"
)
print(completion)
```
```python Logprobs
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/v1",
api_key=os.environ.get("FRIENDLI_TOKEN")
)
completion = client.chat.completions.create(
model="meta-llama-3.1-8b-instruct",
messages=[
{"role": "user", "content": "Hello!"}
],
logprobs=True,
top_logprobs=2
)
print(completion.choices[0].message)
print(completion.choices[0].logprobs)
```
### Tool assisted chat completion
This feature is in Beta and available only on the **Serverless Endpoints**.
Using tool assisted chat completion API, models can utilize built-in tools prepared for tool calls, enhancing its capability to provide more comprehensive and actionable responses.
Available tools are listed [here](/guides/serverless_endpoints/tool-assisted-api#built-in-tools).
```python Basic
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/tools/v1",
api_key=os.environ.get("FRIENDLI_TOKEN")
)
stream = client.chat.completions.create(
model="meta-llama-3.1-8b-instruct",
messages=[{"role": "user", "content": "What is the current average home price in New York City, and if I put 15% down, how much will my mortgage be?"}],
tools=[
{"type": "web:search"},
{"type": "math:calculator"},
],
stream=True,
)
for chunk in stream:
if chunk.choices is None:
print(f"{chunk.event=}, {chunk.data=}")
elif chunk.choices[0].delta.content is not None:
print(chunk.choices[0].delta.content, end="")
```
```python Advanced (REPL)
import os
from openai import OpenAI
client = OpenAI(
base_url="https://api.friendli.ai/serverless/tools/v1",
api_key=os.environ.get("FRIENDLI_TOKEN")
)
class bcolors:
OKBLUE = '\033[94m'
OKCYAN = '\033[96m'
FAIL = '\033[91m'
WHITE = '\033[97m'
def print_response(response):
print(f"{bcolors.OKCYAN}{response}", end='')
def print_tool_call(data):
print(f"\n{bcolors.OKBLUE}βοΈ TOOL CALL: {
data['name']}({data['parameters']})")
def print_tool_result(data):
print(f"{bcolors.OKBLUE}π§ TOOL RESULT: {data['result']}")
def print_tool_error(data):
print(f"{bcolors.FAIL}π§ TOOL ERROR: {data['error']}", end='')
def print_tool_update(data):
print(f"{bcolors.OKBLUE}π§ TOOL UPDATE: {data['result']}")
def chatbot(prompt):
stream = client.chat.completions.create(
model="meta-llama-3.1-8b-instruct",
messages=[{"role": "user", "content": prompt}],
stream=True,
tools=[
{"type": "web:url"},
{"type": "code:python-interpreter"},
{"type": "math:calculator"},
{"type": "web:search"}
]
)
for chunk in stream:
if chunk.choices is None:
if chunk.event == "tool_status":
match chunk.data:
case {"status": "STARTED"}:
print_tool_call(chunk.data)
case {"status": "ENDED"}:
print_tool_result(chunk.data)
case {"status": "ERRORED"}:
print_tool_error(chunk.data)
case {"status": "UPDATING"}:
print_tool_update(chunk.data)
elif chunk.choices[0].delta.content is not None:
print_response(chunk.choices[0].delta.content)
print("\n")
print("Welcome to the Tool Inference!")
print("To exit, enter 'q'.")
while True:
user_input = input(f"{bcolors.WHITE}You: ")
if user_input.lower() == 'q':
break
chatbot(user_input)
```
# Friendli Integrations
Source: https://friendli.ai/docs/sdk/integrations/overview
Effortlessly integrate FriendliAI models into your projects with support for popular SDKs and frameworks.
## Effortless AI integration with popular SDKs
Friendli is committed to providing developers with flexible and powerful tools to integrate our AI models into their projects.
We support a variety of popular SDKs and frameworks,
making it easy to incorporate Friendli's capabilities into existing workflows and applications.
Our integration options include LiteLLM for unified LLM interactions, Vercel AI SDK for seamless web application development,
LangChain for building complex AI-driven applications, and an OpenAI-compatible API for those familiar with OpenAI's interface.
These integrations enable developers to leverage Friendli's AI models across a wide range of use cases,
from simple chat applications to sophisticated AI systems,
all while maintaining ease of use and compatibility with existing tools and practices.
# Vercel AI SDK
Source: https://friendli.ai/docs/sdk/integrations/vercel-ai-sdk
Easily integrate FriendliAI models with the Vercel AI SDK, supporting serverless, dedicated, and fine-tuned endpoints.
You can use [**Vercel AI SDK**](https://sdk.vercel.ai) to interact with FriendliAI.
This makes migration of existing applications already using Vercel AI SDK particularly easy.
## How to use
Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens).
```bash npm
npm i ai @friendliai/ai-provider
```
```bash yarn
yarn add ai @friendliai/ai-provider
```
```bash pnpm
pnpm add ai @friendliai/ai-provider
```
### Instantiation
Instantiate your models using a Friendli provider instance.
We provide usage examples for each type of endpoint. Choose the one that best suits your needs:
```ts Serverless Endpoints {4,7-9}
import { friendli } from '@friendliai/ai-provider';
// Automatically select serverless endpoints
const model = friendli("meta-llama-3.3-70b-instruct");
// Or specify a specific serverless endpoint
const model = friendli("meta-llama-3.3-70b-instruct", {
endpoint: "serverless",
});
```
```ts Dedicated Endpoints {4,7-9}
import { friendli } from '@friendliai/ai-provider';
// Replace YOUR_ENDPOINT_ID with the ID of your endpoint, e.g. "zbimjgovmlcb"
const model = friendli("YOUR_ENDPOINT_ID");
// Specify a dedicated endpoint instead of auto-selecting
const model = friendli("YOUR_ENDPOINT_ID", {
endpoint: "dedicated",
});
```
```ts Friendli Container {9}
import { createFriendli } from "@friendliai/ai-provider";
const friendli = createFriendli({
// Update with the URL where your container is running.
baseURL: "http://localhost:8000/v1",
});
// Containers do not require a model id.
const model = friendli("");
```
### Example: Generating text
Generate a response with the `generateText` function:
```ts
import { friendli } from "@friendliai/ai-provider";
import { generateText } from "ai";
const { text } = await generateText({
model: friendli("meta-llama-3.3-70b-instruct"),
prompt: "Write a vegetarian lasagna recipe for 4 people.",
});
console.log(text);
```
### Example: Using Enforcing Patterns (Regex)
Specify a specific pattern (e.g., CSV), character sets, or specific language characters (e.g., Korean Hangul characters) for your LLM's output.
```ts {6}
import { friendli } from "@friendliai/ai-provider";
import { generateText } from "ai";
const { text } = await generateText({
model: friendli("meta-llama-3.3-70b-instruct", {
regex: new RegExp("[\n ,.?!0-9\uac00-\ud7af]*"),
}),
prompt: "Who is the first king of the Joseon Dynasty?",
});
console.log(text);
```
### Example: Using built-in tools
This feature is in Beta and available only on the **Serverless Endpoints**.
Using tool assisted chat completion API, models can utilize built-in tools prepared for tool calls, enhancing its capability to provide more comprehensive and actionable responses.
Available tools are listed [here](/guides/serverless_endpoints/tool-assisted-api#built-in-tools).
```ts {6-9}
import { friendli } from "@friendliai/ai-provider";
import { streamText } from "ai";
const result = await streamText({
model: friendli("meta-llama-3.3-70b-instruct", {
tools: [
{"type": "web:search"},
{"type": "math:calculator"},
],
}),
prompt: "Find the current USD to CAD exchange rate and calculate how much $5,000 USD would be in Canadian dollars.",
});
for await (const textPart of result.textStream) {
console.log(textPart);
}
```
## OpenAI Compatibility
You can also use `@ai-sdk/openai` as the APIs are OpenAI-compatible.
```ts
import { createOpenAI } from '@ai-sdk/openai';
const friendli = createOpenAI({
baseURL: 'https://api.friendli.ai/serverless/v1',
apiKey: process.env.FRIENDLI_TOKEN,
});
```
If you are using dedicated endpoints
```ts
import { createOpenAI } from '@ai-sdk/openai';
const friendli = createOpenAI({
baseURL: 'https://api.friendli.ai/dedicated/v1',
apiKey: process.env.FRIENDLI_TOKEN,
});
```
## Further resources
* [Implementing a simple streaming chat with Next.js](https://sdk.vercel.ai/examples/next-app/basics/streaming-text-generation)
* [Build a Next.js app with the Vercel AI SDK](https://sdk.vercel.ai/docs/getting-started/nextjs-app-router)
* [Explore the Vercel AI SDK Core Reference](https://sdk.vercel.ai/docs/ai-sdk-core/overview)
# FriendliAI + Weaviate (Node.js)
Source: https://friendli.ai/docs/sdk/integrations/weaviate/nodejs
Utilize the Weaviate to build applications with less hallucination open-source vector database.
Integration with [**Weaviate**](https://github.com/weaviate/weaviate) enables performing Retrieval Augmented Generation (RAG) directly within the Weaviate database.
This combines the power of [**Friendli Engine**](https://friendli.ai/solutions/engine) and Weaviate's efficient storage and fast retrieval capabilities to generate personalized and context-aware responses.
## How to use
Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens).
Also, set up your Weaviate instance following this [guide](https://weaviate.io/developers/weaviate/starter-guides/which-weaviate).
Your Weaviate instance must be configured with the FriendliAI generative AI integration (`generative-friendliai`) module.
```bash npm
npm i weaviate-client
```
```bash yarn
yarn add weaviate-client
```
```bash pnpm
pnpm add weaviate-client
```
### Instantiation
Now we can instantiate a [Weaviate collection](https://weaviate.io/developers/weaviate/manage-data/collections) using our model.
We provide usage examples for each type of endpoint. Choose the one that best suits your needs.
You can specify one of the [available models](https://friendli.ai/models/search?products=SERVERLESS) for the serverless endpoints.
The default model (i.e. `meta-llama-3.3-70b-instruct`) will be used if no model is specified.
```ts Serverless Endpoints
import weaviate from 'weaviate-client'
const client = await weaviate.connectToWeaviateCloud(
'WEAVIATE_INSTANCE_URL', // your Weaviate instance URL
{
authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_APIKEY'),
headers: {
'X-Friendli-Api-Key': process.env.FRIENDLI_TOKEN,
}
}
)
await client.collections.create({
name: 'DemoCollection',
generative: weaviate.configure.generative.friendliai({
model: 'meta-llama-3.3-70b-instruct'
}),
// Additional parameters ...
});
client.close()
```
```ts Dedicated Endpoints
import weaviate from 'weaviate-client'
const client = await weaviate.connectToWeaviateCloud(
'WEAVIATE_INSTANCE_URL', // your Weaviate instance URL
{
authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_APIKEY'),
headers: {
'X-Friendli-Api-Key': process.env.FRIENDLI_TOKEN,
"X-Friendli-Baseurl": "https://api.friendli.ai/dedicated",
}
}
)
await client.collections.create({
name: 'DemoCollection',
generative: weaviate.configure.generative.friendliai({
model: 'YOUR_ENDPOINT_ID'
}),
// Additional parameters ...
});
client.close()
```
```ts Fine-tuned Dedicated Endpoints
import weaviate from 'weaviate-client'
const client = await weaviate.connectToWeaviateCloud(
'WEAVIATE_INSTANCE_URL', // your Weaviate instance URL
{
authCredentials: new weaviate.ApiKey('WEAVIATE_INSTANCE_APIKEY'),
headers: {
'X-Friendli-Api-Key': process.env.FRIENDLI_TOKEN,
"X-Friendli-Baseurl": "https://api.friendli.ai/dedicated",
}
}
)
await client.collections.create({
name: 'DemoCollection',
generative: weaviate.configure.generative.friendliai({
model: 'YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE'
}),
// Additional parameters ...
});
client.close()
```
#### Configurable parameters
Configure the following generative parameters to customize the model behavior.
```ts
await client.collections.create({
name: 'DemoCollection',
generative: weaviate.configure.generative.friendliai({
model: 'meta-llama-3.3-70b-instruct',
maxTokens: 500,
temperature: 0.7,
}),
// Additional parameters ...
});
```
### Retrieval Augmented Generation
After configuring Weaviate, perform RAG operations, either with the single prompt or grouped task method.
#### Single prompt
To generate text for each object in the search results, use the single prompt method.
The example below generates outputs for each of the n search results, where n is specified by the limit parameter.
When creating a single prompt query, use braces `{}` to interpolate the object properties you want Weaviate to pass on to the language model.
For example, to pass on the object's title property, include `{title}` in the query.
```ts
let myCollection = client.collections.get('DemoCollection');
const singlePromptResults = await myCollection.generate.nearText(
['A holiday film'],
{
singlePrompt: `Translate this into French: {title}`,
},
{
limit: 2,
}
);
for (const obj of singlePromptResults.objects) {
console.log(obj.properties['title']);
console.log(`Generated output: ${obj.generated}`); // Note that the generated output is per object
}
```
#### Grouped task
To generate one text for the entire set of search results, use the grouped task method.
In other words, when you have n search results, the generative model generates one output for the entire group.
```ts
let myCollection = client.collections.get('DemoCollection');
const groupedTaskResults = await myCollection.generate.nearText(
['A holiday film'],
{
groupedTask: `Write a fun tweet to promote readers to check out these films.`,
},
{
limit: 2,
}
);
console.log(`Generated output: ${groupedTaskResults.generated}`); // Note that the generated output is per query
for (const obj of groupedTaskResults.objects) {
console.log(obj.properties['title']);
}
```
### Further resources
Once the integrations are configured at the collection, the data management and search operations in Weaviate work identically to any other collection.
See the following model-agnostic examples:
* [How-to manage data guides show how to perform data operations](https://weaviate.io/developers/weaviate/manage-data/create).
* [How-to search guides show how to perform search operations](https://weaviate.io/developers/weaviate/search/basics).
# FriendliAI + Weaviate (Python)
Source: https://friendli.ai/docs/sdk/integrations/weaviate/python
Utilize the Weaviate to build applications with less hallucination open-source vector database.
Integration with [**Weaviate**](https://github.com/weaviate/weaviate) enables performing Retrieval Augmented Generation (RAG) directly within the Weaviate database.
This combines the power of [**Friendli Engine**](https://friendli.ai/solutions/engine) and Weaviate's efficient storage and fast retrieval capabilities to generate personalized and context-aware responses.
## How to use
Before you start, ensure you've already obtained the `FRIENDLI_TOKEN` from the [Friendli Suite](https://friendli.ai/suite/setting/tokens).
Also, set up your Weaviate instance following this [guide](https://weaviate.io/developers/weaviate/starter-guides/which-weaviate).
Your Weaviate instance must be configured with the FriendliAI generative AI integration (`generative-friendliai`) module.
```bash
pip install -qU weaviate-client
```
### Instantiation
Now we can instantiate a [Weaviate collection](https://weaviate.io/developers/weaviate/manage-data/collections) using our model.
We provide usage examples for each type of endpoint. Choose the one that best suits your needs.
You can specify one of the [available models](https://friendli.ai/models/search?products=SERVERLESS) for the serverless endpoints.
The default model (i.e. `meta-llama-3.3-70b-instruct`) will be used if no model is specified.
```python Serverless Endpoints
import weaviate
import os
from weaviate.classes.init import Auth
from weaviate.classes.config import Configure
headers = {
"X-Friendli-Api-Key": os.getenv("FRIENDLI_TOKEN"),
}
client = weaviate.connect_to_weaviate_cloud(
cluster_url=weaviate_url, # `weaviate_url`: your Weaviate URL
auth_credentials=Auth.api_key(weaviate_key), # `weaviate_key`: your Weaviate API key
headers=headers
)
client.collections.create(
"DemoCollection",
generative_config=Configure.Generative.friendliai(
model = "meta-llama-3.3-70b-instruct",
)
# Additional parameters not shown
)
client.close()
```
```python Dedicated Endpoints
import weaviate
import os
from weaviate.classes.init import Auth
from weaviate.classes.config import Configure
headers = {
"X-Friendli-Api-Key": os.getenv("FRIENDLI_TOKEN"),
"X-Friendli-Baseurl": "https://api.friendli.ai/dedicated",
}
client = weaviate.connect_to_weaviate_cloud(
cluster_url=weaviate_url, # `weaviate_url`: your Weaviate URL
auth_credentials=Auth.api_key(weaviate_key), # `weaviate_key`: your Weaviate API key
headers=headers
)
client.collections.create(
"DemoCollection",
generative_config=Configure.Generative.friendliai(
model = "YOUR_ENDPOINT_ID",
)
# Additional parameters not shown
)
client.close()
```
```python Fine-tuned Dedicated Endpoints
import weaviate
import os
from weaviate.classes.init import Auth
from weaviate.classes.config import Configure
headers = {
"X-Friendli-Api-Key": os.getenv("FRIENDLI_TOKEN"),
"X-Friendli-Baseurl": "https://api.friendli.ai/dedicated",
}
client = weaviate.connect_to_weaviate_cloud(
cluster_url=weaviate_url, # `weaviate_url`: your Weaviate URL
auth_credentials=Auth.api_key(weaviate_key), # `weaviate_key`: your Weaviate API key
headers=headers
)
client.collections.create(
"DemoCollection",
generative_config=Configure.Generative.friendliai(
model = "YOUR_ENDPOINT_ID:YOUR_ADAPTER_ROUTE",
)
# Additional parameters not shown
)
client.close()
```
#### Configurable parameters
Configure the following generative parameters to customize the model behavior.
```python
from weaviate.classes.config import Configure
client.collections.create(
"DemoCollection",
generative_config=Configure.Generative.friendliai(
# These parameters are optional
model = "meta-llama-3.3-70b-instruct",
max_tokens = 500,
temperature = 0.7,
)
)
```
### Retrieval Augmented Generation
After configuring Weaviate, perform RAG operations, either with the single prompt or grouped task method.
#### Single prompt
To generate text for each object in the search results, use the single prompt method.
The example below generates outputs for each of the n search results, where n is specified by the limit parameter.
When creating a single prompt query, use braces `{}` to interpolate the object properties you want Weaviate to pass on to the language model.
For example, to pass on the object's title property, include `{title}` in the query.
```python
collection = client.collections.get("DemoCollection")
response = collection.generate.near_text(
query="A holiday film", # The model provider integration will automatically vectorize the query
single_prompt="Translate this into French: {title}",
limit=2
)
for obj in response.objects:
print(obj.properties["title"])
print(f"Generated output: {obj.generated}") # Note that the generated output is per object
```
#### Grouped task
To generate one text for the entire set of search results, use the grouped task method.
In other words, when you have n search results, the generative model generates one output for the entire group.
```python
collection = client.collections.get("DemoCollection")
response = collection.generate.near_text(
query="A holiday film", # The model provider integration will automatically vectorize the query
grouped_task="Write a fun tweet to promote readers to check out these films.",
limit=2
)
print(f"Generated output: {response.generated}") # Note that the generated output is per query
for obj in response.objects:
print(obj.properties["title"])
```
### Further resources
Once the integrations are configured at the collection, the data management and search operations in Weaviate work identically to any other collection.
See the following model-agnostic examples:
* [How-to manage data guides show how to perform data operations](https://weaviate.io/developers/weaviate/manage-data/create).
* [How-to search guides show how to perform search operations](https://weaviate.io/developers/weaviate/search/basics).
# Friendli Python SDK
Source: https://friendli.ai/docs/sdk/python-sdk
Interact with Friendli AI services using the official Python SDK for seamless integration with your applications.
## Introduction
The [Friendli Python SDK](https://github.com/friendliai/friendli-python) provides a powerful and flexible way to interact with FriendliAI services, including Serverless Endpoints, Dedicated Endpoints, and Container. This allows developers to easily integrate their Python applications with FriendliAI.
## Installation
The SDK can be installed with either pip or poetry:
```bash
# Using pip
pip install friendli
# Using poetry
poetry add friendli
```
## Authentication
Authentication is done using a Friendli Token, which can be generated from the [Friendli Suite](https://friendli.ai/suite) in your Personal Settings:
```python
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
# Your code here
```
For detailed instructions on generating a Friendli Token, see the [Personal Access Tokens](/guides/suite/personal_access_tokens) guide.
## Chat Completions
The SDK supports chat completions across all deployment types. Choose the deployment option that best fits your needs.
```python Serverless Endpoints
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = friendli.serverless.chat.complete(
messages=[
{
"content": "You are a helpful assistant.",
"role": "system",
},
{
"content": "Hello!",
"role": "user",
},
],
model="meta-llama-3.1-8b-instruct",
max_tokens=200,
)
print(res)
```
```python Dedicated Endpoints
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = friendli.dedicated.chat.complete(
messages=[
{
"content": "You are a helpful assistant.",
"role": "system",
},
{
"content": "Hello!",
"role": "user",
},
],
model="YOUR_ENDPOINT_ID",
max_tokens=200,
)
print(res)
```
```python Container Deployment
from friendli import SyncFriendli
with SyncFriendli() as friendli:
res = friendli.container.chat.complete(
messages=[
{
"content": "You are a helpful assistant.",
"role": "system",
},
{
"content": "Hello!",
"role": "user",
},
],
max_tokens=200,
)
print(res)
```
### Asynchronous Chat Completions
```python Serverless Endpoints
import asyncio
import os
from friendli import AsyncFriendli
async def main():
async with AsyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = await friendli.serverless.chat.complete(
messages=[
{
"content": "You are a helpful assistant.",
"role": "system",
},
{
"content": "Hello!",
"role": "user",
},
],
model="meta-llama-3.1-8b-instruct",
max_tokens=200,
)
print(res)
asyncio.run(main())
```
```python Dedicated Endpoints
import asyncio
import os
from friendli import AsyncFriendli
async def main():
async with AsyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = await friendli.dedicated.chat.complete(
messages=[
{
"content": "You are a helpful assistant.",
"role": "system",
},
{
"content": "Hello!",
"role": "user",
},
],
model="YOUR_ENDPOINT_ID",
max_tokens=200,
)
print(res)
asyncio.run(main())
```
```python Container Deployment
import asyncio
from friendli import AsyncFriendli
async def main():
async with AsyncFriendli() as friendli:
res = await friendli.container.chat.complete(
messages=[
{
"content": "You are a helpful assistant.",
"role": "system",
},
{
"content": "Hello!",
"role": "user",
},
],
max_tokens=200,
)
print(res)
asyncio.run(main())
```
### Tool-Assisted Chat Completions
Tool-assisted chat completions are only available for Serverless endpoints.
```python
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = friendli.serverless.tool_assisted_chat.complete(
messages=[
{
"content": "What is 3 + 6?",
"role": "user",
},
],
model="meta-llama-3.1-8b-instruct",
max_tokens=200,
tools=[
{
"type": "math:calculator",
},
],
)
print(res)
```
## Advanced Features
### Streaming Responses
The SDK supports streaming responses using server-sent events, which can be consumed using a simple `for` loop:
```python
import os
from friendli import SyncFriendli
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = friendli.serverless.chat.stream(
messages=[
{
"content": "You are a helpful assistant.",
"role": "system",
},
{
"content": "Hello!",
"role": "user",
},
],
model="meta-llama-3.1-8b-instruct",
max_tokens=200,
)
with res as event_stream:
for event in event_stream:
# Process each chunk as it arrives
print(event, flush=True)
```
### Custom Retry Strategy
You can customize retry behavior for operations that support retries:
```python
import os
from friendli import SyncFriendli
from friendli.utils import BackoffStrategy, RetryConfig
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
res = friendli.serverless.chat.complete(
messages=[
{
"content": "You are a helpful assistant.",
"role": "system",
},
{
"content": "Hello!",
"role": "user",
},
],
model="meta-llama-3.1-8b-instruct",
max_tokens=200,
retries=RetryConfig("backoff", BackoffStrategy(1, 50, 1.1, 100), False),
)
# Handle response
print(res)
```
### Error Handling
The SDK provides comprehensive error handling with detailed exception information:
```python
import os
from friendli import SyncFriendli, models
with SyncFriendli(
token=os.environ["FRIENDLI_TOKEN"],
) as friendli:
try:
res = friendli.dedicated.endpoint.create(
advanced={
"tokenizer_add_special_tokens": True,
"tokenizer_skip_special_tokens": False,
},
hf_model_repo="",
instance_option_id="",
name="",
project_id="",
)
# Handle response
print(res)
except models.HTTPValidationError as e:
# Handle validation errors
print(f"Validation error: {e.data}")
except models.SDKError as e:
# Handle general SDK errors
print(f"Error {e.status_code}: {e.message}")
```
### Custom Logging
You can pass your own logger to the client class to help troubleshoot and diagnose issues during API interactions. This is especially useful when you encounter unexpected behavior or errors.
```python
import logging
import os
from friendli import SyncFriendli
# Configure your custom logger, for example:
logger = logging.getLogger(__name__)
logging.basicConfig(
format="[%(filename)s:%(lineno)s - %(funcName)s()] %(message)s",
level=logging.INFO,
handlers=[logging.StreamHandler()],
)
with SyncFriendli(
server_url=SERVER_URL,
token=TOKEN,
debug_logger=logger, # Pass your logger here
) as friendli:
# Your code here
pass
```
## Beta Features
### Dataset Management (Beta)
Our SDK provides a straightforward way to create, retrieve, and update datasets within your projects. Datasets can contain samples across various modalitiesβsuch as text, images, and moreβallowing flexible and comprehensive dataset construction for your fine-tuning and validation workflows.
```python
import os
from friendli.friendli import SyncFriendli
from friendli.models import Sample
TEAM_ID = os.environ["FRIENDLI_TEAM_ID"]
PROJECT_ID = os.environ["FRIENDLI_PROJECT_ID"]
TOKEN = os.environ["FRIENDLI_TOKEN"]
with SyncFriendli(
token=TOKEN,
x_friendli_team=TEAM_ID,
) as friendli:
# Create dataset
with friendli.dataset.create(
modality=["TEXT", "IMAGE"],
name="test-create-dataset-sync",
project_id=PROJECT_ID,
) as dataset:
# Read dataset
with open("dataset.jsonl", "rb") as f:
data = [Sample.model_validate_json(line) for line in f]
# Add samples to dataset
dataset.upload_samples(
samples=data,
split="train",
)
```
### File Management (Beta)
You can download and upload files to and from our database. This feature is primarily designed for storing sample files related to datasets, with additional use cases planned for the future.
```python
import io
import os
from hashlib import sha256
import httpx
from friendli import SyncFriendli
TEAM_ID = os.environ["FRIENDLI_TEAM_ID"]
PROJECT_ID = os.environ["FRIENDLI_PROJECT_ID"]
TOKEN = os.environ["FRIENDLI_TOKEN"]
with SyncFriendli(
token=TOKEN,
) as friendli:
# Read data from file
with open("lorem.txt", "rb") as f:
data = f.read()
# Initiate upload
init_upload_res = friendli.file.init_upload(
digest=f"sha256:{sha256(data).hexdigest()}",
name="lorem.txt",
project_id=PROJECT_ID,
size=len(data),
x_friendli_team=TEAM_ID,
)
# Upload to S3
if init_upload_res.upload_url is not None:
httpx.post(
url=init_upload_res.upload_url,
data=init_upload_res.aws,
files={"file": io.BytesIO(data)},
timeout=60,
).raise_for_status()
# Complete upload
friendli.file.complete_upload(
file_id=init_upload_res.file_id,
x_friendli_team=TEAM_ID,
)
# Get download URL
get_download_url_res = friendli.file.get_download_url(
file_id=init_upload_res.file_id,
x_friendli_team=TEAM_ID,
)
print(get_download_url_res.download_url)
```
## Further Resources
For complete API documentation, advanced usage examples, and detailed reference information, please visit the [Friendli Python SDK GitHub repository](https://github.com/friendliai/friendli-python).